Commit 302f56eb authored by Igor Merkulow's avatar Igor Merkulow
Browse files

added some suggestions from Fabian and Azat

parent 0fad6f8d
......@@ -49,7 +49,7 @@ After that, the InfluxDB access needs to be configured:
cp conf/influxdb.sample conf/influxdb.py
```
In the `conf/influxdb.py` file, the user name, password, db address and the name of the database itself has to be set. For the demonstration purposes, the default values can be used, if the database is run on the same machine.
In the `conf/influxdb.py` file, the user name, password, db address and the name of the database itself has to be set. For the demonstration purposes, the default values can be used, if the database is running on the same machine.
To get the formatted data from the database, following command is needed (replace JOBID with a real job ID):
......
# Extended metrics set
IMPORTANT: here we are using a different naming scheme to clearly distinguish between the metrics collected and metrics used in the reports. The metrics in the report are defined independently from all collectors to avoid ambiguities and errors. We will provide an explanation or an example how these values are calculated so that they will have ideantical meaning in all reports.
IMPORTANT: here we are using a different naming scheme to clearly distinguish between the metrics collected and metrics used in the reports. The metrics in the report are defined independently from all collectors to avoid ambiguities and errors. We will provide an explanation or an example how these values are calculated so that they will have ideantical meaning in all reports. For example, to get an impression how much data is read from the disk, we have a parameter called `pfit_fs_read_bytes` - this value will probably never exist as a collected metric, since most of the file systems report their data separately. But for the basic understanding of the program IO behavior, an aggregated value over all file systems should be sufficient.
The data in the extended set will have an additional marker, denoting if a value is required or not. Required values should be easy to collect and be sufficient to represent a high-level picture of the job. The optional values represent specific metrics that may be not available everywhere. Should the values be available, they will be included in the report, otherwise the section is skipped.
The data in the extended set will have an additional marker, denoting if a value is required or not. Required values should be easy to collect and be sufficient to get a high-level picture of the job. The optional values represent specific metrics that may be not available everywhere. Should the values be available, they will be included in the report, otherwise the section is skipped.
Additional values, that are not covered by this specification, are allowed, but currently will be not included in the report. This decision should allow developers to parse all the metrics from a source without the need to eliminate the "unnecessary" results. Additionally, it should simplify the extension of the specification in the future.
......@@ -10,13 +10,13 @@ Additional values, that are not covered by this specification, are allowed, but
In the combined interface for all the reports, we have to deal with multiple kinds of data, each used in the different context. At the moment we can identify 5 of them:
- global job data - can be read once and stays unchanged during the entire job runtime (e.g. job-ID, user name etc.)
- static node data - node-related data that doesn't change for the duration of the job (e.g. node name, CPU model, RAM amount etc.)
- dynamic node data - measured data samples (converted to our format specification)
- global job data - is valid for the job and doesn't change during the entire job runtime (e.g. job-ID, user name etc.).
- static node data - node-related data that doesn't change for the duration of the job (e.g. node name, CPU model, RAM amount etc.).
- dynamic node data - data samples (e.g disk reads, memory usage etc., but converted to our format specification).
- aggregates per node - data, aggregated over the job runtime for every node separately (e.g. maximum CPU load, total number of packets sent over network etc.)
- aggregates per job - data, aggregated over all nodes (e.g. was swap used)
- aggregates per job - data, aggregated over all nodes (e.g. if swap was used)
Global job-related and static node-related data is displayed in the reports more or less as is. Dynamic data samples are used for the time series plots in PDF report and may be used for calculating recommendations, but most probably are not going to be displayed in the raw form (due to the data amount). Different aggregated values are used to estimate job efficiency and can also be shown in the reports.
Global job-related and static node-related data is displayed in the reports more or less as is, either as a part of the header or as a baseline for calculations. Dynamic data samples are used for the time series plots in PDF report, for aggregates, and may be used for calculating recommendations, but most probably are not going to be displayed in the raw form (due to the data amount). Aggregated values are used to estimate job efficiency and can also be shown in the reports.
TODO: network data is missing yet. We need to define the supported network types, what kind of data is needed, how it should be aggregated, and what kind of information can be derived from it.
......@@ -36,7 +36,7 @@ TODO: do we already have metrics, supported by all tools/plugins, but not includ
- INT-POS0: value is zero or greater.
- INT-TS: value is an UNIX timestamp, representing the number of seconds since 01.01.1970, minimum value is 1,000,000,000 (was on 09.09.2001, so we should get larger values)
- INT-TS: value is an UNIX timestamp, representing the number of seconds since 01.01.1970, minimum value is 1,000,000,000 (was on 09.09.2001, so we should get larger values). If this constraint is not met, there are probably other issues on that machine.
- FLOAT-POS0: floating point value, greater then or equal 0.0
......@@ -48,7 +48,7 @@ TODO: do we already have metrics, supported by all tools/plugins, but not includ
Question mark means that the data is not yet there, minus sign means "not applicable".
## Metrics representing global job data
## Metrics representing global job information
Most of this information is provided by the job management system and can probably be taken over without modifications. Special care is needed if identifier strings use non-ASCII charsets. Currently it's disallowed by the validator.
......@@ -61,17 +61,17 @@ Most of this information is provided by the job management system and can probab
|`pfit_submit_time`|integer|INT-TS|y|Job submitted|
|`pfit_start_time`|integer|INT-TS|y|Job started execution|
|`pfit_end_time`|integer|INT-TS|y|Job finished/terminated|
|`pfit_requested_time`|integer|SEC, Range [1, 31,556,952]|y|Total requested job walltime - should roughly be equal to (start_time - end_time)|
|`pfit_requested_cores`|integer|Range [1, 1,000,000,000]|y|Total number of cores (sum over all nodes) requested|
|`pfit_num_used_nodes`|integer|Range [1, 100,000,000]|y|Number of nodes the job ran on - it also has to be equal to the number of node-related data blocks in the set|
|`pfit_requested_time`|integer|SEC, Range [1, 31,556,952]|y|Total requested job walltime - should roughly be equal to (start_time - end_time). Upper limit is set to one year.|
|`pfit_requested_cores`|integer|INT-POS|y|Total number of cores (sum over all nodes) requested|
|`pfit_num_used_nodes`|integer|INT-POS|y|Number of nodes the job ran on - it also has to be equal to the number of node-related data blocks in the set|
|`pfit_sampling_interval`|string|"[0-9]+ [h\|m\|s]"|y|How often metrics are generated. If there are multiple time intervals for different metrics, the shortest should be stated here.|
|`pfit_return_value`|integer|either 0 or 1|n|Should signal if the job has finished correctly. Is not necessarily the result delivered by the job management system, since programs can have also negative exit codes.|
`pfit_sampling_interval` is a value that we set in the configuration, so it should be identical for all nodes, but it can also be aggregated if necessary.
`pfit_sampling_interval` is a value that we set in the configuration, so it should be identical for all nodes, but it can also be aggregated if necessary. TODO: define how exactly the interval is specified (e.g. if "1m30s" should be allowed or only "90s") and adapt the RegEx. Maybe integer value in seconds would be better.
## Metrics per node
### Static data - doesn't change over the runtime of the job (so it needs to be measured only once)
### Static data - Characteristics of the node and values valid for the entire job on this node
|Report metric|Data type|Constraint|Required|Information|
| --- | --- | --- | --- | --- |
......@@ -81,15 +81,15 @@ Most of this information is provided by the job management system and can probab
|`pfit_mem_latency`|float|Nanoseconds|n|RAM latency, default value|
|`pfit_mem_bw`|float|MB per second|n|RAM bandwidth, default value|
|`pfit_sockets_per_node`|integer|Range [1, 16]|y|Number of CPU sockets for this node|
|`pfit_cores_per_socket`|integer|Range [1, 256]|y|Number of CPU cores for every socket (assumes node-wide identical number for every socket)|
|`pfit_phys_threads_per_core`|integer|Range [1, 256]|y|How many physical/HW threads can be executed on a CPU core|
|`pfit_virt_threads_per_core`|integer|Range [0, 256]|y|Number of _additional_, virtual threads per core. If this value is non-zero, Hyperthreading or a similar technology is active.|
|`pfit_cores_per_socket`|integer|Range [1, 1024]|y|Number of actual CPU cores for every socket (assumes node-wide identical number for every socket)|
|`pfit_phys_threads_per_core`|integer|Range [1, 1024]|y|How many physical/HW threads can be executed on a CPU core|
|`pfit_virt_threads_per_core`|integer|Range [0, 1024]|y|Number of _additional_, virtual threads per core. If this value is non-zero, Hyperthreading or a similar technology is active.|
|`pfit_cache_l1i_size`|integer|BYTE, INT-POS|y|L1 Instruction Cache size per core|
|`pfit_cache_l1d_size`|integer|BYTE, INT-POS|y|L1 Data Cache size per core|
|`pfit_cache_l2_size`|integer|BYTE, INT-POS|y|L2 Cache size per core|
|`pfit_cache_l3_size`|integer|BYTE, INT-POS0|y|L3 Cache size per core, zero if not available|
|`pfit_assigned_cpus`|integer|INT-POS|y|CPU is a physical or logical thread in Intel's parlance. We need the number of CPUs, assigned to this job on this node. Has to be less or equal to the total number of CPUs available on the node.|
|`pfit_used_mem`|integer|BYTE, Plausibility range [10MB, 10TB]|y|Virtual Memory size allocated by job per node. It includes all memory that the process can access, including memory that is swapped out, memory that is allocated, but not used, and memory that is from shared libraries.|
|`pfit_used_mem`|integer|BYTE, Plausibility range [10MB, 10TB]|y|Maximum memory size allocated by job per node. It includes all memory that the process can access, including memory that is swapped out, memory that is allocated, but not used, and memory that is from shared libraries.|
|`pfit_requested_mem`|integer|BYTE, INT-POS0|y|Amount of RAM on this node, requested by the job|
- `pfit_node_name` - can be reported by the job management or extracted with `uname -n`. It's uniqueness in the context of the job has to be guaranteed. Used to distinguish the node-related data sets.
......@@ -97,12 +97,12 @@ Most of this information is provided by the job management system and can probab
- `pfit_available_main_mem` - a total amount of physically available memory, e.g. from `free` or `/proc/meminfo` (but converted to bytes). But if there are limits imposed on the processes' memory usage, this has to be the limit (e.g. from `ulimit -a`). Collected to analyse the memory usage.
- `pfit_mem_latency` and `pfit_mem_bw` - The values here are the best possible / upper limits, ideally measured on an empty machine (e.g. using `lmbench3` or other microbenchmarks) or even theoretical values derived from installed hardware. They can be used to estimate the efficiency of the memory usage.
- `pfit_sockets_per_node`, `pfit_cores_per_socket`, `pfit_phys_threads_per_core`, and `pfit_virt_threads_per_core` describe the CPU configuration - they can be basically generated with `lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('`, but the threads per core need to be split into physical and virtual. E.g. if on an Intel CPU, the "threads per core" value is 2, then 1 physical and 1 virtual thread is configured (that also means Hyperthreading is active). If the value is just 1, then there is 1 physical and 0 virtual thread and HT is disabled.
- `pfit_cache_l1i_size`, `pfit_cache_l1d_size`, `pfit_cache_l2_size`, and `pfit_cache_l3_size` values are available through multiple linux tools, e.g. `getconf -a | grep -i cache`. Additional cache-related parameters (like cache line size and associativity) are not used at the moment. These values can be used later to identify sub-optimal memory access patterns. Additionally, `cache misses` and `tlb misses` will probably be necessary.
- `pfit_cache_l1i_size`, `pfit_cache_l1d_size`, `pfit_cache_l2_size`, and `pfit_cache_l3_size` values are available through multiple linux tools, e.g. `getconf -a | grep -i cache`. Additional cache-related parameters (like cache line size and associativity) are not used at the moment. These values can be used later to identify sub-optimal memory access patterns. Additionally, `cache misses` and `tlb misses` will probably be necessary. Systems with L4 and L5 cache are considered too exotic at the moment.
- `pfit_assigned_cpus` - we need the maximum number of CPUs assigned to the job. It's used to calculate job-specific CPU usage efficiency. TODO: do we need number of processes per node? is it the same value for the job management? which value does the job management report?
- `pfit_used_mem` - what we need is the maximum amount of memory requested by a process from the OS (including libraries, swap, and whatsoever), so maybe this has to be moved to the "aggregates" section. In combination with requested memory and RSS values can help to identify a memory-bound job and also possible memory-related problems.
- `pfit_requested_mem` - this is the amount of memory (per node) requested by user from the job management system. TODO: is this value delivered by all job management systems?
### Dynamic data - different for every sample
### Dynamic data - time samples per node
|Report metric|Data type|Constraint|Required|Information|
| --- | --- | --- | --- | --- |
......@@ -117,7 +117,7 @@ Most of this information is provided by the job management system and can probab
|`pfit_fs_read_count`|integer|INT-POS0|n|Number of disk accesses for reading at this node|
|`pfit_fs_write_count`|integer|INT-POS0|n| --''-- for writing at this node|
|`pfit_num_threads`|integer|INT-POS|n|TODO: Not sure what this means - threads total on the node? User threads? Relevant process' threads?|
|`pfit_num_processes`|integer|INT-POS|y|Number of processes on the node|
|`pfit_num_processes`|integer|INT-POS0|y|Number of processes on the node. The value of zero indicates a possible problem.|
|`pfit_total_context_switches`|integer|INT-POS0|n|Total amount of voluntary + involuntary switches combined. TODO: this is a highly OS-dependent value and there is no baseline or threshold to derive anything from it - why is it interesting?|
|`pfit_load1`|float|FLOAT-POS0|n|Weighted average number of processes waiting for execution on all the cores in the last 1 minute, can probably be derived from num_processes. On some systems, processes waiting for IO are counted, on other - not. Additionally, some systems count threads as processes (and other do not). Aggregating this value is not easy.|
|`pfit_frequency_per_cpuX`|integer|Value in Hertz, INT-POS0|n|current CPU frequency for every CPU available, X has to be replaced with the CPU number|
......@@ -129,7 +129,7 @@ Most of this information is provided by the job management system and can probab
- `pfit_fs_read_bytes`, `pfit_fs_write_bytes`, `pfit_fs_read_count`, and `pfit_fs_write_count` are used to track disk accesses. Correlation between disk access and cpu load can help find out if the disk access is synchronous, which in turn can be an indicator of inefficient behavior.
- `pfit_num_processes` and `pfit_load1` basically show the same information in a different way (so maybe we should only keep one of them). Load1 value adds its own magic, being a weighted moving average value, thus making it much more complex to interpret. The process count seems to me to be the more interesting number, especially if broken down to "running", "waiting", "sleeping" etc. (can be achieved with `ps aux` or the like - the processes with the status "D", meaning "uninterruptible sleep", which is most often equivalent to "waiting for IO" can be used in combination with other IO-related metrics to get a better picture). This values can help understanding the job behavior and find possible bottlenecks.
- `pfit_num_threads` - if the total number is needed, it can be derived from the `ps -A` vs. `ps -AL` output. But need to discuss if this value is meaningful at all.
- `pfit_frequency_per_cpuX` - since this value is changing depending on the load, the thermal situation, and other parameters, it can be an indicator of some general problems of the node (e.g. combined with maximal possible frequency or with the current load value). Can be obtained with `cat /proc/cpuinfo | grep -i "cpu mhz"` or by using `i7z` (both need to be converted to Hz).
- `pfit_frequency_per_cpuX` - since this value is changing depending on the load, the thermal situation, and other parameters, it can be an indicator of some general problems of the node (e.g. combined with maximal possible frequency or with the current load value). Can be obtained with `cat /proc/cpuinfo | grep -i "cpu mhz"` or by using `i7z` (both results need to be converted to Hz).
### Aggregates per node
......@@ -139,11 +139,11 @@ IMPORTANT: Average values are often floats. Since the floating point value of e.
|Report metric|Data type|Constraint|Aggregation|Required|Information|
| --- | --- | --- | --- | --- | --- |
|`pfit_num_processes_node`|integer|INT-POS|min, max, avg, mean|y|Number of processes on this node|
|`pfit_num_threads_node`|integer|INT-POS|min, max, avg, mean|n|TODO: (the basic value needs to be defined properly first)|
|`pfit_num_processes_node`|integer|INT-POS|min, max, avg, median|y|Number of processes on this node|
|`pfit_num_threads_node`|integer|INT-POS|min, max, avg, median|n|TODO: (the basic value needs to be defined properly first)|
|`pfit_mem_rss_node`|integer|BYTE, Plausibility range [10MB, 10TB]|max, avg|y|RSS memory statistics for this node|
|`pfit_used_swap_node`|integer|BYTE, INT-POS0|max|y|Max used swap on this node|
|`pfit_frequency_per_cpuX_node`|integer|Value in Hertz, INT-POS0|min, max, avg, mean|n|Frequency aggregated per CPU (replace X with the CPU number)|
|`pfit_frequency_per_cpuX_node`|integer|Value in Hertz, INT-POS0|min, max, avg, median|n|Frequency aggregated per CPU (replace X with the CPU number)|
### Aggregates per job
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment