Commit 83c69cb3 authored by Igor Merkulow's avatar Igor Merkulow
Browse files

alpha of the new spec

parent 82b6d6c4
......@@ -51,164 +51,3 @@ Please tell us if this convention is not sufficient for you.
| `mem_rss_avg` | `integer` in bytes | [10MB, 10TB] plausibility range | average RSS memory used by job per node |
| `mem_swap_max` | `integer` in bytes | >= 0 | maximum swap memory used by job per node |
| `mem_vms` | `integer` in bytes | [10MB, 10TB] plausibility range | Virtual Memory size allocated by job per node. It includes all memory that the process can access, including memory that is swapped out, memory that is allocated, but not used, and memory that is from shared libraries. |
## WIP: building extended set, including minset, metrics for PDF report and Grafana, and some more
In the combined interface for all the reports, we have to deal with multiple kinds of data, used in the different context. At the moment we can identify 5 of them:
- global job data - can be read once and stays unchanged during the entire job runtime (e.g. job-ID, user name etc.)
- static node data - node-related data that doesn't change for the duration of the job (e.g. node name, CPU model, RAM amount etc.)
- dynamic node data - measured data samples (converted to our format specification)
- aggregates per node - data, aggregated over the job runtime for every node separately (e.g. maximum CPU load, total number of packets sent over network etc.)
- aggregates per job - data, aggregated over all nodes (e.g. minimum and maximum number of processes)
Global job-related and static node-related data is displayed in the reports more or less as is. Data samples are used for the time series plots in PDF report and for calculating recommendations, but most probably are not going to be displayed in the raw form (due to the data amount). Different aggregated values are used to estimate job efficiency and can also be shown in the report.
IMPORTANT: here we are using a different naming scheme to clearly distinguish between the metrics collected and metrics used in the report. The metrics in the report are defined independently from all collectors to avoid ambiguities and errors. We will provide a detailed explanation how the report values are calculated so that they can be used in all reports without further modification.
The data in the extended set will have a marker for the actual subset: the data in the minimal subset should be easy to collect and represents a high-level picture of the job. All of the metrics in this subset are mandatory. The different parts of the extended subset represent specific topics and are optional. Should the values be available, they will be part of the report, otherwise the section is skipped.
Additional values, that are not covered by this specification, are allowed, but currently will be not included in the report. This decision should allow developers to parse all the metrics from a source without the need to eliminate the "unnecessary" results. Additionally, it should simplify the extension of the specification in the future.
Column "Aggregation": which additional values to calculate from the raw data. Aggregated values have the same name with the aggregation type appended (something like XXX_mean or XXX_max). ATTENTION: raw data is part of the node-related information, aggregated data can be node- (aggregated over the runtime, suffix like \_node\_mean) or job-related (aggregated over nodes, suffix like \_job\_mean).
Column "Used in set": m = minimal, p = PDF and Grafana, e = extended
Question mark means that the data is not yet there, minus sign means "not applicable".
Constraints:
- STR: string is limited to upper- and lower-case ASCII characters, numbers, and underscores. It can be between 3 and 45 characters long.
- STR-EXT: string can also contain whitespace and other characters (e.g. punctuation, '@' or parentheses). Length still between 3 and 45 characters.
- INT-POS: value is strictly greater than zero.
- INT-POS0: value is zero or greater.
- INT-TS: value is an UNIX timestamp, representing the number of seconds since 01.01.1970, minimum value is 1,000,000,000 (was on 09.09.2001, so we should get larger values)
- FLOAT-POS0: floating point value, greater then or equal 0.0
- FLOAT-PERCENT: floating point value, between 0 and 1 (both incl.)
- BYTE: value is in bytes
- SEC: value is in seconds
Per Job:
|Pfit metric|Data type|Constraint|Job / Node|Aggregation|Used in set|Additional Information|
| --- | --- | --- | --- | --- | --- | --- |
|`pfit_job_id`|string|STR|j|-|m|Job identifier|
|`pfit_user_name`|string|STR|j|-|m|User name/handle|
|`pfit_project_account`|string|STR|j|-|e|Project identifier|
|`pfit_used_queue`|string|STR|j|-|m|Used job queue identifier|
|`pfit_submit_time`|integer|INT-TS|j|-|m|-|
|`pfit_start_time`|integer|INT-TS|j|-|m|-|
|`pfit_end_time`|integer|INT-TS|j|-|m|-|
|`pfit_requested_time`|integer|SEC, Range [1, 31,556,952]|j|-|m|Total requested job walltime - should roughly be equal to (start_time - end_time)|
|`pfit_requested_cores`|integer|Range [1, 1,000,000,000]|j|sum|m|Total number of cores (aggregated over all nodes) requested|
|`pfit_num_used_nodes`|integer|Range [1, 100,000,000]|j|-|m|Number of nodes the job ran on - it also has to be equal to the number of node-related data blocks in the set|
|`pfit_sampling_interval`|string|"[0-9]+ [h\|m\|s]"|j|min|p|How often metrics are generated. If there are multiple time intervals for different metrics, the shortest should be stated here.|
|`pfit_return_value`|integer|either 0 or 1|j|-|e|Should signal if the job has finished correctly. Is not necessarily the result delivered by the job management system, since programs can have also negative exit codes.|
Most of this information is provided by the job management system and can probably be taken over without modifications. Special care is needed if identifier strings use non-ASCII charsets. Currently it's disallowed by the validator.
Sampling interval is a value that we set in the configuration, so it should be identical for all nodes, but it can also be aggregated if necessary.
Per Node:
Static data - doesn't change over the runtime of the job (so it needs to be measured only once):
|Pfit metric|Data type|Constraint|Job / Node|Aggregation|Used in set|Additional Information|
| --- | --- | --- | --- | --- | --- | --- |
|`pfit_node_name`|string|STR|n|-|m|Identifier of a node, has to be unique for the job|
|`pfit_cpu_model`|string|STR-EXT|n|-|m|CPU vendor and model name|
|`pfit_available_main_mem`|integer|BYTE, Plausibility range [10MB, 10TB]|n|-|m|Node's total RAM amount|
|`pfit_mem_latency`|float|Nanoseconds|n|-|e|RAM latency, default value|
|`pfit_mem_bw`|float|MB per second|n|-|e|RAM bandwidth, default value|
|`pfit_sockets_per_node`|integer|Range [1, 16]|n|-|m|Number of CPU sockets for this node|
|`pfit_cores_per_socket`|integer|Range [1, 256]|n|-|m|Number of CPU cores for every socket (assumes node-wide identical number for every socket)|
|`pfit_phys_threads_per_core`|integer|Range [1, 256]|n|-|m|How many physical/HW threads can be executed on a CPU core|
|`pfit_virt_threads_per_core`|integer|Range [0, 256]|n|-|m|Number of _additional_, virtual threads per core. If this value is non-zero, Hyperthreading or a similar technology is active.|
|`pfit_cache_l1i_size`|integer|BYTE, INT-POS|n|-|e|L1 Instruction Cache size per core|
|`pfit_cache_l1d_size`|integer|BYTE, INT-POS|n|-|e|L1 Data Cache size per core|
|`pfit_cache_l2_size`|integer|BYTE, INT-POS|n|-|e|L2 Cache size per core|
|`pfit_cache_l3_size`|integer|BYTE, INT-POS0|n|-|e|L3 Cache size per core, zero if not available|
|`pfit_assigned_cpus`|integer|INT-POS|n|-|m|CPU is a physical or logical thread in Intel's parlance. We need the number of CPUs, assigned to this job on this node. Has to be less or equal to the total number of CPUs available on the node.|
|`pfit_used_mem`|integer|BYTE, Plausibility range [10MB, 10TB]|n|-|m|Virtual Memory size allocated by job per node. It includes all memory that the process can access, including memory that is swapped out, memory that is allocated, but not used, and memory that is from shared libraries.|
|`pfit_requested_mem`|integer|BYTE, INT-POS0|n|-|e|Amount of RAM on this node, requested by the job|
- `pfit_node_name` - can be reported by the job management or extracted with `uname -n`. It's uniqueness in the context of the job has to be guaranteed. Used to distinguish the node-related data sets.
- `pfit_cpu_model` - CPU vendor string, is collected just for the statistical purposes and is currently not processed further.
- `pfit_available_main_mem` - a total amount of physically available memory, e.g. from `free` or `/proc/meminfo` (but converted to bytes). But if there are limits imposed on the processes' memory usage, this has to be the limit (e.g. from `ulimit -a`). Collected to analyse the memory usage.
- `pfit_mem_latency` and `pfit_mem_bw` - The values here are the best possible / upper limits, ideally measured on an empty machine (e.g. using `lmbench3` or other microbenchmarks) or even theoretical values derived from installed hardware. They can be used to estimate the efficiency of the memory usage.
- `pfit_sockets_per_node`, `pfit_cores_per_socket`, `pfit_phys_threads_per_core`, and `pfit_virt_threads_per_core` describe the CPU configuration - they can be basically generated with `lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('`, but the threads per core need to be split into physical and virtual. E.g. if on an Intel CPU, the "threads per core" value is 2, then 1 physical and 1 virtual thread is configured (that also means Hyperthreading is active). If the value is just 1, then there is 1 physical and 0 virtual thread and HT is disabled.
- `pfit_cache_l1i_size`, `pfit_cache_l1d_size`, `pfit_cache_l2_size`, and `pfit_cache_l3_size` values are available through multiple linux tools, e.g. `getconf -a | grep -i cache`. Additional cache-related parameters (like cache line size and associativity) are not used at the moment. These values can be used later to identify sub-optimal memory access patterns. Additionally, `cache misses` and `tlb misses` will probably be necessary.
- `pfit_assigned_cpus` - we need the maximum number of CPUs assigned to the job. It's used to calculate job-specific CPU usage efficiency. TODO: do we need number of processes per node? is it the same value for the job management? which value does the job management report?
- `pfit_used_mem` - what we need is the maximum amount of memory requested by a process from the OS (including libraries, swap, and whatsoever), so maybe this has to be moved to the "aggregates" section. In combination with requested memory and RSS values can help to identify a memory-bound job and also possible memory-related problems.
- `pfit_requested_mem` - this is the amount of memory (per node) requested by user from the job management system. TODO: is this value delivered by all job management systems?
Dynamic data - different for every sample:
|Pfit metric|Data type|Constraint|Job / Node|Aggregation|Used in set|Additional Information|
| --- | --- | --- | --- | --- | --- | --- |
|`pfit_cpu_time_user`|integer|SEC, INT-POS0|n|-|m|The part of the aggregated walltime (over all cores) spent in the user code|
|`pfit_cpu_time_system`|integer|SEC, INT-POS0|n|-|m| --''-- system calls or kernel processes|
|`pfit_cpu_time_idle`|integer|SEC, INT-POS0|n|-|m| --''-- in the idle task (not doing anything)|
|`pfit_cpu_time_iowait`|integer|SEC, INT-POS0|n|-|m| --''-- waiting for IO. Value not reliable for loaded systems|
|`pfit_mem_rss`|integer|BYTE, Plausibility range [10MB, 10TB]|n|-|m|RSS memory used by job per node|
|`pfit_used_swap`|integer|BYTE, INT-POS0|n|-|m|Used swap size|
|`pfit_fs_read_bytes`|integer|BYTE, INT-POS0|n|sum|p|Total amount of data read from all disk filesystems on this node|
|`pfit_fs_write_bytes`|integer|BYTE, INT-POS0|n|sum|p| --''-- written to all disk filesystems on this node|
|`pfit_fs_read_count`|integer|INT-POS0|n|sum|p|Number of disk accesses for reading at this node|
|`pfit_fs_write_count`|integer|INT-POS0|n|sum|p| --''-- for writing at this node|
|`pfit_num_threads`|integer|INT-POS|n|-|p|TODO: Not sure what this means - threads total on the node? User threads? Relevant process' threads?|
|`pfit_num_processes`|integer|INT-POS|n|-|e|Number of currently running processes (executed, waiting, sleeping)|
|`pfit_total_context_switches`|integer|INT-POS0|n|sum|p|Total amount of voluntary + involuntary switches combined. TODO: this is a highly OS-dependent value and there is no baseline or threshold to derive anything from it - why is it interesting?|
|`pfit_load1`|float|FLOAT-POS0|n|-|p|Weighted average number of processes waiting for execution on all the cores in the last 1 minute, can probably be derived from num_processes. On some systems, processes waiting for IO are counted, on other - not. Additionally, some systems count threads as processes (and other do not). Aggregating this value is not easy.|
|`pfit_frequency_per_core`|integer|Value in Hertz, INT-POS0|n|-|e|current CPU frequency|
- `pfit_cpu_time_user`, `pfit_cpu_time_system`, and `pfit_cpu_time_idle` are the values that can be obtained by timing the running program, in the format similar to the `time` utility. In combination with number of requested cores, they can help find stall times, thus maybe indicating some efficiency problems.
- `pfit_cpu_time_iowait` is a more complex value - it's also a wait time, thus part of the idle value, so it's usually only measured/displayed on Linux if there is significant idle time. The reason for that is that idle is a CPU state, but iowait - a process state, thus if the CPU is working (e.g. executing a different process), there is no idle time, so the iowait of a stalled process is ignored.
- `pfit_mem_rss` is a part of the process data (incl. process executable itself, libraries, memory allocations etc.) that are currently active in RAM. The data that is swapped out or unused is not included. This value can halp tracking the real-time memory usage and may help identify memory-related problems in combination with `pfit_used_mem` value.
- `pfit_used_swap` - size of used swap in bytes. Generally speaking, swap usage is sub-optimal, usually leading to significant performance decrease and can also indicate memory-related problems.
- `pfit_fs_read_bytes`, `pfit_fs_write_bytes`, `pfit_fs_read_count`, and `pfit_fs_write_count` are used to track disk accesses. Correlation between disk access and cpu load can help find out if the disk access is synchronous, which in turn can be an indicator of inefficient behavior.
Aggregates per job:
|Pfit metric|Data type|Constraint|Job / Node|Aggregation|Used in set|Additional Information|
| --- | --- | --- | --- | --- | --- | --- |
|`pfit_used_swap`|integer|BYTE, INT-POS0|j|if sum_nodes(used_swap) == 0, then False, else True|m|Is swap used during the job|
Aggregates per node:
|Pfit metric|Data type|Constraint|Job / Node|Aggregation|Used in set|Additional Information|
| --- | --- | --- | --- | --- | --- | --- |
|`pfit_num_processes`|integer|INT-POS|n|min, max, avg, mean|e|Number of currently running processes (executed, waiting, sleeping)|
|`pfit_num_threads`|integer|INT-POS|n|min, max, avg, mean|p|TODO: Not sure what this means - threads total on the node? User threads? Relevant process' threads?|
|`pfit_used_mem_rss`|integer|BYTE, Plausibility range [10MB, 10TB]|n|max, avg|m|RSS memory used by job per node|
|`pfit_used_swap`|integer|BYTE, INT-POS0|n|max|m|Max used swap during the job per node|
|`pfit_frequency_per_core`|integer|Value in Hertz, INT-POS0|n|min, max, avg, mean|e|CPU frequency - can be useful when compared to the theoretical value|
Additional metrics, not properly sorted yet:
|Pfit metric|Data type|Constraint|Job / Node|Aggregation|Used in set|Additional Information|
| --- | --- | --- | --- | --- | --- | --- |
|`pfit_cpu_usage`|integer|SEC, INT-POS|n|min, max, avg, mean|p|?|
|`pfit_cpu_usage_user`|integer|SEC, INT-POS|n|min, max, avg, mean|p|?|
|`pfit_cpu_usage_system`|integer|SEC, INT-POS0|n|min, max, avg, mean|p|?|
|`pfit_cpu_usage_idle`|integer|SEC, INT-POS0|n|min, max, avg, mean|p|?|
|`pfit_cpu_usage_iowait`|integer|SEC, INT-POS0|n|min, max, avg, mean|p|?|
|`pfit_cpu_time_iowait`|integer|SEC, INT-POS0|?|?|e|?|
|||||||
|`pfit_mem_rss`|integer|BYTE, INT-POS|n|min, max, avg, mean|p|?|
|`pfit_memory_rss`|integer|BYTE, INT-POS0|?|?|e|?|
|`pfit_mem_total`|integer|BYTE, INT-POS|n|min, max, avg, mean|p|?|
|`pfit_mem_used`|integer|BYTE, INT-POS0|n|min, max, avg, mean, percent|p|?|
|`pfit_mem_avail`|integer|BYTE, INT-POS0|n|min, max, avg, mean, percent|p|?|
|||||||
|`pfit_memory_swap`|integer|BYTE, INT-POS0|?|?|e|?|
|`pfit_swap_free`|integer|BYTE, INT-POS|n|min, max, avg, mean|p|?|
|`pfit_swap_used`|integer|BYTE, INT-POS0|n|min, max, avg, mean|p|?|
# Extended metrics set
IMPORTANT: here we are using a different naming scheme to clearly distinguish between the metrics collected and metrics used in the reports. The metrics in the report are defined independently from all collectors to avoid ambiguities and errors. We will provide an explanation or an example how these values are calculated so that they will have ideantical meaning in all reports.
The data in the extended set will have an additional marker, denoting if a value is required or not. Required values should be easy to collect and be sufficient to represent a high-level picture of the job. The optional values represent specific metrics that may be not available everywhere. Should the values be available, they will be included in the report, otherwise the section is skipped.
Additional values, that are not covered by this specification, are allowed, but currently will be not included in the report. This decision should allow developers to parse all the metrics from a source without the need to eliminate the "unnecessary" results. Additionally, it should simplify the extension of the specification in the future.
## Data subsets
In the combined interface for all the reports, we have to deal with multiple kinds of data, each used in the different context. At the moment we can identify 5 of them:
- global job data - can be read once and stays unchanged during the entire job runtime (e.g. job-ID, user name etc.)
- static node data - node-related data that doesn't change for the duration of the job (e.g. node name, CPU model, RAM amount etc.)
- dynamic node data - measured data samples (converted to our format specification)
- aggregates per node - data, aggregated over the job runtime for every node separately (e.g. maximum CPU load, total number of packets sent over network etc.)
- aggregates per job - data, aggregated over all nodes (e.g. was swap used)
Global job-related and static node-related data is displayed in the reports more or less as is. Dynamic data samples are used for the time series plots in PDF report and may be used for calculating recommendations, but most probably are not going to be displayed in the raw form (due to the data amount). Different aggregated values are used to estimate job efficiency and can also be shown in the reports.
TODO: network data is missing yet. We need to define the supported network types, what kind of data is needed, how it should be aggregated, and what kind of information can be derived from it.
TODO: GPU data is not included yet.
TODO: how to deal with partial data? E.g. if not all file systems used are supported by us.
TODO: do we already have metrics, supported by all tools/plugins, but not included in the reports?
## Value Constraints
- STR: string is limited to upper- and lower-case ASCII characters, numbers, and underscores. It can be between 3 and 45 characters long.
- STR-EXT: string can also contain whitespace and other characters (e.g. punctuation, '@' or parentheses). Length still between 3 and 45 characters.
- INT-POS: value is strictly greater than zero.
- INT-POS0: value is zero or greater.
- INT-TS: value is an UNIX timestamp, representing the number of seconds since 01.01.1970, minimum value is 1,000,000,000 (was on 09.09.2001, so we should get larger values)
- FLOAT-POS0: floating point value, greater then or equal 0.0
- FLOAT-PERCENT: floating point value, between 0 and 1 (both incl.)
- BYTE: value is in bytes
- SEC: value is in seconds
Question mark means that the data is not yet there, minus sign means "not applicable".
## Metrics representing global job data
Most of this information is provided by the job management system and can probably be taken over without modifications. Special care is needed if identifier strings use non-ASCII charsets. Currently it's disallowed by the validator.
|Report metric|Data type|Constraint|Required|Information|
| --- | --- | --- | --- | --- |
|`pfit_job_id`|string|STR|y|Job identifier|
|`pfit_user_name`|string|STR|y|User name/handle|
|`pfit_project_account`|string|STR|n|Project identifier|
|`pfit_used_queue`|string|STR|y|Used job queue identifier|
|`pfit_submit_time`|integer|INT-TS|y|Job submitted|
|`pfit_start_time`|integer|INT-TS|y|Job started execution|
|`pfit_end_time`|integer|INT-TS|y|Job finished/terminated|
|`pfit_requested_time`|integer|SEC, Range [1, 31,556,952]|y|Total requested job walltime - should roughly be equal to (start_time - end_time)|
|`pfit_requested_cores`|integer|Range [1, 1,000,000,000]|y|Total number of cores (sum over all nodes) requested|
|`pfit_num_used_nodes`|integer|Range [1, 100,000,000]|y|Number of nodes the job ran on - it also has to be equal to the number of node-related data blocks in the set|
|`pfit_sampling_interval`|string|"[0-9]+ [h\|m\|s]"|y|How often metrics are generated. If there are multiple time intervals for different metrics, the shortest should be stated here.|
|`pfit_return_value`|integer|either 0 or 1|n|Should signal if the job has finished correctly. Is not necessarily the result delivered by the job management system, since programs can have also negative exit codes.|
`pfit_sampling_interval` is a value that we set in the configuration, so it should be identical for all nodes, but it can also be aggregated if necessary.
## Metrics per node
### Static data - doesn't change over the runtime of the job (so it needs to be measured only once)
|Report metric|Data type|Constraint|Required|Information|
| --- | --- | --- | --- | --- |
|`pfit_node_name`|string|STR|y|Identifier of a node, has to be unique for the job|
|`pfit_cpu_model`|string|STR-EXT|y|CPU vendor and model name|
|`pfit_available_main_mem`|integer|BYTE, Plausibility range [10MB, 10TB]|y|Node's total RAM amount|
|`pfit_mem_latency`|float|Nanoseconds|n|RAM latency, default value|
|`pfit_mem_bw`|float|MB per second|n|RAM bandwidth, default value|
|`pfit_sockets_per_node`|integer|Range [1, 16]|y|Number of CPU sockets for this node|
|`pfit_cores_per_socket`|integer|Range [1, 256]|y|Number of CPU cores for every socket (assumes node-wide identical number for every socket)|
|`pfit_phys_threads_per_core`|integer|Range [1, 256]|y|How many physical/HW threads can be executed on a CPU core|
|`pfit_virt_threads_per_core`|integer|Range [0, 256]|y|Number of _additional_, virtual threads per core. If this value is non-zero, Hyperthreading or a similar technology is active.|
|`pfit_cache_l1i_size`|integer|BYTE, INT-POS|y|L1 Instruction Cache size per core|
|`pfit_cache_l1d_size`|integer|BYTE, INT-POS|y|L1 Data Cache size per core|
|`pfit_cache_l2_size`|integer|BYTE, INT-POS|y|L2 Cache size per core|
|`pfit_cache_l3_size`|integer|BYTE, INT-POS0|y|L3 Cache size per core, zero if not available|
|`pfit_assigned_cpus`|integer|INT-POS|y|CPU is a physical or logical thread in Intel's parlance. We need the number of CPUs, assigned to this job on this node. Has to be less or equal to the total number of CPUs available on the node.|
|`pfit_used_mem`|integer|BYTE, Plausibility range [10MB, 10TB]|y|Virtual Memory size allocated by job per node. It includes all memory that the process can access, including memory that is swapped out, memory that is allocated, but not used, and memory that is from shared libraries.|
|`pfit_requested_mem`|integer|BYTE, INT-POS0|y|Amount of RAM on this node, requested by the job|
- `pfit_node_name` - can be reported by the job management or extracted with `uname -n`. It's uniqueness in the context of the job has to be guaranteed. Used to distinguish the node-related data sets.
- `pfit_cpu_model` - CPU vendor string, is collected just for the statistical purposes and is currently not processed further.
- `pfit_available_main_mem` - a total amount of physically available memory, e.g. from `free` or `/proc/meminfo` (but converted to bytes). But if there are limits imposed on the processes' memory usage, this has to be the limit (e.g. from `ulimit -a`). Collected to analyse the memory usage.
- `pfit_mem_latency` and `pfit_mem_bw` - The values here are the best possible / upper limits, ideally measured on an empty machine (e.g. using `lmbench3` or other microbenchmarks) or even theoretical values derived from installed hardware. They can be used to estimate the efficiency of the memory usage.
- `pfit_sockets_per_node`, `pfit_cores_per_socket`, `pfit_phys_threads_per_core`, and `pfit_virt_threads_per_core` describe the CPU configuration - they can be basically generated with `lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('`, but the threads per core need to be split into physical and virtual. E.g. if on an Intel CPU, the "threads per core" value is 2, then 1 physical and 1 virtual thread is configured (that also means Hyperthreading is active). If the value is just 1, then there is 1 physical and 0 virtual thread and HT is disabled.
- `pfit_cache_l1i_size`, `pfit_cache_l1d_size`, `pfit_cache_l2_size`, and `pfit_cache_l3_size` values are available through multiple linux tools, e.g. `getconf -a | grep -i cache`. Additional cache-related parameters (like cache line size and associativity) are not used at the moment. These values can be used later to identify sub-optimal memory access patterns. Additionally, `cache misses` and `tlb misses` will probably be necessary.
- `pfit_assigned_cpus` - we need the maximum number of CPUs assigned to the job. It's used to calculate job-specific CPU usage efficiency. TODO: do we need number of processes per node? is it the same value for the job management? which value does the job management report?
- `pfit_used_mem` - what we need is the maximum amount of memory requested by a process from the OS (including libraries, swap, and whatsoever), so maybe this has to be moved to the "aggregates" section. In combination with requested memory and RSS values can help to identify a memory-bound job and also possible memory-related problems.
- `pfit_requested_mem` - this is the amount of memory (per node) requested by user from the job management system. TODO: is this value delivered by all job management systems?
### Dynamic data - different for every sample
|Report metric|Data type|Constraint|Required|Information|
| --- | --- | --- | --- | --- |
|`pfit_cpu_time_user`|integer|SEC, INT-POS0|y|The part of the total walltime (over all cores) spent in the user code|
|`pfit_cpu_time_system`|integer|SEC, INT-POS0|y| --''-- system calls or kernel processes|
|`pfit_cpu_time_idle`|integer|SEC, INT-POS0|y| --''-- in the idle task (not doing anything)|
|`pfit_cpu_time_iowait`|integer|SEC, INT-POS0|n| --''-- waiting for IO. (s. below)|
|`pfit_mem_rss`|integer|BYTE, Plausibility range [10MB, 10TB]|y|RSS memory used by job per node|
|`pfit_used_swap`|integer|BYTE, INT-POS0|y|Used swap size|
|`pfit_fs_read_bytes`|integer|BYTE, INT-POS0|n|Total amount of data read from all disk filesystems on this node|
|`pfit_fs_write_bytes`|integer|BYTE, INT-POS0|n| --''-- written to all disk filesystems on this node|
|`pfit_fs_read_count`|integer|INT-POS0|n|Number of disk accesses for reading at this node|
|`pfit_fs_write_count`|integer|INT-POS0|n| --''-- for writing at this node|
|`pfit_num_threads`|integer|INT-POS|n|TODO: Not sure what this means - threads total on the node? User threads? Relevant process' threads?|
|`pfit_num_processes`|integer|INT-POS|y|Number of processes on the node|
|`pfit_total_context_switches`|integer|INT-POS0|n|Total amount of voluntary + involuntary switches combined. TODO: this is a highly OS-dependent value and there is no baseline or threshold to derive anything from it - why is it interesting?|
|`pfit_load1`|float|FLOAT-POS0|n|Weighted average number of processes waiting for execution on all the cores in the last 1 minute, can probably be derived from num_processes. On some systems, processes waiting for IO are counted, on other - not. Additionally, some systems count threads as processes (and other do not). Aggregating this value is not easy.|
|`pfit_frequency_per_cpuX`|integer|Value in Hertz, INT-POS0|n|current CPU frequency for every CPU available, X has to be replaced with the CPU number|
- `pfit_cpu_time_user`, `pfit_cpu_time_system`, and `pfit_cpu_time_idle` are the values that can be obtained by timing the running program, in the format similar to the `time` utility. In combination with number of requested cores, they can help find stall times, thus maybe indicating some efficiency problems.
- `pfit_cpu_time_iowait` is a more complex value - it is a part of the idle value, so it's usually only measured/displayed on Linux if there is significant idle time. The reason for that is that idle is a CPU state, but iowait - a process state, thus if the CPU is working (e.g. executing a different process), there is no CPU idle time, so the iowait of a stalled process is ignored.
- `pfit_mem_rss` is a part of the process data (incl. process executable itself, libraries, memory allocations etc.) that are currently active in RAM. The data that is swapped out or unused is not included. This value can halp tracking the real-time memory usage and may help identify memory-related problems in combination with `pfit_used_mem` value.
- `pfit_used_swap` - size of used swap in bytes. Generally speaking, swap usage is sub-optimal, usually leading to significant performance decrease and can also indicate memory-related problems.
- `pfit_fs_read_bytes`, `pfit_fs_write_bytes`, `pfit_fs_read_count`, and `pfit_fs_write_count` are used to track disk accesses. Correlation between disk access and cpu load can help find out if the disk access is synchronous, which in turn can be an indicator of inefficient behavior.
- `pfit_num_processes` and `pfit_load1` basically show the same information in a different way (so maybe we should only keep one of them). Load1 value adds its own magic, being a weighted moving average value, thus making it much more complex to interpret. The process count seems to me to be the more interesting number, especially if broken down to "running", "waiting", "sleeping" etc. (can be achieved with `ps aux` or the like - the processes with the status "D", meaning "uninterruptible sleep", which is most often equivalent to "waiting for IO" can be used in combination with other IO-related metrics to get a better picture). This values can help understanding the job behavior and find possible bottlenecks.
- `pfit_num_threads` - if the total number is needed, it can be derived from the `ps -A` vs. `ps -AL` output. But need to discuss if this value is meaningful at all.
- `pfit_frequency_per_cpuX` - since this value is changing depending on the load, the thermal situation, and other parameters, it can be an indicator of some general problems of the node (e.g. combined with maximal possible frequency or with the current load value). Can be obtained with `cat /proc/cpuinfo | grep -i "cpu mhz"` or by using `i7z` (both need to be converted to Hz).
### Aggregates per node
The column "Aggregation" shows which aggregates are needed. In the output, the method has to be appended to the metric name, like `pfit_num_processes_node_avg` to ensure uniqueness.
IMPORTANT: Average values are often floats. Since the floating point value of e.g. processes or bytes doesn't really make sense, these values should be rounded mathematically to the integer type.
|Report metric|Data type|Constraint|Aggregation|Required|Information|
| --- | --- | --- | --- | --- | --- |
|`pfit_num_processes_node`|integer|INT-POS|min, max, avg, mean|y|Number of processes on this node|
|`pfit_num_threads_node`|integer|INT-POS|min, max, avg, mean|n|TODO: (the basic value needs to be defined properly first)|
|`pfit_mem_rss_node`|integer|BYTE, Plausibility range [10MB, 10TB]|max, avg|y|RSS memory statistics for this node|
|`pfit_used_swap_node`|integer|BYTE, INT-POS0|max|y|Max used swap on this node|
|`pfit_frequency_per_cpuX_node`|integer|Value in Hertz, INT-POS0|min, max, avg, mean|n|Frequency aggregated per CPU (replace X with the CPU number)|
### Aggregates per job
|Report metric|Data type|Constraint|Aggregation|Required|Information|
| --- | --- | --- | --- | --- | --- |
|`pfit_used_swap_job`|integer|either 0 or 1|if sum_nodes(used_swap) == 0, then 0, else 1|y|Is swap used on any node during the job|
- `pfit_used_swap_job` - currently we only evaluate if swap was used during the job to give a hint to the user, thus a boolean is sufficient.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment