spec-metrics2.md 18.8 KB
Newer Older
Igor Merkulow's avatar
Igor Merkulow committed
1
2
# Extended metrics set

Igor Merkulow's avatar
Igor Merkulow committed
3
IMPORTANT: here we are using a different naming scheme to clearly distinguish between the metrics collected and metrics used in the reports. The metrics in the report are defined independently from all collectors to avoid ambiguities and errors. We will provide an explanation or an example how these values are calculated so that they will have identical meaning in all reports. For example, to get an impression how much data is read from the disk, we have a parameter called `pfit_fs_read_bytes` - this value will probably never exist as a collected metric, since most of the file systems report their data separately. But for the basic understanding of the program IO behavior, an aggregated value over all file systems should be sufficient.
Igor Merkulow's avatar
Igor Merkulow committed
4

Igor Merkulow's avatar
Igor Merkulow committed
5
The data in the extended set will have an additional marker, denoting if a value is required or not. Required values should be easy to collect and be sufficient to get a high-level picture of the job. The optional values represent specific metrics that may be not available everywhere. Should the values be present, they will be included in the report, otherwise the section is skipped.
Igor Merkulow's avatar
Igor Merkulow committed
6

Igor Merkulow's avatar
Igor Merkulow committed
7
Additional values, not covered by this specification, are allowed, but currently will be not included in the report. This decision should allow developers to parse all the metrics from a source without the need to eliminate the "unnecessary" results. Additionally, it should simplify the extension of the specification in the future.
Igor Merkulow's avatar
Igor Merkulow committed
8
9
10
11

## Data subsets

In the combined interface for all the reports, we have to deal with multiple kinds of data, each used in the different context. At the moment we can identify 5 of them:
12

13
14
15
16
17
18
19
|JSON name|Explanation|
| --- | --- |
|general|global job data - is valid for the job and doesn't change during the entire job runtime (e.g. job-ID, user name etc.).|
|static|static node data - node-related data that doesn't change for the duration of the job (e.g. node name, CPU model, RAM amount etc.).|
|dynamic|dynamic node data - data samples (e.g disk reads, memory usage etc., but converted to our format specification).|
|aggregates|aggregates per node - data, aggregated over the job runtime for every node separately (e.g. maximum CPU load, total number of packets sent over network etc.)|
|totals|aggregates per job - data, aggregated over all nodes (e.g. if swap was used)|
Igor Merkulow's avatar
Igor Merkulow committed
20

21
Global job-related and static node-related data is displayed in the reports more or less as is, either as a part of the header or as a baseline for calculations. Dynamic data samples are used for the time series plots in PDF report, for aggregates, and may be used for calculating recommendations, but most probably are not going to be displayed in the raw form (due to the data amount). Aggregated and total values are used to estimate job efficiency and can also be shown in the reports.
Igor Merkulow's avatar
Igor Merkulow committed
22
23
24
25
26
27
28
29
30

TODO: network data is missing yet. We need to define the supported network types, what kind of data is needed, how it should be aggregated, and what kind of information can be derived from it.

TODO: GPU data is not included yet.

TODO: how to deal with partial data? E.g. if not all file systems used are supported by us.

TODO: do we already have metrics, supported by all tools/plugins, but not included in the reports?

31
32
33
34
35
36
## Value Constraints and Units

Table of constraints:

|Abbreviation|Constraint|Explanation|
| --- | --- | --- |
37
38
39
40
41
42
43
44
45
46
|STR|regex='[a-zA-Z0-9_]{3, 45}'|string is limited to upper- and lower-case ASCII characters, numbers, and underscores. It can be between 3 and 45 characters long.|
|STR-EXT|regex=r'[a-zA-Z0-9_ \(\)@\.,\-]{3, 45}'|string can also contain whitespace and other characters (e.g. punctuation, '@' or parentheses). Length still between 3 and 45 characters.|
|INT-POS|min=1|value is strictly greater than zero.|
|INT-POS0|min=0|value is zero or greater.|
|INT-TS|min=1000000000|value is an UNIX timestamp, representing the number of seconds since 01.01.1970, minimum value is 1,000,000,000 (was on 09.09.2001, so we should get larger values). If this constraint is not met, there are probably other issues on that machine.|
|INT-01|min=0, max=1|value is either 0 or 1|
|FLOAT-POS0|min=0.0|floating point value, greater then or equal 0.0|
|FLOAT-PERCENT|min=0.0, max=1.0|floating point value, between 0 and 1 (both incl.)|
|MEM-PR|min=10485760, max=10995116277760|Plausibility check for memory amounts - default is [10MB, 10TB]|
|SAMPL|regex='[0-9]{1,5}[hms]'|sampling interval in the form "1s" or "24h" etc.|
47
48
49
50
51
52
53
54
55
56

Used unit abbreviations:

|Abbreviation|Unit|Explanation|
| --- | --- | --- |
|BYTE|byte(s)|value is in bytes, thus only positive integers or zero is allowed|
|SEC|s|value is in seconds, thus only positive integers or zero is allowed|
|NSEC|ns|value is in nanoseconds|
|HZ|Hz|value is in Hertz|
|MBS|MB/s|Megabytes per second|
57

Igor Merkulow's avatar
Igor Merkulow committed
58
59
Question mark means that the data is not yet there, minus sign means "not applicable".

60
## Metrics representing global job information
Igor Merkulow's avatar
Igor Merkulow committed
61
62
63

Most of this information is provided by the job management system and can probably be taken over without modifications. Special care is needed if identifier strings use non-ASCII charsets. Currently it's disallowed by the validator.

64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
|Report metric|Data type|Constraint|Unit|Required|Label / Caption|
| --- | --- | --- | --- | --- | --- |
|`pfit_job_id`|string|STR|-|y|Job identifier|
|`pfit_user_name`|string|STR|-|y|User name/handle|
|`pfit_project_account`|string|STR|-|n|Project identifier|
|`pfit_used_queue`|string|STR|-|y|Used job queue name|
|`pfit_submit_time`|integer|INT-TS|SEC|y|Job submitted|
|`pfit_start_time`|integer|INT-TS|SEC|y|Job started execution|
|`pfit_end_time`|integer|INT-TS|SEC|y|Job finished/terminated|
|`pfit_requested_time`|integer|Range [1, 31556952]|SEC|y|Requested job walltime|
|`pfit_requested_cores`|integer|INT-POS|-|y|Number of requested cores|
|`pfit_num_used_nodes`|integer|INT-POS|-|y|Number of nodes the job ran on|
|`pfit_sampling_interval`|string|SAMPL|-|y|Metrics sampling interval|
|`pfit_return_value`|integer|INT-01|-|n|Job exit status|

Additional explanations:

Igor Merkulow's avatar
Igor Merkulow committed
81
- `pfit_requested_time` should ideally be roughly equal to (end_time - start_time). Significantly lower value can indicate a problem or over-requesting of resources. Both situations should be investigated and avoided. The upper limit is currently set to one year.
82
83
- `pfit_requested_cores` is aggregated over all nodes (total sum).
- `pfit_num_used_nodes` - it has to be equal to the number of node-related data blocks in the set.
84
- `pfit_sampling_interval` is a value that we set in the configuration, so it should be identical for all nodes, but it can also be aggregated if necessary (e.g. the shortest should be stated here.). TODO: define how exactly the interval is specified (e.g. if "1m30s" should be allowed or only "90s") and adapt the RegEx. Maybe integer value in seconds would be better. Currently, only lengths between 2 and 6 are allowed by the validator.
85
- `pfit_return_value` should indicate if the job has finished correctly. Is not necessarily the result delivered by the job management system, since programs can have also negative exit codes.
Igor Merkulow's avatar
Igor Merkulow committed
86
87
88

## Metrics per node

89
### Static data - Characteristics of the node and values valid for the entire job on this node
Igor Merkulow's avatar
Igor Merkulow committed
90

91
92
93
94
95
|Report metric|Data type|Constraint|Unit|Required|Label / Caption|
| --- | --- | --- | --- | --- | --- |
|`pfit_node_name`|string|STR|-|y|Identifier of a node, has to be unique for the job|
|`pfit_cpu_model`|string|STR-EXT|-|y|CPU vendor and model name|
|`pfit_available_main_mem`|integer|MEM-PR|BYTE|y|Node's total RAM amount|
96
97
|`pfit_mem_latency`|float|FLOAT-POS0|NSEC|n|RAM latency, default value|
|`pfit_mem_bw`|float|FLOAT-POS0|MBS|n|RAM bandwidth, default value|
98
99
100
101
102
103
104
105
106
107
108
109
110
|`pfit_sockets_per_node`|integer|Range [1, 16]|-|y|Number of CPU sockets for this node|
|`pfit_cores_per_socket`|integer|Range [1, 1024]|-|y|Number of actual CPU cores for every socket|
|`pfit_phys_threads_per_core`|integer|Range [1, 1024]|-|y|How many HW threads can be executed on a CPU core|
|`pfit_virt_threads_per_core`|integer|Range [0, 1024]|-|y|Number of _additional_, virtual threads per core|
|`pfit_cache_l1i_size`|integer|INT-POS|BYTE|y|L1 Instruction Cache size per core|
|`pfit_cache_l1d_size`|integer|INT-POS|BYTE|y|L1 Data Cache size per core|
|`pfit_cache_l2_size`|integer|INT-POS|BYTE|y|L2 Cache size per core|
|`pfit_cache_l3_size`|integer|INT-POS0|BYTE|y|L3 Cache size per core, zero if not available|
|`pfit_assigned_cpus`|integer|INT-POS|-|y|The number of CPUs, assigned to this job on this node|
|`pfit_max_alloc_mem`|integer|MEM-PR|BYTE|y|Maximum memory size allocated by job per node|
|`pfit_requested_mem`|integer|INT-POS0|BYTE|y|Amount of RAM on this node, requested by the job|

Additional explanations:
Igor Merkulow's avatar
Igor Merkulow committed
111

112
- `pfit_node_name` - can be reported by the job management or extracted with `uname -n`. Its uniqueness in the context of the job has to be guaranteed. Used to distinguish the node-related data sets.
Igor Merkulow's avatar
Igor Merkulow committed
113
114
115
- `pfit_cpu_model` - CPU vendor string, is collected just for the statistical purposes and is currently not processed further.
- `pfit_available_main_mem` - a total amount of physically available memory, e.g. from `free` or `/proc/meminfo` (but converted to bytes). But if there are limits imposed on the processes' memory usage, this has to be the limit (e.g. from `ulimit -a`). Collected to analyse the memory usage.
- `pfit_mem_latency` and `pfit_mem_bw` - The values here are the best possible / upper limits, ideally measured on an empty machine (e.g. using `lmbench3` or other microbenchmarks) or even theoretical values derived from installed hardware. They can be used to estimate the efficiency of the memory usage.
116
- `pfit_sockets_per_node`, `pfit_cores_per_socket`, `pfit_phys_threads_per_core`, and `pfit_virt_threads_per_core` describe the CPU configuration - they can be basically generated with `lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('`, but the threads per core need to be split into physical and virtual. E.g. if on an Intel CPU, the "threads per core" value is 2, then 1 physical and 1 virtual thread is configured (that also means Hyperthreading is active). If the value is just 1, then there is 1 physical and 0 virtual thread and HT is disabled. Cores per socket assumes node-wide identical number for every socket.
117
- `pfit_cache_l1i_size`, `pfit_cache_l1d_size`, `pfit_cache_l2_size`, and `pfit_cache_l3_size` values are available through multiple linux tools, e.g. `getconf -a | grep -i cache`. Additional cache-related parameters (like cache line size and associativity) are not used at the moment. These values can be used later to identify sub-optimal memory access patterns. Additionally, `cache misses` and `tlb misses` will probably be necessary. Systems with L4 and L5 cache are considered too exotic at the moment.
118
119
- `pfit_assigned_cpus` - CPU is a physical or logical thread in Intel's parlance. we need the maximum number of CPUs assigned to the job. Has to be less or equal to the total number of CPUs available on the node. It's used to calculate job-specific CPU usage efficiency. TODO: do we need number of processes per node? is it the same value for the job management? which value does the job management report?
- `pfit_max_alloc_mem` - what we need is the maximum amount of memory requested by a process from the OS (including libraries, swap, and whatsoever), so maybe this has to be moved to the "aggregates" section. In combination with requested memory and RSS values can help to identify a memory-bound job and also possible memory-related problems. It includes all memory that the process can access, including memory that is swapped out, memory that is allocated, but not used, and memory that is from shared libraries.
Igor Merkulow's avatar
Igor Merkulow committed
120
121
- `pfit_requested_mem` - this is the amount of memory (per node) requested by user from the job management system. TODO: is this value delivered by all job management systems?

122
### Dynamic data - time samples per node
Igor Merkulow's avatar
Igor Merkulow committed
123

124
125
|Report metric|Data type|Constraint|Unit|Required|Label / Caption|
| --- | --- | --- | --- | --- | --- |
126
|`pfit_timestamp`|integer|INT-POS0|NSEC|y|Time stamp identifying the metrics|
127
128
129
130
131
132
133
134
135
136
137
138
139
140
|`pfit_cpu_time_user`|integer|INT-POS0|SEC|y|Part of the total walltime spent in the user code|
|`pfit_cpu_time_system`|integer|INT-POS0|SEC|y|Part of the total walltime in system calls or kernel processes|
|`pfit_cpu_time_idle`|integer|INT-POS0|SEC|y|Part of the total walltime in the idle task|
|`pfit_cpu_time_iowait`|integer|INT-POS0|SEC|n|Part of the total walltime waiting for IO|
|`pfit_mem_rss`|integer|MEM-PR|BYTE|y|RSS memory used by job per node|
|`pfit_used_swap_size`|integer|INT-POS0|BYTE|y|Used swap size|
|`pfit_fs_read_bytes`|integer|INT-POS0|BYTE|n|Total amount of data read from all disk filesystems on this node|
|`pfit_fs_write_bytes`|integer|INT-POS0|BYTE|n|Total amount of data written to all disk filesystems on this node|
|`pfit_fs_read_count`|integer|INT-POS0|-|n|Number of disk accesses for reading at this node|
|`pfit_fs_write_count`|integer|INT-POS0|-|n|Number of disk accesses for writing at this node|
|`pfit_num_threads`|integer|INT-POS|-|n|Number of threads|
|`pfit_num_processes`|integer|INT-POS0|-|y|Number of processes on the node|
|`pfit_load1`|float|FLOAT-POS0|-|n|Weighted average number of processes waiting|
|`pfit_total_context_switches`|integer|INT-POS0|-|n|Total amount of context switches|
141
|`pfit_frequency_per_cpuX`|integer|INT-POS0|HZ|n|Current CPU frequency for CPU X|
142
143
144
145

Additional explanations:

- `pfit_cpu_time_user`, `pfit_cpu_time_system`, and `pfit_cpu_time_idle` are the values that can be obtained by timing the running program, in the format similar to the `time` utility. The are aggregated over all cores. In combination with number of requested cores, they can help find stall times, thus maybe indicating some efficiency problems.
Igor Merkulow's avatar
Igor Merkulow committed
146
- `pfit_cpu_time_iowait` is a more complex value - it is a part of the idle value, so it's usually only measured/displayed on Linux if there is significant idle time. The reason for that is that idle is a CPU state, but iowait - a process state, thus if the CPU is working (e.g. executing a different process), there is no CPU idle time, so the iowait of a stalled process is ignored.
147
- `pfit_mem_rss` is a part of the process data (incl. process executable itself, libraries, memory allocations etc.) that are currently active in RAM. The data that is swapped out or unused is not included. This value can help tracking the real-time memory usage and may help identify memory-related problems in combination with `pfit_used_mem` value.
Igor Merkulow's avatar
Igor Merkulow committed
148
149
- `pfit_used_swap` - size of used swap in bytes. Generally speaking, swap usage is sub-optimal, usually leading to significant performance decrease and can also indicate memory-related problems.
- `pfit_fs_read_bytes`, `pfit_fs_write_bytes`, `pfit_fs_read_count`, and `pfit_fs_write_count` are used to track disk accesses. Correlation between disk access and cpu load can help find out if the disk access is synchronous, which in turn can be an indicator of inefficient behavior.
150
151
152
- `pfit_num_processes` and `pfit_load1` basically show the same information in a different way (so maybe we should only keep one of them). Load1 value adds its own magic, being a weighted moving average value, thus making it much more complex to interpret. The process count seems to me to be the more interesting number, especially if broken down to "running", "waiting", "sleeping" etc. (can be achieved with `ps aux` or the like - the processes with the status "D", meaning "uninterruptible sleep", which is most often equivalent to "waiting for IO" can be used in combination with other IO-related metrics to get a better picture). This values can help understanding the job behavior and find possible bottlenecks. Processes value of zero indicates a possible problem.
- `pfit_num_threads` - TODO: Not sure what this means - threads total on the node? User threads? Relevant process' threads? if the total number is needed, it can be derived from the `ps -A` vs. `ps -AL` output. But need to discuss if this value is meaningful at all.
- `pfit_total_context_switches` voluntary + involuntary switches combined. TODO: this is a highly OS-dependent value and there is no baseline or threshold to derive anything from it - why is it interesting?
153
- `pfit_frequency_per_cpuX` - since this value is changing depending on the load, the thermal situation, and other parameters, it can be an indicator of some general problems of the node (e.g. combined with maximal possible frequency or with the current load value). Can be obtained with `cat /proc/cpuinfo | grep -i "cpu mhz"` or by using `i7z` (both results need to be converted to Hz). This value is currently not validated due to the unclear number of CPUs. Converting this value to a nested structure seems not feasible yet.
Igor Merkulow's avatar
Igor Merkulow committed
154
155
156
157
158

### Aggregates per node

IMPORTANT: Average values are often floats. Since the floating point value of e.g. processes or bytes doesn't really make sense, these values should be rounded mathematically to the integer type.

159
|Report metric|Data type|Constraint|Unit|Required|Label / Caption|
Igor Merkulow's avatar
Igor Merkulow committed
160
| --- | --- | --- | --- | --- | --- |
161
162
163
164
165
166
167
168
169
170
171
|`pfit_num_processes_node_min`|integer|INT-POS|-|y|Min number of processes on this node|
|`pfit_num_processes_node_max`|integer|INT-POS|-|y|Max number of processes on this node|
|`pfit_num_processes_node_avg`|integer|INT-POS|-|y|Average number of processes on this node|
|`pfit_num_processes_node_median`|integer|INT-POS|-|y|Median number of processes on this node|
|`pfit_num_threads_node_min`|integer|INT-POS|-|n|Min number of threads|
|`pfit_num_threads_node_max`|integer|INT-POS|-|n|Max number of threads|
|`pfit_num_threads_node_avg`|integer|INT-POS|-|n|Average number of threads|
|`pfit_num_threads_node_median`|integer|INT-POS|-|n|Median number of threads|
|`pfit_mem_rss_node_max`|integer|MEM-PR|BYTE|y|Max RSS memory statistics for this node|
|`pfit_mem_rss_node_avg`|integer|MEM-PR|BYTE|y|Average RSS memory statistics for this node|
|`pfit_used_swap_node_max`|integer|INT-POS0|BYTE|y|Max used swap on this node|
172
173
174
175
|`pfit_frequency_per_cpuX_node_min`|integer|INT-POS0|HZ|n|Min frequency aggregated per CPU|
|`pfit_frequency_per_cpuX_node_max`|integer|INT-POS0|HZ|n|Max frequency aggregated per CPU|
|`pfit_frequency_per_cpuX_node_avg`|integer|INT-POS0|HZ|n|Average frequency aggregated per CPU|
|`pfit_frequency_per_cpuX_node_median`|integer|INT-POS0|HZ|n|Median frequency aggregated per CPU|
176
177
178
179
180

Additional explanations:

- `pfit_num_threads_node` TODO: the basic value needs to be defined properly before it can be used
- `pfit_frequency_per_cpuX_node` is currently not validated because the number of CPUs is not fixed and adding another nesting level just for one value is probably an overkill. X in the name has to be replaced with the actual CPU number.
181

Igor Merkulow's avatar
Igor Merkulow committed
182
183
### Aggregates per job

184
|Report metric|Data type|Constraint|Unit|Required|Label / Caption|
Igor Merkulow's avatar
Igor Merkulow committed
185
| --- | --- | --- | --- | --- | --- |
186
187
188
|`pfit_used_swap_job`|integer|INT-01|-|y|Is swap used on any node during the job|

Additional explanations:
Igor Merkulow's avatar
Igor Merkulow committed
189

190
- `pfit_used_swap_job` - currently we only evaluate if swap was used during the job to give a hint to the user, thus a boolean is sufficient. Calculation: if sum_nodes(used_swap) == 0, then 0, else 1