Commit 1953b9c2 authored by akhuziy's avatar akhuziy
Browse files

added the DB spec

parent cc52d827
# InfluxDB specification
This file contains specifications for values stored in InfluxDB.
Grafana, ASCII and PDF columns denote if the metric is used for it.
In these columns `MIN` is a minimal set of measurements and `EXT` is extended.
## Batch system Data
Collected by the post execution script `exportjobinfo.py`.
Collected as soon as a job finishes.
Measurement: `pfit-jobinfo`.
| Key | Value | Info | Grafana | ASCII | PDF |
| -------------- | -------------- | --------------------------------------------------------------------------- | ------- | ----- | --- |
| jobid | string\* | Job ID in the batch system | MIN | MIN | MIN |
| uname | string\* | user name in HPC | MIN | | |
| user_name | string | user name in HPC | MIN | MIN | MIN |
| used_queue | string | used queue in the batch system | MIN | MIN | MIN |
| num_nodes | integer | number of nodes the job runs on | MIN | MIN | MIN |
| requested_cu | integer | requested compute units (slots/cores/...) | MIN | MIN | MIN |
| requested_time | seconds | requested time | MIN | MIN | MIN |
| submit_time | UNIX timestamp | submitted time | MIN | MIN | MIN |
| start_time | UNIX timestamp | start time | MIN | MIN | MIN |
| end_time | UNIX timestamp | end time | MIN | MIN | MIN |
| run_time | seconds | duration of the job | MIN | | |
| alloc_cu | string | allocation of the job on nodes. Format: `CU1*NODE1:CU2*NODE2:...:CU#*NODE#` | | MIN | MIN |
Measurement: `pfit-jobinfo-alloc`
For every allocated node in the job there should be 1 entry in this measurement.
| Key | Value | Info | Grafana | ASCII | PDF |
| --------- | -------- | -------------------------- | ------- | ----- | --- |
| host | string\* | Hostname of the node | MIN | MIN | MIN |
| jobid | string\* | Job ID in the batch system | MIN | MIN | MIN |
| alloc_cu | integer | allocated amount of CUs | MIN | MIN | MIN |
| alloc_mem | integer | allocated amount of memory | MIN | MIN | MIN |
## Node Data
Collected by the Telegram plugin `pfit-nodeinfo`.
Collected with some feasible interval (minute/hour/day).
Measurement: `pfit-nodeinfo`.
| Key | Value | Info | Grafana | ASCII | PDF |
| ---------------- | -------- | -------------------- | ------- | ----- | --- |
| host | string\* | Hostname of the node | | MIN | MIN |
| cores_per_socket | integer | cores per socket | | MIN | MIN |
| cpu_model | string | CPU model | | MIN | MIN |
| main_mem | bytes | total amount of RAM | | MIN | MIN |
| sockets | integer | number of sockets | | MIN | MIN |
| threads_per_core | integer | threads per core | | MIN | MIN |
## Process data
Collected by the Telegram plugin `pfit-uprocstat`.
Collected with higher frequency (by default: 10 second interval) for every user process in the system (not threads).
Measurement: `pfit-uprocstat`.
| Key | Value | Info | Grafana | ASCII | PDF |
| ---------------------------- | -------- | ---------------------------- | ------- | ----- | --- |
| host | string\* | Hostname of the node | MIN | MIN | MIN |
| jobid1 | string\* | Job ID | MIN | MIN | MIN |
| jobid2 | string\* | Job ID | | | |
| pid | string\* | PID (Process ID) | MIN | MIN | MIN |
| process_name | string\* | process name | MIN | | |
| uid | string\* | UID of process owner | | | |
| user | string\* | Username of process owner | MIN | MIN | MIN |
| cpu_time_idle | float | CPU idle time | | | |
| cpu_time_iowait | float | CPU iowait time | | | |
| cpu_time_system | float | CPU system time | MIN | MIN | MIN |
| cpu_time_user | float | CPU user time | MIN | MIN | MIN |
| cpu_time_guest | float | CPU guest time | | | |
| cpu_time_guest_nice | float | CPU guest nice time | | | |
| cpu_time_irq | float | CPU irq time | | | |
| cpu_time_nice | float | CPU nice time | | | |
| cpu_time_soft_irq | float | CPU soft irq time | | | |
| cpu_time_steal | float | CPU steal time | | | |
| cpu_time_stolen | float | CPU stolen time | | | |
| cpu_usage | float | CPU usage | MIN | MIN | MIN |
| involuntary_context_switches | integer | involuntary context switches | | | MIN |
| voluntary_context_switches | integer | voluntary context switches | | | MIN |
| memory_rss | bytes | memory RSS | MIN | MIN | MIN |
| memory_swap | bytes | memory SWAP | MIN | MIN | MIN |
| memory_vms | bytes | VMS | MIN | MIN | MIN |
| num_fds | integer | number of file descriptors | | | MIN |
| num_threads | integer | number of threads | | | |
| read_bytes | bytes | bytes read | MIN | MIN | MIN |
| read_count | integer | read count | MIN | MIN | MIN |
| write_bytes | bytes | bytes written | MIN | MIN | MIN |
| write_count | integer | write count | MIN | MIN | MIN |
## Swap data
Collected by the Telegram plugin `swap`.
Collected with higher frequency (by default: 10 second interval) for every node.
Measurement: `swap`.
| Key | Value | Info | Grafana | ASCII | PDF |
| ------------ | -------- | ------------------------------------------------------------ | ------- | ----- | --- |
| host | string\* | Hostname of the node | | | |
| free | bytes | free swap memory | | | MIN |
| in | bytes | data swapped in since last boot calculated from page number | | | |
| out | bytes | data swapped out since last boot calculated from page number | | | |
| total | bytes | total swap memory | | | |
| used | bytes | used swap memory | | | MIN |
| used_percent | float | percentage of swap memory used | | | |
## CPU data
Collected by the Telegram plugin `cpu`.
Collected with higher frequency (by default: 10 second interval) for every CPU of the node and total values.
Measurement: `cpu`.
| Key | Value | Info | Grafana | ASCII | PDF |
| ---------------- | -------- | --------------------- | ------- | ----- | --- |
| host | string\* | Hostname of the node | MIN | | MIN |
| cpu | string\* | <cpuN> or <cpu-total> | MIN | | |
| usage_guest | float | CPU usage guest | MIN | | |
| usage_guest_nice | float | CPU usage guest nice | | | |
| usage_idle | float | CPU usage idle | MIN | | MIN |
| usage_iowait | float | CPU usage iowait | MIN | | MIN |
| usage_irq | float | CPU usage irq | MIN | | |
| usage_nice | float | CPU usage nice | MIN | | |
| usage_softirq | float | CPU usage softirq | | | |
| usage_steal | float | CPU usage steal | | | |
| usage_system | float | CPU usage system | MIN | | MIN |
| usage_user | float | CPU usage user | MIN | | MIN |
## Memory data
Collected by the Telegram plugin `mem`.
Collected with higher frequency (by default: 10 second interval) for every node.
Measurement: `mem`.
| Key | Value | Info | Grafana | ASCII | PDF |
| ----------------- | -------- | --------------------------- | ------- | ----- | --- |
| host | string\* | Hostname of the node | MIN | | MIN |
| active | bytes | active memory | MIN | | |
| available | bytes | available memory | MIN | | MIN |
| available_percent | float | available percent of memory | MIN | | MIN |
| buffered | bytes | buffered memory | | | |
| cached | bytes | cached memory | | | |
| free | bytes | free memory | MIN | | |
| inactive | bytes | inactive memory | | | |
| total | bytes | total memory | MIN | | MIN |
| used | bytes | used memory | MIN | | MIN |
| used_percent | float | used percent of memory | MIN | | MIN |
## System data
Collected by the Telegram plugin `system`.
Collected with higher frequency (by default: 10 second interval) for every node.
Measurement: `system`.
| Key | Value | Info | Grafana | ASCII | PDF |
| ------------- | -------- | --------------------------- | ------- | ----- | --- |
| host | string\* | Hostname of the node | MIN | | MIN |
| load1 | float | load1 | MIN | | MIN |
| load5 | float | load5 | | | |
| load15 | float | load15 | | | |
| n_cpus | integer | Number of CPUs | MIN | | |
| n_users | integer | Number of users on the host | | | |
| uptime | seconds | Uptime | | | |
| uptime_format | string | Formatted uptime | | | |
## Beegfs data
Collected by the Telegraf plugin script `beegfs_clients.sh`
Measurement: `beegfs_clients`
| Key | Value | Info | Grafana | ASCII | PDF |
| ------ | -------- | ------------------------ | ------- | ----- | --- |
| host | string\* | Hostname of the node | EXT | EXT | EXT |
| MiB-rd | float | MiBs read by the host | EXT | EXT | EXT |
| MiB-wr | float | MiBs written by the host | EXT | EXT | EXT |
## Infiniband data
Collected by the Telegraf plugin script `infiniband.sh`
Measurement: `infiniband`
| Key | Value | Info | Grafana | ASCII | PDF |
| ------------ | -------- | ---------------------------- | ------- | ----- | --- |
| host | string\* | Hostname of the node | EXT | EXT | EXT |
| PortXmitData | float | Data transmitted by the host | EXT | EXT | EXT |
| PortRcvData | float | Data Received by the host | EXT | EXT | EXT |
## Nvidia GPU data
Collected by the Telegraf plugin script `nvidiatotal.sh`
Measurement: `nvidia_gpu`
| Key | Value | Info | Grafana | ASCII | PDF |
| --------------- | -------- | --------------------------------- | ------- | ----- | --- |
| host | string\* | Hostname of the node | EXT | EXT | EXT |
| bus | string\* | The Bus ID of the GPU on the node | EXT | EXT | EXT |
| gpu_name | string\* | GPU type | EXT | EXT | EXT |
| power.limit | float | The power limit of the GPU | EXT | | |
| memory.total | float | Total memory of GPU | EXT | | EXT |
| temperature.gpu | float | The temperature of GPU | EXT | | EXT |
| power.draw | float | Power draw of GPU | EXT | | EXT |
| memory.used | float | Used memory of GPU | EXT | EXT | EXT |
| utilization.gpu | float | GPU utilization | EXT | EXT | EXT |
## Nvidia GPU process data
Collected by the Telegraf plugin script `nvidiatotal.sh`
Measurement: `nvidia_proc`
| Key | Value | Info | Grafana | ASCII | PDF |
| -------- | -------- | ----------------------------------- | ------- | ----- | --- |
| host | string\* | Hostname of the node | EXT | EXT | EXT |
| bus | string\* | The Bus ID of the GPU on the node | EXT | EXT | EXT |
| gpu_name | string\* | GPU type | EXT | EXT | EXT |
| JOBID | string\* | JOBID the process belongs to | EXT | EXT | EXT |
| username | string\* | Username the process belongs to | EXT | EXT | EXT |
| cpu_pcpu | float | CPU usage percentage of the process | EXT | EXT | EXT |
| cpu_rss | float | Memory used by the process | EXT | EXT | EXT |
| pid | float | The process PID | EXT | EXT | EXT |
\* Tag keys
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment