Commit fadfbd5a authored by Azat Khuziyakhmetov's avatar Azat Khuziyakhmetov
Browse files

further fixes of the rcm documentation

parent 05eea0d2
......@@ -95,8 +95,233 @@ This attribute contains the object of `NodeData`. The data collected from the no
#### JobData.id
**type**: string
The id of the job given by the batch system.
#### JobData.user_name
**type**: string
The username of the user who submitted the job.
#### JobData.used_queue
**type**: string
The queue / partiton name where the job was running
#### JobData.submit_time
**type**: integer. UNIX timestamp
Time when the job was submitted
#### JobData.start_time
**type**: integer. UNIX timestamp
Time when the job was started
#### JobData.end_time
**type**: integer. UNIX timestamp
Time when the job was ended
#### JobData.run_time
**type**: integer. seconds
The runtime of the job
#### JobData.requested_time
**type**: integer. seconds
Walltime requested by user for the job
#### JobData.num_nodes
**type**: integer
Amount of requested nodes
#### JobData.alloc_cu
**type**: list of [AllocCUData][#AllocCUData]
List of Compute Units (CU) allocation data
## AllocCUData
#### AllocCUData.node_id
**type**: string
The hostname of the node
#### AllocCUData.cu_count
**type**: integer
The allocated Compute Units (CU) on the corresponding node.
## NodeData
#### NodeData.name
**type**: string
The hostname of the node
#### NodeData.cpu_model
**type**: string
The CPU model of the node
#### NodeData.sockets
**type**: integer
Amount of sockets on the node
#### NodeData.cores_per_socket
**type**: integer
Amount of cores per socket
#### NodeData.virt_thr_core
**type**: integer
Amount of virtual threads per core
#### NodeData.phys_thr_core
**type**: integer
Amount of physical threads per core
#### NodeData.main_mem
**type**: integer. Bytes
Amount of RAM on the node
#### NodeData.alloc_cu
**type**: integer
Amount of allocated CU on the node
#### NodeData.ib_rcv_max
**type**: integer. Bytes per second
The maximum (high water mark) rate of received data over infiniband in bytes per second
#### NodeData.ib_xmit_max
**type**: integer. Bytes per second
The maximum (high water mark) rate of transmitted data over infiniband in bytes per second
#### NodeData.proc
**type**: [ProcData](#ProcData)
Aggregated data of job's processes on the node.
#### NodeData.seq_cpu_usage
**type**: [SeqVals](#SeqVals)
Sequential data of cpu usage. aggregated as an average in interval.
#### NodeData.seq_load_avg
**type**: [SeqVals](#SeqVals)
Sequential data of load1 of the node. aggregated as an average in interval.
#### NodeData.seq_load_max
**type**: [SeqVals](#SeqVals)
Sequential data of load1 of the node. aggregated as maximum in interval.
#### NodeData.seq_ib_rcv_max
**type**: [SeqVals](#SeqVals)
Sequential data of received data rate over infiniband. aggregated as maximum bytes per second in interval.
#### NodeData.seq_ib_xmit_max
**type**: [SeqVals](#SeqVals)
Sequential data of transmitted data rate over infiniband. aggregated as maximum bytes per second in interval.
## SeqVals
#### SeqVals.delta
**type**: integer. seconds
The difference between every consecutive points in the sequence.
#### SeqVals.seq
**type**: list of arbitrary data types
The list of data points. The first point corresponds to the data point at the time [JobData.start_time](#JobData.start_time) and the last point corresponds to the data point at the time [JobData.end_time](#JobData.end_time)
## ProcData
#### ProcData.cpu_time_user
**type**: integer. seconds
The sum of the time CPU spent on running user processes/calls.
#### ProcData.cpu_time_system
**type**: integer. seconds
The sum of the time CPU spent on running system processes/calls.
#### ProcData.cpu_time_idle
**type**: integer. seconds
The sum of the time CPU was idling.
#### ProcData.cpu_time_iowait
**type**: integer. seconds
The sum of the time CPU spent waiting for IO.
#### ProcData.write_bytes
**type**: integer. bytes
The maximum (high water mark) amount of bytes processes wrote to FS.
#### ProcData.read_bytes
**type**: integer. bytes
The maximum (high water mark) amount of bytes processes read from FS.
#### ProcData.write_count
**type**: integer
The maximum (high water mark) amount of write accesses to FS.
#### ProcData.read_count
**type**: integer
The maximum (high water mark) amount of read accesses to FS.
#### ProcData.mem_rss_max
**type**: integer. bytes
The maximum (high water mark) amount of RSS memory occupied by the job on the node.
#### ProcData.mem_rss_avg
**type**: integer. bytes
The average amount of RSS memory occupied by the job on the node.
#### ProcData.mem_swap_max
**type**: integer. bytes
The maximum (high water mark) amount of swaped memory on the node.
#### ProcData.mem_vms
**type**: integer. bytes
The average amount of VMS memory on the node.
#### ProcData.cpu_usage
**type**: integer. bytes
The maximum (high water mark) of cpu usage sum of all processes running in the job on the node.
#### ProcData.seq_cpu_usage
**type**: [SeqVals](#SeqVals)
Sequential data of cpu usage sum of all processes running in the job on the node aggregated as an average on the interval.
#### ProcData.seq_mem_rss_sum
**type**: [SeqVals](#SeqVals)
Sequential data of RSS memory usage sum of all processes running in the job on the node aggregated as an average on the interval.
......@@ -10,7 +10,6 @@ exec(open(rules_path).read())
def print_header():
print("# Rules \n"
"\n"
"A rule is a set of [attributes](./attributes.md) with specified values."
"\n")
......@@ -19,20 +18,21 @@ def format_attr_link(attr):
def print_rule(rule):
print("---")
print("#### recommendation\n")
print("#### recommendation")
print("```")
print(rule["msg"])
print("```")
print("#### attributes\n")
print("#### attributes")
print("attribute name | value(s)")
print("--- | ---")
for attr_name, attr_value in rule["attrs"].items():
print("[{an}]({al}) | `{av}`\n".format(
print("[{an}]({al}) | `{av}`".format(
an = attr_name,
al = format_attr_link(attr_name),
av = attr_value,
))
print("")
def main():
print_header()
for arule in RULES:
......
# Rules
# Rules
A rule is a set of [attributes](./attributes.md) with specified values.
---
#### recommendation
```
Some nodes are overloaded. Probably by other processes
```
#### attributes
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) | `NORM`
[overloaded_node_exists](./attributes.md#overloaded_node_exists) | `True`
---
#### recommendation
```
The requested walltime is too hight. Try to request less time
```
#### attributes
attribute name | value(s)
--- | ---
[req_walltime](./attributes.md#req_walltime) | `HIGH`
---
#### recommendation
```
The CPU usage on some nodes is low, please request less cores or increase amount of processes
```
#### attributes
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) | `LOW`
[node_cpu_usage_min](./attributes.md#node_cpu_usage_min) | `LOW`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `NORM`
---
#### recommendation
```
Some nodes were not used during the runtime
```
#### attributes
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[node_cpu_usage_min](./attributes.md#node_cpu_usage_min) | `ZERO`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `NORM`
---
#### recommendation
```
The CPU usage is not distributed equally among the nodes. Try to use the nodes evenly
```
#### attributes
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) | `NORM`
[node_cpu_usage_min](./attributes.md#node_cpu_usage_min) | `LOW`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `NORM`
---
#### recommendation
```
The CPU usage of the job is low on all nodes, please request appropriate amount of resources
```
#### attributes
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[mem_usage_total](./attributes.md#mem_usage_total) | `LOW`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) | `LOW`
[node_cpu_usage_min](./attributes.md#node_cpu_usage_min) | `LOW`
---
#### recommendation
```
The CPU usage of the node is low. It might indicate that the job is not running in full power
```
#### attributes
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `ONE`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `LOW`
---
#### recommendation
```
The node is overloaded. Probably by other processes on the node
```
#### attributes
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `ONE`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `NORM`
[overloaded_node_exists](./attributes.md#overloaded_node_exists) | `True`
---
#### recommendation
```
The node is overloaded and cpu usage of job was high
```
#### attributes
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `ONE`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `HIGH`
[overloaded_node_exists](./attributes.md#overloaded_node_exists) | `True`
---
#### recommendation
```
Swap was used on one of the nodes
```
#### attributes
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[mem_swap_used](./attributes.md#mem_swap_used) | `True`
---
#### recommendation
```
Swap was used on the node
```
#### attributes
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `ONE`
[mem_swap_used](./attributes.md#mem_swap_used) | `True`
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment