Commit 05eea0d2 authored by Azat Khuziyakhmetov's avatar Azat Khuziyakhmetov
Browse files

fixed documentations

parent 92ae80a4
......@@ -4,6 +4,7 @@ The tool aggregates data in a format wich is suitable for generating reports.
*Currently also contains the script to export `jobinfo` data into DB from the post execution script*.
## Requirements
The aggregator needs particular software packages to be installed along with data available in DB.
- Python 3.
......@@ -17,6 +18,7 @@ python3 -m pip install -r requirements.txt
```
## Configuration
Samples for configurations can be stored in the repository, please rename corresponding templates with an extension `.sample` to `.py` and change placeholder values accordingly.
Real configs should be ignored in `.gitignore` file.
......@@ -26,6 +28,7 @@ Sample configs must not be used in the code.
**Example**: `influxdb.sample` -> `influxdb.py`
## Usage
The main executable of the aggregator module is `data.py`. You can type `./data.py -h` for more help.
```
......@@ -45,6 +48,7 @@ optional arguments:
```
## Get test output
In order to get `json` output from the test data, located at `test/data`, a docker container with InfluxDB instanse and imported data should be running. To run the InfluxDB container, simply execute:
```
test/docker/influxdb/run_influxdb.sh
......@@ -70,8 +74,13 @@ As an example the following command will output the test data in the `json` form
You can use other `JOBID`s which you can find in `test/data` directory.
## Export job info
In order to export the job info with `JOBID` you should call `export.py`:
```bash
export.py JOBID
```
Then the aggregator will gather a job information with ID `JOBID` from the batch system configured in `/conf/config.py` and save it into the configured database as a `pfit-jobinfo` and additionally in `pfit-jobinfo-alloc` measurements.
## Recommendation system
Currently the recommendation system is a module used by aggregator and located in this repository under [rcm](./rcm) directory. Please see the [documentations](./rcm/docs) and [README](./rcm/README.md) files for more information.
......@@ -19,13 +19,15 @@ def format_attr_link(attr):
def print_rule(rule):
print("---")
print("**recommendation**:\n")
print("#### recommendation\n")
print("```")
print(rule["msg"])
print("```")
print("**attributes**:\n")
print("#### attributes\n")
print("attribute name | value(s)")
print("--- | ---")
for attr_name, attr_value in rule["attrs"].items():
print("[{an}]({al}) = `{av}`\n".format(
print("[{an}]({al}) | `{av}`\n".format(
an = attr_name,
al = format_attr_link(attr_name),
av = attr_value,
......
......@@ -3,166 +3,188 @@
A rule is a set of [attributes](./attributes.md) with specified values.
---
**recommendation**:
#### recommendation
```
Some nodes are overloaded. Probably by other processes
```
**attributes**:
#### attributes
[job_nodes_amount](./attributes.md#job_nodes_amount) = `MULT`
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) = `NORM`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) | `NORM`
[overloaded_node_exists](./attributes.md#overloaded_node_exists) = `True`
[overloaded_node_exists](./attributes.md#overloaded_node_exists) | `True`
---
**recommendation**:
#### recommendation
```
The requested walltime is too hight. Try to request less time
```
**attributes**:
#### attributes
[req_walltime](./attributes.md#req_walltime) = `HIGH`
attribute name | value(s)
--- | ---
[req_walltime](./attributes.md#req_walltime) | `HIGH`
---
**recommendation**:
#### recommendation
```
The CPU usage on some nodes is low, please request less cores or increase amount of processes
```
**attributes**:
#### attributes
[job_nodes_amount](./attributes.md#job_nodes_amount) = `MULT`
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[req_walltime](./attributes.md#req_walltime) = `NORM`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) = `LOW`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) | `LOW`
[node_cpu_usage_min](./attributes.md#node_cpu_usage_min) = `LOW`
[node_cpu_usage_min](./attributes.md#node_cpu_usage_min) | `LOW`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) = `NORM`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `NORM`
---
**recommendation**:
#### recommendation
```
Some nodes were not used during the runtime
```
**attributes**:
#### attributes
[job_nodes_amount](./attributes.md#job_nodes_amount) = `MULT`
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[req_walltime](./attributes.md#req_walltime) = `NORM`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[node_cpu_usage_min](./attributes.md#node_cpu_usage_min) = `ZERO`
[node_cpu_usage_min](./attributes.md#node_cpu_usage_min) | `ZERO`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) = `NORM`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `NORM`
---
**recommendation**:
#### recommendation
```
The CPU usage is not distributed equally among the nodes. Try to use the nodes evenly
```
**attributes**:
#### attributes
[job_nodes_amount](./attributes.md#job_nodes_amount) = `MULT`
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[req_walltime](./attributes.md#req_walltime) = `NORM`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) = `NORM`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) | `NORM`
[node_cpu_usage_min](./attributes.md#node_cpu_usage_min) = `LOW`
[node_cpu_usage_min](./attributes.md#node_cpu_usage_min) | `LOW`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) = `NORM`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `NORM`
---
**recommendation**:
#### recommendation
```
The CPU usage of the job is low on all nodes, please request appropriate amount of resources
```
**attributes**:
#### attributes
[job_nodes_amount](./attributes.md#job_nodes_amount) = `MULT`
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[req_walltime](./attributes.md#req_walltime) = `NORM`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[mem_usage_total](./attributes.md#mem_usage_total) = `LOW`
[mem_usage_total](./attributes.md#mem_usage_total) | `LOW`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) = `LOW`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) | `LOW`
[node_cpu_usage_min](./attributes.md#node_cpu_usage_min) = `LOW`
[node_cpu_usage_min](./attributes.md#node_cpu_usage_min) | `LOW`
---
**recommendation**:
#### recommendation
```
The CPU usage of the node is low. It might indicate that the job is not running in full power
```
**attributes**:
#### attributes
[job_nodes_amount](./attributes.md#job_nodes_amount) = `ONE`
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `ONE`
[req_walltime](./attributes.md#req_walltime) = `NORM`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) = `LOW`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `LOW`
---
**recommendation**:
#### recommendation
```
The node is overloaded. Probably by other processes on the node
```
**attributes**:
#### attributes
[job_nodes_amount](./attributes.md#job_nodes_amount) = `ONE`
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `ONE`
[req_walltime](./attributes.md#req_walltime) = `NORM`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) = `NORM`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `NORM`
[overloaded_node_exists](./attributes.md#overloaded_node_exists) = `True`
[overloaded_node_exists](./attributes.md#overloaded_node_exists) | `True`
---
**recommendation**:
#### recommendation
```
The node is overloaded and cpu usage of job was high
```
**attributes**:
#### attributes
[job_nodes_amount](./attributes.md#job_nodes_amount) = `ONE`
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `ONE`
[req_walltime](./attributes.md#req_walltime) = `NORM`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) = `HIGH`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `HIGH`
[overloaded_node_exists](./attributes.md#overloaded_node_exists) = `True`
[overloaded_node_exists](./attributes.md#overloaded_node_exists) | `True`
---
**recommendation**:
#### recommendation
```
Swap was used on one of the nodes
```
**attributes**:
#### attributes
[job_nodes_amount](./attributes.md#job_nodes_amount) = `MULT`
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[mem_swap_used](./attributes.md#mem_swap_used) = `True`
[mem_swap_used](./attributes.md#mem_swap_used) | `True`
---
**recommendation**:
#### recommendation
```
Swap was used on the node
```
**attributes**:
#### attributes
[job_nodes_amount](./attributes.md#job_nodes_amount) = `ONE`
attribute name | value(s)
--- | ---
[job_nodes_amount](./attributes.md#job_nodes_amount) | `ONE`
[mem_swap_used](./attributes.md#mem_swap_used) = `True`
[mem_swap_used](./attributes.md#mem_swap_used) | `True`
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment