Commit 642e547f authored by Azat Khuziyakhmetov's avatar Azat Khuziyakhmetov
Browse files

added attributes and rules for GPUs

parent 725e067c
...@@ -42,7 +42,7 @@ The optimal CPU usages depends on the amount of cores requested. When the number ...@@ -42,7 +42,7 @@ The optimal CPU usages depends on the amount of cores requested. When the number
`NORM` - The maximum `cpu_usage` is in the tolerable range `NORM` - The maximum `cpu_usage` is in the tolerable range
`LOW` - The maximum `cpu_usage` is lower than expected `LOW` - The maximum `cpu_usage` is lower than tolerable value
`ZERO` - The maximum `cpu_usage` is 0 `ZERO` - The maximum `cpu_usage` is 0
...@@ -77,7 +77,7 @@ Similar to the attribute `node_cpu_usage_max` but indicates the minimum of `cpu_ ...@@ -77,7 +77,7 @@ Similar to the attribute `node_cpu_usage_max` but indicates the minimum of `cpu_
`NORM` - The minimum `cpu_usage` is in the tolerable range `NORM` - The minimum `cpu_usage` is in the tolerable range
`LOW` - The minimum `cpu_usage` is lower than expected `LOW` - The minimum `cpu_usage` is lower than tolerable value
`ZERO` - The minimum `cpu_usage` is 0 `ZERO` - The minimum `cpu_usage` is 0
...@@ -89,7 +89,7 @@ Similar to the attribute `node_cpu_usage_max` but indicates the minimum of `cpu_ ...@@ -89,7 +89,7 @@ Similar to the attribute `node_cpu_usage_max` but indicates the minimum of `cpu_
`ZERO` = if such node exists, that `U = 0` `ZERO` = if such node exists, that `U = 0`
`LOW` = !`ZERO` if such node exists, that `U < 50` & `R <= 8` | `U < 80` & `R > 8` holds `LOW` = !`ZERO` & if such node exists, that `U < 50` & `R <= 8` | `U < 80` & `R > 8` holds
`NORM` = !`ZERO` & !`LOW` & such node exists, that `U > 50` & `R <= 8` | `U > 80` & `R > 8` holds `NORM` = !`ZERO` & !`LOW` & such node exists, that `U > 50` & `R <= 8` | `U > 80` & `R > 8` holds
...@@ -111,7 +111,7 @@ The attribute is very similar to **node_cpu_usage_max** but instead of global ma ...@@ -111,7 +111,7 @@ The attribute is very similar to **node_cpu_usage_max** but instead of global ma
`NORM` - The maximum `cpu_usage` is in the tolerable range `NORM` - The maximum `cpu_usage` is in the tolerable range
`LOW` - The maximum `cpu_usage` is lower than expected `LOW` - The maximum `cpu_usage` is lower than tolerable value
**Calculation**: **Calculation**:
...@@ -129,7 +129,7 @@ The attribute is very similar to **node_cpu_usage_max** but instead of global ma ...@@ -129,7 +129,7 @@ The attribute is very similar to **node_cpu_usage_max** but instead of global ma
The attribute for the amount of nodes allocated for the job. The attribute for the amount of nodes allocated for the job.
It simply divides the job into 2 categories: running on a single node and running on multiple nodes. It simply divides the jobs into 2 categories: running on a single node and running on multiple nodes.
**Values**: **Values**:
...@@ -171,9 +171,7 @@ It is `True` if the node has a high load during the interval `D`. ...@@ -171,9 +171,7 @@ It is `True` if the node has a high load during the interval `D`.
**Calculation**: **Calculation**:
If during the interval `D` (300 seconds) consequent measurements of `load1` on the node exceeds the amount of cores the node has, then the value is `True`. If during the interval `D` (300 seconds) consequent measurements of `load1` on the node exceeds the amount of cores the node has, then the value is `True`, `False` otherwise.
`False` otherwise.
## mem_swap_used ## mem_swap_used
...@@ -188,3 +186,96 @@ It is `True` if there is a node where swap memory was used. ...@@ -188,3 +186,96 @@ It is `True` if there is a node where swap memory was used.
**Calculation**: **Calculation**:
For every node check if `mem_swap_max` value is non zero. If it is non zero for any node, then set the attribute to `True`, otherwise `False`. For every node check if `mem_swap_max` value is non zero. If it is non zero for any node, then set the attribute to `True`, otherwise `False`.
## gpu_job
It is `True` if the job used at least one GPU
**Values**:
`True` - if GPU was used
`False` - otherwise
**Calculation**:
For every node check if GPU was used. If used then set the attribute to `True`, otherwise `False`.
## gpu_usage_max
This attribute indicates the maximum of GPU usage among all GPUs which were running processes of the job.
**Values**:
`HIGH` - The maximum GPU usage is almost 100%
`NORM` - The maximum GPU usage is in the tolerable range
`LOW` - The maximum GPU usage is lower than tolerable value
`ZERO` - The maximum GPU usage is 0
**Calculation**:
`U` - the GPU usage of particular GPU (max 100%).
`ZERO` = if such GPU exists, that `U <= 0.5`
`LOW` = !`ZERO` & if such GPU exists, that `U < 50` holds
`NORM` = !`ZERO` & !`LOW` & such node exists, that `U < 90` holds
`HIGH` = !`ZERO` & !`LOW` & !`NORM`
## gpu_usage_min
This attribute indicates the minimum of GPU usage among all GPUs which were running processes of the job.
Similar to `gpu_usage_max` but indicates the minimum GPU usage.
**Values**:
`HIGH` - The minimum GPU usage is almost 100%
`NORM` - The minimum GPU usage is in the tolerable range
`LOW` - The minimum GPU usage is lower than tolerable value
`ZERO` - The minimum GPU usage is 0
**Calculation**:
`U` - the GPU usage of particular GPU (max 100%).
`ZERO` = if such GPU exists, that `U <= 0.5`
`LOW` = !`ZERO` & if such GPU exists, that `U < 50` holds
`NORM` = !`ZERO` & !`LOW` & such node exists, that `U < 90` holds
`HIGH` = !`ZERO` & !`LOW` & !`NORM`
## gpus_amount
The attribute for the amount of gpus used in the runtime of the job.
It simply divides the jobs into 2 categories: using a single GPU and using multiple GPUs.
**Values**:
`ONE` - if the number of GPUs equals 1
`MULT` - if the number of GPUs is greater than 1
## gpus_overcrowded_exist
It is `True` if any GPU has multiple processes running on it during the interval `D`.
**Values**:
`True` - if such GPU exists
`False` - otherwise
**Calculation**:
If during the interval `D` (1800 seconds) consequent measurements of number of processes on the GPU exceeds 1, then the value is `True`, `False` otherwise.
...@@ -9,6 +9,7 @@ Some nodes are overloaded. Probably by other processes ...@@ -9,6 +9,7 @@ Some nodes are overloaded. Probably by other processes
#### attributes #### attributes
attribute name | value(s) attribute name | value(s)
--- | --- --- | ---
[gpu_job](./attributes.md#gpu_job) | `False`
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT` [job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) | `NORM` [cpu_usage_total_max](./attributes.md#cpu_usage_total_max) | `NORM`
[overloaded_node_exists](./attributes.md#overloaded_node_exists) | `True` [overloaded_node_exists](./attributes.md#overloaded_node_exists) | `True`
...@@ -31,6 +32,7 @@ The CPU usage on some nodes is low, please adjust the amount of requested resour ...@@ -31,6 +32,7 @@ The CPU usage on some nodes is low, please adjust the amount of requested resour
#### attributes #### attributes
attribute name | value(s) attribute name | value(s)
--- | --- --- | ---
[gpu_job](./attributes.md#gpu_job) | `False`
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT` [job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[req_walltime](./attributes.md#req_walltime) | `NORM` [req_walltime](./attributes.md#req_walltime) | `NORM`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) | `LOW` [cpu_usage_total_max](./attributes.md#cpu_usage_total_max) | `LOW`
...@@ -49,6 +51,7 @@ attribute name | value(s) ...@@ -49,6 +51,7 @@ attribute name | value(s)
[req_walltime](./attributes.md#req_walltime) | `NORM` [req_walltime](./attributes.md#req_walltime) | `NORM`
[node_cpu_usage_min](./attributes.md#node_cpu_usage_min) | `ZERO` [node_cpu_usage_min](./attributes.md#node_cpu_usage_min) | `ZERO`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `NORM` [node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `NORM`
[gpu_usage_min](./attributes.md#gpu_usage_min) | `ZERO`
--- ---
#### recommendation #### recommendation
...@@ -58,6 +61,7 @@ The CPU usage is not distributed equally among the nodes. Try to use the nodes e ...@@ -58,6 +61,7 @@ The CPU usage is not distributed equally among the nodes. Try to use the nodes e
#### attributes #### attributes
attribute name | value(s) attribute name | value(s)
--- | --- --- | ---
[gpu_job](./attributes.md#gpu_job) | `False`
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT` [job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[req_walltime](./attributes.md#req_walltime) | `NORM` [req_walltime](./attributes.md#req_walltime) | `NORM`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) | `NORM` [cpu_usage_total_max](./attributes.md#cpu_usage_total_max) | `NORM`
...@@ -72,6 +76,7 @@ The CPU usage of the job is low on all nodes, please request appropriate amount ...@@ -72,6 +76,7 @@ The CPU usage of the job is low on all nodes, please request appropriate amount
#### attributes #### attributes
attribute name | value(s) attribute name | value(s)
--- | --- --- | ---
[gpu_job](./attributes.md#gpu_job) | `False`
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT` [job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[req_walltime](./attributes.md#req_walltime) | `NORM` [req_walltime](./attributes.md#req_walltime) | `NORM`
[mem_usage_total](./attributes.md#mem_usage_total) | `LOW` [mem_usage_total](./attributes.md#mem_usage_total) | `LOW`
...@@ -87,6 +92,7 @@ The CPU usage of the job is low on all nodes. Please check if the job is running ...@@ -87,6 +92,7 @@ The CPU usage of the job is low on all nodes. Please check if the job is running
#### attributes #### attributes
attribute name | value(s) attribute name | value(s)
--- | --- --- | ---
[gpu_job](./attributes.md#gpu_job) | `False`
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT` [job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[req_walltime](./attributes.md#req_walltime) | `NORM` [req_walltime](./attributes.md#req_walltime) | `NORM`
[mem_usage_total](./attributes.md#mem_usage_total) | `NORM` [mem_usage_total](./attributes.md#mem_usage_total) | `NORM`
...@@ -102,6 +108,7 @@ The CPU usage of the node is low. It might indicate that the job is not running ...@@ -102,6 +108,7 @@ The CPU usage of the node is low. It might indicate that the job is not running
#### attributes #### attributes
attribute name | value(s) attribute name | value(s)
--- | --- --- | ---
[gpu_job](./attributes.md#gpu_job) | `False`
[job_nodes_amount](./attributes.md#job_nodes_amount) | `ONE` [job_nodes_amount](./attributes.md#job_nodes_amount) | `ONE`
[req_walltime](./attributes.md#req_walltime) | `NORM` [req_walltime](./attributes.md#req_walltime) | `NORM`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `LOW` [node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `LOW`
...@@ -154,3 +161,88 @@ attribute name | value(s) ...@@ -154,3 +161,88 @@ attribute name | value(s)
[job_nodes_amount](./attributes.md#job_nodes_amount) | `ONE` [job_nodes_amount](./attributes.md#job_nodes_amount) | `ONE`
[mem_swap_used](./attributes.md#mem_swap_used) | `True` [mem_swap_used](./attributes.md#mem_swap_used) | `True`
---
#### recommendation
```
The single GPU has low usage
```
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `True`
[gpus_amount](./attributes.md#gpus_amount) | `ONE`
[gpu_usage_max](./attributes.md#gpu_usage_max) | `LOW`
---
#### recommendation
```
All GPUs have low usage
```
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `True`
[gpus_amount](./attributes.md#gpus_amount) | `MULT`
[gpu_usage_max](./attributes.md#gpu_usage_max) | `LOW`
---
#### recommendation
```
Some of the GPUs have low usage which might be caused by workload inbalance
```
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `True`
[gpus_amount](./attributes.md#gpus_amount) | `MULT`
[gpu_usage_max](./attributes.md#gpu_usage_max) | `NORM, HIGH`
[gpu_usage_min](./attributes.md#gpu_usage_min) | `LOW`
---
#### recommendation
```
Some of the GPUs were not used
```
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `True`
[gpus_amount](./attributes.md#gpus_amount) | `MULT`
[gpu_usage_min](./attributes.md#gpu_usage_min) | `ZERO`
---
#### recommendation
```
None of the GPUs were used
```
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `True`
[gpus_amount](./attributes.md#gpus_amount) | `MULT`
[gpu_usage_max](./attributes.md#gpu_usage_max) | `ZERO`
---
#### recommendation
```
Some GPUs have more than 1 process using them at the same time
```
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `True`
[gpus_amount](./attributes.md#gpus_amount) | `MULT`
[gpus_overcrowded_exist](./attributes.md#gpus_overcrowded_exist) | `True`
---
#### recommendation
```
The GPU has been used by more than 1 process simultaneously
```
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `True`
[gpus_amount](./attributes.md#gpus_amount) | `ONE`
[gpus_overcrowded_exist](./attributes.md#gpus_overcrowded_exist) | `True`
...@@ -3,6 +3,7 @@ import math ...@@ -3,6 +3,7 @@ import math
from db.aggrstruct import * from db.aggrstruct import *
LONG_DUR_SEC = 300 LONG_DUR_SEC = 300
GPU_LONG_DUR_SEC = 1800
def accepts(*types, **kw): def accepts(*types, **kw):
'''Function decorator. Checks decorated function's arguments are '''Function decorator. Checks decorated function's arguments are
...@@ -53,7 +54,7 @@ def node_has_high_load_interval(node): ...@@ -53,7 +54,7 @@ def node_has_high_load_interval(node):
seq = node.seq_load_max seq = node.seq_load_max
max_points = math.ceil(seq.delta / LONG_DUR_SEC) max_points = math.ceil(LONG_DUR_SEC / seq.delta)
max_load = node.sockets * node.cores_per_socket * (node.phys_thr_core + node.virt_thr_core) max_load = node.sockets * node.cores_per_socket * (node.phys_thr_core + node.virt_thr_core)
for p in seq.seq: for p in seq.seq:
...@@ -70,13 +71,26 @@ def node_has_high_load_interval(node): ...@@ -70,13 +71,26 @@ def node_has_high_load_interval(node):
return has_high_load return has_high_load
@accepts(Aggregator) @accepts(GPUData)
def has_high_load_interval(aggr): def gpu_has_overcrowded_interval(gpu):
has_high_load = False has_overcrowded = False
for node in Aggr.nodes: conseq_points = 0
if node_has_high_load_interval(node, node_max_load):
has_high_load = True seq = gpu.seq_cpu_proc_count
max_points = math.ceil(GPU_LONG_DUR_SEC / seq.delta)
for p in seq.seq:
if conseq_points >= max_points:
has_overcrowded = True
break break
return has_high_load if p is None: continue
if p >= 2:
conseq_points += 1
else:
conseq_points = 0
return has_overcrowded
RULES = [ RULES = [
{ {
"attrs": { "attrs": {
"gpu_job": False,
"job_nodes_amount": "MULT", "job_nodes_amount": "MULT",
"cpu_usage_total_max": "NORM", "cpu_usage_total_max": "NORM",
"overloaded_node_exists": True}, "overloaded_node_exists": True},
...@@ -11,6 +12,7 @@ RULES = [ ...@@ -11,6 +12,7 @@ RULES = [
"msg": "The requested walltime is too hight. Try to request less time"}, "msg": "The requested walltime is too hight. Try to request less time"},
{ {
"attrs": { "attrs": {
"gpu_job": False,
"job_nodes_amount": "MULT", "job_nodes_amount": "MULT",
"req_walltime": "NORM", "req_walltime": "NORM",
"cpu_usage_total_max": "LOW", "cpu_usage_total_max": "LOW",
...@@ -22,10 +24,12 @@ RULES = [ ...@@ -22,10 +24,12 @@ RULES = [
"job_nodes_amount": "MULT", "job_nodes_amount": "MULT",
"req_walltime": "NORM", "req_walltime": "NORM",
"node_cpu_usage_min": "ZERO", "node_cpu_usage_min": "ZERO",
"node_cpu_usage_max": "NORM"}, "node_cpu_usage_max": "NORM",
"gpu_usage_min": "ZERO"},
"msg": "Some nodes were not used during the runtime"}, "msg": "Some nodes were not used during the runtime"},
{ {
"attrs": { "attrs": {
"gpu_job": False,
"job_nodes_amount": "MULT", "job_nodes_amount": "MULT",
"req_walltime": "NORM", "req_walltime": "NORM",
"cpu_usage_total_max": "NORM", "cpu_usage_total_max": "NORM",
...@@ -34,6 +38,7 @@ RULES = [ ...@@ -34,6 +38,7 @@ RULES = [
"msg": "The CPU usage is not distributed equally among the nodes. Try to use the nodes evenly"}, "msg": "The CPU usage is not distributed equally among the nodes. Try to use the nodes evenly"},
{ {
"attrs": { "attrs": {
"gpu_job": False,
"job_nodes_amount": "MULT", "job_nodes_amount": "MULT",
"req_walltime": "NORM", "req_walltime": "NORM",
"mem_usage_total": "LOW", "mem_usage_total": "LOW",
...@@ -43,6 +48,7 @@ RULES = [ ...@@ -43,6 +48,7 @@ RULES = [
"msg": "The CPU usage of the job is low on all nodes, please request appropriate amount of resources"}, "msg": "The CPU usage of the job is low on all nodes, please request appropriate amount of resources"},
{ {
"attrs": { "attrs": {
"gpu_job": False,
"job_nodes_amount": "MULT", "job_nodes_amount": "MULT",
"req_walltime": "NORM", "req_walltime": "NORM",
"mem_usage_total": "NORM", "mem_usage_total": "NORM",
...@@ -52,6 +58,7 @@ RULES = [ ...@@ -52,6 +58,7 @@ RULES = [
"msg": "The CPU usage of the job is low on all nodes. Please check if the job is running properly"}, "msg": "The CPU usage of the job is low on all nodes. Please check if the job is running properly"},
{ {
"attrs": { "attrs": {
"gpu_job": False,
"job_nodes_amount": "ONE", "job_nodes_amount": "ONE",
"req_walltime": "NORM", "req_walltime": "NORM",
"node_cpu_usage_max": "LOW"}, "node_cpu_usage_max": "LOW"},
...@@ -80,4 +87,47 @@ RULES = [ ...@@ -80,4 +87,47 @@ RULES = [
"job_nodes_amount": "ONE", "job_nodes_amount": "ONE",
"mem_swap_used": True}, "mem_swap_used": True},
"msg": "Swap was used on the node"}, "msg": "Swap was used on the node"},
{
"attrs": {
"gpu_job": True,
"gpus_amount": "ONE",
"gpu_usage_max": "LOW"},
"msg": "The single GPU has low usage"},
{
"attrs": {
"gpu_job": True,
"gpus_amount": "MULT",
"gpu_usage_max": "LOW"},
"msg": "All GPUs have low usage"},
{
"attrs": {
"gpu_job": True,
"gpus_amount": "MULT",
"gpu_usage_max": ["NORM", "HIGH"],
"gpu_usage_min": "LOW"},
"msg": "Some of the GPUs have low usage which might be caused by workload inbalance"},
{
"attrs": {
"gpu_job": True,
"gpus_amount": "MULT",
"gpu_usage_min": "ZERO"},
"msg": "Some of the GPUs were not used"},
{
"attrs": {
"gpu_job": True,
"gpus_amount": "MULT",
"gpu_usage_max": "ZERO"},
"msg": "None of the GPUs were used"},
{
"attrs": {
"gpu_job": True,
"gpus_amount": "MULT",
"gpus_overcrowded_exist": True},
"msg": "Some GPUs have more than 1 process using them at the same time"},
{
"attrs": {
"gpu_job": True,
"gpus_amount": "ONE",
"gpus_overcrowded_exist": True},
"msg": "The GPU has been used by more than 1 process simultaneously"},
] ]
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment