Commit 642e547f authored by akhuziy's avatar akhuziy
Browse files

added attributes and rules for GPUs

parent 725e067c
......@@ -42,7 +42,7 @@ The optimal CPU usages depends on the amount of cores requested. When the number
`NORM` - The maximum `cpu_usage` is in the tolerable range
`LOW` - The maximum `cpu_usage` is lower than expected
`LOW` - The maximum `cpu_usage` is lower than tolerable value
`ZERO` - The maximum `cpu_usage` is 0
......@@ -77,7 +77,7 @@ Similar to the attribute `node_cpu_usage_max` but indicates the minimum of `cpu_
`NORM` - The minimum `cpu_usage` is in the tolerable range
`LOW` - The minimum `cpu_usage` is lower than expected
`LOW` - The minimum `cpu_usage` is lower than tolerable value
`ZERO` - The minimum `cpu_usage` is 0
......@@ -89,7 +89,7 @@ Similar to the attribute `node_cpu_usage_max` but indicates the minimum of `cpu_
`ZERO` = if such node exists, that `U = 0`
`LOW` = !`ZERO` if such node exists, that `U < 50` & `R <= 8` | `U < 80` & `R > 8` holds
`LOW` = !`ZERO` & if such node exists, that `U < 50` & `R <= 8` | `U < 80` & `R > 8` holds
`NORM` = !`ZERO` & !`LOW` & such node exists, that `U > 50` & `R <= 8` | `U > 80` & `R > 8` holds
......@@ -111,7 +111,7 @@ The attribute is very similar to **node_cpu_usage_max** but instead of global ma
`NORM` - The maximum `cpu_usage` is in the tolerable range
`LOW` - The maximum `cpu_usage` is lower than expected
`LOW` - The maximum `cpu_usage` is lower than tolerable value
**Calculation**:
......@@ -129,7 +129,7 @@ The attribute is very similar to **node_cpu_usage_max** but instead of global ma
The attribute for the amount of nodes allocated for the job.
It simply divides the job into 2 categories: running on a single node and running on multiple nodes.
It simply divides the jobs into 2 categories: running on a single node and running on multiple nodes.
**Values**:
......@@ -171,9 +171,7 @@ It is `True` if the node has a high load during the interval `D`.
**Calculation**:
If during the interval `D` (300 seconds) consequent measurements of `load1` on the node exceeds the amount of cores the node has, then the value is `True`.
`False` otherwise.
If during the interval `D` (300 seconds) consequent measurements of `load1` on the node exceeds the amount of cores the node has, then the value is `True`, `False` otherwise.
## mem_swap_used
......@@ -188,3 +186,96 @@ It is `True` if there is a node where swap memory was used.
**Calculation**:
For every node check if `mem_swap_max` value is non zero. If it is non zero for any node, then set the attribute to `True`, otherwise `False`.
## gpu_job
It is `True` if the job used at least one GPU
**Values**:
`True` - if GPU was used
`False` - otherwise
**Calculation**:
For every node check if GPU was used. If used then set the attribute to `True`, otherwise `False`.
## gpu_usage_max
This attribute indicates the maximum of GPU usage among all GPUs which were running processes of the job.
**Values**:
`HIGH` - The maximum GPU usage is almost 100%
`NORM` - The maximum GPU usage is in the tolerable range
`LOW` - The maximum GPU usage is lower than tolerable value
`ZERO` - The maximum GPU usage is 0
**Calculation**:
`U` - the GPU usage of particular GPU (max 100%).
`ZERO` = if such GPU exists, that `U <= 0.5`
`LOW` = !`ZERO` & if such GPU exists, that `U < 50` holds
`NORM` = !`ZERO` & !`LOW` & such node exists, that `U < 90` holds
`HIGH` = !`ZERO` & !`LOW` & !`NORM`
## gpu_usage_min
This attribute indicates the minimum of GPU usage among all GPUs which were running processes of the job.
Similar to `gpu_usage_max` but indicates the minimum GPU usage.
**Values**:
`HIGH` - The minimum GPU usage is almost 100%
`NORM` - The minimum GPU usage is in the tolerable range
`LOW` - The minimum GPU usage is lower than tolerable value
`ZERO` - The minimum GPU usage is 0
**Calculation**:
`U` - the GPU usage of particular GPU (max 100%).
`ZERO` = if such GPU exists, that `U <= 0.5`
`LOW` = !`ZERO` & if such GPU exists, that `U < 50` holds
`NORM` = !`ZERO` & !`LOW` & such node exists, that `U < 90` holds
`HIGH` = !`ZERO` & !`LOW` & !`NORM`
## gpus_amount
The attribute for the amount of gpus used in the runtime of the job.
It simply divides the jobs into 2 categories: using a single GPU and using multiple GPUs.
**Values**:
`ONE` - if the number of GPUs equals 1
`MULT` - if the number of GPUs is greater than 1
## gpus_overcrowded_exist
It is `True` if any GPU has multiple processes running on it during the interval `D`.
**Values**:
`True` - if such GPU exists
`False` - otherwise
**Calculation**:
If during the interval `D` (1800 seconds) consequent measurements of number of processes on the GPU exceeds 1, then the value is `True`, `False` otherwise.
......@@ -9,6 +9,7 @@ Some nodes are overloaded. Probably by other processes
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `False`
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) | `NORM`
[overloaded_node_exists](./attributes.md#overloaded_node_exists) | `True`
......@@ -31,6 +32,7 @@ The CPU usage on some nodes is low, please adjust the amount of requested resour
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `False`
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) | `LOW`
......@@ -49,6 +51,7 @@ attribute name | value(s)
[req_walltime](./attributes.md#req_walltime) | `NORM`
[node_cpu_usage_min](./attributes.md#node_cpu_usage_min) | `ZERO`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `NORM`
[gpu_usage_min](./attributes.md#gpu_usage_min) | `ZERO`
---
#### recommendation
......@@ -58,6 +61,7 @@ The CPU usage is not distributed equally among the nodes. Try to use the nodes e
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `False`
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[cpu_usage_total_max](./attributes.md#cpu_usage_total_max) | `NORM`
......@@ -72,6 +76,7 @@ The CPU usage of the job is low on all nodes, please request appropriate amount
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `False`
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[mem_usage_total](./attributes.md#mem_usage_total) | `LOW`
......@@ -87,6 +92,7 @@ The CPU usage of the job is low on all nodes. Please check if the job is running
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `False`
[job_nodes_amount](./attributes.md#job_nodes_amount) | `MULT`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[mem_usage_total](./attributes.md#mem_usage_total) | `NORM`
......@@ -102,6 +108,7 @@ The CPU usage of the node is low. It might indicate that the job is not running
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `False`
[job_nodes_amount](./attributes.md#job_nodes_amount) | `ONE`
[req_walltime](./attributes.md#req_walltime) | `NORM`
[node_cpu_usage_max](./attributes.md#node_cpu_usage_max) | `LOW`
......@@ -154,3 +161,88 @@ attribute name | value(s)
[job_nodes_amount](./attributes.md#job_nodes_amount) | `ONE`
[mem_swap_used](./attributes.md#mem_swap_used) | `True`
---
#### recommendation
```
The single GPU has low usage
```
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `True`
[gpus_amount](./attributes.md#gpus_amount) | `ONE`
[gpu_usage_max](./attributes.md#gpu_usage_max) | `LOW`
---
#### recommendation
```
All GPUs have low usage
```
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `True`
[gpus_amount](./attributes.md#gpus_amount) | `MULT`
[gpu_usage_max](./attributes.md#gpu_usage_max) | `LOW`
---
#### recommendation
```
Some of the GPUs have low usage which might be caused by workload inbalance
```
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `True`
[gpus_amount](./attributes.md#gpus_amount) | `MULT`
[gpu_usage_max](./attributes.md#gpu_usage_max) | `NORM, HIGH`
[gpu_usage_min](./attributes.md#gpu_usage_min) | `LOW`
---
#### recommendation
```
Some of the GPUs were not used
```
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `True`
[gpus_amount](./attributes.md#gpus_amount) | `MULT`
[gpu_usage_min](./attributes.md#gpu_usage_min) | `ZERO`
---
#### recommendation
```
None of the GPUs were used
```
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `True`
[gpus_amount](./attributes.md#gpus_amount) | `MULT`
[gpu_usage_max](./attributes.md#gpu_usage_max) | `ZERO`
---
#### recommendation
```
Some GPUs have more than 1 process using them at the same time
```
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `True`
[gpus_amount](./attributes.md#gpus_amount) | `MULT`
[gpus_overcrowded_exist](./attributes.md#gpus_overcrowded_exist) | `True`
---
#### recommendation
```
The GPU has been used by more than 1 process simultaneously
```
#### attributes
attribute name | value(s)
--- | ---
[gpu_job](./attributes.md#gpu_job) | `True`
[gpus_amount](./attributes.md#gpus_amount) | `ONE`
[gpus_overcrowded_exist](./attributes.md#gpus_overcrowded_exist) | `True`
......@@ -3,6 +3,7 @@ import math
from db.aggrstruct import *
LONG_DUR_SEC = 300
GPU_LONG_DUR_SEC = 1800
def accepts(*types, **kw):
'''Function decorator. Checks decorated function's arguments are
......@@ -53,7 +54,7 @@ def node_has_high_load_interval(node):
seq = node.seq_load_max
max_points = math.ceil(seq.delta / LONG_DUR_SEC)
max_points = math.ceil(LONG_DUR_SEC / seq.delta)
max_load = node.sockets * node.cores_per_socket * (node.phys_thr_core + node.virt_thr_core)
for p in seq.seq:
......@@ -70,13 +71,26 @@ def node_has_high_load_interval(node):
return has_high_load
@accepts(Aggregator)
def has_high_load_interval(aggr):
has_high_load = False
@accepts(GPUData)
def gpu_has_overcrowded_interval(gpu):
has_overcrowded = False
for node in Aggr.nodes:
if node_has_high_load_interval(node, node_max_load):
has_high_load = True
conseq_points = 0
seq = gpu.seq_cpu_proc_count
max_points = math.ceil(GPU_LONG_DUR_SEC / seq.delta)
for p in seq.seq:
if conseq_points >= max_points:
has_overcrowded = True
break
return has_high_load
if p is None: continue
if p >= 2:
conseq_points += 1
else:
conseq_points = 0
return has_overcrowded
RULES = [
{
"attrs": {
"gpu_job": False,
"job_nodes_amount": "MULT",
"cpu_usage_total_max": "NORM",
"overloaded_node_exists": True},
......@@ -11,6 +12,7 @@ RULES = [
"msg": "The requested walltime is too hight. Try to request less time"},
{
"attrs": {
"gpu_job": False,
"job_nodes_amount": "MULT",
"req_walltime": "NORM",
"cpu_usage_total_max": "LOW",
......@@ -22,10 +24,12 @@ RULES = [
"job_nodes_amount": "MULT",
"req_walltime": "NORM",
"node_cpu_usage_min": "ZERO",
"node_cpu_usage_max": "NORM"},
"node_cpu_usage_max": "NORM",
"gpu_usage_min": "ZERO"},
"msg": "Some nodes were not used during the runtime"},
{
"attrs": {
"gpu_job": False,
"job_nodes_amount": "MULT",
"req_walltime": "NORM",
"cpu_usage_total_max": "NORM",
......@@ -34,6 +38,7 @@ RULES = [
"msg": "The CPU usage is not distributed equally among the nodes. Try to use the nodes evenly"},
{
"attrs": {
"gpu_job": False,
"job_nodes_amount": "MULT",
"req_walltime": "NORM",
"mem_usage_total": "LOW",
......@@ -43,6 +48,7 @@ RULES = [
"msg": "The CPU usage of the job is low on all nodes, please request appropriate amount of resources"},
{
"attrs": {
"gpu_job": False,
"job_nodes_amount": "MULT",
"req_walltime": "NORM",
"mem_usage_total": "NORM",
......@@ -52,6 +58,7 @@ RULES = [
"msg": "The CPU usage of the job is low on all nodes. Please check if the job is running properly"},
{
"attrs": {
"gpu_job": False,
"job_nodes_amount": "ONE",
"req_walltime": "NORM",
"node_cpu_usage_max": "LOW"},
......@@ -80,4 +87,47 @@ RULES = [
"job_nodes_amount": "ONE",
"mem_swap_used": True},
"msg": "Swap was used on the node"},
{
"attrs": {
"gpu_job": True,
"gpus_amount": "ONE",
"gpu_usage_max": "LOW"},
"msg": "The single GPU has low usage"},
{
"attrs": {
"gpu_job": True,
"gpus_amount": "MULT",
"gpu_usage_max": "LOW"},
"msg": "All GPUs have low usage"},
{
"attrs": {
"gpu_job": True,
"gpus_amount": "MULT",
"gpu_usage_max": ["NORM", "HIGH"],
"gpu_usage_min": "LOW"},
"msg": "Some of the GPUs have low usage which might be caused by workload inbalance"},
{
"attrs": {
"gpu_job": True,
"gpus_amount": "MULT",
"gpu_usage_min": "ZERO"},
"msg": "Some of the GPUs were not used"},
{
"attrs": {
"gpu_job": True,
"gpus_amount": "MULT",
"gpu_usage_max": "ZERO"},
"msg": "None of the GPUs were used"},
{
"attrs": {
"gpu_job": True,
"gpus_amount": "MULT",
"gpus_overcrowded_exist": True},
"msg": "Some GPUs have more than 1 process using them at the same time"},
{
"attrs": {
"gpu_job": True,
"gpus_amount": "ONE",
"gpus_overcrowded_exist": True},
"msg": "The GPU has been used by more than 1 process simultaneously"},
]
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment