Skip to content

Nvidia + Zabbix

Zabbix

What is it

Zabbix is an open-source monitoring software tool for diverse IT components, including networks, servers, virtual machines (VMs) and cloud services. Zabbix provides monitoring metrics, among others network utilization, CPU load and disk space consumption. Zabbix monitoring configuration can be done using XML based templates which contain elements to monitor. The software monitors operations on Linux, Hewlett Packard Unix (HP-UX), Mac OS X, Solaris and other operating systems (OSes); however, Windows monitoring is only possible through agents.

Requirements

We have server with 2 GPU Nvidia. We use neural networnks. We want to monitoring GPU it's loaded.

How to do it

First of all we hav to install packages and configure how get statistic data.

Instalation GPU Stat

NVidia stock software is awful. We have to install gpustat. There are two ways to installation, you can choose any.

  • First way:
sudo apt install gpustat
  • Second way:
sudo pip install gpustat

Configure cron task

We have to create some script for cron, because zabbix has some timeouts. And them schedule.

crontab -l
* * * * * /usr/local/bin/gpustat --json > /storage/docker-zabbix-agent/zbx_env/var/lib/zabbix/scripts/log/gpu_all.log

After that we have a file with json output

{
    "hostname": "serverwithgpu",
    "query_time": "2020-08-25T11:30:01.756647",
    "gpus": [
        {
            "index": 0,
            "uuid": "GPU-b25d4db2-6730-ed49-394d-27e72110a700",
            "name": "GeForce RTX 2080 Ti",
            "temperature.gpu": 46,
            "fan.speed": 37,
            "utilization.gpu": 0,
            "power.draw": 56,
            "enforced.power.limit": 250,
            "memory.used": 9368,
            "memory.total": 11019,
            "processes": [
            ]
        },
        {
            "index": 1,
            "uuid": "GPU-e7f907fc-4d00-4f4f-dca3-1663ff9616d8",
            "name": "GeForce RTX 2080 Ti",
            "temperature.gpu": 45,
            "fan.speed": 35,
            "utilization.gpu": 0,
            "power.draw": 64,
            "enforced.power.limit": 250,
            "memory.used": 8058,
            "memory.total": 11019,
            "processes": [
            ]
        }
    ]
}

Some changes

But I want to change something. It's the type of time in the key of "query_time" I made the python script wich conwerts ISO time to UNIX timestamp

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import sys,json
from datetime import datetime

if __name__ == "__main__":
    data = json.load(sys.stdin)
    d = data['query_time']
    data['query_time'] = round(datetime.strptime(d, '%Y-%m-%dT%H:%M:%S.%f').timestamp())
    print(json.dumps(data))

I call the python script and give it data over pipe. I made the shell script because the script in cron was too long.

#!/bin/sh

export ZSPATH="/storage/docker-zabbix-agent/zbx_env/var/lib/zabbix/scripts"
/usr/local/bin/gpustat --json|${ZSPATH}/gpu.py > ${ZSPATH}/log/gpu_all.log

And I call the shell script from cron.

crontab -l
* * * * * /storage/docker-zabbix-agent/zbx_env/var/lib/zabbix/scripts/crongpy.sh > /dev/null 2>&1

After that I have the json with normal query_time value.

[
  {
    "hostname": "serverwithgpu",
    "query_time": 1599202023,
    "gpus": [
      {
        "index": 0,
        "uuid": "GPU-b25d4db2-6730-ed49-394d-27e72110a700",
        "name": "GeForce RTX 2080 Ti",
        "temperature.gpu": 29,
        "fan.speed": 32,
        "utilization.gpu": 0,
        "power.draw": 51,
        "enforced.power.limit": 250,
        "memory.used": 0,
        "memory.total": 11019,
        "processes": []
      },
      {
        "index": 1,
        "uuid": "GPU-e7f907fc-4d00-4f4f-dca3-1663ff9616d8",
        "name": "GeForce RTX 2080 Ti",
        "temperature.gpu": 28,
        "fan.speed": 35,
        "utilization.gpu": 0,
        "power.draw": 31,
        "enforced.power.limit": 250,
        "memory.used": 0,
        "memory.total": 11019,
        "processes": []
      }
    ]
  }
]

Configure Zabbix Agent

cat /storage/docker-zabbix-agent/zbx_env/etc/zabbix/zabbix_agentd.d/gpusetj.conf
UserParameter=gpuset[*],cat /var/lib/zabbix/scripts/log/gpu_all.log