cgroups/Slurm exporter for telegraf?

volhpc · July 17, 2024, 2:39pm

Hello,

I’d like to be able to get the cpu usage per job using Telegraf > InfluxDB > Grafana
Lets say a user requested 20 cores, but the job is only using 4 cores.
in this case I’m expecting to see 20% usage.
However, I’m seeing 100% usage for that job in Grafana.

this is the relevant part in my telegraf.conf:

[[inputs.cgroup]]
  paths = [
    "/sys/fs/cgroup/memory/slurm/uid_*/job_*",
    "/sys/fs/cgroup/cpu/slurm/uid_*/job_*"
  ]
  files = ["memory.usage_in_bytes", "cpuacct.usage"]
 

[[processors.regex]]
  namepass = ["cgroup"]
  [[processors.regex.tags]]
    key = "path"
    pattern = "/sys/fs/cgroup/.+/slurm/uid_(?P<uid>[^/]+)/job_(?P<job_id>[^/]+)"
    replacement = "${uid}"
    result_key = "uid"

  [[processors.regex.tags]]
    key = "path"
    pattern = "/sys/fs/cgroup/.+/slurm/uid_(?P<uid>[^/]+)/job_(?P<job_id>[^/]+)"
    replacement = "${job_id}"
    result_key = "jobid"

This is the query from Grafana:

SELECT MEAN("cpuacct.usage") / max("cpuacct.usage") * 100 AS "CPU Usage (%)"
FROM "cgroup"
WHERE time >= now() - 5m
GROUP BY time(5m), "jobid", "cluster"

Is this possible using Telegraf or do I need to use Prometheus-slurm-exporter?

Thank you

jpowers · July 17, 2024, 3:01pm

My first step, would be to look at your actual data and see what you are getting from Telegraf. The value reported by cpuacct.usage should be the total CPU time consumed; it is not a utilization percentage.

I would have expected a calculation where you take the max minus the minimum divided by time see: Calculating CPU usage of a cgroup over a period of time - Unix & Linux Stack Exchange

Topic		Replies	Views
Query for memory utilization for a spesific service running on linux server telegraf	7	3170	December 31, 2018
[Solved²] Grafana + Influxdb query: get all procstat results from telegraf Dashboards influxdb , telegraf , grafana	13	5677	July 6, 2017
InfluxDB 1.7.8 high CPU usage	0	542	October 14, 2019
Telegraf vsphere Plugin - Plot total number of VMs in the vCenter against time influxdb , telegraf	0	934	February 7, 2020
CPU usage always is 0 when using telegraf	1	467	January 6, 2020

cgroups/Slurm exporter for telegraf?

Related topics