cgroups/Slurm exporter for telegraf?

Hello,

I’d like to be able to get the cpu usage per job using Telegraf > InfluxDB > Grafana
Lets say a user requested 20 cores, but the job is only using 4 cores.
in this case I’m expecting to see 20% usage.
However, I’m seeing 100% usage for that job in Grafana.

this is the relevant part in my telegraf.conf:

[[inputs.cgroup]]
  paths = [
    "/sys/fs/cgroup/memory/slurm/uid_*/job_*",
    "/sys/fs/cgroup/cpu/slurm/uid_*/job_*"
  ]
  files = ["memory.usage_in_bytes", "cpuacct.usage"]
 

[[processors.regex]]
  namepass = ["cgroup"]
  [[processors.regex.tags]]
    key = "path"
    pattern = "/sys/fs/cgroup/.+/slurm/uid_(?P<uid>[^/]+)/job_(?P<job_id>[^/]+)"
    replacement = "${uid}"
    result_key = "uid"

  [[processors.regex.tags]]
    key = "path"
    pattern = "/sys/fs/cgroup/.+/slurm/uid_(?P<uid>[^/]+)/job_(?P<job_id>[^/]+)"
    replacement = "${job_id}"
    result_key = "jobid"

This is the query from Grafana:

SELECT MEAN("cpuacct.usage") / max("cpuacct.usage") * 100 AS "CPU Usage (%)"
FROM "cgroup"
WHERE time >= now() - 5m
GROUP BY time(5m), "jobid", "cluster"

Is this possible using Telegraf or do I need to use Prometheus-slurm-exporter?

Thank you

My first step, would be to look at your actual data and see what you are getting from Telegraf. The value reported by cpuacct.usage should be the total CPU time consumed; it is not a utilization percentage.

I would have expected a calculation where you take the max minus the minimum divided by time see: Calculating CPU usage of a cgroup over a period of time - Unix & Linux Stack Exchange

1 Like