Hello,
I’d like to be able to get the cpu usage per job using Telegraf > InfluxDB > Grafana
Lets say a user requested 20 cores, but the job is only using 4 cores.
in this case I’m expecting to see 20% usage.
However, I’m seeing 100% usage for that job in Grafana.
this is the relevant part in my telegraf.conf:
[[inputs.cgroup]]
paths = [
"/sys/fs/cgroup/memory/slurm/uid_*/job_*",
"/sys/fs/cgroup/cpu/slurm/uid_*/job_*"
]
files = ["memory.usage_in_bytes", "cpuacct.usage"]
[[processors.regex]]
namepass = ["cgroup"]
[[processors.regex.tags]]
key = "path"
pattern = "/sys/fs/cgroup/.+/slurm/uid_(?P<uid>[^/]+)/job_(?P<job_id>[^/]+)"
replacement = "${uid}"
result_key = "uid"
[[processors.regex.tags]]
key = "path"
pattern = "/sys/fs/cgroup/.+/slurm/uid_(?P<uid>[^/]+)/job_(?P<job_id>[^/]+)"
replacement = "${job_id}"
result_key = "jobid"
This is the query from Grafana:
SELECT MEAN("cpuacct.usage") / max("cpuacct.usage") * 100 AS "CPU Usage (%)"
FROM "cgroup"
WHERE time >= now() - 5m
GROUP BY time(5m), "jobid", "cluster"
Is this possible using Telegraf or do I need to use Prometheus-slurm-exporter?
Thank you