Hello there,
It has come to my attention that the inputs “cpu” and “system” are reporting strange values for the CPU usage of a system of mine.
Particularly, I’ve contrasted the values that I’ve gotten from Telegraf with two other tools: Zabbix and Sysstat.
Let me offer some context. There’s a timeframe where a machine had an unusual spike in CPU usage, as reported by sysstat:
23:00:01 CPU %user %nice %system %iowait %steal %idle
23:05:01 all 0,20 0,00 0,14 0,00 0,00 99,65
23:10:33 all 29,55 0,00 30,46 18,66 0,04 21,29
23:15:24 all 27,77 0,00 36,37 33,68 0,06 2,12
23:20:25 all 29,56 0,00 37,97 32,33 0,06 0,08
23:25:02 all 28,70 0,00 35,46 32,92 0,06 2,86
23:30:34 all 29,77 0,00 37,64 32,37 0,05 0,16
23:35:42 all 30,28 0,00 36,53 33,05 0,05 0,09
23:40:46 all 31,07 0,00 37,12 31,65 0,06 0,10
23:45:02 all 32,16 0,00 36,79 30,99 0,05 0,02
23:50:01 all 4,60 0,00 3,20 2,96 0,01 89,24
23:55:01 all 0,13 0,00 0,11 0,01 0,00 99,74
00:00:01 all 0,17 0,00 0,12 0,01 0,00 99,69
Zabbix reports similar CPU usage, as show below:
But when trying to contrast that information with the data on InfluxDB which has been collected from Telegraf I’m finding a few discrepancies. See the CPU usage mean Graph on Grafana for that timeframe (I can’t embed more than one image on the post so I’m linking it, hope that’s OK):
As you can see it’s around ~10-15% which doesn’t feel like it’s the same information as the one reported by other tools. At first I’ve thought that it have something to do with the mean()
transformation that I’m applying to it on the previous graph, but I’ve checked the data and I still can’t make sense of it.
Let me show you. Let’s start with the CPU usage, as reported per the cpu
input [1]. If I check the max()
values obtained from Telegraf [2] I get this:
time user system softirq steal nice irq iowait guest guest_nice kk
---- ---- ------ ------- ----- ---- --- ------ ----- ---------- --
1596397231000000000 13.293634152433443 3.044535771184912 0.05871005619344671 0.5770195684807924 0 0 0.00837731423297606 0 0 99.90818030099894
The mean()
of that timeframe [3] reports this:
time user system softirq steal nice irq iowait guest guest_nice kk
---- ---- ------ ------- ----- ---- --- ------ ----- ---------- --
1596397231000000000 7.900581297676124 1.8773078727536534 0.02697589764955647 0.05810101980085044 0 0 0.00017586437696013338 0 0 90.13685804774548
Checking the load1
, load10
, load15
values also report several numbers that do not fit what the rest of the monitoring tools are saying (~30 load). These are also gathered by the system
Telegraf input [4]:
Could someone shed some light into why this is happening? If I understand correctly the cpu
input reports a percentage value of CPU utilization and the load1
, load10
and load15
values from the system
input return the value of the system load expressed in times one. Is this correct?
If it is, why is there such disparity among these tools?
The machine is a KVM VM and I’m using telegraf 1.4.5, in case it is relevant.
Thank you all in advance.
[1]: My “cpu” input configuration
[[inputs.cpu]]
collect_cpu_time = false
totalcpu = true
percpu = true
[2]: Query to get the max values from the influxdb backend that stores the CPU input data.
SELECT max(usage_user) as "user", max(usage_system) as "system",
max(usage_softirq) as "softirq", max(usage_steal) as "steal", max(usage_nice) as "nice",
max(usage_irq) as "irq", max(usage_iowait) as "iowait", max(usage_guest) as "guest",
max(usage_guest_nice) as "guest_nice", max(usage_idle) as "kk" FROM "cpu"
WHERE
"host" =~ /^[REDACTED]$/
and cpu = 'cpu-total'
AND time >= [REDACTED]ms and time <= [REDACTED]ms
[3]: As [2], but mean values.
SELECT mean(usage_user) as "user", mean(usage_system) as "system",
mean(usage_softirq) as "softirq", mean(usage_steal) as "steal",
mean(usage_nice) as "nice", mean(usage_irq) as "irq", mean(usage_iowait) as "iowait",
mean(usage_guest) as "guest", mean(usage_guest_nice) as "guest_nice",
mean(usage_idle) as "kk" FROM "cpu"
WHERE
"host" =~ /^[REDACTED]$/
and cpu = 'cpu-total'
AND time >= 1596397231000ms
and time <= 1596400797000ms
[4]: My system input configuration
[[inputs.system]]
# no extra configuration