Telegraf reporting strange CPU metrics

Hello there,

It has come to my attention that the inputs “cpu” and “system” are reporting strange values for the CPU usage of a system of mine.

Particularly, I’ve contrasted the values that I’ve gotten from Telegraf with two other tools: Zabbix and Sysstat.

Let me offer some context. There’s a timeframe where a machine had an unusual spike in CPU usage, as reported by sysstat:

23:00:01        CPU     %user     %nice   %system   %iowait    %steal     %idle
23:05:01        all      0,20      0,00      0,14      0,00      0,00     99,65
23:10:33        all     29,55      0,00     30,46     18,66      0,04     21,29
23:15:24        all     27,77      0,00     36,37     33,68      0,06      2,12
23:20:25        all     29,56      0,00     37,97     32,33      0,06      0,08
23:25:02        all     28,70      0,00     35,46     32,92      0,06      2,86
23:30:34        all     29,77      0,00     37,64     32,37      0,05      0,16
23:35:42        all     30,28      0,00     36,53     33,05      0,05      0,09
23:40:46        all     31,07      0,00     37,12     31,65      0,06      0,10
23:45:02        all     32,16      0,00     36,79     30,99      0,05      0,02
23:50:01        all      4,60      0,00      3,20      2,96      0,01     89,24
23:55:01        all      0,13      0,00      0,11      0,01      0,00     99,74
00:00:01        all      0,17      0,00      0,12      0,01      0,00     99,69

Zabbix reports similar CPU usage, as show below:

But when trying to contrast that information with the data on InfluxDB which has been collected from Telegraf I’m finding a few discrepancies. See the CPU usage mean Graph on Grafana for that timeframe (I can’t embed more than one image on the post so I’m linking it, hope that’s OK):

As you can see it’s around ~10-15% which doesn’t feel like it’s the same information as the one reported by other tools. At first I’ve thought that it have something to do with the mean() transformation that I’m applying to it on the previous graph, but I’ve checked the data and I still can’t make sense of it.

Let me show you. Let’s start with the CPU usage, as reported per the cpu input [1]. If I check the max() values obtained from Telegraf [2] I get this:

time                user               system            softirq             steal              nice irq iowait              guest guest_nice kk
----                ----               ------            -------             -----              ---- --- ------              ----- ---------- --
1596397231000000000 13.293634152433443 3.044535771184912 0.05871005619344671 0.5770195684807924 0    0   0.00837731423297606 0     0          99.90818030099894

The mean() of that timeframe [3] reports this:

time                user              system             softirq             steal               nice irq iowait                 guest guest_nice kk
----                ----              ------             -------             -----               ---- --- ------                 ----- ---------- --
1596397231000000000 7.900581297676124 1.8773078727536534 0.02697589764955647 0.05810101980085044 0    0   0.00017586437696013338 0     0          90.13685804774548

Checking the load1, load10, load15 values also report several numbers that do not fit what the rest of the monitoring tools are saying (~30 load). These are also gathered by the system Telegraf input [4]:

Could someone shed some light into why this is happening? If I understand correctly the cpu input reports a percentage value of CPU utilization and the load1, load10 and load15 values from the system input return the value of the system load expressed in times one. Is this correct?

If it is, why is there such disparity among these tools?

The machine is a KVM VM and I’m using telegraf 1.4.5, in case it is relevant.

Thank you all in advance.


[1]: My “cpu” input configuration

[[inputs.cpu]]
  collect_cpu_time = false
  totalcpu = true
  percpu = true

[2]: Query to get the max values from the influxdb backend that stores the CPU input data.

SELECT max(usage_user) as "user", max(usage_system) as "system", 
max(usage_softirq) as "softirq",  max(usage_steal) as "steal", max(usage_nice) as "nice", 
max(usage_irq) as "irq", max(usage_iowait) as "iowait", max(usage_guest) as "guest", 
max(usage_guest_nice) as "guest_nice", max(usage_idle) as "kk" FROM "cpu" 
WHERE 
"host" =~ /^[REDACTED]$/ 
and cpu = 'cpu-total' 
AND time >= [REDACTED]ms and time <= [REDACTED]ms

[3]: As [2], but mean values.

SELECT mean(usage_user) as "user", mean(usage_system) as "system", 
mean(usage_softirq) as "softirq", mean(usage_steal) as "steal", 
mean(usage_nice) as "nice", mean(usage_irq) as "irq", mean(usage_iowait) as "iowait", 
mean(usage_guest) as "guest", mean(usage_guest_nice) as "guest_nice",  
mean(usage_idle) as "kk" FROM "cpu" 
WHERE 
"host" =~ /^[REDACTED]$/ 
and cpu = 'cpu-total' 
AND time >= 1596397231000ms 
and time <= 1596400797000ms

[4]: My system input configuration

[[inputs.system]]
  # no extra configuration

I’ve never found any benchmarking to be very accurate inside vm. The only ways I could be certain was to measure time to execute a task from out side the vm. But that was under VMware, so worth exploring this discrepancy you’ve found.

1 Like