How to correctly query metrics from telegraf inputs.cpu?

In windows OS, I use telegraf to read CPU metrics, with the following configuration:

[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = false
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states
  report_active = false

When I run --test, output is like this

cpu,cpu=cpu-total,host=someServer usage_guest=0,usage_guest_nice=0,usage_idle=94.86301369863014,usage_iowait=0,usage_irq=0,usage_nice=0,usage_softirq=0,usage_steal=0,usage_system=2.984344422700587,usage_user=2.152641878669276 1770795008000000000

However, when I use the Influx Data Explorer (and Grafana) to query ‘usage_system’ and ‘usage_user’, the values are under 1%.

from(bucket: "hires-90d")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "cpu")
  |> filter(fn: (r) => r["cpu"] == "cpu-total")
  |> filter(fn: (r) => r["_field"] == "usage_system" or r["_field"] == "usage_user")
  |> filter(fn: (r) => r["host"] == "someServer")
  |> aggregateWindow(every: v.windowPeriod, fn: last, createEmpty: false)
  |> yield(name: "last")

This does not match the output from Telegraf, which is usually more than 2% for each metric. What am I doing wrong?

Try including usage_idle into your graph - does usage_idle + usage_user + usage_system = 100 for each datapoint? If so, that’s a sign the data is valid, and you just have higher CPU utilization when you as a user are using the device and running Telegraf tests. It’s possible that the aggregation selecting the last value in each window is happening at times when you’re not doing anything, and thus CPU usage is lower. It may also help to use fn: mean instead of fn: last.

I’m pretty sure that --test output matches reality, it’s similar to CPU activity in Windows Task Manager. Using ‘mean’ has no effect on the result.

Including ‘usage_idle’ will bring the total to 100, no matter whether the calculation is performed using the ‘–test’ output or a query inside Data Explorer.

The issue is the discrepancy between the data collected and the data written to the database. To debug it, I tried querying the raw data, but it returned the same result as the Data Explorer. Maybe it’s a bug in Telegraf?

What I’m trying to suggest is that it’s possible this isn’t a bug at all, and if the data looks sane when it’s written, it may be the case that there’s something you’re doing when testing that’s causing CPU utilization to be higher than when it’s simply running on its own.