Confused about telegraf health metrics

Hello,

In my set up, I have 50k devices posting data to Influx every 5 minutes (via a bunch of stuff, then ultimately rabbitmq and telegraf). The number of fields in the data and destination measurement varies.
To check everything is working, I’d like to know how many lines of line protocol telegraf has processed every 5 minutes. If the answer is close to 50k, I can sleep at night.

I’m a bit confused about the metrics, especially measurement “internal_write”, field “metrics_written”.

If I send this (and only this) data to Influx via telegraf:

some_measurement,tag1=a,tag2=b,tag3=c,tag4=d marker_1=38,marker_2=6,marker_3=68,marker_4=44,marker_5=12,marker_6=14,marker_7=46,marker_8=97,marker_9=21,id="xyz"

The output of the following query is 21:

import "experimental/aggregate"
from(bucket: "some_bucket")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_measurement"] == "internal_write")
  |> filter(fn: (r) => r["_field"] == "metrics_written")
  |> filter(fn: (r) => r["host"] == "telegraf")
  |> filter(fn: (r) => r["output"] == "influxdb_v2")
  |> aggregate.rate(
    every: 1m,
    unit: 1m,
  )

My data has 10 fields and 4 tags. Does each field/tag count as a “metric”? In which case perhaps I’m seeing 14 + 7 metrics written to something else, like telegraf’s own internal_ measurements?
For my purposes, I’d be more interested in knowing the number of lines of line protocol telegraf has sent - is it possible to get that?

Secondly, is there a way to add a host tag to only the health metrics?
If I set hostname="" in my agent config, the host tag in Influx contains the telegraf pod name (I’m running telegraf on kubernetes) but its applied to all measurements which is going to blow up my cardinality; if I have 10 telegraf pods, my cardinality is x10 what it was with hostname=“telegraf”.

I was hoping to monitor metrics (ideally lines of line protocol) written to Influx on a per pod basis.

Thanks,
Tom

Hello @thopewell,
Here’s some info on the internal_write fields:

  • buffer_limit : number of metrics in the buffer. A metric is essentially one line of line protocol.
  • buffer_size : max buffer size for storing metrics. This the same as the metric_buffer_limit in your configuration file.
  • metrics_written : cumulative number of metrics written since process start.
  • metrics_filtered : cumulative number of metrics filtered since process start.
  • gather_time_ns cumulative time in nanoseconds that the tagged input has spent gathering its input since process start.
  • write_time_ns : cumulative time in nanoseconds that the tagged output has spent writing its output since process start.

When. you say the output of the query is 21, do you mean the value is 21? Or 21 points are returned? If you remove the aggregate.rate() function and query for a short period of time, do you see the expected number of results?

Hmm I’m not sure about how to add a host tag to the health metrics only. Or at least not directly, you could apply tagdrop everywhere else.

Hi @Anaisdg ,

Thanks for the info. After some further tests, I worked it out - metrics_written is cummalative lines of line protocol. However, the figure includes telegraf metrics about itself, which at scale will be insignificant, but when I dialled everything back yesterday, caused confusion.

If I stop everything before rabbitmq (no data to Influx apart from telegraf health metrics) and set the agent.flush_interval to an even number like 20s, I see 60 metrics sent to Influx every ~20s, which I assume is all the metrics telegraf is collecting about itself and posting to Influx:

2021-01-27T17:57:02Z D! [outputs.health] Wrote batch of 60 metrics in 19.412µs
2021-01-27T17:57:02Z D! [outputs.health] Buffer fullness: 0 / 20000 metrics
2021-01-27T17:57:03Z D! [outputs.influxdb_v2] Wrote batch of 60 metrics in 1.069056341s
2021-01-27T17:57:03Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 20000 metrics
2021-01-27T17:57:22Z D! [outputs.health] Wrote batch of 60 metrics in 13.155µs
2021-01-27T17:57:22Z D! [outputs.health] Buffer fullness: 0 / 20000 metrics
2021-01-27T17:57:23Z D! [outputs.influxdb_v2] Wrote batch of 60 metrics in 685.686464ms
2021-01-27T17:57:23Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 20000 metrics

This then correlates with what I see in Influx:

If I publish a message directly in rabbitmq onto the influx queue:

some_measurement,tag1=a,tag2=b,tag3=c,tag4=d marker_1=38,marker_2=6,marker_3=68,marker_4=44,marker_5=12,marker_6=14,marker_7=46,marker_8=97,marker_9=21,id="xyz"

This figure increases by 1 - which is exactly what I hoped for!

2021-01-27T17:58:42Z D! [outputs.health] Wrote batch of 61 metrics in 15.6µs
2021-01-27T17:58:42Z D! [outputs.health] Buffer fullness: 0 / 20000 metrics
2021-01-27T17:58:43Z D! [outputs.influxdb_v2] Wrote batch of 61 metrics in 1.242752496s
2021-01-27T17:58:43Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 20000 metrics

Previously, flush interval was set to 15s. I think uneven results from aggregate.rate kind of hid what was really going on.

Thanks,
Tom

1 Like

@thopewell glad it’s making sense now. Anytime!

@Anaisdg , incidentally, I still need to explore the tagdrop feature because once I have more than one telegraf pod running, its very hard (impossible?) to work out how many metrics have been sent since the aggregate.rate function can’t distinguish between the separate counters. I get figures in the thousands when I’m not sending any data, which is really the difference between the counters as a result of the pods starting at different times

For the sake of completeness, this gets me exactly what I need:

    [agent]
... stuff
      hostname = ""
      omit_hostname = false

# this one sends the telegraf metrics included host=podname
    [[outputs.influxdb_v2]]
  ... stuff
      tagexclude = ["source"]
      [outputs.influxdb_v2.tagdrop]
        source = ["rabbitmq"]

# this one sends the data pulled off rabbitmq and drops hostname, saving cardinality
    [[outputs.influxdb_v2]]
  ... stuff
      tagexclude = ["source","host"]
      [outputs.influxdb_v2.tagpass]
        source = ["rabbitmq"]

    [[inputs.amqp_consumer]]
  .. stuff
      [inputs.amqp_consumer.tags]
        source = "rabbitmq"

I can now use aggregate.rate to plot metrics_written per telegraf pod with the numbers making sense!

Hello @thopewell,
looks good! Thank you for sharing!