Telegraf time to dectect a output is down and start buffering

Hello Influx Community,

i have a question if this normal or i made mistake with the config from telegraf.


Current Setup:

Telegraf version: Telegraf 1.29.2 (git: HEAD@d92d7073)

3 VMs

1 VM Telegraf for generating metrics
1 VM with nats for pub/sub
1 VM with victoriametrcs / grafana for visualizing

Dataflow:

Telegraf → Nats → victoriametrics / grafana


Config files

/etc/telegraf/telegraf.conf

[agent]
  interval = "1s"
  round_interval = true
  metric_batch_size = 10000
  metric_buffer_limit = 250000
  flush_interval = "2"
  precision = "1ms"
  hostname = ""
  omit_hostname = false

/etc/telegraf/telegraf.d/node-exporter.conf

[[outputs.nats]]
  servers = ["nats://dev.nats01.int.mon.vm.webapp.krz:6000"]
  subject = "dev-test"
  data_format = "prometheus"

/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d --debug

2024-01-22T09:31:43Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
2024-01-22T09:31:45Z D! [outputs.nats] Wrote batch of 892 metrics in 4.501546ms
2024-01-22T09:31:45Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
2024-01-22T09:31:47Z D! [outputs.nats] Wrote batch of 892 metrics in 3.71167ms
2024-01-22T09:31:47Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
2024-01-22T09:31:49Z D! [outputs.nats] Wrote batch of 892 metrics in 3.626572ms
2024-01-22T09:31:49Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
2024-01-22T09:31:51Z D! [outputs.nats] Wrote batch of 892 metrics in 4.228643ms
2024-01-22T09:31:51Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
2024-01-22T09:31:53Z D! [outputs.nats] Wrote batch of 892 metrics in 6.883217ms
2024-01-22T09:31:53Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
2024-01-22T09:31:55Z D! [outputs.nats] Wrote batch of 892 metrics in 4.012743ms
2024-01-22T09:31:55Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
2024-01-22T09:31:57Z D! [outputs.nats] Wrote batch of 892 metrics in 3.966055ms
2024-01-22T09:31:57Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
2024-01-22T09:31:59Z D! [outputs.nats] Buffer fullness: 892 / 250000 metrics
2024-01-22T09:31:59Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:01Z D! [outputs.nats] Buffer fullness: 1784 / 250000 metrics
2024-01-22T09:32:01Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:03Z D! [outputs.nats] Buffer fullness: 2676 / 250000 metrics
2024-01-22T09:32:03Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:05Z D! [outputs.nats] Buffer fullness: 3568 / 250000 metrics
2024-01-22T09:32:05Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:07Z D! [outputs.nats] Buffer fullness: 4460 / 250000 metrics
2024-01-22T09:32:07Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:09Z D! [outputs.nats] Buffer fullness: 5352 / 250000 metrics
2024-01-22T09:32:09Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:11Z D! [outputs.nats] Buffer fullness: 6244 / 250000 metrics
2024-01-22T09:32:11Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:13Z D! [outputs.nats] Buffer fullness: 7136 / 250000 metrics
2024-01-22T09:32:13Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:15Z D! [outputs.nats] Buffer fullness: 8028 / 250000 metrics
2024-01-22T09:32:15Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:17Z D! [outputs.nats] Buffer fullness: 8920 / 250000 metrics
2024-01-22T09:32:17Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:19Z D! [outputs.nats] Buffer fullness: 9812 / 250000 metrics
2024-01-22T09:32:19Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded

We have set the interval and flush_intervall very low to test what happens if the nats output is not reachable. we have testet it. The telegraf output needs 60-75 seconds in our case to recognize the the nats endpoint is down and start to buffer.
Between output down and starting to buffer, we lost the metrics. Is the behavior normal or is it a config failure?

Our expectation is that it should buffer after the first try which the output is not reachable

regards

That is not the case based on your logs:

# This is a successful send of 892 metrics
2024-01-22T09:31:57Z D! [outputs.nats] Wrote batch of 892 metrics in 3.966055ms
# the buffer is empty since the send was sucessful
2024-01-22T09:31:57Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
# The buffer now has 892 metrics in it as something has gone wrong
2024-01-22T09:31:59Z D! [outputs.nats] Buffer fullness: 892 / 250000 metrics
# This is why the buffer has metrics, it failed to send.
2024-01-22T09:31:59Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded

The only way we know to buffer is if the call to publish returns any type of error.

Hello jpowers,


i tested it again, maybe i dont understand something?

Telegraf is normal running and i see metrics in grafana

Stopped nats: Mon Jan 22 03:58:49 PM CET 2024

Telegraf sending now data to the dead endpoint

journalctl -fu telegraf
...
Jan 22 15:58:57 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:58:57Z D! [outputs.nats] Wrote batch of 982 metrics in 4.822462ms
Jan 22 15:58:57 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:58:57Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:58:59 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:58:59Z D! [outputs.nats] Wrote batch of 982 metrics in 4.387049ms
Jan 22 15:58:59 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:58:59Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:01 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:01Z D! [outputs.nats] Wrote batch of 982 metrics in 5.54675ms
Jan 22 15:59:01 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:01Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:03 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:03Z D! [outputs.nats] Wrote batch of 982 metrics in 4.673452ms
Jan 22 15:59:03 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:03Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:05 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:05Z D! [outputs.nats] Wrote batch of 982 metrics in 6.804667ms
Jan 22 15:59:05 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:05Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:07 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:07Z D! [outputs.nats] Wrote batch of 982 metrics in 4.344286ms
Jan 22 15:59:07 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:07Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:09 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:09Z D! [outputs.nats] Wrote batch of 982 metrics in 4.049769ms
Jan 22 15:59:09 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:09Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:11 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:11Z D! [outputs.nats] Wrote batch of 982 metrics in 3.962ms
Jan 22 15:59:11 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:11Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:13 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:13Z D! [outputs.nats] Wrote batch of 982 metrics in 4.443938ms
Jan 22 15:59:13 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:13Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:15 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:15Z D! [outputs.nats] Wrote batch of 982 metrics in 4.184204ms
Jan 22 15:59:15 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:15Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:17 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:17Z D! [outputs.nats] Wrote batch of 982 metrics in 4.238351ms
Jan 22 15:59:17 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:17Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:19 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:19Z D! [outputs.nats] Wrote batch of 982 metrics in 4.627999ms
Jan 22 15:59:19 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:19Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:21 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:21Z D! [outputs.nats] Wrote batch of 982 metrics in 9.981734m

till

Jan 22 15:59:53 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:53Z D! [outputs.nats] Buffer fullness: 982 / 250000 metrics
Jan 22 15:59:53 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:53Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
Jan 22 15:59:55 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:55Z D! [outputs.nats] Buffer fullness: 1964 / 250000 metrics
Jan 22 15:59:55 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:55Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
Jan 22 15:59:57 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:57Z D! [outputs.nats] Buffer fullness: 2946 / 250000 metrics
Jan 22 15:59:57 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:57Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
Jan 22 15:59:59 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:59Z D! [outputs.nats] Buffer fullness: 3928 / 250000 metrics
Jan 22 15:59:59 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:59Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
Jan 22 16:00:01 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T15:00:01Z D! [outputs.nats] Buffer fullness: 4910 / 250000 metrics
Jan 22 16:00:01 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T15:00:01Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
Jan 22 16:00:03 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T15:00:03Z D! [outputs.nats] Buffer fullness: 5892 / 250000 metrics

waited 4 minutes and started nats again, telegraf pushed the buffered messages into nats



Nats stopped = 03:58:34
Till Telegraf notice the endpoint is down 1 minute passes by:

Jan 22 15:59:53 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:53Z D! [outputs.nats] Buffer fullness: 982 / 250000 metrics
Jan 22 15:59:53 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:53Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded

Telegraf starts buffering
i wait 4 minutes
start nats
telegraf sends the buffered metrics to nats


If the output / endpoint e.g nats is not reachable it should buffer after the first flush and buffer it. With the current behavior it takes around 1 minute till telegraf nats output reconizes the output / endpoint / nats is dead

It looks like the nats library used to communicate with the server has a built-in buffer and will buffer metrics untill the server is back or it runs out of space. This explains why the error message you got is around buffer limit exceeded. The confusion around this behavior is covered in the upstream issue: Better capture pending writes when getting disconnected · Issue #86 · nats-io/nats.go · GitHub

There does appear to be a SubChanLen (65536) option in the library to set the size of the buffered channel, but when set to 0, it would be set to the default. There are also the MaxPingsOut (2) and PingInterval (2min) options to determine if the connection is down.

Some level of tuning could help here, but I’m not sure how we could completely avoid sending to the library’s buffer successfully, but not actually sending the metric.

thank you for the explanation :+1: