Hello Influx Community,
i have a question if this normal or i made mistake with the config from telegraf.
Current Setup:
Telegraf version: Telegraf 1.29.2 (git: HEAD@d92d7073)
3 VMs
1 VM Telegraf for generating metrics
1 VM with nats for pub/sub
1 VM with victoriametrcs / grafana for visualizing
Dataflow:
Telegraf → Nats → victoriametrics / grafana
Config files
/etc/telegraf/telegraf.conf
[agent]
interval = "1s"
round_interval = true
metric_batch_size = 10000
metric_buffer_limit = 250000
flush_interval = "2"
precision = "1ms"
hostname = ""
omit_hostname = false
/etc/telegraf/telegraf.d/node-exporter.conf
[[outputs.nats]]
servers = ["nats://dev.nats01.int.mon.vm.webapp.krz:6000"]
subject = "dev-test"
data_format = "prometheus"
/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d --debug
2024-01-22T09:31:43Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
2024-01-22T09:31:45Z D! [outputs.nats] Wrote batch of 892 metrics in 4.501546ms
2024-01-22T09:31:45Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
2024-01-22T09:31:47Z D! [outputs.nats] Wrote batch of 892 metrics in 3.71167ms
2024-01-22T09:31:47Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
2024-01-22T09:31:49Z D! [outputs.nats] Wrote batch of 892 metrics in 3.626572ms
2024-01-22T09:31:49Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
2024-01-22T09:31:51Z D! [outputs.nats] Wrote batch of 892 metrics in 4.228643ms
2024-01-22T09:31:51Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
2024-01-22T09:31:53Z D! [outputs.nats] Wrote batch of 892 metrics in 6.883217ms
2024-01-22T09:31:53Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
2024-01-22T09:31:55Z D! [outputs.nats] Wrote batch of 892 metrics in 4.012743ms
2024-01-22T09:31:55Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
2024-01-22T09:31:57Z D! [outputs.nats] Wrote batch of 892 metrics in 3.966055ms
2024-01-22T09:31:57Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
2024-01-22T09:31:59Z D! [outputs.nats] Buffer fullness: 892 / 250000 metrics
2024-01-22T09:31:59Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:01Z D! [outputs.nats] Buffer fullness: 1784 / 250000 metrics
2024-01-22T09:32:01Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:03Z D! [outputs.nats] Buffer fullness: 2676 / 250000 metrics
2024-01-22T09:32:03Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:05Z D! [outputs.nats] Buffer fullness: 3568 / 250000 metrics
2024-01-22T09:32:05Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:07Z D! [outputs.nats] Buffer fullness: 4460 / 250000 metrics
2024-01-22T09:32:07Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:09Z D! [outputs.nats] Buffer fullness: 5352 / 250000 metrics
2024-01-22T09:32:09Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:11Z D! [outputs.nats] Buffer fullness: 6244 / 250000 metrics
2024-01-22T09:32:11Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:13Z D! [outputs.nats] Buffer fullness: 7136 / 250000 metrics
2024-01-22T09:32:13Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:15Z D! [outputs.nats] Buffer fullness: 8028 / 250000 metrics
2024-01-22T09:32:15Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:17Z D! [outputs.nats] Buffer fullness: 8920 / 250000 metrics
2024-01-22T09:32:17Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
2024-01-22T09:32:19Z D! [outputs.nats] Buffer fullness: 9812 / 250000 metrics
2024-01-22T09:32:19Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
We have set the interval and flush_intervall very low to test what happens if the nats output is not reachable. we have testet it. The telegraf output needs 60-75 seconds in our case to recognize the the nats endpoint is down and start to buffer.
Between output down and starting to buffer, we lost the metrics. Is the behavior normal or is it a config failure?
Our expectation is that it should buffer after the first try which the output is not reachable
regards
That is not the case based on your logs:
# This is a successful send of 892 metrics
2024-01-22T09:31:57Z D! [outputs.nats] Wrote batch of 892 metrics in 3.966055ms
# the buffer is empty since the send was sucessful
2024-01-22T09:31:57Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
# The buffer now has 892 metrics in it as something has gone wrong
2024-01-22T09:31:59Z D! [outputs.nats] Buffer fullness: 892 / 250000 metrics
# This is why the buffer has metrics, it failed to send.
2024-01-22T09:31:59Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
The only way we know to buffer is if the call to publish returns any type of error.
Hello jpowers,
i tested it again, maybe i dont understand something?
Telegraf is normal running and i see metrics in grafana
Stopped nats: Mon Jan 22 03:58:49 PM CET 2024
Telegraf sending now data to the dead endpoint
journalctl -fu telegraf
...
Jan 22 15:58:57 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:58:57Z D! [outputs.nats] Wrote batch of 982 metrics in 4.822462ms
Jan 22 15:58:57 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:58:57Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:58:59 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:58:59Z D! [outputs.nats] Wrote batch of 982 metrics in 4.387049ms
Jan 22 15:58:59 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:58:59Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:01 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:01Z D! [outputs.nats] Wrote batch of 982 metrics in 5.54675ms
Jan 22 15:59:01 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:01Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:03 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:03Z D! [outputs.nats] Wrote batch of 982 metrics in 4.673452ms
Jan 22 15:59:03 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:03Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:05 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:05Z D! [outputs.nats] Wrote batch of 982 metrics in 6.804667ms
Jan 22 15:59:05 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:05Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:07 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:07Z D! [outputs.nats] Wrote batch of 982 metrics in 4.344286ms
Jan 22 15:59:07 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:07Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:09 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:09Z D! [outputs.nats] Wrote batch of 982 metrics in 4.049769ms
Jan 22 15:59:09 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:09Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:11 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:11Z D! [outputs.nats] Wrote batch of 982 metrics in 3.962ms
Jan 22 15:59:11 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:11Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:13 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:13Z D! [outputs.nats] Wrote batch of 982 metrics in 4.443938ms
Jan 22 15:59:13 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:13Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:15 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:15Z D! [outputs.nats] Wrote batch of 982 metrics in 4.184204ms
Jan 22 15:59:15 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:15Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:17 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:17Z D! [outputs.nats] Wrote batch of 982 metrics in 4.238351ms
Jan 22 15:59:17 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:17Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:19 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:19Z D! [outputs.nats] Wrote batch of 982 metrics in 4.627999ms
Jan 22 15:59:19 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:19Z D! [outputs.nats] Buffer fullness: 0 / 250000 metrics
Jan 22 15:59:21 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:21Z D! [outputs.nats] Wrote batch of 982 metrics in 9.981734m
till
Jan 22 15:59:53 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:53Z D! [outputs.nats] Buffer fullness: 982 / 250000 metrics
Jan 22 15:59:53 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:53Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
Jan 22 15:59:55 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:55Z D! [outputs.nats] Buffer fullness: 1964 / 250000 metrics
Jan 22 15:59:55 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:55Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
Jan 22 15:59:57 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:57Z D! [outputs.nats] Buffer fullness: 2946 / 250000 metrics
Jan 22 15:59:57 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:57Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
Jan 22 15:59:59 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:59Z D! [outputs.nats] Buffer fullness: 3928 / 250000 metrics
Jan 22 15:59:59 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:59Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
Jan 22 16:00:01 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T15:00:01Z D! [outputs.nats] Buffer fullness: 4910 / 250000 metrics
Jan 22 16:00:01 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T15:00:01Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
Jan 22 16:00:03 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T15:00:03Z D! [outputs.nats] Buffer fullness: 5892 / 250000 metrics
waited 4 minutes and started nats again, telegraf pushed the buffered messages into nats
Nats stopped = 03:58:34
Till Telegraf notice the endpoint is down 1 minute passes by:
Jan 22 15:59:53 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:53Z D! [outputs.nats] Buffer fullness: 982 / 250000 metrics
Jan 22 15:59:53 dev.telegrafproducer01.int.mon.vm.webapp.krz telegraf[24962]: 2024-01-22T14:59:53Z E! [agent] Error writing to outputs.nats: FAILED to send NATS message: nats: outbound buffer limit exceeded
Telegraf starts buffering
i wait 4 minutes
start nats
telegraf sends the buffered metrics to nats
If the output / endpoint e.g nats is not reachable it should buffer after the first flush and buffer it. With the current behavior it takes around 1 minute till telegraf nats output reconizes the output / endpoint / nats is dead
It looks like the nats library used to communicate with the server has a built-in buffer and will buffer metrics untill the server is back or it runs out of space. This explains why the error message you got is around buffer limit exceeded. The confusion around this behavior is covered in the upstream issue: Better capture pending writes when getting disconnected · Issue #86 · nats-io/nats.go · GitHub
There does appear to be a SubChanLen
(65536) option in the library to set the size of the buffered channel, but when set to 0, it would be set to the default. There are also the MaxPingsOut
(2) and PingInterval
(2min) options to determine if the connection is down.
Some level of tuning could help here, but I’m not sure how we could completely avoid sending to the library’s buffer successfully, but not actually sending the metric.
thank you for the explanation