Telegraf Unable to Recover if LB is stopped for couple of minutes and then restarted

Arjun_kochhar · June 28, 2022, 9:35pm

Hi Team, we are running into an issue with telegraf.
We have telegraf statsd input that reads in certain data and then writes to wavefront.
There is a containerized load balancer(traefik) between telegraf and wavefront(multiple instances), so the flow is essentially like this:

telegraf → traefik(LB) → wavefront.

The traffic flow works fine. But if we bring traefik down for couple of minutes to a point where telegraf starts reporting Metrics Buffer Overflow and then bring traefik back up, telegraf is unable to recover and it is then unable to push metrics down to wavefront:

2022-06-27T17:07:29Z W! [outputs.wavefront] Metric buffer overflow; 3506 metrics have been dropped
2022-06-27T17:07:29Z E! [agent] Error writing to outputs.wavefront: wavefront sending error: buffer full, dropping line: “some proprietary text”

Any thoughts on how can telegraf detect such broken connections and then auto recover?

Posting here as I haven’t received a response from slack community. Thanks.

jpowers · June 29, 2022, 1:21pm

Hi,

When you say unable to recover, does that mean Telegraf stops sending metrics unless it is restarted?

Thanks

Arjun_kochhar · June 29, 2022, 1:35pm

Telegraf keeps reporting Error writing to outputs.wavefront: and the buffer keeps get overflown and metrics getting dropped(even though I have restarted the LB(i.e. traefik) in between telegraf and wavefront).

Only when telegraf is restarted is when it is able to send metrics again.

jpowers · June 29, 2022, 2:40pm

There are two different buffers in the messages you provided:

wavefront sending error: buffer full, dropping line: “some proprietary text”

This error message is coming directly from Wavefront here, not Telegraf. Which means that Telegraf is successfully connected to Wavefront.

At the time Telegraf is attempting to send more metrics, but Wavefront is saying its buffer is full. As a result, Telegraf assumes the metrics did not get there, keeps the metrics in the buffer, which overfills and prints out:

Metric buffer overflow; 3506 metrics have been dropped

Based on the two lines of log messages you have provided, it sounds like Telegraf is doing the right thing and sending metrics again. The problem lies with Wavefront’s buffer not accepting metrics.

You could increase Telegraf’s metric buffer to store more data to prevent overflowing as a possible bandaid as well.

edit: it also sounds like there is an upstream issue ability to disable buffering · Issue #4 · wavefrontHQ/wavefront-sdk-go · GitHub to disable buffering and it was filed by a user of Telegraf. This led to [output.wavefront] Introduced "immediate_flush" flag by prydin · Pull Request #8165 · influxdata/telegraf · GitHub as a config option in Telegraf. Have you looked at that?

Arjun_kochhar · June 29, 2022, 5:28pm

Thanks @jpowers , we already have immediate_flush = true set in wavefront output, but it doesn’t seem to help

[[outputs.wavefront]]
url = “pre-configured wavefront url”
metric_separator = “.”
source_override = [“hostname”, “agent_host”, “node_host”]
convert_paths = true
immediate_flush = true

Arjun_kochhar · August 2, 2022, 1:32pm

@jpowers Ultimately, the fix for us was to upgrade the telegraf version to 1.21.4 and that helped resolve our issues.

Topic		Replies	Views
Telegraf recover from - or detect - temporary failure Telegraf telegraf	6	2768	December 28, 2017
Telegraf time to dectect a output is down and start buffering Telegraf	4	467	January 23, 2024
Telegraf with multiple outputs: If one is down, no one gets the data Telegraf	10	6298	November 20, 2019
Telegraf failover mechanism while sending data to influxdb telegraf	1	1345	June 29, 2017
[solved] Telegraf should reconnect after influxdb-timeouts Telegraf influxdb , telegraf	7	5254	May 24, 2019

Telegraf Unable to Recover if LB is stopped for couple of minutes and then restarted

Related topics