Telegraf Unable to Recover if LB is stopped for couple of minutes and then restarted

Hi Team, we are running into an issue with telegraf.
We have telegraf statsd input that reads in certain data and then writes to wavefront.
There is a containerized load balancer(traefik) between telegraf and wavefront(multiple instances), so the flow is essentially like this:

telegraf → traefik(LB) → wavefront.

The traffic flow works fine. But if we bring traefik down for couple of minutes to a point where telegraf starts reporting Metrics Buffer Overflow and then bring traefik back up, telegraf is unable to recover and it is then unable to push metrics down to wavefront:

2022-06-27T17:07:29Z W! [outputs.wavefront] Metric buffer overflow; 3506 metrics have been dropped
2022-06-27T17:07:29Z E! [agent] Error writing to outputs.wavefront: wavefront sending error: buffer full, dropping line: “some proprietary text”

Any thoughts on how can telegraf detect such broken connections and then auto recover?

Posting here as I haven’t received a response from slack community. Thanks.

Hi,

When you say unable to recover, does that mean Telegraf stops sending metrics unless it is restarted?

Thanks

Telegraf keeps reporting Error writing to outputs.wavefront: and the buffer keeps get overflown and metrics getting dropped(even though I have restarted the LB(i.e. traefik) in between telegraf and wavefront).

Only when telegraf is restarted is when it is able to send metrics again.

There are two different buffers in the messages you provided:

wavefront sending error: buffer full, dropping line: “some proprietary text”

This error message is coming directly from Wavefront here, not Telegraf. Which means that Telegraf is successfully connected to Wavefront.

At the time Telegraf is attempting to send more metrics, but Wavefront is saying its buffer is full. As a result, Telegraf assumes the metrics did not get there, keeps the metrics in the buffer, which overfills and prints out:

Metric buffer overflow; 3506 metrics have been dropped

Based on the two lines of log messages you have provided, it sounds like Telegraf is doing the right thing and sending metrics again. The problem lies with Wavefront’s buffer not accepting metrics.

You could increase Telegraf’s metric buffer to store more data to prevent overflowing as a possible bandaid as well.

edit: it also sounds like there is an upstream issue ability to disable buffering · Issue #4 · wavefrontHQ/wavefront-sdk-go · GitHub to disable buffering and it was filed by a user of Telegraf. This led to [output.wavefront] Introduced "immediate_flush" flag by prydin · Pull Request #8165 · influxdata/telegraf · GitHub as a config option in Telegraf. Have you looked at that?

Thanks @jpowers , we already have immediate_flush = true set in wavefront output, but it doesn’t seem to help

[[outputs.wavefront]]
url = “pre-configured wavefront url”
metric_separator = “.”
source_override = [“hostname”, “agent_host”, “node_host”]
convert_paths = true
immediate_flush = true

@jpowers Ultimately, the fix for us was to upgrade the telegraf version to 1.21.4 and that helped resolve our issues.

1 Like