Error writing to outputs.http: 401 Error repeatedly

When telegraf is having trouble connecting to an output URL, we repeatedly see this in the telegraf.log. Is there any way to detect this condition and stop the retry attempts? I’m presuming, in our case where flush_interval is to 1 sec, we are repeatedly making a TCP connection every second and the telegraf queue is slowing filling up. Is there a backoff mechanism in place to limit these attempts, then retry after some set amount of time, rather than retrying every flush_interval? Is there a way to notify an external process when telegraf is in this condition?

2024-03-13T20:32:40Z E! [agent] Error writing to outputs.http: when writing to [https://va2.xyz.com/telemetry/v1/state/platform] received status code: 401. body: {“code”:-3,“message”:null}
2024-03-13T20:32:41Z E! [agent] Error writing to outputs.http: when writing to [https://va2.xyz.com/telemetry/v1/state/platform] received status code: 401. body: {“code”:-3,“message”:null}
2024-03-13T20:32:43Z E! [agent] Error writing to outputs.http: when writing to [https://va2.xyz.com/telemetry/v1/stats/platform] received status code: 401. body: {“code”:-3,“message”:null}

So these are not retry attempts, this is your flush_interval of 1 second getting launched by the agent. Meaning every second the agent will tell the http output to try to send metrics.

Is there a backoff mechanism in place to limit these attempts, then retry after some set amount of time, rather than retrying every flush_interval?

Not at this time.

Is there a way to notify an external process when telegraf is in this condition?

If you are not able to send metrics, then your buffer must be growing. There is an internal plugin that you can enable that provides metrics about the output buffers. You can then watch that metric and even alert on it.

Thanks for the response and thanks for the tip on the internal plugin. I’ll play around with that.

WRT HTTP output, do we only purge the queue when data is sent successfully and we get back a 200 response? I’m wondering what HTTP error responses, if any, would cause data to not get flushed and the queue continue to grow.

do we only purge the queue when data is sent successfully

Correct - Sucess if defined as:

  1. Any 2xx return code
  2. A return code that matches a value found in the non_retryable_statuscodes config option. This option is provided to the user to drop metrics if specific return codes are found that you would rather drop the metric, lose the data, and continue on.