Telegraf fails to write to InfluxDB 3.0 Serverless due to retention period error

smith84 · November 8, 2023, 8:00am

We’re currently trialling Cloud 2 / InfluxDB 3 Serverless to replace an InfluxDB 2.0 OSS instance we’re managing ourselves. We use Telegraf to with the AMQP consumer input and InfluxDB 2.0 output plugins.

Our system uses data buckets with 6 months’ retention but the data source will occasionally generate samples well before the retention period.

With InfluxDB 2.4 OSS we get the following message from Telegraf when it writes a batch with some samples outside the range. Telegraf then moves on to the next batch of points which is fine for our application.
2023-10-20T10:07:49Z E! [outputs.influxdb_v2] Failed to write metric to BUCKET_NAME_REDACTED (will be dropped: 422 Unprocessable Entity): unprocessable entity: failure writing points to database: partial write: points beyond retention policy dropped=3

On InfluxDB 3 Serverless we see the following messages instead. Telegraf then retries the same batch of points over and over again.
2023-10-20T10:07:48Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)
2023-10-20T10:07:48Z E! [outputs.influxdb_v2] When writing to [https://eu-central-1-1.aws.cloud2.influxdata.com/]: failed to write metric to BUCKET_NAME_REDACTED (403 Forbidden): forbidden: dml handler error: data in table sensor_sample is outside of the retention period: minimum acceptable timestamp is 2023-09-20T10:07:48.930235814+00:00, but observed timestamp 2013-01-01T00:03:38.221+00:00 is older.

The HTTP response code is 422 in the first case and 403 in the second case, and these are handled differently by the Telegraf InfluxDB v2 output plugin on review of the source.

What’s the correct behaviour from InfluxDB? Is there a simple way to make the cloud setup behave more like the OSS setup? (This would be the easiest migration path for us.)

Anaisdg · November 16, 2023, 8:34am

@jpowers have you encountered this?

jpowers · November 16, 2023, 1:16pm

I have and the difference in behavior is something getting reviewed by the influxdb team currently. It is not clear to me yet what, if anything, telegraf may need to change.

A temporary solution would be to use the client libraries instead of telegraf so you can change how you handle each return code.

smith84 · November 17, 2023, 2:25am

Even if we were to change the return code handling either by using the client libraries or by patching telegraf, will InfluxDB Serverless accept and write the other records in the request, or will the whole batch (probably 10,000 records) be rejected? Is this the “something getting reviewed by the influxdb team currently”?

jpowers · November 17, 2023, 1:31pm

I believe this is the current behavior and something that is looking to get changed in certain situations.

smith84 · November 20, 2023, 5:46am

Thanks. I’ll keep a watch out for an update, though I’m not aware of any published release notes for InfluxDB 3.0 / Serverless…

smith84 · June 26, 2024, 4:03am

@jpowers It’s been a little while on this, has there been any update to InfluxDB 3.0 serverless behaviour? I can’t find any published release notes for Influx 3.0, so all I’m going on here is whatever documentation and source I can find.

The docs seem to suggest that if all points are outside the retention period InfluxDB 3 will still return a 403 Forbidden error… (IMHO this is not good semantics for HTTP; 422 is much more sensible.)

Telegraf handles a 403 error as one where it should retry…

jpowers · June 26, 2024, 12:47pm

Last I saw this was still the behavior, but there is agreement to change the return code to 400. This will prevent Telegraf from retrying.

thopewell · July 6, 2024, 3:39am

FYI I just hit this today landed here via google.
Some data with a timestamp outside the retention time of our Influx3 cloud severless bucket found its way into our pipeline and caused telegraf to seemlingly lock up. If I restarted telegraf, it would pull a few messages from rabbitmq, until it pulled a message with a “bad” timestamp and then rabbitmq delivery went to 0. Telegraf didn’t log anything to say it was retrying (might be our log level).
I had to change the retention time of the bucket to clear the queue.
Thanks!
Tom

Topic		Replies	Views
InfluxDB 3 serverless: write operation fails with "Error 403 data outside of the retention period" InfluxDB 2 influxdb , client-libraries , windows , python	2	499	February 8, 2024
Outputs.influxdb_v2 plugin Error: 422 Unprocessable Entity - could be due to influxdb uptime? Telegraf	3	1404	April 18, 2024
Points beyond retention policy dropped Store influxdb	1	2259	January 26, 2019
Adding retention policy results in 400 error for feeds influxdb , telegraf	1	1033	November 20, 2019
E! [outputs.influxdb] Failed to write metric (will be dropped: 400 Bad Request): partial write: points beyond retention policy dropped=1 InfluxDB 1	8	2233	January 26, 2023

Telegraf fails to write to InfluxDB 3.0 Serverless due to retention period error

Related topics