We’re currently trialling Cloud 2 / InfluxDB 3 Serverless to replace an InfluxDB 2.0 OSS instance we’re managing ourselves. We use Telegraf to with the AMQP consumer input and InfluxDB 2.0 output plugins.
Our system uses data buckets with 6 months’ retention but the data source will occasionally generate samples well before the retention period.
With InfluxDB 2.4 OSS we get the following message from Telegraf when it writes a batch with some samples outside the range. Telegraf then moves on to the next batch of points which is fine for our application. 2023-10-20T10:07:49Z E! [outputs.influxdb_v2] Failed to write metric to BUCKET_NAME_REDACTED (will be dropped: 422 Unprocessable Entity): unprocessable entity: failure writing points to database: partial write: points beyond retention policy dropped=3
On InfluxDB 3 Serverless we see the following messages instead. Telegraf then retries the same batch of points over and over again. 2023-10-20T10:07:48Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s) 2023-10-20T10:07:48Z E! [outputs.influxdb_v2] When writing to [https://eu-central-1-1.aws.cloud2.influxdata.com/]: failed to write metric to BUCKET_NAME_REDACTED (403 Forbidden): forbidden: dml handler error: data in table sensor_sample is outside of the retention period: minimum acceptable timestamp is 2023-09-20T10:07:48.930235814+00:00, but observed timestamp 2013-01-01T00:03:38.221+00:00 is older.
The HTTP response code is 422 in the first case and 403 in the second case, and these are handled differently by the Telegraf InfluxDB v2 output plugin on review of the source.
What’s the correct behaviour from InfluxDB? Is there a simple way to make the cloud setup behave more like the OSS setup? (This would be the easiest migration path for us.)
I have and the difference in behavior is something getting reviewed by the influxdb team currently. It is not clear to me yet what, if anything, telegraf may need to change.
A temporary solution would be to use the client libraries instead of telegraf so you can change how you handle each return code.
Even if we were to change the return code handling either by using the client libraries or by patching telegraf, will InfluxDB Serverless accept and write the other records in the request, or will the whole batch (probably 10,000 records) be rejected? Is this the “something getting reviewed by the influxdb team currently”?
@jpowers It’s been a little while on this, has there been any update to InfluxDB 3.0 serverless behaviour? I can’t find any published release notes for Influx 3.0, so all I’m going on here is whatever documentation and source I can find.
The docs seem to suggest that if all points are outside the retention period InfluxDB 3 will still return a 403 Forbidden error… (IMHO this is not good semantics for HTTP; 422 is much more sensible.)
Telegraf handles a 403 error as one where it should retry…
FYI I just hit this today landed here via google.
Some data with a timestamp outside the retention time of our Influx3 cloud severless bucket found its way into our pipeline and caused telegraf to seemlingly lock up. If I restarted telegraf, it would pull a few messages from rabbitmq, until it pulled a message with a “bad” timestamp and then rabbitmq delivery went to 0. Telegraf didn’t log anything to say it was retrying (might be our log level).
I had to change the retention time of the bucket to clear the queue.
Thanks!
Tom