InfluxDB Output error

I have telegraf configured to take inputs from a kafka topic and writing output to influxdb. It has been working for months and lately we have seen multiple issues with the agent and reading a few blogs saw a particular version fixed this issue. I was running telegraf v1.3.2 and upgraded to 1.11.5 then started to see some warnings as below:

Mar 12 02:06:39 ip-10-204-29-53 telegraf: 2020-03-12T02:06:39Z W! [outputs.influxdb] Metric buffer overflow; 34514 metrics have been dropped
Mar 12 02:06:40 ip-10-204-29-53 telegraf: 2020-03-12T02:06:40Z W! [outputs.influxdb] Metric buffer overflow; 7732 metrics have been dropped

telegraf config:

[agent]
metric_buffer_limit = 15000

[[outputs.influxdb]]
urls = [“http://monitoring.amgen.com:18086”]
database = “aggregator”
retention_policy = “”
write_consistency = “any”
timeout = “5s”

[[outputs.influxdb]]
urls = [“http://monitoring.amgen.com:28086”]
database = “aggregator”
retention_policy = “”
write_consistency = “any”
timeout = “5s”

[[inputs.kafka_consumer]]

brokers = [“bk1.monitoring.amgen.com:9092”,“bk2.monitoring.amgen.com:9093”]

topics = [“telegraf”]

consumer_group = “telegraf_metrics_consumers”

offset = “oldest”

data_format = “influx”

max_message_len = 65536

[[inputs.kafka_consumer_legacy]]
topics = [“telegraf”]
zookeeper_peers = [“zk4.monitoring.devops.amgen.com:2181”,“zk5.monitoring.devops.amgen.com:2181”,“zk6.monitoring.devops.amgen.com:2181”]
zookeeper_chroot = “”
consumer_group = “telegraf_metrics_consumers”
offset = “newest”
data_format = “influx”
max_message_len = 6553600

I am getting the same error trying to scrape metrics from Prometheus. Any updates on this or any configuration fixes available for the resolution of this ?

Here is what I get as output :

2020-09-22T11:58:13Z W! [outputs.influxdb] Metric buffer overflow; 55040 metrics have been dropped
2020-09-22T11:58:14Z W! [outputs.influxdb] Metric buffer overflow; 16189 metrics have been dropped
2020-09-22T11:58:14Z W! [outputs.influxdb] Metric buffer overflow; 9665 metrics have been dropped
2020-09-22T11:58:15Z W! [agent] [“outputs.influxdb”] did not complete within its flush interval
2020-09-22T11:58:15Z W! [outputs.influxdb] Metric buffer overflow; 31218 metrics have been dropped
2020-09-22T11:58:15Z W! [outputs.influxdb] Metric buffer overflow; 24285 metrics have been dropped

Thanks,

-Sreeni

Hello! The metrics are stored in a fixed sized ring buffer. If the outputs aren’t keeping up with the inputs, older metrics will be overwritten (“dropped”) from the buffer. You can increase the buffer size with the metric_buffer_limit config. This will increase telegraf’s memory usage.

The newer versions of telegraf surface buffer overflow errors better than earlier ones.

@philjb Thanks for the reply Philjb, I tried increasing the metric_buffer_limit from 30000 to 100000. still dropping. someone suggested to use the influxdb_v2 output plugin with content_encoding = gzip option. We are running influx 1.7.6. I am not sure if using the v2 would be compatible or not. Also just to try it out am looking thru the options, but I am not finding where I would refer the database name in the v2… any ideas ? can I use the same format as the outputs.influxdb ( v1 ) plugin syntax itself ?

Thanks,

-Sreeni

The v2 output plugin is for InfluxDB 2. I suspect you are using a 1.x version. The v1 output plugin supports gzip already.

Using gzip will help push requests out of telegraf faster. If you are still running into problems, I would increase the buffer size again. Potentially you should look at the network connection speed too.