Context Deadline Error Telegraf

Hi, I am running Telegraf & InfluxDB as docker containers deployed over two separate ec2 instance in the same subnet, the type of instance is r5.large and r5a.4xlarge respectively. The input which I am using is Kafka, the frequency of data is every 5 minutes. I am running two Telegraf containers for two different Kafka Brokers. For one Telegraf, I am using 7 kafka topics and for others its 25/26 topics.

For the second Telegraf, I am getting a “Context Deadline exceeded error”, however after restarting every thing works fine until the next time. I want to know, what might be the issue and how to debug that. I am using http timeout as “15s”. I will be shifting the topics to another telegraf, however before that, I would wanted to see some logs on the basis of which “Context Deadline” message is coming.

On Telegraf - we are using quite heavy processing but the code itself is most optimized according to our use case. Below are the processor plugin configurations -

  1. Starlark processor - Extract 3 different metrics from each metric
  2. Convertor - Converting data type of field from string to float
  3. Adding some new tags based on regular expression.

I am using jitter as well to make sure both Telegraf’s are not pushing data to influxDB at same time.

Context Deadline exceeded error

This is generally due to a connection to your input having issues. It could be DNS, it could be network blips, it could be disconnects, a proxy, etc.

what might be the issue and how to debug

Telegraf’s handling of Kafka connections is all done by sarama. There is no way for us to give insight without the sarama logs. You need to run telegraf with --debug enabled and see what sarama is reporting.

This is the Telegraf Log -

2023-02-13T06:14:00Z E! [outputs.influxdb_v2] When writing to [http://influxdb_url:8086/]: Post "http://influxdb_url:8086/api/v2/write?bucket=bucket_name&org=org_name": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2023-02-13T06:14:00Z D! [outputs.influxdb_v2] Buffer fullness: 5755 / 300000 metrics
2023-02-13T06:14:00Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)

So, until I did not restarted the container, we kept seeing the same error. The duration was for about 24 hours. After restarting container, everything is working fine since last 12 hours now.

Ah from influxdb output not kafka. My comment above, about this usually being related to networking issues still stands though. I am surprised to see this if they are on the same subnet, are they in the same availability zone?

Yes, same availability zone. For InfluxDB, we are running some 100-110 tasks which are separated by 30s/10s offset, in total taking 40-45 to calculate, apart from that load on influxdb is not much.

What do your InfluxDB logs show?

Telegraf Log -

InfluxDB logs -

Do you track the load of the telegraf system? That system you are using only has 2-3 GB of memory right?

Every time I have seen this issue it has come down to:

  1. Networking issues, whether they are DNS, firewall, a network blip, and/or proxy causing issues
  2. Load related on the telegraf system

No, actually the Ec2 instance we are using for Telegraf is r5.large, the specs are 2 CPU 16 GB Ram. What should I check for both the points? I wanted similar solution as mentioned here - Azure output plugin - properly recover after context deadline exceeded · Issue #10950 · influxdata/telegraf · GitHub. Can you help?

File an issue and we can look at making that change

Done - InfluxDB output plugin - recover after Context Deadline · Issue #12685 · influxdata/telegraf · GitHub. Thanks, also I seeing the Internal metrics going through out the timeline while, the metrics from Kafka not going to InfluxDB. What could be the reason for that?