Need help triaging a potential telegraf 1.3.0 bug with kafka_consumer

nirvine_xmatters · May 30, 2017, 10:58pm

Since upgrading to telegraf 1.3.0, we’re encountering a recurring problem

pulling with kafka consumer into influxdb
6 telegraf consumers pulling from 6 kafka partitions

After a few hours of running fine, 1 random telegraf instance out of 6 starts complaining in a loop about basic local collectors like cpu, memory, etc. not being able to collect within the time limit (20s):

2017-05-30T22:07:04Z E! Error in plugin [inputs.net]: took longer to collect than collection interval (20s)
2017-05-30T22:07:04Z E! Error in plugin [inputs.kernel_vmstat]: took longer to collect than collection interval (20s)
2017-05-30T22:07:04Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (20s)
2017-05-30T22:07:06Z E! Error in plugin [inputs.netstat]: took longer to collect than collection interval (20s)
...

At this point, telegraf stops consuming from kafka and writing to influxdb (including from the internal plugin).

A restart of telegraf fixes the problems, and it churns away for hours more before the problem crops up again.

The log message before the looping stuff says nothing out of the ordinary

daniel · May 30, 2017, 11:21pm

How does the cpu/memory usage by Telegraf look during the time that it is collecting succesfully? Is it increasing or stable?

jackzampolin · May 30, 2017, 11:25pm

@nirvine_xmatters Thank you for the bug report! A couple of asks:

There are some known issues with newer versions of Kafka. What version are you running?
Can you post your full telegraf config?
If your issue is not this one, can you open an issue on telegraf to help track this?
- Pull to fix this issue

nirvine_xmatters · May 30, 2017, 11:47pm

I’d say it loses about 100 MB of memory over the previous 4 hours. CPU looks pretty stable at about 99% idle

nirvine_xmatters · May 30, 2017, 11:53pm

Hi Jack,

We’re using version Confluent: confluent-kafka-2.11-0.10.1.1-1.noarch

I did in fact upgrade it around the same time as upgrading to telegraf 1.3.0

[global_tags]
  dc = "den"
  env = "prd"
  role = "xm_telegraf_dataproc"

[agent]
  hostname = "redacted"
  interval = "20s"
  round_interval = true
  metric_buffer_limit = 100000
  flush_buffer_when_full = true
  collection_jitter = "15s"
  flush_interval = "20s"
  flush_jitter = "15s"
  debug = false
  quiet = false

#
# OUTPUTS:
#
[[outputs.influxdb]]
  database = "telegraf"
  password = "telegraf"
  urls = ["http://redacted:8086"]
  username = "telegraf"

#
# INPUTS:
#
[[inputs.conntrack]]
[[inputs.cpu]]
  fielddrop = ["time_*"]
  percpu = true
  totalcpu = true
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs"]
[[inputs.diskio]]
[[inputs.internal]]
[[inputs.kernel]]
[[inputs.kernel_vmstat]]
[[inputs.mem]]
[[inputs.net]]
  drop = ["net_icmp"]
  interfaces = ["eth0"]
[[inputs.netstat]]
[[inputs.ping]]
  count = 1
  timeout = 1.0
  urls = ["redacted"]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]

Regarding the existing bug, TBH I couldn’t say: we generate a lot of metrics and due to a number of bugs throughout the stack a not insignificant portion of them are malformed. But my gut says no.

I opened a bug: Telegraf 1.3.0 stops sending any metrics after a few hours processing them ... maybe kafka's fault? · Issue #2870 · influxdata/telegraf · GitHub

Topic		Replies	Views
Kafka Consumer Input Plugin error Telegraf	9	3922	July 11, 2017
Issue with telegraf Telegraf telegraf	1	158	March 14, 2024
Telegraf lags when consuming more than one Kafka topic Telegraf telegraf	4	440	February 12, 2024
Plz Look at my telegraf config Kapacitor influxdb , telegraf	3	546	November 13, 2019
Persisting Problem - ["outputs.kafka"] did not complete within its flush interval	3	527	August 29, 2022

Need help triaging a potential telegraf 1.3.0 bug with kafka_consumer

Related topics