Need help triaging a potential telegraf 1.3.0 bug with kafka_consumer

Since upgrading to telegraf 1.3.0, we’re encountering a recurring problem

  • pulling with kafka consumer into influxdb
  • 6 telegraf consumers pulling from 6 kafka partitions

After a few hours of running fine, 1 random telegraf instance out of 6 starts complaining in a loop about basic local collectors like cpu, memory, etc. not being able to collect within the time limit (20s):

2017-05-30T22:07:04Z E! Error in plugin [inputs.net]: took longer to collect than collection interval (20s)
2017-05-30T22:07:04Z E! Error in plugin [inputs.kernel_vmstat]: took longer to collect than collection interval (20s)
2017-05-30T22:07:04Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (20s)
2017-05-30T22:07:06Z E! Error in plugin [inputs.netstat]: took longer to collect than collection interval (20s)
...

At this point, telegraf stops consuming from kafka and writing to influxdb (including from the internal plugin).

A restart of telegraf fixes the problems, and it churns away for hours more before the problem crops up again.

The log message before the looping stuff says nothing out of the ordinary

How does the cpu/memory usage by Telegraf look during the time that it is collecting succesfully? Is it increasing or stable?

@nirvine_xmatters Thank you for the bug report! A couple of asks:

  • There are some known issues with newer versions of Kafka. What version are you running?
  • Can you post your full telegraf config?
  • If your issue is not this one, can you open an issue on telegraf to help track this?

I’d say it loses about 100 MB of memory over the previous 4 hours. CPU looks pretty stable at about 99% idle

Hi Jack,

We’re using version Confluent: confluent-kafka-2.11-0.10.1.1-1.noarch

I did in fact upgrade it around the same time as upgrading to telegraf 1.3.0

[global_tags]
  dc = "den"
  env = "prd"
  role = "xm_telegraf_dataproc"

[agent]
  hostname = "redacted"
  interval = "20s"
  round_interval = true
  metric_buffer_limit = 100000
  flush_buffer_when_full = true
  collection_jitter = "15s"
  flush_interval = "20s"
  flush_jitter = "15s"
  debug = false
  quiet = false

#
# OUTPUTS:
#
[[outputs.influxdb]]
  database = "telegraf"
  password = "telegraf"
  urls = ["http://redacted:8086"]
  username = "telegraf"

#
# INPUTS:
#
[[inputs.conntrack]]
[[inputs.cpu]]
  fielddrop = ["time_*"]
  percpu = true
  totalcpu = true
[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs"]
[[inputs.diskio]]
[[inputs.internal]]
[[inputs.kernel]]
[[inputs.kernel_vmstat]]
[[inputs.mem]]
[[inputs.net]]
  drop = ["net_icmp"]
  interfaces = ["eth0"]
[[inputs.netstat]]
[[inputs.ping]]
  count = 1
  timeout = 1.0
  urls = ["redacted"]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]

Regarding the existing bug, TBH I couldn’t say: we generate a lot of metrics and due to a number of bugs throughout the stack a not insignificant portion of them are malformed. But my gut says no.

I opened a bug: https://github.com/influxdata/telegraf/issues/2870

1 Like