Since upgrading to telegraf 1.3.0, we’re encountering a recurring problem
pulling with kafka consumer into influxdb
6 telegraf consumers pulling from 6 kafka partitions
After a few hours of running fine, 1 random telegraf instance out of 6 starts complaining in a loop about basic local collectors like cpu, memory, etc. not being able to collect within the time limit (20s):
2017-05-30T22:07:04Z E! Error in plugin [inputs.net]: took longer to collect than collection interval (20s)
2017-05-30T22:07:04Z E! Error in plugin [inputs.kernel_vmstat]: took longer to collect than collection interval (20s)
2017-05-30T22:07:04Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (20s)
2017-05-30T22:07:06Z E! Error in plugin [inputs.netstat]: took longer to collect than collection interval (20s)
...
At this point, telegraf stops consuming from kafka and writing to influxdb (including from the internal plugin).
A restart of telegraf fixes the problems, and it churns away for hours more before the problem crops up again.
The log message before the looping stuff says nothing out of the ordinary
Regarding the existing bug, TBH I couldn’t say: we generate a lot of metrics and due to a number of bugs throughout the stack a not insignificant portion of them are malformed. But my gut says no.