Telegraf crashes on high load

tarunchabarwal · February 15, 2021, 1:13pm

After a certain threshold of metrics, telegraf starts to crash and I get following logs:

2021-02-15T12:49:26Z I! Starting Telegraf 1.11.0
2021-02-15T12:49:26Z I! Loaded inputs: kernel swap system net netstat cpu disk diskio mem processes jolokia2_agent procstat statsd
2021-02-15T12:49:26Z I! Loaded aggregators:
2021-02-15T12:49:26Z I! Loaded processors:
2021-02-15T12:49:26Z I! Loaded outputs: prometheus_client
2021-02-15T12:49:26Z I! Tags enabled: cloud_provider=aws host=***
2021-02-15T12:49:26Z I! [agent] Config: Interval:15s, Quiet:false, Hostname:"***", Flush Interval:4s
2021-02-15T12:49:26Z W! [inputs.statsd] The parse_data_dog_tags option is deprecated, use datadog_extensions instead.
2021-02-15T12:49:26Z I! [inputs.statsd] Statsd UDP listener listening on: [::]:8130
2021-02-15T12:49:26Z I! [inputs.statsd] Started the statsd service on :8130
2021-02-15T12:55:26Z I! [inputs.statsd] Stopping the statsd service
2021-02-15T12:55:26Z I! Stopped Statsd listener service on :8130
2021-02-15T12:55:26Z I! [agent] Hang on, flushing any cached metrics before shutdown
2021-02-15T12:55:26Z I! Starting Telegraf 1.11.0

Host: ubuntu-14

I even tried with 1.17.2 and same result. A lot of metrics are getting written down to statsd via datadog emitter(udp)

What is causing these restarts ?

Pooh · February 15, 2021, 1:34pm

After a certain threshold of metrics,

What threshold?

telegraf starts to crash and I get following logs:

Are the timestamps of the logs you show coincident with a crash?

A lot of metrics are getting written down to statsd via datadog emitter(udp)

Please define “a lot”.

What is causing these restarts ?

How long have you been running telegraf before this started happening?

How long has the problem been noticeable?

How often does it happen?

Have you increased the quantity of data coming in shortly before the problem
started occurring?

What do iotop, htop or iftop tell you about disk, CPU and network activity
shortly before a crash occurs?

Antony.

tarunchabarwal · February 17, 2021, 2:53pm

Everything seemed normal from systems’s perspective. I’m having ~600 timeseries with ~50 histograms. I observed that one of the histogram had 3.5K/sec metrics. With this rate telegraf process was taking ~25% cpu on main thread.

Sampling this metric with 0.05 rate is working fine for my use-case.

I think there was some issue with statsd to process such number of metrics.

Topic		Replies	Views
Telegraf goes down	20	1210	May 14, 2020
Telegram Failed to Start	5	22812	May 27, 2021
D! [outputs.influxdb_v2] Buffer fullness: 0 / 1 metrics Telegraf	11	3568	August 2, 2022
Issues with splitting metrics sent to StatsD - Telegraf	1	1297	April 19, 2018
Telegraf statsd thread control Telegraf	5	894	January 10, 2023

Telegraf crashes on high load

Related topics