Telegraf crashes on high load

After a certain threshold of metrics, telegraf starts to crash and I get following logs:

2021-02-15T12:49:26Z I! Starting Telegraf 1.11.0
2021-02-15T12:49:26Z I! Loaded inputs: kernel swap system net netstat cpu disk diskio mem processes jolokia2_agent procstat statsd
2021-02-15T12:49:26Z I! Loaded aggregators:
2021-02-15T12:49:26Z I! Loaded processors:
2021-02-15T12:49:26Z I! Loaded outputs: prometheus_client
2021-02-15T12:49:26Z I! Tags enabled: cloud_provider=aws host=***
2021-02-15T12:49:26Z I! [agent] Config: Interval:15s, Quiet:false, Hostname:"***", Flush Interval:4s
2021-02-15T12:49:26Z W! [inputs.statsd] The parse_data_dog_tags option is deprecated, use datadog_extensions instead.
2021-02-15T12:49:26Z I! [inputs.statsd] Statsd UDP listener listening on: [::]:8130
2021-02-15T12:49:26Z I! [inputs.statsd] Started the statsd service on :8130
2021-02-15T12:55:26Z I! [inputs.statsd] Stopping the statsd service
2021-02-15T12:55:26Z I! Stopped Statsd listener service on :8130
2021-02-15T12:55:26Z I! [agent] Hang on, flushing any cached metrics before shutdown
2021-02-15T12:55:26Z I! Starting Telegraf 1.11.0

Host: ubuntu-14

I even tried with 1.17.2 and same result. A lot of metrics are getting written down to statsd via datadog emitter(udp)

What is causing these restarts ?

After a certain threshold of metrics,

What threshold?

telegraf starts to crash and I get following logs:

Are the timestamps of the logs you show coincident with a crash?

A lot of metrics are getting written down to statsd via datadog emitter(udp)

Please define “a lot”.

What is causing these restarts ?

How long have you been running telegraf before this started happening?

How long has the problem been noticeable?

How often does it happen?

Have you increased the quantity of data coming in shortly before the problem
started occurring?

What do iotop, htop or iftop tell you about disk, CPU and network activity
shortly before a crash occurs?

Antony.

Everything seemed normal from systems’s perspective. I’m having ~600 timeseries with ~50 histograms. I observed that one of the histogram had 3.5K/sec metrics. With this rate telegraf process was taking ~25% cpu on main thread.

Sampling this metric with 0.05 rate is working fine for my use-case.

I think there was some issue with statsd to process such number of metrics.