Hi, I’m running a telegraf (v1.9.1) pod alongside an application in kubernetes. We’re sending a lot of metrics and running into a statsd message queue full error:
2019-12-02T00:57:24Z E! Error: statsd message queue full. We have dropped 901920000 messages so far. You may want to increase allowed_pending_messages in the config
We’ve increased the allowed_pending_messages to about 80000 but the problem is still persisting. We’re only running 1 telegraf pod because we thought horizontally scaling it would affect the statsd aggregation. Would it be possible to run multiple telegraf pods or would that cause the statsd aggregate measurements to be inaccurate?
This would affect the aggregated data if you split the same metrics across the two Telegraf, but if you shard the output consistently you could send to multiple Telegraf.
I’m curious about how many statsd metrics you are sending, do you know your approximate rate?
I added telegraf internal and it says metrics gathered for statsd is about 130units/sec
That doesn’t seem very high, but it occurs to me now that these metrics can be made up of any number of statsd metrics due to the aggregation, so I guess it isn’t very helpful in the current form. I could add some additional counters for the plugin though, would you be able to test a development build?
I opened an issue on GitHub (#6779), right now I’m very busy finalizing the 1.13.0 release but I’ll work on this later this week and will add links for testing on the issue.
thanks, I was wondering, our agent interval was at 60s, would there have been less pending messages if the interval was at 10s?
I wouldn’t expect it to make much of a difference.
Sorry about the delay, I added some build links to telegraf #6921 and some queries that I’m interested in to telegraf #6919.