Hi, I’m running a telegraf (v1.9.1) pod alongside an application in kubernetes. We’re sending a lot of metrics and running into a statsd message queue full error:
2019-12-02T00:57:24Z E! Error: statsd message queue full. We have dropped 901920000 messages so far. You may want to increase allowed_pending_messages in the config
We’ve increased the allowed_pending_messages to about 80000 but the problem is still persisting. We’re only running 1 telegraf pod because we thought horizontally scaling it would affect the statsd aggregation. Would it be possible to run multiple telegraf pods or would that cause the statsd aggregate measurements to be inaccurate?
This would affect the aggregated data if you split the same metrics across the two Telegraf, but if you shard the output consistently you could send to multiple Telegraf.
I’m curious about how many statsd metrics you are sending, do you know your approximate rate?
That doesn’t seem very high, but it occurs to me now that these metrics can be made up of any number of statsd metrics due to the aggregation, so I guess it isn’t very helpful in the current form. I could add some additional counters for the plugin though, would you be able to test a development build?
I opened an issue on GitHub (#6779), right now I’m very busy finalizing the 1.13.0 release but I’ll work on this later this week and will add links for testing on the issue.