Telegraf mqtt_consumer, how to increase consumption speed


I’m using telegraf to consume messages from MQTT server. I have mainly QoS2 messages, and more than 1000 metrics per seconds.
I have an issue that I found telegraf is not consumtion the messages fast enough and they tend to accumulate in the broker, where I often have 100 thousands of message waiting for ack.

Right now my telegraf is setup like that:

metric_batch_size: 10000
flush_interval: 30s
interval: 10s
metric_buffer_limit: 50000

max_undelivered_messages= 5000

MQTT Consumer, consumes messages as they are available + as it has room based on the max_undelivered_messages size. The interval option does not apply to this plugin.

As such, what your config is saying is, read up to 5000 messages total. Then attempt to write 10,0000 messags every 30 seconds. Which means that MQTT consumer will effectively only ever read 5,000 messages every 30 seconds.

Thank you Josh for your answer. Are you saying I should match max_undelivered_messages with metric_batch_size ?

It depends on what your goal is :slight_smile:

If you have no other input plugins and only want to read more metrics, then yes increasing the max_undelivered_messages option to fill the metric batch size would read more metrics. If you want to read metrics faster another option is to increase the flush_interval as well.

Yes MQTT is the main source of metrics here.
Well that’s really interesting. With the configuration described above, I was stuck at 166 metric/s in gathered and written, which is exactly 5000metrics/30seconds, and some MQTT QoS0 were just lost.
write_buffer size was also stuck to 4.88kB.
Now I’ve push to 10k for max_undelivered_messages I have a written metrics/s which fluctuate between 260 and 280/s, which makes more sense.
Also, the write_buffer size is now around 200B. This one I cannot explain !

But it means I will have to increase max_undelivered_messages together with increasing of message rate received I believe.

@cyril.jean Telegraf will collect messages until either flush_interval (±jitter) is reached or metric_batch_size number of metrics arrived. So in your case, you will receive 5k messages (due to your max_undelivered_messages setting and then Telegraf waits for the 30 seconds (flush_interval) to pass by.
So in your case, you are filling in the 5k messages in the first 5 seconds and then wait 25 seconds to flush the metrics as the metric_batch_size is never full.

As a solution I would increase your max_undelivered_messages to say twice the metric_batch_size. Make sure that metric_buffer_limit is still greater than the batch size by margin (say e.g. factor 2 or more). You can additionally reduce the flush_interval to control the maximum latency for your metrics if the rate drops for some reason.