I am running Telegraf v1.29.4 (git: HEAD@4441c4ed) to consume influx formatted data from Kafka and output into Postgres. Until today I had one Kafka topic (“topic A”) with ~4-5 machines worth of influx formatted metrics flowing – about 14K metrics per minute. Kafka + Telegraf + Postgres are all running on a 16 core machine and machine CPU load is very light, roughly 5-10%.
Today I added a second Kafka topic (“topic B”) with a single machine’s worth of metrics – about 600 metrics per minute. After doing this, telegraf started to fall behind badly on topic A – by at least 20-30 minutes for the few hours it was running. If I remove topic B from the configuration and restart, telegraf catches up within a few seconds as expected. I’ve tried using both the “topics” and “topic_regexp” settings to see if there was a difference. I’ve also played with the other configuration settings around balancing, message sizes, etc… and there wasn’t a noticeable performance difference. I’ve made sure the Kafka consumer group used doesn’t conflict with any other in use. FWIW, I originally encountered this issue on v1.26 and upgraded to v1.29.4 hoping the new version would fix it.
Try running separate instances of “top -H -p PidNumber” against all components in the chain, telegraf , postgres and anything else.
Look for :-
one of the processes being delayed “D”
Compared with the good configuration, what differences in cpu , mem usage . As unlikely as it may be is one of the threads CPU limited and can’t go faster than 1 core ?
When the problem configuration is running, is all data lagging, or only data for machine B ?
Do you have an independent view on the Kafka side to prove the lag isn’t on that side of the fence?
Out of curiosity do you have numbers on how fast data can be written to postgres that mirrors what telegraf is writing, but independently? Ie scripted synthetic data of same variation as what telegraf writes, but directly on Postgres machine ?
When you talk about lag, we typically talk about latency. Which does not seem to be the case here. Instead, it is around the ability to consume messages in a timely manner as you mentioned falling behind.
Keep in mind that the consumer plugins, like kafka, mqtt, etc. will only consume up to the max_undelivered_messages value, which is 1000 messages at any given time.
This means telegraf will only read up to 1000 messages from your kafka instance. If your flush interval is too large or your batch size too small, then you may be slowing down your consumption of messages.
My suggestion is to start increasing the max_undelivered_messages value, after you verify the flush interval and batch size are not overwhelmed already.
The data was lagging for topic “A” (~5 machines worth of data), topic “B” (1 machine) was fine. I just reconfigured my setup temporarily to have two telegraf instances running in separate containers – one consuming from topic “A” and the other topic “B”, both inserting into the same PG table. So far, running independently, everything stays up-to-date.
When I started the telegraf instance servicing only topic “A”, it consumed several hours of data from Kafka and inserted it into PG within ~15-20 seconds. The slowdown I am observing only seems to manifest when telegraf is configured to subscribe to 2 topics. I know others are doing this successfully, so it’s puzzling.
In terms of load – this is a very lightly loaded machine at the moment: Ryzen 5950X (16 cores), 128GB RAM, WD SN850X (PCIE 4.0), system CPU load is < 5%, general I/O load negligible, etc.
I will see if I can do some synthetic testing if I have time this weekend.
My suggestion is to start increasing the max_undelivered_messages value, after you verify the flush interval and batch size are not overwhelmed already.