Best deployment strategy for production, 1x Telegraf per NE or per NE hardware type?

We are working on deploying Telegraf as SNMP & gNMI collectors, but we’ve seen that running 1x Telegraf process per Network Element (NE) uses a lot of threads, on average 11 threads per NE. The reason why we did this was to avoid issues if one of more NE’s wouldn’t answer or slow down collection for the other NE’s.

We have around 6k NE’s to monitor.
Setup: Telegraf → Kafka → Kafka-Connect → Snowflake

Or better to run 1x Telegraf process per NE Hardware type instead?

Monitoring all 6k Telegraf processes (making sure we don’t drop any metrics) also starts to feel a bit challenging, hence my post :slight_smile:

What ways does others deploy Telegraf? What’s the best way performance-wise?
Is it possible to do SNMPbulk with Telegraf?

Thank you!

Hello @pelas,
Welcome! Maybe a hybrid approach? Something like
Use 1 Telegraf per hardware type or region and assign each Telegraf instance a subset of NEs to poll? And perhaps use a load balancer or round robin for distributing the load across multiple Telegraf instances.

I’d also take a look at:

For SNMPBulk there is:

  ## The GETBULK max-repetitions parameter.
  # max_repetitions = 10

@srebhan any thoughts here? thank you so much!

When you say NE do you simply mean a single router, switch, firewall, etc.? Just something along the lines of one network device?

Indeed, Network Element (NE), as in a router, switch, firewall etc.

You don’t need to have one Telegraf service per device, you just need to have a shorter timeout as part of your Telegraf config for those not responding. For our setup we are monitoring everything local to the site in which the Telegraf server is running and never have issues with missing metrics because of non-responsive devices. Granted, at a single site we are only monitoring 20-25 devices but we also have almost 2000 sites. You just need to find a way that you can easily split them up into smaller chunks that fits your environment because no, you don’t want an agent list of 6,000 devices in Telegraf. However, you also do not need 6,000 Telegraf services for 6,000 monitored devices.