Best deployment strategy for production, 1x Telegraf per NE or per NE hardware type?

pelas · November 8, 2024, 6:03pm

We are working on deploying Telegraf as SNMP & gNMI collectors, but we’ve seen that running 1x Telegraf process per Network Element (NE) uses a lot of threads, on average 11 threads per NE. The reason why we did this was to avoid issues if one of more NE’s wouldn’t answer or slow down collection for the other NE’s.

We have around 6k NE’s to monitor.
Setup: Telegraf → Kafka → Kafka-Connect → Snowflake

Or better to run 1x Telegraf process per NE Hardware type instead?

Monitoring all 6k Telegraf processes (making sure we don’t drop any metrics) also starts to feel a bit challenging, hence my post

What ways does others deploy Telegraf? What’s the best way performance-wise?
Is it possible to do SNMPbulk with Telegraf?

Thank you!

Anaisdg · November 18, 2024, 3:41pm

Hello @pelas,
Welcome! Maybe a hybrid approach? Something like
Use 1 Telegraf per hardware type or region and assign each Telegraf instance a subset of NEs to poll? And perhaps use a load balancer or round robin for distributing the load across multiple Telegraf instances.

I’d also take a look at:

For SNMPBulk there is:

  ## The GETBULK max-repetitions parameter.
  # max_repetitions = 10

github.com

influxdata/telegraf/blob/master/plugins/inputs/snmp/README.md

# SNMP Input Plugin

The `snmp` input plugin uses polling to gather metrics from SNMP agents.
Support for gathering individual OIDs as well as complete SNMP tables is
included.

## Note about Paths

Path is a global variable, separate snmp instances will append the specified
path onto the global path variable

## Global configuration options <!-- @/docs/includes/plugin_config.md -->

In addition to the plugin-specific configuration settings, plugins support
additional global and plugin configuration settings. These settings are used to
modify metrics, tags, and field or create aliases and configure ordering, etc.
See the [CONFIGURATION.md][CONFIGURATION.md] for more details.

[CONFIGURATION.md]: ../../../docs/CONFIGURATION.md#plugins

This file has been truncated. show original

@srebhan any thoughts here? thank you so much!

npm_engineer · November 21, 2024, 3:15am

When you say NE do you simply mean a single router, switch, firewall, etc.? Just something along the lines of one network device?

pelas · November 25, 2024, 9:30am

Indeed, Network Element (NE), as in a router, switch, firewall etc.

npm_engineer · November 25, 2024, 9:55am

You don’t need to have one Telegraf service per device, you just need to have a shorter timeout as part of your Telegraf config for those not responding. For our setup we are monitoring everything local to the site in which the Telegraf server is running and never have issues with missing metrics because of non-responsive devices. Granted, at a single site we are only monitoring 20-25 devices but we also have almost 2000 sites. You just need to find a way that you can easily split them up into smaller chunks that fits your environment because no, you don’t want an agent list of 6,000 devices in Telegraf. However, you also do not need 6,000 Telegraf services for 6,000 monitored devices.

Hipska · December 18, 2024, 1:14pm

I have divided the telegraf instances based on the interval. Most devices can do 60s interval, but some only work reliable by 120s or even 180s. I closely monitor that using inputs.internal metrics.

Doing this split gives the best results at times telegraf needs to reload/restart.

Topic		Replies	Views
Best way to scale Telegraf Telegraf telegraf	3	3458	October 4, 2018
Telegraf SNMP data collection for fleet of devices Telegraf telegraf , smnp , performance	7	2558	August 10, 2021
Telegraf - scale out SNMP collectors? Telegraf telegraf	0	739	November 14, 2018
Monitoring an application running on several servers Telegraf telegraf , smnp , grafana	2	919	September 7, 2017
Performance of the telegraf snmp collection Telegraf	14	3199	October 9, 2018

Best deployment strategy for production, 1x Telegraf per NE or per NE hardware type?

Related topics