I want to reiterate that my suggestion above is actually the simple, low imact solution.
Only the administrator can update the telegraf config file. The logic to replace a host = [ “1”, “2”, …] type list in the configuration file only needs done once, not hundreds of times and doesn’t require telegraf to ‘speak’ additional config languages.
We could do this today, as I think many such as fercasjr are, by copying the telegraf config file to a template and then replacing those lists with a variable: (expanded out to be super readable if not exact syntax)
#do command to get list for snmp hosts section A > list-snmp-A.output
#sed telegraf.conf.template ‘s/placeholder-in-config/cat list-snmp-A.output
’ > telegraf.conf
What I’m saying here is to just let us replace anything to the right of a configuration key:
current:
[[inputs.snmp]]
agents = [ … ]
change:
agents = [ /opt/telegrafscripts/snmp-agents-in-double-quoted-csv.sh
]
or
include /opt/telegrafvariables/snmplists
\which includes snmp-agents-in-double-quoted-csv = some script or command to get content
agents = [ $snmp-agents-in-double-quoted-csv ]
or, simply read the config from a variable that can be in an includes list so we can operate on that separate file without blowing up the config file.
most of the dynamic config items are essentially just csv lists. admins already have established OIDs for snmp for example or registeres for modbus, those aren’t really changing but the hosts/targets of these queries are. You could make it so that ONLY csv lists can be read and have a tiny bit of validation that it’s actually a csv list and if that breaks, throw an info message out to logs and blank that section so it’s safe.
The primary problem with updating the config file is that a single error in the inputs.snmp config will also make every single other part of the configuration fail. This fragility is the problem and routinely changing the primary config file for hosts is the source of failures.
I would also add that if you do malform the config file and then reload you get a failure, then go back and fix your error and in that time you have zero reads of any kind. adding a host to snmp and making a typo means no ping responses, no disk io, no cpu, no nothing and a big gap in your charts. It’s just too fragile.
Other systems for things like snmp monitoring are often used because telegraf is unfriendly for an admin despite being fantastic once configured. I have a friend at a university here that uses a full tick stack but for snmp he runs a python script and individual snmp walks pushing to influx via http connector because he can put all the snmp query details in mysql. Telegraf’s snmp connector is perfect for this job except the config method is a problem.