Update telegraf snmp agents simply and in bulk

Dan_Denson · July 31, 2022, 2:13am

There is another similar request, but it was marked answered without a great solution

I would like to be able to populate the agent’s list via database entry and have telegraf query that. Is this possible? Google has returned nothing for me. That or an existing tool that can iterate through a list of IPs to hit with SNMP and then update the file?

Thanks

Jay_Clifford · August 1, 2022, 10:21am

Hi @Dan_Denson,
So looking at the previous blog there seems to be a two-part discussion here. One of the best practices of building an SNMP config with Telegraf. The second is how to generate a config which dynamically contains the snmp endpoint.
For the first question I highly recommend checking out this best practice blog: Telegraf Best Practices: SNMP Plugin | InfluxData

The second question is a bit more difficult and will take some effort on your part. Telegraf is not built to dynamically generate configs that is something you would have to conceive. One of the primary methods developers use is the environment variable feature. This works if the changes so creating a bash script to update the environment variables and restarting telegraf to collect the new env changes. The second option is writing a script to build the telegraf config file.

Dan_Denson · August 1, 2022, 10:07pm

That’s what I’m seeing. I’m surprised someone hasn’t made a utility for this, seems it would be a pretty common need.

Jay_Clifford · August 3, 2022, 12:15pm

Hi @Dan_Denson,
You would be surprised a lot of community build their own deployment scripts to fit their individual needs. It’s hard for us to standardise on a way for building configs from other sources. If you would like to contribute code to the idea though that would be awesome.

Dan_Denson · May 24, 2023, 9:38pm

I hear plenty of complaints about it. I personally hate scripts that update config files and reload things like this. A band character creeps in here or there and it takes the whole thing down.

Having some native database connectors would be ideal and the ability to do a reload of those variable list items would be incredible and would increase the utility of telegraf as the snmp collector dramatically.

srebhan · May 25, 2023, 7:52am

@Dan_Denson can you please elaborate where your config data comes from? Do you use some kind of asset manager that knows about your devices?

Dan_Denson · May 25, 2023, 6:10pm

I want to collect snmp data from network devices, usually in the thousands. Putting thousands of individual devices in the telegraf config file is… well it’s not nice.

I would love to put them in a little sql table for example, so I could update them live/programatically and have telegraf just pick that up without trouble.

The problem with having so many items in a config file is that one mistake in entry takes not only the snmp polling out, but everything else telegraf is doing because now it’s broken.

To get this data now, I’m using snmpwalk in a sh script and pushing it in via http because I can query the data currently in a postgredb and iterate through.

In short, this is a task that telegraf could very much do but the config file is so offputting that I can’t use the built-in snmp.

srebhan · May 26, 2023, 8:00am

@Dan_Denson it is hard to do a general type of config generation as each individual solution might vary a bit. Therefore I suggest to have a little program reading your SQL table and convert it into a telegraf config. You know best how to query the SQL table and which information of that table maps to which field, doing this generally in Telegraf is really hard!
You can then make telegraf reading this generated config via http…

Dan_Denson · May 26, 2023, 3:13pm

I’m not so sure about that. It’s just a list of addresses. The general/generic way would be to either add a connector so a query the returned a properly formatted list, or the ability to spawn an external command to product that list at each run interval. Pretty straight forward to build that list in an external script where we could query any data source and form that into the csv listing telegraf uses.

It’s actually really unusual to have the runtime arguments for a tool live inside of the config file. One error in the snmp list and the entire config file is broken and there’s no real error handling using this method.

It’s not just snmp either, there should be some method to have the config file stay static and the variables for different modules configured externally, even if that’s just another plain text file.

srebhan · May 26, 2023, 4:19pm

It’s just a “list of addresses” for YOU but others might want to also template modbus-registers, MQTT topics dependent on the host or simply read the input from a MongoDB… That’s what I mean with “each individual solution might vary a bit”.

This being said, I’m currently thinking about reworking the config so that more “sources” are possible with more fine-grained update strategies. However, don’t expect anything soon as this is a delicate subject especially as it has the potential to completely take-down an agent if the config is malformed due to some reason.

Dan_Denson · May 26, 2023, 9:52pm

It feels like we are on the same page here. Others may very well want other details… so that’s why you allow a query. If you can put it in the config file then that can be queried to perfectly match that as well. If it’s a list of entries like snmp, ‘select column from database’. If it’s modbus registers ‘select slave_id, register, length, type from modbus_sql_table’.

The thing is that each implementation does not and cannot vary because telegraf has a strict format for any given config key. Users don’t get to decide how snmp targets are listed in the hosts list, that’s fixed by telegraf. Users will store data many different ways, and thus will query data many different ways, but it always needs to line up in telegraf in the same format.

srebhan · May 31, 2023, 7:40am

I think you are too much “trapped” in your use-case… Why would I want to query the modbus registers if I have 1000 meters of the same type, the IP address or the Slave-ID would be sufficient to query and fill the template… Maybe you do have 100 meters of type A and 100 meters of type B. How would you do this? What if someone using a MongoDB or some other datasource?

My point is that the complexity of such a solution, covering all plugins and all possible datasources and all potential use-cases, is gigantic! I wrote a small tool taking data from a database and generating Telegraf configs for only a set of Modbus devices of a handful of different models and the “corner-case handling” was out-weighting the “normal” code by large.

While I think that we should think of a better integration of Telegraf into other infrastructure like asset management, a general “generate a config from an arbitrary database” is probably out-of-scope. You can and should do this now by having a tool that generates and updates your configs from your SQL (or whatever) and provides it via HTTP. This can be done today. We should then add more ways to react on changes as a first step.
Please don’t get me wrong, I’m not saying your use-case is not important or it cannot be done, but the complexity is enormous and simply not the point I would start with. Anyway, I would love to see a feature request for Telegraf from your side @Dan_Denson with concrete examples on how you envision the configuration and what your expectations are!

Dan_Denson · May 31, 2023, 4:05pm

I feel like you’re understanding the need exactly but then not seeing the point on how much of a wreck the config file is for this.

1000 meters all behind a TCP/Modbus gateway… in a long list in the config file “meter1address”,“meter2address”,“meter3address”…

When that could be “mysql-client ... select meter_ip from table where metertype = PDU” and that’s it. or mongdb-cli -c db.meters.find({search query}). A single query fills that section of the telegraf config file instead of a list of 1000 entries. Heck, it could bat cat /home/thedude/meters.csv at that point.

It’s the least complex solution here, not the most. keeping the config file clean with 1000 entries for one monitor is a mess and can break every other thing. A malformed item in querying TCP/Modbus gateways also kills off all the SNMP entries too. load the data via external script or command.

For SNMP for example, allow a query for the OIDs as well. I can do mysql-client select OID from mytable where type = PDU and then mysql-client select IP from mytable where type = PDU and that completes that SNMP section.

This request solves almost everyone’s issues. It’s the most usable solution for the most people. Just trigger a command or script on the system with the expectation it outputs data how telegraf wants it which is documented in the config. Most of the time this is a csv list of items to poll in that section so it’s just a csv list of IPs or modbus registers.

fercasjr · May 31, 2023, 9:06pm

hahahaha yes, I do manage my configuration file on telegraf with a template and a node red flow to generate the file, and I get what you are saying it happened to me before…

One question… isn’t possible to load part of the configuration file from another file like the custom plugins using the execd ?

srebhan · June 5, 2023, 9:32am

@Dan_Denson I’m not sure why you say that

[…] but then not seeing the point on how much of a wreck the config file is for this.

I have done exactly this. I wrote a python script that uses N (and belief me N will be > 1 even > 10) templates to generate the configs… And I furtermore said before that we are thinking about making the configuration more user friendly for large installations. I also said that I’m open to suggestion on how to do this. Unfortunately

When that could be “mysql-client ... select meter_ip from table where metertype = PDU ” and that’s it. or mongdb-cli -c db.meters.find({search query}) . A single query fills that section of the telegraf config file instead of a list of 1000 entries. Heck, it could bat cat /home/thedude/meters.csv at that point.

does not show where to put those lines nor how Telegraf would know what to do with the output of this command. I can imagine that we have templates and some “plugin” (that also needs to configured) that knows how to fill those templates. Or that we fill the config AST directly from the query assuming that it has a certain structure, but that’s what I want to discuss and that’s why I asked you to open an issue for documenting what was discussed…

It’s the least complex solution here, not the most. keeping the config file clean with 1000 entries for one monitor is a mess and can break every other thing. A malformed item in querying TCP/Modbus gateways also kills off all the SNMP entries too. load the data via external script or command.

I now outlined several times why I think it is complex. Maybe my English is not good enough or I simply cannot get my point across, but let me try one more time.
My constrain in all those discussions is that Telegraf and the plugins’ code needs to stay maintainable with your/a configuration solution in place. Given this constraint, we cannot simply put some arbitrary code into each of the 200+ plugins. There needs to be some general framework that deals with this. The most simple solution is to use templates and then tells Telegraf which templates to use and how to fill them. Basically that was the solution I also applied to the 1000 meters problem but as an external script.
While this sounds easy at first there are some caveats I experienced. Foremost it is that you potentially end up with special- or corner-cases, e.g. the meter is installed the wrong direction and you need to invert values for X of your 1000 meters, there are different firmware versions that require slightly different configuration, you have heterogeneous setups with different meter type and even topology etc. We need a real strategy for dealing with those kind of things as this is what happens in practice!
Additional to the diversity coming from the devices or infrastructure there is also diversity in the data-source you want to use for filling the configuration. There might be SQL databases, but how do we map the columns to the configuration items? Do we constraint the name, i.e. the column has to have the same name as the config option? What about other database types like MongoDB or other NoSQL databases? What if your CSV columns are not named, how do you then map the data? Other people want to query their asset management system to automatically configure their devices (and add new ones/remove old ones)…

I could think of a Telegraf configuration that has something like

[[config.sql]]
    template = "my_modus_device_A.tmpl"
    dsn = "mysql://127.0.0.1/..."
    query = 'SELECT ip AS controller FROM modbus_devices WHERE type = "A"'
    refresh = "10m"

[[config.sql]]
    template = "my_modus_device_B.tmpl"
    dsn = "mysql://127.0.0.1/..."
    query = 'SELECT ip, slave_id, register_address,data_type AS controller, slave_id, address,type FROM modbus_devices WHERE type = "B"'
    group_by = ["controller", "slave_id"]
    refresh = "10m"
...

The above is one possibility, so @Dan_Denson and @fercasjr if you are interested in a constructive discussion please open an issue and give examples as the above. I’m looking forward to see your ideas!

Best regards,

Sven

Dan_Denson · June 5, 2023, 3:20pm

I want to reiterate that my suggestion above is actually the simple, low imact solution.

Only the administrator can update the telegraf config file. The logic to replace a host = [ “1”, “2”, …] type list in the configuration file only needs done once, not hundreds of times and doesn’t require telegraf to ‘speak’ additional config languages.

We could do this today, as I think many such as fercasjr are, by copying the telegraf config file to a template and then replacing those lists with a variable: (expanded out to be super readable if not exact syntax)
#do command to get list for snmp hosts section A > list-snmp-A.output
#sed telegraf.conf.template ‘s/placeholder-in-config/cat list-snmp-A.output’ > telegraf.conf

What I’m saying here is to just let us replace anything to the right of a configuration key:
current:
[[inputs.snmp]]
agents = [ … ]

change:
agents = [ /opt/telegrafscripts/snmp-agents-in-double-quoted-csv.sh ]

or

include /opt/telegrafvariables/snmplists
\which includes snmp-agents-in-double-quoted-csv = some script or command to get content
agents = [ $snmp-agents-in-double-quoted-csv ]

or, simply read the config from a variable that can be in an includes list so we can operate on that separate file without blowing up the config file.

most of the dynamic config items are essentially just csv lists. admins already have established OIDs for snmp for example or registeres for modbus, those aren’t really changing but the hosts/targets of these queries are. You could make it so that ONLY csv lists can be read and have a tiny bit of validation that it’s actually a csv list and if that breaks, throw an info message out to logs and blank that section so it’s safe.

The primary problem with updating the config file is that a single error in the inputs.snmp config will also make every single other part of the configuration fail. This fragility is the problem and routinely changing the primary config file for hosts is the source of failures.

I would also add that if you do malform the config file and then reload you get a failure, then go back and fix your error and in that time you have zero reads of any kind. adding a host to snmp and making a typo means no ping responses, no disk io, no cpu, no nothing and a big gap in your charts. It’s just too fragile.

Other systems for things like snmp monitoring are often used because telegraf is unfriendly for an admin despite being fantastic once configured. I have a friend at a university here that uses a full tick stack but for snmp he runs a python script and individual snmp walks pushing to influx via http connector because he can put all the snmp query details in mysql. Telegraf’s snmp connector is perfect for this job except the config method is a problem.

srebhan · June 7, 2023, 10:23am

@Dan_Denson your solution looks simple in a forum post, but the code to realize that for all plugins and most use-cases is complex. It’s a bit like arguing that telegraf --do-what-i-want is a simple solution without showing how to realize it. Think about a topology of multiple Modbus gateways each with multiple slaves. How would you realize this? What if you have different credentials etc for each agent? What is the benefit to run that script and produce the config file(s) from a template?

While I do agree that we should improve your situation and in general the support for large-scale configurations, I do not agree with your assessment of simplicity of code nor am I convinced that we can cover many use-cases with your suggestion. However I am convinced that your proposal would need to touch all plugins and is thus infeasible to maintain.

Anyway, as I just repeat myself without adding new points to the discussion I suggest to either start a discussion in an issue also discussing other use cases (besides your SNMP use-case) or, even better, commit a pull-request showing a realization of your proposal. Happy to review it.

Dan_Denson · August 10, 2023, 3:44pm

you’re convoluting what I’m saying to the extreme. I’m not asking for literally anything of what you just said.

I’m only suggesting a change to these formats:
list = [ quoted csv list ]
to
list = [ sourced from a file or from a command]

if you’re worried about running a command, then just:
listA = [ /directory/listA.txt ]
and forget the code execution, we can populate the files externally. ie, I’ll run my own database query with the output being ‘listA.txt’ so long as telegraf can read that file.

This is how 99% of config files work. telegraf is the odd one out and the most difficult and error prone to configure via scripting because I have to touch the ENTIRE configuration to update a csv list. This is the wrong way, I’m suggesting telegraf simply standardize with everything else, ie the tried and true way.

Hipska · August 22, 2023, 7:08am

This is how 99% of config files work. telegraf is the odd one out and the most difficult and error prone to configure via scripting because I have to touch the ENTIRE configuration to update a csv list. This is the wrong way, I’m suggesting telegraf simply standardize with everything else, ie the tried and true way.

This statement is completely wrong. There is a reason why configuration management tools exist (Ansible, Puppet, Salt, Foreman, …). And exactly this scenario is where it would be perfect fit to use such a tool. I am doing it like this as well to generate config for 2k+ of snmp devices to be monitored by Telegraf. No need to create a custom script. Also no need to touch the ENTIRE configuration, You can put the basic config in the normal telegraf.conf and then specific generated config file(s) in the telegraf.d directory.

Topic		Replies	Views
How can I add thousands of SNMP agent to telegraf Telegraf snmp	6	2525	August 1, 2022
Include a text file in telegraf Telegraf influxdb , telegraf , grafana	4	2093	December 16, 2020
Telegraf SNMP inputs agents on different file Telegraf telegraf	3	3316	March 6, 2018
Dynamically add and remove devices for SNMP agent_host=r1,r2,r3 Telegraf snmp	2	819	February 8, 2022
Telegraf SNMP config advice Telegraf	2	502	May 4, 2020

Update telegraf snmp agents simply and in bulk

Related topics