Monitor telegraf with telegraf

Hi.

Currently we have some infrastructure where we have lots of telegraf instances collecting data and sending to one for monitoring. Now, there are cases when telegraf can be down(especially for windows machines) and I would like to monitor this as well. So, from monitoring server I want to have some kind of ping to all the machines and see if telegraf is running. Now from the plugins I see it seems to me that I should use: socket listener for the nodes and on monitoring node use ping or net_response to check if running, is this solution OK?

Edit: Or I can use http_listener plugin and basically from monitoring write some data. If I have data into DB it’s OK if not then smth is wrong.
Thx.

Hi I’m interested about this topic also.

–==Central server==–
grafana-4.6.3-1.x86_64
influxdb-1.4.2-1.x86_64
telegraf-1.5.0-1.x86_64

–==VMs side==–
telegraf-1.5.0-1.x86_64

Is very important to know when you send/ receive data Telegraf <> InfluxDB.
Here you can see at the CLI level but not helping when you have no time to track all the servers.
# systemctl status telegraf -l
# /usr/bin/telegraf --config /etc/telegraf/telegraf.conf --test

I’ve studied also this 3 options but what will fit the best?

  1. [[inputs.socket_listener]]
  2. [[inputs.net_response]]
  3. [[inputs.ping]]

Kind Regards,

You might also want to look into using the internal input plugin, it will report on the activity of the Telegraf instance. Perhaps you can alert if these metrics stop being reported?

1 Like

How can we implement this in a proper/ easy manner?

  • To have in each server side “VM” the /etc/telegraf/telegraf.conf with this config for: [[inputs.ping]], [[inputs.net_response]], [[inputs.socket_listener]] and to send metrics to influxDB?

  • For InfluxDB I didn’t implemented a dashboard for KPIs but I think it will be wise to implement in order to cover the status, operations, if UP/ DOWN and…

Any suggestion is appreciated and for sure your proposal to use “internal” will not put overload on VMs.

Thank you Daniel.

I did few tests with http_listener on each node and using http_response on monitoring node to basically make a simple connection(without writing anything to DB so, body is no set). This allows me to have data even if we have timeout when trying to connect, which is smth I’m interested.

@daniel you mention internal but I don’t see how we can use it. I mean we may have lots of nodes running telegraf now let’s imagine that for some nodes telegraf was not started at all so, in this case there will be no entries in DB for this host.
Now, how can we create such an alert? I mean I have to have the list of nodes predefined and then when creating alert check if this node is present or not? But I would like to not have this predefined list because nodes can be added dynamically. I hope that makes sense.

Thx.

In a environment where there are a dynamic number of hosts things are complicated, perhaps you can use the exec plugin to collect the inventory of hosts and then check that there is a corresponding internal metric being collected? Out of curiosity, are you using any particular cloud provider or orchestration software to manage your hosts?

OK, for now I don’t use any cloud provider, just simple ansible deployment with custom server list. Currently I don’t know if it’s possible to fire an alert into an graph if we are using GROUP BY host as example and only one series stopes returning data. According to your solution I should have as many queries as number of hosts into the graph and basically a separate alert for each of them?

Plus, if telegraf never started on some server then actually there is no way to properly show this on a graph and actually use an alert because alerts require to be data first so that it can compare with smth and fire it. So, we need a pull mechanism for this but to be honest http_listener seems too much for this task and still I can’t properly use alers for the case when there are no data(connection_timeout).

[[inputs.ping]] -> working fine for me.

For the rest I’ll come with a feedback:

  • [[inputs.socket_listener]]
  • [[inputs.net_response]]

So, how do you use inputs.ping to check if telegraf is running ?

@geo Are you using Kapacitor for alerting? You can define a TICK script that uses deadman, when telegraf is added/removed from a host by ansible you will need to modify the TICK script, start the new task, and stop the old task. If you wanted to get advanced you could define a user defined function (UDF) that could update the host list.

I’m far from an expert with Kapacitor, but the script would look something like this:

dbrp "telegraf"."autogen"
stream
    |from()
        .measurement('internal_agent')
    |where(lambda: "host" == 'telegraf.example.org')
    |deadman(1.0, 20s)
        .id('{ index .Tags "host" }')
1 Like

@daniel I’m using Grafana actually, sorry I thought I mentioned that. I can’t use TICK at the moment as I need Tables in my dashboards. Yes I think that Kapacitor is more powerful then Grafana alerts.

I haven’t used Grafana alerts yet, so I can’t help much there. One advantage of using a deadman is that it will trigger if reporting stops, while pinging Telegraf over the network could report it as up even if it were hung and not sending metrics, so long as it can reply to the request.

You could, of course, use both Grafana alerts and Kapacitor together if needed.

I’d like to know this too

@skinfrakki you are a necromancer! :wink:
The easiest is to create an alert on the database side in a deadman fashion. That is, if you do not receive data for a certain amount of time (interval + flush-interval + safety margin) you should issue an alert. You might want to use the internal input plugin for this.

Additionally, there is a health output plugin that you can use…

Does that help?