Monitor telegraf with telegraf

geo · January 9, 2018, 2:58pm

Hi.

Currently we have some infrastructure where we have lots of telegraf instances collecting data and sending to one for monitoring. Now, there are cases when telegraf can be down(especially for windows machines) and I would like to monitor this as well. So, from monitoring server I want to have some kind of ping to all the machines and see if telegraf is running. Now from the plugins I see it seems to me that I should use: socket listener for the nodes and on monitoring node use ping or net_response to check if running, is this solution OK?

Edit: Or I can use http_listener plugin and basically from monitoring write some data. If I have data into DB it’s OK if not then smth is wrong.
Thx.

fchiorascu · January 9, 2018, 6:45pm

Hi I’m interested about this topic also.

–==Central server==–
grafana-4.6.3-1.x86_64
influxdb-1.4.2-1.x86_64
telegraf-1.5.0-1.x86_64

–==VMs side==–
telegraf-1.5.0-1.x86_64

Is very important to know when you send/ receive data Telegraf <> InfluxDB.
Here you can see at the CLI level but not helping when you have no time to track all the servers.
# systemctl status telegraf -l
# /usr/bin/telegraf --config /etc/telegraf/telegraf.conf --test

I’ve studied also this 3 options but what will fit the best?

[[inputs.socket_listener]]
[[inputs.net_response]]
[[inputs.ping]]

Kind Regards,

daniel · January 9, 2018, 7:05pm

You might also want to look into using the internal input plugin, it will report on the activity of the Telegraf instance. Perhaps you can alert if these metrics stop being reported?

fchiorascu · January 9, 2018, 7:25pm

How can we implement this in a proper/ easy manner?

To have in each server side “VM” the /etc/telegraf/telegraf.conf with this config for: [[inputs.ping]], [[inputs.net_response]], [[inputs.socket_listener]] and to send metrics to influxDB?
For InfluxDB I didn’t implemented a dashboard for KPIs but I think it will be wise to implement in order to cover the status, operations, if UP/ DOWN and…

Any suggestion is appreciated and for sure your proposal to use “internal” will not put overload on VMs.

Thank you Daniel.

geo · January 10, 2018, 9:18am

I did few tests with http_listener on each node and using http_response on monitoring node to basically make a simple connection(without writing anything to DB so, body is no set). This allows me to have data even if we have timeout when trying to connect, which is smth I’m interested.

@daniel you mention internal but I don’t see how we can use it. I mean we may have lots of nodes running telegraf now let’s imagine that for some nodes telegraf was not started at all so, in this case there will be no entries in DB for this host.
Now, how can we create such an alert? I mean I have to have the list of nodes predefined and then when creating alert check if this node is present or not? But I would like to not have this predefined list because nodes can be added dynamically. I hope that makes sense.

Thx.

daniel · January 10, 2018, 9:20pm

In a environment where there are a dynamic number of hosts things are complicated, perhaps you can use the exec plugin to collect the inventory of hosts and then check that there is a corresponding internal metric being collected? Out of curiosity, are you using any particular cloud provider or orchestration software to manage your hosts?

geo · January 11, 2018, 11:19am

OK, for now I don’t use any cloud provider, just simple ansible deployment with custom server list. Currently I don’t know if it’s possible to fire an alert into an graph if we are using GROUP BY host as example and only one series stopes returning data. According to your solution I should have as many queries as number of hosts into the graph and basically a separate alert for each of them?

geo · January 11, 2018, 3:31pm

Plus, if telegraf never started on some server then actually there is no way to properly show this on a graph and actually use an alert because alerts require to be data first so that it can compare with smth and fire it. So, we need a pull mechanism for this but to be honest http_listener seems too much for this task and still I can’t properly use alers for the case when there are no data(connection_timeout).

fchiorascu · January 11, 2018, 6:23pm

[[inputs.ping]] -> working fine for me.

For the rest I’ll come with a feedback:

[[inputs.socket_listener]]
[[inputs.net_response]]

geo · January 11, 2018, 6:44pm

So, how do you use inputs.ping to check if telegraf is running ?

daniel · January 11, 2018, 8:21pm

@geo Are you using Kapacitor for alerting? You can define a TICK script that uses deadman, when telegraf is added/removed from a host by ansible you will need to modify the TICK script, start the new task, and stop the old task. If you wanted to get advanced you could define a user defined function (UDF) that could update the host list.

I’m far from an expert with Kapacitor, but the script would look something like this:

dbrp "telegraf"."autogen"
stream
    |from()
        .measurement('internal_agent')
    |where(lambda: "host" == 'telegraf.example.org')
    |deadman(1.0, 20s)
        .id('{ index .Tags "host" }')

geo · January 12, 2018, 8:35am

@daniel I’m using Grafana actually, sorry I thought I mentioned that. I can’t use TICK at the moment as I need Tables in my dashboards. Yes I think that Kapacitor is more powerful then Grafana alerts.

daniel · January 12, 2018, 7:40pm

I haven’t used Grafana alerts yet, so I can’t help much there. One advantage of using a deadman is that it will trigger if reporting stops, while pinging Telegraf over the network could report it as up even if it were hung and not sending metrics, so long as it can reply to the request.

You could, of course, use both Grafana alerts and Kapacitor together if needed.

skinfrakki · October 20, 2023, 8:16pm

I’d like to know this too

srebhan · November 6, 2023, 10:50am

@skinfrakki you are a necromancer!
The easiest is to create an alert on the database side in a deadman fashion. That is, if you do not receive data for a certain amount of time (interval + flush-interval + safety margin) you should issue an alert. You might want to use the internal input plugin for this.

Additionally, there is a health output plugin that you can use…

Does that help?

Topic		Replies	Views
Telegraf synthetic monitoring	12	773	May 21, 2020
How to get alerts/notifications when telegraf is dead /unable to send metrics from server Telegraf	4	1723	December 9, 2021
Server Availability Monitoring Store influxdb , telegraf	13	2664	December 18, 2019
Simple Offline Check Telegraf - InfluxDB 2 - Grafana Alert InfluxDB 2 influxdb , telegraf , grafana , getting-started , flux	2	1848	July 5, 2021
Is there a way for Telegraf to notify if it hasn't seen data from a source for some time? Telegraf telegraf , iot , sensors	4	1018	May 27, 2020

Monitor telegraf with telegraf

Related topics