Monitoring: Time since last update for many hosts

#1

I’m trying to monitor my Telegraf -> InfluxDB setup. I have many hosts writing data into Influx (100% via Telegraf), but have found that on occasion one or more systems will have some sort of problem and will stop sending stats (yeah, I know, that’s probably a Telegraf problem). I’d like to detect when this problem occurs and thus quantify the severity.

I have previously been ‘tailing’ log files on the client boxes and looking for certain lines which indicate Telegraf sent some data. However, recent updates seem to have stopped writing those lines, and so I can’t do this client-side any more.

What I need is an Influx query, or a series of queries, or a continuous query (I guess) that I can use (in a script, probably Python) to hook into monitoring. What I want to so (in pseudo code) is something like:

SELECT last_update_time FROM telegraf.measurements GROUP BY host;

…in more Influxy language, I’ve got as far as this:

SELECT last(usage_user),host FROM cpu
(which takes ages to execute - not sure if it would ever actually return)

SELECT last(usage_user),time,host FROM cpu where time > now() - 1h GROUP BY time(60s)
(but this seems to return lots of rows about some hosts and few or none about others)

I may be able to work around or solve some of the problems I’ve got because I’m using a Python program to do this querying. I also know the names of all the hosts that should be sending in stats (as I can get that from Ansible), so I could just run this repeatedly:

for host in hosts:
client.query(“SELECT last(usage_user),host FROM cpu WHERE host=%s” % (host))

This ostensibly seems to work, but it’s a bit infeasible for many hosts, and there doesn’t seem to be a way to say "WHERE host IN [‘host1’,‘host2’]. I’ve also had a few problems getting the host into Python when running queries, but I’ll worry about that later.

Has anyone got any idea how I can check to see if a (known) host amongst many stops sending stats?

#2

@coofercat You might need to use kapacitor for this: Server availability percentage value

Does that post help with what you are looking to do?