Monitoring health of Linux box

Hi,

We have a number of linux (RHEL) boxes that we want to monitor the health of. When I say monitor, we already have a dashboard that gives up all the informtaion regarding network, memory, disk etc. What we want is simply a SingleStatus plugin that shows green when the box is on, and red if the box goes down. Oh, and goes green again once the box comes up again.

The biggest problem is knowing what metric to use. We’ve tried n_cpus which seemed to work except if the network is busy.

This is what we had which looked quite promising. Telegraf is set to sample every 10s and flush every 20s.

select sum(sum_v) from (select sum(n_cpus) as sum_v from system where "host" ='aaps-00001.ref' and time > now() - 70s and time < now()- 10s group by time(60s) )

One question which comes to mind is whether we should be ‘jittering’ the telegraf collection to prevent a boundary sample problem.

Has anyone done something similar or can point me in the right direction of which field to use from the wealth of tables and fields that are available but which will give us the visual description we want.

TIA
Martin

When you say “… the box is on” i am presuming you meant it is pingable, alive or server is up and running, am i right?
Just realised you mentioned telegrag with a typo so did not make out :slight_smile:

You may want to try telegraf “ping” plugin.

Hi Ashish,

“… the box is on” - yes, the box is running and performing as it should. We just need notification when it goes down so the “ping” plugin is a great call. I’ll give it a try.

On a side note, is there a complete list of the plugins available? - And thanks for the typo note :+1:

Ah - found the list in GitHub but would be nice to have a list with descriptions for each one for novices like myself :wink:.
I’ll create one for my own use and put it somewhere if there isn’t one already - if someone would like to tell me where I could put it.

Hi Martin,
Glad the suggestion gave you some direction.

1 Like

Yeah you may want to try https://github.com/influxdata/telegraf/tree/master/plugins
Remember there input and output plugins, you may try github api to get content from the read me file.

Hi Ashish,

Many thanks for your help on this. It works really well and does what we want. There are a couple of quirks when the result_code isn’t 0 such that the other fields aren’t filled or when result_code is 0 and packets_received is also 0! But all in all, much better than we had and gives us a nice green status when the box is up and running, and red when it’s not.

Thanks again!:grin:

:sunglasses: Superb!