Monitoring health of Linux box

MartinWalke · February 15, 2019, 2:24pm

Hi,

We have a number of linux (RHEL) boxes that we want to monitor the health of. When I say monitor, we already have a dashboard that gives up all the informtaion regarding network, memory, disk etc. What we want is simply a SingleStatus plugin that shows green when the box is on, and red if the box goes down. Oh, and goes green again once the box comes up again.

The biggest problem is knowing what metric to use. We’ve tried n_cpus which seemed to work except if the network is busy.

This is what we had which looked quite promising. Telegraf is set to sample every 10s and flush every 20s.


select sum(sum_v) from (select sum(n_cpus) as sum_v from system where "host" ='aaps-00001.ref' and time > now() - 70s and time < now()- 10s group by time(60s) )

One question which comes to mind is whether we should be ‘jittering’ the telegraf collection to prevent a boundary sample problem.

Has anyone done something similar or can point me in the right direction of which field to use from the wealth of tables and fields that are available but which will give us the visual description we want.

TIA
Martin

Ashish_Sikarwar · February 15, 2019, 3:51pm

When you say “… the box is on” i am presuming you meant it is pingable, alive or server is up and running, am i right?
Just realised you mentioned telegrag with a typo so did not make out

You may want to try telegraf “ping” plugin.

MartinWalke · February 18, 2019, 8:28am

Hi Ashish,

“… the box is on” - yes, the box is running and performing as it should. We just need notification when it goes down so the “ping” plugin is a great call. I’ll give it a try.

On a side note, is there a complete list of the plugins available? - And thanks for the typo note

MartinWalke · February 18, 2019, 9:00am

Ah - found the list in GitHub but would be nice to have a list with descriptions for each one for novices like myself .
I’ll create one for my own use and put it somewhere if there isn’t one already - if someone would like to tell me where I could put it.

Ashish_Sikarwar · February 18, 2019, 11:23am

Hi Martin,
Glad the suggestion gave you some direction.

Ashish_Sikarwar · February 18, 2019, 11:25am

Yeah you may want to try telegraf/plugins at master · influxdata/telegraf · GitHub
Remember there input and output plugins, you may try github api to get content from the read me file.

MartinWalke · February 18, 2019, 12:18pm

Hi Ashish,

Many thanks for your help on this. It works really well and does what we want. There are a couple of quirks when the result_code isn’t 0 such that the other fields aren’t filled or when result_code is 0 and packets_received is also 0! But all in all, much better than we had and gives us a nice green status when the box is up and running, and red when it’s not.

Thanks again!

Ashish_Sikarwar · February 18, 2019, 12:35pm

Superb!

Topic		Replies	Views
Monitor telegraf with telegraf Telegraf telegraf	14	3514	November 6, 2023
Monitor/reporting multiple Linux servers telegraf	2	466	May 9, 2024
Status panel showing servers running/not running Telegraf Telegraf influxdb , telegraf , grafana	0	512	July 1, 2020
Telegraf synthetic monitoring	12	748	May 21, 2020
Storage health data from HP ILO using IPMI Sensor and Telegraf Telegraf telegraf	5	317	July 15, 2024

Monitoring health of Linux box

Related topics