We have a number of linux (RHEL) boxes that we want to monitor the health of. When I say monitor, we already have a dashboard that gives up all the informtaion regarding network, memory, disk etc. What we want is simply a SingleStatus plugin that shows green when the box is on, and red if the box goes down. Oh, and goes green again once the box comes up again.
The biggest problem is knowing what metric to use. We’ve tried n_cpus which seemed to work except if the network is busy.
This is what we had which looked quite promising. Telegraf is set to sample every 10s and flush every 20s.
select sum(sum_v) from (select sum(n_cpus) as sum_v from system where "host" ='aaps-00001.ref' and time > now() - 70s and time < now()- 10s group by time(60s) )
One question which comes to mind is whether we should be ‘jittering’ the telegraf collection to prevent a boundary sample problem.
Has anyone done something similar or can point me in the right direction of which field to use from the wealth of tables and fields that are available but which will give us the visual description we want.