Server Availability Monitoring

Mert · December 12, 2019, 6:21am

Hi,

I’m trying to set up a system to monitor server up/down via ICMP. I’ve more than 500+ servers and each week more servers getting added. I don’t want to manually type all the servers under [[inputs.ping]] since it won’t scale well and is very vulnerable to human errors.

I would like to hear your experiences.

Regards,

rawkode · December 14, 2019, 10:42am

Hi @Mert,

Do you have any automatic provisioning of software on each of these machines that would allow you to run s Telegraf on each?

Mert · December 14, 2019, 12:34pm

No, I don’t have an automatic provisioning software but I am not sure if I explained the situation correctly. At the end I’m thinking to deploy telegraf on each host but my question here is different.

I only want to monitor and record their ping responses. So, I’m thinking to send ping requests to each server from one agent.

The problem is to fill the telegraf.conf file with server names. It’s a job that should be automated somehow. I need to create an inventory of these hosts in somewhere and tell the agent the read hostnames from the inventory. Do you have any suggestions? Like SHOWS TAG VALUES WITH KEY = "host"

rawkode · December 14, 2019, 8:08pm

You have two options here.

Continue as planned with a central Telegraf. You’ll need to automate generating the configuration with every host
Stick a Telegraf on every machine and have it phone home. Use a dead man alert to detect machines that are offline

Mert · December 16, 2019, 5:49am

@rawkode If I go with the second one, how should I call home? I’m thinking that using ping input may be unnecessary. Do you agree? Is there a way to tell telegraf to write “1” to database, as some dummy indicator that it’s up and well like hearthbeat?

Edit: I’m, probably, going to ping localhost on each server.

rawkode · December 16, 2019, 9:01am

Hi,

I’d start by enabling the CPU and Mem plugins. You’ll get useful host metrics, and it’ll provide enough to build the deadman alert too.

rawkode · December 16, 2019, 9:02am

You can also ping localhost if you want, add a few options and see what you prefer.

Mert · December 16, 2019, 12:47pm

@rawkode

Hi, this is going to a different topic yet I want to ask.

I’ve created a deadman alert and set it for “if no data for 1min” and now it repeats the same alert each minute. Is there a way to escalate it and stack the alerts into one instead of 10 different lines. Also, in current case kapacitor sends “e-mail, telegram message” for each alert.

Mert · December 16, 2019, 1:57pm

Another question is, this time related to the topic (

I want to see when a server was down, how long it stayed down and when did it go up again.

When I ping localhost, I got only true responses there is no down value(packet loss) since agent doesn’t work at that time.
Lets say I ping for once in a minute for simplicity

Records would be:
15:01 ping server up
15:02 ping server up
15:05 ping server up

Which means server(agent) was down for 2 minutes.

If I ping the server from outside I’ll get

15:01 ping server up
15:02 ping server up
15:03 no response server down
15:04 no response server down
15:05 ping server up

In first example 15:03 and 15:04 values will be null.
Can I create a table in grafana or get the output below
via influxql for any scenario (local ping or from outside)

15:03 server went down
15:05 server went up

gberger · December 17, 2019, 2:29am

Hello, what are some limits using a central telegraf inputs.ping? Could we configure 5k of ip’s? 10k? as long system resources are adequate? Thanks,

rawkode · December 17, 2019, 9:45am

With InfluxQL, you can use FILL(null) to fill in the missing blanks. Does this provide what you require?

rawkode · December 17, 2019, 9:51am

Hmm. I believe this would be restricted by the number of open connections on your kernel. You’d need to test and tweak the ulimit I think.

Telegraf uses Go subroutines, so the limit isn’t Telegraf; but the host.

You can always use multiple “central” Telegrafs.

Though I definitely prefer the dial home approach, it’s more scalable and works well in ephemeral environments with dynamic infrastructure

daniel · December 17, 2019, 9:40pm

Definitely agree with @rawkode on decentralized monitoring, since it takes the discovery problem out of the equation and scales better, but of course it isn’t always possible to use this architecture. I have heard of users running 5K ping targets with method = "native", for this you will need to increase the ulimit and set up permissions to use raw sockets.

gunter_berger · December 18, 2019, 1:51am

Thanks, we run some tests to see where the limits are

Topic		Replies	Views
Monitor telegraf with telegraf Telegraf telegraf	14	3571	November 6, 2023
Telegraf synthetic monitoring	12	773	May 21, 2020
Monitor/reporting multiple Linux servers telegraf	2	605	May 9, 2024
Monitoring several hosts InfluxDB 2 telegraf	2	416	December 17, 2019
Monitor server down with telegraf , influxdb and grafana Dashboards	3	2903	November 30, 2018

Server Availability Monitoring

Related topics