Server Availability Monitoring

Hi,

I’m trying to set up a system to monitor server up/down via ICMP. I’ve more than 500+ servers and each week more servers getting added. I don’t want to manually type all the servers under [[inputs.ping]] since it won’t scale well and is very vulnerable to human errors.

I would like to hear your experiences.

Regards,

Hi @Mert,

Do you have any automatic provisioning of software on each of these machines that would allow you to run s Telegraf on each?

No, I don’t have an automatic provisioning software but I am not sure if I explained the situation correctly. At the end I’m thinking to deploy telegraf on each host but my question here is different.

I only want to monitor and record their ping responses. So, I’m thinking to send ping requests to each server from one agent.

The problem is to fill the telegraf.conf file with server names. It’s a job that should be automated somehow. I need to create an inventory of these hosts in somewhere and tell the agent the read hostnames from the inventory. Do you have any suggestions? Like SHOWS TAG VALUES WITH KEY = "host"

You have two options here.

  1. Continue as planned with a central Telegraf. You’ll need to automate generating the configuration with every host

  2. Stick a Telegraf on every machine and have it phone home. Use a dead man alert to detect machines that are offline

@rawkode If I go with the second one, how should I call home? I’m thinking that using ping input may be unnecessary. Do you agree? Is there a way to tell telegraf to write “1” to database, as some dummy indicator that it’s up and well like hearthbeat?

Edit: I’m, probably, going to ping localhost on each server.

Hi,

I’d start by enabling the CPU and Mem plugins. You’ll get useful host metrics, and it’ll provide enough to build the deadman alert too.

You can also ping localhost if you want, add a few options and see what you prefer.

@rawkode

Hi, this is going to a different topic yet I want to ask.

I’ve created a deadman alert and set it for “if no data for 1min” and now it repeats the same alert each minute. Is there a way to escalate it and stack the alerts into one instead of 10 different lines. Also, in current case kapacitor sends “e-mail, telegram message” for each alert.

Another question is, this time related to the topic (:slight_smile:

I want to see when a server was down, how long it stayed down and when did it go up again.

When I ping localhost, I got only true responses there is no down value(packet loss) since agent doesn’t work at that time.
Lets say I ping for once in a minute for simplicity

Records would be:
15:01 ping server up
15:02 ping server up
15:05 ping server up

Which means server(agent) was down for 2 minutes.

If I ping the server from outside I’ll get

15:01 ping server up
15:02 ping server up
15:03 no response server down
15:04 no response server down
15:05 ping server up

In first example 15:03 and 15:04 values will be null.
Can I create a table in grafana or get the output below
via influxql for any scenario (local ping or from outside)

15:03 server went down
15:05 server went up

Hello, what are some limits using a central telegraf inputs.ping? Could we configure 5k of ip’s? 10k? as long system resources are adequate? Thanks,

With InfluxQL, you can use FILL(null) to fill in the missing blanks. Does this provide what you require?

Hmm. I believe this would be restricted by the number of open connections on your kernel. You’d need to test and tweak the ulimit I think.

Telegraf uses Go subroutines, so the limit isn’t Telegraf; but the host.

You can always use multiple “central” Telegrafs.

Though I definitely prefer the dial home approach, it’s more scalable and works well in ephemeral environments with dynamic infrastructure

1 Like

Definitely agree with @rawkode on decentralized monitoring, since it takes the discovery problem out of the equation and scales better, but of course it isn’t always possible to use this architecture. I have heard of users running 5K ping targets with method = "native", for this you will need to increase the ulimit and set up permissions to use raw sockets.

2 Likes

Thanks, we run some tests to see where the limits are