I’m trying to set up a system to monitor server up/down via ICMP. I’ve more than 500+ servers and each week more servers getting added. I don’t want to manually type all the servers under [[inputs.ping]] since it won’t scale well and is very vulnerable to human errors.
No, I don’t have an automatic provisioning software but I am not sure if I explained the situation correctly. At the end I’m thinking to deploy telegraf on each host but my question here is different.
I only want to monitor and record their ping responses. So, I’m thinking to send ping requests to each server from one agent.
The problem is to fill the telegraf.conf file with server names. It’s a job that should be automated somehow. I need to create an inventory of these hosts in somewhere and tell the agent the read hostnames from the inventory. Do you have any suggestions? Like SHOWS TAG VALUES WITH KEY = "host"
@rawkode If I go with the second one, how should I call home? I’m thinking that using ping input may be unnecessary. Do you agree? Is there a way to tell telegraf to write “1” to database, as some dummy indicator that it’s up and well like hearthbeat?
Edit: I’m, probably, going to ping localhost on each server.
Hi, this is going to a different topic yet I want to ask.
I’ve created a deadman alert and set it for “if no data for 1min” and now it repeats the same alert each minute. Is there a way to escalate it and stack the alerts into one instead of 10 different lines. Also, in current case kapacitor sends “e-mail, telegram message” for each alert.
Another question is, this time related to the topic (
I want to see when a server was down, how long it stayed down and when did it go up again.
When I ping localhost, I got only true responses there is no down value(packet loss) since agent doesn’t work at that time.
Lets say I ping for once in a minute for simplicity
Records would be:
15:01 ping server up
15:02 ping server up
15:05 ping server up
Which means server(agent) was down for 2 minutes.
If I ping the server from outside I’ll get
15:01 ping server up
15:02 ping server up
15:03 no response server down
15:04 no response server down
15:05 ping server up
In first example 15:03 and 15:04 values will be null.
Can I create a table in grafana or get the output below
via influxql for any scenario (local ping or from outside)
Definitely agree with @rawkode on decentralized monitoring, since it takes the discovery problem out of the equation and scales better, but of course it isn’t always possible to use this architecture. I have heard of users running 5K ping targets with method = "native", for this you will need to increase the ulimit and set up permissions to use raw sockets.