I’m very happy with the usability of the TICK stack, but I am missing one very vital feature which is what I’d call a “server availability percentage value”. What I mean with that is, I would like to have an automatically calculated value for a given timerange which is the percentage between the time the server was available and unavailable. Strictly speaking, I personally would need this for the servers network availability, but so far I have not found a way to achieve this, so my attempt so far is to use the telegraf system plugins “uptime” value and compare it to the duration value of chronograf’s alert database. On a sidenote: which unit does the duration value use? I already have a few alerts logged and their duration values are sometimes huge. It seems like the telegraf uptime plugin calculates time in seconds, could it be that chronograf’s alerts duration are measured in miliseconds?
As I haven’t found that feature in chronograf yet, I tried to do a mixed query in Grafana, with A being the server uptime value and B being the duration value of alerts. Is there a way to “compare” these two values as a single (percentage) value? I have so far not found a way to do this, as I always get the N/A result whatever I try. My Query looks like this:
Query1 (database: chronograf) SELECT count(“message”) FROM “alerts” WHERE “alertName” = ‘Deadman Netzausfall’ AND “level” = ‘CRITICAL’ AND “host” =~ /^$server$/ AND $timeFilter
Query2 (database: telegraf) SELECT “uptime” FROM “system” WHERE “host” =~ /^$server$/ AND $timeFilter
Deadman Netzausfall is my kapacitor alarm which notifies me if a server did not send any net_response data in the last few minutes.
Did anyone try or manage to achieve something similar?
Additionally, it would be absolutely great if I could create a table which lists the time-ranges the server was available/unavailable, but I guess that is asking for too much. A Percentage value would be a huge help already, e.g. “the server was 97% available and 3% unavailable”
We’re actually doing something very similar to what you describe to power these status lights in Chronograf:
The specific query that we’re running is:
select non_negative_derivative(mean(uptime)) as deltaUptime from "system" where time > now() - 10m group by host, time(1m) fill(0)
The idea is to get the rate that uptime is changing. If the rate is greater than zero, the server is continuing to report changes to its uptime and is therefore up. If it’s zero, there haven’t been any changes reported in the 10m period we asked for, so we change the light to amber to indicate that the server may be down. Finally, if the value isn’t present at all, it means the last reported change to uptime was outside a 10m window, so we change the light to red to indicate the server is down. This all happens here: https://github.com/influxdata/chronograf/blob/master/ui/src/hosts/components/HostsTable.js#L152-L159
We could get this as a percentage over the 10m window using subqueries (in v1.2.0+). I think this should do the trick:
select sum("isUp") / count("isUp") from (select non_negative_derivative(mean("uptime")) / non_negative_derivative(mean("uptime")) as isUp from system where time > now() - 10m group by time(1m) fill(0));
The idea here is to take the rate uptime is changing, divide it by itself to get a 1 or 0 signal (InfluxDB can’t do NaNs so replaces them with 0s). Then we take the sum over that window divided by the count to get a percentage uptime.
I am happy to see that you found that feature interesting/useful as well! I have also tried to implement your suggestion regarding the subquery, but so far I have not managed how to achieve this (which might be due to my lack of experience with subqueries. So far I always used the query-builders).
If it’s not asked too much, could you please guide me a bit how to use your suggested subquery? The first one works for me, but the second one always gives me a parsing error at char 41, which is where the subquery begins, i think.
edit:
I do have Influxdb version 1.2.2 installed, but I think my database does not allow subqueries yet and currently, I’m trying to find out how to “enable” that feature.
edit2:
Nevermind! It’s working now, thank you! I am always getting these values here though
name: system
time sum_count
0 1
and trying to modify this so we have the percentage values. Will post them here in case I get the result
edit3:
I got a percentage value using this query here:
SELECT sum(“duration”) / (10000000 * 2592000) FROM “chronograf”.“autogen”.“alerts” WHERE “alertName” = ‘Alarm Name’ AND “host” =~ /^$server$/ AND time > now() - 30d
the duration value for alerts apparently is measured in nanoseconds, so I divided it by the number of nanoseconds in a month.
Problems:
If the server is currently in a critical state, the value will be wrong, as it requires the value for the alarm to end to calculate the duration properly (duration is 0 as long as the alarm does not return to “OK” levels)
If the server has never been in a critical state in the past 30 days, the result will not be 0%, but “no value”.
Hey @tim, I was trying to get the availability report of the server on the daily basis. How could we do, if we are sending the data through telegraf and another question if the check interval is 5 min, metric is down for one interval, will that application is down for 10 min.? I am beginner on this and need help on this.