This is gonna be kind of long, but I’m unable to find fewer words to explain it.
I suggest you look at the Grafana docs here as they are changing their alert functionality, have a look at them and decide which one to use.
I’ll put here a sample of what is now called the Legacy Grafana Alerts.
The first step is to decide how to check if something is wrong.
1- Finding the key → in your case “name”
2- Choosing a metric to identify when something is wrong → This is up to your single case
3- build a query with the desired precision
4- configure the alert rule
I’ll show you my case
1 - Key
My key is given by 3 tags (company, host, telegraf_instance)
2- Detection
I’ll use the data about telegraf internal monitoring to detect gathering issues (more info here)
3- Building the Query
the below query returns the number of points written by a telegraf agent
SELECT
non_negative_difference(last("metrics_written")) AS "Points Written"
FROM "standard"."internal_write"
WHERE $timeFilter
GROUP BY
time(1h)
,"host"
,"telegraf_instance"
,"company"
fill(0)
When plotted the result is something like this (unreadable due to the number of agents, but that’s not an issue as it doesn’t have to be human-readable)
Notes:
- I keep the Grafana time filter
WHERE $timeFilter
, the alert itself will manage it
- I return 0 when there are no data -
fill(0)
, as I like actually to see the “drop” in the chart
Practically when a host is not sending data I’ll have something like this:
4 - Alert Rule
for this one I suggest you look at the docs, if you go with the Legacy Alerting you should have something like this:
I’ll give a quick explanation
-
Evaluate Every - Alert rule running frequency
-
For - This sets an optional threshold, the alert will actually alert only after the alerting rule has been “firing” for this amount of time, below this threshold, the status will be “pending” (there are 3 statuses: OK, Pending, Alert)
-
Conditions - Choose Query and time range, then define an aggregation and compare the value to one of your choosing. Pay attention to the time range
The rule you see above does the following:
- Given the number of points received by each agent, grouped by 1h
- Use 96h as time range time >= now()-96h → last 4 days of data
- Get the
last
value in the series
- compare it to 500, if lower trigger an alert
- This rule will run every 6h
- No alert will be triggered until the alert has been firing for more than 12h (it avoids alerting for temporary problems)
as you should know when you querying your data if a “key” 8tag) is not contained in the time range, it won’t appear at all, this is true for the alerting, if a series does not exist, you can’t alert on it.
ie: if you want to check every 10m,
-
do not get only the last 10m, because if a “key” is missing it won’t even appear and if it does not exists it can’t be compared to anything.
-
do get at least the last 20mins, this way the key will exist and in case of no value for the last 10mins its value will be null or 0 (if you use
fill
),
I suggest you use a way greater multiple of your alerting window, in my case I check every 6hrs, but get the latest 96hrs of data, meaning that my alert will keep firing for up to 16 executions before being “ok” again because the “key” gets completely out of range.
Hope this helps