Notifications 'sent' column is False: why?

Hi there,

we’re using Influx2 rc0, we are trying to use alerts and notifications for a dataset receiving high-ish volumes of data (~200 records/s) and are observing a certain number of notifications missed by the system. Looking in the Alerts / Notification endpoints UI page there is a warning against all of the rows displayed, with the data query returning _sent=False.

Its sure that not all notifications are failing, but some are, how are we able to determine what does it mean ‘false’, is there a log of individual notification executions, what does that page actually mean ?

We have other questions about how / if you guys are using alerts on high volumes (e.g. web interface freezes under load) but i’ll leave that to later.

thanks for any pointers

-ivan

Hello @ivanpricewaycom,
You might find this template for monitoring tasks useful. task__summary_dashboard.txt (19.8 KB)

Can you please share the exact log you’re seeing?
In the alerts you can view the alert history as well as task runs.

Does this help at all?

Thanks very much for your response @Anaisdg, I installed the task summary dashboard and all the graphs display: queue length exceeded

When I try to view the history for the Alerts / checks it runs indefinitely (2 mins and counting…).

It leads me to wonder if the alert system is designed to handle the volume of data that we are generating (as i said previously > 200 obs per second)… the UI is almost unusable because the default time slices are too large.

Interestingly when I execute the query that the logs page executes:

from(bucket: "_monitoring")
  |> range(start: -1d, stop: 1603111941)
  |> filter(fn: (r) => r._measurement == "statuses" and r._field == "_message")
  |> filter(fn: (r) => exists r._check_id and exists r._value and exists r._check_name and exists r._level)
  |> keep(columns: ["_time", "_value", "_check_id", "_check_name", "_level"])
  |> rename(columns: {"_time": "time",
                      "_value": "message",
                      "_check_id": "checkID",
                      "_check_name": "checkName",
                      "_level": "level"})
  |> group()
  |> filter(fn: (r) => r["checkID"] == "0664ca60e739b000")
  |> sort(columns: ["time"], desc: true)
  |> limit(n: 10, offset: 0)

I see the same thing: the server does not manage to return the 10 last statuses. Is anybody using the alerts system on very large datasets ?

For info the (dev/test) machine is a VM with 64G of ram and 12 cpus allocated, swap is 0… influx 2 version rc0.

If it would help I could probably write a docker/python scenario to replicate this on your end, i don’t think we’re doing anything exotic.

thanks again for your help.

-ivan

just to follow up: reducing the range to -1h at least returns results, but unfortunately that value is hard-coded in the web gui which renders it unusable for us.