Notifications 'sent' column is False: why?

Hi there,

we’re using Influx2 rc0, we are trying to use alerts and notifications for a dataset receiving high-ish volumes of data (~200 records/s) and are observing a certain number of notifications missed by the system. Looking in the Alerts / Notification endpoints UI page there is a warning against all of the rows displayed, with the data query returning _sent=False.

Its sure that not all notifications are failing, but some are, how are we able to determine what does it mean ‘false’, is there a log of individual notification executions, what does that page actually mean ?

We have other questions about how / if you guys are using alerts on high volumes (e.g. web interface freezes under load) but i’ll leave that to later.

thanks for any pointers

-ivan

Hello @ivanpricewaycom,
You might find this template for monitoring tasks useful. task__summary_dashboard.txt (19.8 KB)

Can you please share the exact log you’re seeing?
In the alerts you can view the alert history as well as task runs.

Does this help at all?

Thanks very much for your response @Anaisdg, I installed the task summary dashboard and all the graphs display: queue length exceeded

When I try to view the history for the Alerts / checks it runs indefinitely (2 mins and counting…).

It leads me to wonder if the alert system is designed to handle the volume of data that we are generating (as i said previously > 200 obs per second)… the UI is almost unusable because the default time slices are too large.

Interestingly when I execute the query that the logs page executes:

from(bucket: "_monitoring")
  |> range(start: -1d, stop: 1603111941)
  |> filter(fn: (r) => r._measurement == "statuses" and r._field == "_message")
  |> filter(fn: (r) => exists r._check_id and exists r._value and exists r._check_name and exists r._level)
  |> keep(columns: ["_time", "_value", "_check_id", "_check_name", "_level"])
  |> rename(columns: {"_time": "time",
                      "_value": "message",
                      "_check_id": "checkID",
                      "_check_name": "checkName",
                      "_level": "level"})
  |> group()
  |> filter(fn: (r) => r["checkID"] == "0664ca60e739b000")
  |> sort(columns: ["time"], desc: true)
  |> limit(n: 10, offset: 0)

I see the same thing: the server does not manage to return the 10 last statuses. Is anybody using the alerts system on very large datasets ?

For info the (dev/test) machine is a VM with 64G of ram and 12 cpus allocated, swap is 0… influx 2 version rc0.

If it would help I could probably write a docker/python scenario to replicate this on your end, i don’t think we’re doing anything exotic.

thanks again for your help.

-ivan

just to follow up: reducing the range to -1h at least returns results, but unfortunately that value is hard-coded in the web gui which renders it unusable for us.

To continue on my followup, I increased the --query-queue-size parameter to avoid the queue length problem, so now i can use the dashboard that @Anaisdg provided.

The dashboard is useful because it provides us with the queries to debug a bit the tasks system, which doesn’t seem to be documented anywhere (that i can find). Using it I can identify certain tasks that have failed, with their taskID, but what can I do with this information ? there seems to be no log of individual tasks and why they ‘failed’ ?

Also, the notifications log continues to be 100% ‘not sent’, despite the vast majority being sent.

I really feel like we’re flying blind with the tasks system, it seems to be powerful in potential but in production / reality it remains fairly opaque.

-i

Hi Ivan,

the cell error list should tell you the list of errors encountered by tasks on the time range. Notification rules will show up in that list, you can search for them by the notification rule name.

We have internally acknowledged that tasks suffer from poor visibility around how our automation works and what the state of that automation is at any given moment. We have a major quarterly goal to improve this situation for the benefit of users. I’d be happy to listen to your feedback/opinions if you have specific ideas, and take them back to my team for discussion and design.

-Adam

Hi there,

The task summary dashboard enabled us to determine the duration of time that is needed for our task to run: in our case ~27-30s. This lead us to increase the time interval from 30s to 60s, which has ‘solved’ our problem now, we do not record any missed changes of status now.

Regarding feedback on the tasks system I could note a couple of things:

Alerts tab:

  1. UI hangs on large datasets, opening the ‘Checks’ page launches database queries to determine measurement keys for the entire bucket, in our case this is very slow and makes the page difficult to use. Memory instantly jumps through the roof (a VM with 64G RAM starts allocating > 10G of swap).
    (in writing this comment i managed to trigger a fatal error: runtime: out of memory !)

  2. Idem for alerts ‘log’, page times out before query responds.

Notifications tab:
3) notifications log shows a warning triangle next to (seemingly all) notifications that were sent, despite a 100% send rate (as far as we know). Impossible to understand what the warning triangle actually means.


As I mentioned above, using the Tasks Summary dashboard (which should be included by default IMHO), we’re able to see tasks that are marked as error, but unable to grep a log or make a query to determine what happened. In our case it seems that the system was too busy to perform all the necessary processing, and hence dropped tasks/notifications ‘silently’.

May I suggest either a ‘verbose logging’ mode exposing individual task / check lifecycle, or optionally writing that info to a stats table for debugging.

thanks again for your response, the alerts system is indeed powerful we look forward to seeing it evolve with influx 2

-i