IoT Dynamic Alerting Feature Design

influxdb
kapacitor
#1

Hi all, I’m new to Influx community and still trying to get familiar with how it works here. So if you find my post is similar to other existing posts, please let me know.

My colleague and I went to Paul Dix’s time series meetup last night at Wayfair Boston. And we chatted with Paul, Ryan and Noah after the talk about the way we want to use Influx to build our new monitoring solution.

Background
To summarize, our IoT devices are collecting data from all kinds of sensors, which form various sensor streams. And the client will be responsible for setting alerts and groups of recipients who will receive the alerts. All sensor data, naturally, will be saved in InfluxDB. But we are not sure what’s the best design for alert evaluation. Currently, we are thinking of 2 designs using InfluxDB and Kapacitor to achieve this:

Design 1 - a self-contained solution
We will be saving everything in InfluxDB, which includes sensor data, alert settings, recipient settings. Although strictly speaking alerts and recipients are not time series data, but when they are created, deleted, modified, we can attach a timestamp to them, and save them as data points as if they are “time series” data. Conceptually we just need to join sensor measurements, alert measurements and recipient measurements and with some magic UDFs we should be able to get alert configured to send to a specific group of recipients, and we will only need one TICKScript for all use cases. Here are the pros and cons:

Pros:

  1. Only need one TICKScript. Very easy to manage and maintain.
  2. Self-contained solution
  3. Whenever alerts or users are modified, we just need to create a datapoint in alerts or users measurement. No new TICKScript needed.

Cons:

  1. The biggest concern is feasibility. Is it the right way to use InfluxDB and Kapacitor?
  2. The TICKScript and UDFs can be very complicated and not easy to debug.

Design 2 - Dynamically generate TICKScripts
Let’s say we will be storing alert settings and recipients in a relational database - Postgres. And we will have a service that generates TICKScript using the data stored in Postgres and use Kapacitor HTTP API to create tasks. Here are the pros and cons:

Pros:

  1. Conceptually, the logic is easier to understand
  2. We can use TICKScript template variables when generating TICKScripts
  3. Each TICKScript will be static and tailored to each device’s sensor data. Easy to debug.

Cons:

  1. There will probably be thousands of tasks created on Kapacitor, is that normal? And whenever the user updates the alert, we have to find the corresponding task and modify it.
  2. We need to implement and maintain the TICKScript generation service.

I imagine the dynamic alert configuration is very common among IoT solutions. I’m very curious how people solve this problem creatively using TICK stack. Thanks in advance!

#2

Hey, thanks for posting your question here! Nice to meet you last night.

Take a look at TICKscript’s Template Tasks. They allow you to define TICKscripts and load data in from a JSON variable file. When a user makes changes, you can update the variable file and re-define the alert.

This will also avoid having to query for additional data when you’re processing the alerts, which seems like it would cause problems with Design 1: the number of queries you are performing would scale linearly with the number of points you are ingesting.

#3

Hey Noah,

We were actually originally considering the design that you mention-- ie storing the alert parameters in the tasks themselves and then updating the tasks when the parameters would change–but wondered if we would have problems from having so many tasks active at the same time. In our use case we would have tens of thousands of sensor feeds that we are monitoring that each might have their own custom alert thresholds. Would it be feasible to run that many tasks simultaneously?

#4

Hi @noahcrowley,

Thanks for the reply! I have tried Template Tasks and I think it fits our purpose. Now with the ability to create tasks out of the template, we will have more than 1k tasks running on Kapacitor. As my colleague @eremzeit mentioned, we are not sure how much tasks Kapacitor can handle at the same time. If you can provide some guidance that will be awesome.

In addition, I’ve tried InfluxCloud Standard I option. But I realized it doesn’t come with Kapacitor. I can’t find any performance metrics for Kapacitor. Is that something Influxdata can provide? Seems like someone asked this question before but hasn’t got any reply yet. Thanks!

How many alerts can Kapacitor scale to?
#5

Hey @eremzeit and @wkopen,

There is an instance of Kapacitor included with the InfluxCloud Standard option. It should already be connected in Chronograf.

It’s difficult to answer questions about the performance of Kapacitor because the workloads are user-defined. The amount of computation in a task, the frequency and size of the data, and whether it is a batch or stream operation all have an impact on the resources consumed.

After talking to one of my colleagues, he suggested an approach that is more similar to Design 1. As I described above, the scalability of Kapacitor depends on the workload, so large numbers of “where” nodes could cause performance issues. As a result, he recommended determining what the set of user-defined variables would be, and then injecting that data into a single task with logic to process the data against those variables.

There are a variety of ways to inject data into the task: load directly into Kapacitor using the sideload node, send writes to InfluxDB, use a Kapacitor UDF as you had suggested, or including thresholds from the source via Telegraf. Which of those will work best for you will depend on your architecture and what’s easiest to manage.

If you are going to use a UDF to query for data, it makes sense to keep a local cache so you don’t have to continuously make network requests.

There are always going to be tradeoffs between various approaches, and an approach that works for one user might not be ideal for another. Best recommendation? Start building and benchmarking some prototypes and see what works best for your workload.