Best monitoring strategy for production

I’ve been using Influxdb1 in my production and dev platforms for 2-3 years. For last two months I’ve been testing Influxdb2 and I’m quite happy with it, but I’m not sure what’s the best approach for a production-grade monitoring system metrics (via telegraf) from about 100-200 servers.

  • Checks:
    According to all documentation I’ve seen, there are only two types of checks: UI-generated (which are very limited) and flux-based tasks (which are far more flexible). Unless I require a very trivial check, my only option is to use flux-based tasks, right?

  • Notifications:
    As far as I know, there’s only two types of notifications: UI-generated (again, very limited), and flux-based notifications (which seem to be far more flexible). There are very little documentation/examples about creating notification endpoints (I only know the http/pagerduty/slack.endpoint() functions). And there’s even less about how to link those endpoints to other tasks, which makes me assume the notification endpoints we create in a task are only available within that task and they can’t be used from other tasks. Repeating the same notification logic on each task seems very ineffective and doesn’t seem to scale up to scenarios with hundreds of alerts. Is it safe to assume the best approach is to use UI-based notification endpoints and rules?

  • Check&monitor tasks vs check tasks + monitor task
    All tasks I’m currently using are “autonomous”: each one of them does the checking and calls monitor.check(). I’m wondering whether is better to limit the tasks to do just the checking and dumping the output to a bucket and then having a single monitor task reading from that bucket to trigger the notifications.

  • Monitoring previous statuses
    Is there any way to configure a check+monitor task that keep previous level (status) if undefined instead of going to “UNKNOWN” (which I find useless). Yes, I know if I remove the “ok” condition on monitor.check() all undefined checks will be tagged with level “ok”, but that’s not what I’m saying; I’m talking about keeping the previous level. I know I can do that reading “statuses” measurement on “_monitoring” bucket and then joining the table with the check data, but I think it’s a bit overkill to have to do a join for all my tasks. Is there an easier way to do that? With Influxdb 1.x it was much easier because if no condition it matched, alertNode didn’t modify the level and therefore kept the previous one. Any clue about how to do that with influxdb 2.x?

Please, don’t send me links to tutorials about how to configure a trivial alert. I perfectly know how to set up a task alert (I’m currently using dozens of them).

1 Like

Hello @vvilaplana,
Thanks for your detailed question. It helps give a better answer. I’ll do the best I can.

  • Checks
    Yes I would recommend using Flux based tasks. You can use kapacitor with InfluxDB 2, but I wouldn’t recommend going down that path as it mostly exists to help 1.x users migrate to 2.x.

  • Notifications
    Without knowing specifics about the types of checks and alerts you want set up and the action you want to take after, I can only answer this question superficially. But my recommendation is to think about the following:
    – are there ways/instances where you can consolidate information from multiple data transformation tasks into a new bucket (with the to() function)? If so, then you can more easily apply one check to multiple series in a subsequent task.
    – similarly/alternatively can you consolidate all data that meets your check requirements into one bucket with the to() function? If so then you can create one alert on all your data simultaneously.
    I don’t think it’s safe to assume that the best approach is to use the UI-based notification system and endpoints and rules as the UI utilizes the api as well. What about your data pipeline makes you think this? I would imagine that if you can use the UI to consolidate your notification logic you find a way to do that with Flux as well. More detail here would be helpful.

  • Check&monitor tasks vs check tasks + monitor task
    Yes you can just do the checking with conditional querying and then write that data to a new bucket with to(). Nice thinking! We’re on the same page here (my questions above feel redundant now. Your thinking is spot on!)

  • Monitoring previous statuses
    Are you familiar with monitor.stateChanges() function | InfluxDB OSS 2.0 Documentation?
    Last tutorial, I promise but this could be useful: TL;DR InfluxDB Tech Tips – How to Monitor States with InfluxDB | InfluxData
    If these two resources don’t help, please let me know. I appreciate your willingness to take a look in advance.

Again thanks for your detailed question, im sure this will help future users. Please let me know if you have any more questions or points I can clarify.

Since monitor.xxxxx() functions are very strict, the only solution seems to go low-level and directly writing/parsing _monitoring bucket. But quite frankly, I don’t think that’s gonna be very performant on large scale platforms. Feel free to correct me if I’m wrong.

Your post (InfluxDB Tech Tips | Using Tasks & Checks for Monitoring with InfluxDB) explains the monitoring workflow process and what monitor.check() internally does, but it doesn’t describe how to parse _monitoring.statuses.

Can you provide a working example (influx 2.0.4) of any trivial test check task writing to _monitoring bucket and a monitor alert task that processes _monitoring bucket? I’m sure that’s what’s gonna help a lot of people.

I’ve been testing StateChangesOnly(), but I couldn’t make it work :frowning: I haven’t seen a single alert task using it. All the examples I found just show partial lines of StateChangesOnly(). The most “detailed” place I’ve found about that function is here: stateChangesOnly() does not work as expected - #17 by Anaisdg

I don’t want to steer you in the wrong direction. Let me see if I can get some help from the Flux team. Your questions have been great.

It does, but it’s a small bit
The example is:

Check: ${ r._check_name } is:${string(v: r._battery_level)}