I’m trying to design an alert to tell me if cron activity deviates from normal. I know this sounds super vague, let me try to explain how my system works:
We have numerous clients that we receive FTP files from on different schedules: some daily, some multiple times per day. We collect things like number of files downloaded and time to process the files.
I want an alert that can tell me things like:
normally we receive 50 files at 8am from client X everyday, but it’s 10am and we haven’t received any files.
normally we receive 10 files at 10am, 12pm, and 2pm, but it’s 4pm and we’ve only seen the 10am and 12pm files.
The struggle I’m having is that it’s not a constant stream of data: for most of the day we receive no files, then at a very specific time (different to every client), we download files. So I’m struggling to see how this can work with something like deadman because a number of 0 without any context is not anomalous, however a number of 0 after 10am is for certain clients.
All the ideas that’ve come to my head so far include aggregating the entire day’s worth of data and using sigma / stddev to calculate if it’s less or more than normal. The problem with this approach is we don’t find out for, at worst, an entire day–way to late for us to take action before our client notices.
My goal is to develop an entire series of this style of alert: have we missed any scheduled FTP downloads? have we failed to process as many files as we usually do? is the processing taking longer or shorter than normal?
Maybe these are all slightly different. But the real alert I’m trying to focus on is to be notified when a client’s files have not been processed when usually they have. I fully believe Kapacitor has the capability to do what I want, but I’m drawing a blank on how to design the alert.
I would start with using historical data for a baseline. Something like every hour select the number of files downloaded in the current hour and the previous day at the same time and the previous week at the same time. Then compare the current value to both historical values.
var week = batch|query('SELECT num_files ...')
.period(1h)
.offset(7d)
.every(1h)
var day = batch|query('SELECT num_files ...')
.period(1h)
.offset(1d)
.every(1h)
var current = batch|query('SELECT num_files ...')
.period(1h)
.every(1h)
current
|join(week, day)
.as('current', 'week', 'day')
|alert()
.crit("current.num_files" < ("week.num_files" + "day.num_files")/2.0)
The above alerts if the current value is less then the average of the two historical values.
You could even extend this to track typical counts for hours of the day. One task would write the number of files downloaded each hour to a database tagged with the hour of day (0-23). Then the second task would compare the current hour to the average of the past n days of data for the current hour tag.
Then the second task would get the current hour and compare to historical average.
var historical = batch|query('SELECT num_files FROM historical_num_files_by_hour ...')
.period(7d)
.every(1h)
.groupBy('hour')
|mean('num_files')
var current = batch|query('SELECT num_files ...')
.period(1h)
.every(1h)
|eval(lambda:hour("time"))
.as('hour')
.tags('hour')
.keep('num_files')
|last('num_files')
.as('num_files')
// Note here techincally we are joining against all 24 hours of historical data but the join will simply drop the historical points that did not match up with the current data since no .fill operation is specified.
current
|join(historical)
.as('current', 'historical')
|alert()
.crit(lambda: "current" < "historical.mean")
There are most definitely typos in the above but that should be a good starting point.
I like the strategy of storing by hour and querying on that later, i hadn’t considered keeping a separate precision within the same retention policy, but it seems perfectly doable. I’ll work on this and let you know how it goes.
Thanks for your detailed response and also all your work on Kapacitor, it’s truly an awesome product!
@nathaniel That’s a great example!.
However, when I have some more complex queries, it seems like the .join operation doesn’t work (since the time is not identical).
I added the .tolerance(), and the query is working, but then - when adding the offset it’s like no values return from the join.