Nested queries in Kapacitor

tsachi · June 21, 2017, 11:32am

I am trying to run the following nested query in Kapacitor, but get an error message :
failed to parse InfluxQL query: found (, expected identifier at line 2, char 39

var email_base = batch
    |query('''
        SELECT count(*) as "cnt" FROM (
            select payment_id, email_domain from "my_db"."autogen"."payment_attempts" GROUP BY payment_id
            )         
    ''')
		.period(14d)
        .every(10s)
        .groupBy(time(1d), 'email_domain')
        .offset(75d)
        .fill(0)
    |mean('cnt_amount')
        .as('amount_mean')

email_base
    |log()

My payment_attempt measurement may have multiple points with the same payment_id tag. I need the number of unique email domains supplied in the same payment_id.

I try to adapt this InfluxDB query to kapacitor:
SELECT count(*) FROM (select * from payment_attempts GROUP BY payment_id) where time >= end_time and time <= start_time GROUP BY email_domain, TIME(1d);

How can I rewrite my TICK script to support this functionality?

Thank you very much!
Tsachi.

tsachi · June 21, 2017, 2:16pm

I created the following TICK script, but I am not sure of it’s correctness. I tried to get an item from the log, but the numbers are not 100% matching my InfluxDB query

My new TICK script:

var email_base = batch
    |query('''
        SELECT payment_id, amount, email_domain FROM "my_db"."autogen"."payment_attempts"
    ''')
        .period(14d)
        .every(10s)
        .groupBy(time(1d), 'payment_id')
        .groupBy('email_domain')
        .offset(75d)
    |count('email_domain')
        .as('email_count')
    |mean('email_count')
        .as('v_mean')

nathaniel · June 22, 2017, 3:32pm

This part doesn’t look correct, the second call will override the first.

      .groupBy(time(1d), 'payment_id')
      .groupBy('email_domain')

You probably meant something like this:

var email_base = batch
    |query('''
        SELECT payment_id, amount, email_domain FROM "my_db"."autogen"."payment_attempts"
    ''')
        .period(14d)
        .every(10s)
        .groupBy(time(1d), 'payment_id')
        .offset(75d)
    |groupBy('email_domain')
    |count('email_domain')
        .as('email_count')
    |mean('email_count')
        .as('v_mean')

And as you have discovered Kapacitor doesn’t support sub queries yet.

tsachi · June 22, 2017, 3:48pm

Thank you @nathaniel! I ran the TICK script you suggested, but got the following error:
error parsing query: GROUP BY requires at least one aggregate function

tsachi · June 25, 2017, 8:08am

@nathaniel I don’t know how to fix the groupBy issue, can you take a look please?
Problem seems to be in this operation:

var email_base = batch
    |query('''
        SELECT payment_id, amount, email_domain FROM "my_db"."autogen"."payment_attempts"
    ''')
        .period(14d)
        .every(10s)
        .groupBy(time(1d), 'payment_id')
        .offset(75d)

Thank you!

nathaniel · June 26, 2017, 3:26pm

@tsachi The query is asking for 14 days of data grouped into 1d time buckets but you haven’t specified how to group the data in to 1d time buckets. Do you want to take the mean, min, first, last, etc?

tsachi · June 26, 2017, 4:23pm

Basically, I need to rewrite the InfluxDB query I wrote in the beginning of the post.
My need is to count unique email domains with the same payment_id.
And this part of the query reflects my wish to remove duplicates of the same payment_id, email_domain.

If there is a cleaner way to rewrite my InfluxDB script, it should also be OK.

Thank you @nathaniel!
Tsachi.

nathaniel · June 28, 2017, 3:39pm

If you want the number of unique email domains with the same payment_id is that not simply a distinct query?

SELECT COUNT(DISTINCT("email_domain")) FROM "my_db"."autogen"."payment_attempts" WHERE time > now() - 14 GROUP BY time(1d), payment_id

If so that would simplify the TICKscript as all you need is that one query without the sub query.

I am not quite sure how to restructure the original query as I am not quite sure what its doing.
Could you share some example data so I can better understand?

tsachi · July 4, 2017, 3:20pm

Thank you nathaniel.

How can I share the data with you?

Best regards,
Tsachi

jackzampolin · July 5, 2017, 4:58pm

@tsachi Posting it in a Github Gist is an easy way to share.

tsachi · July 6, 2017, 1:39pm

Hi @jackzampolin. Thank you for your help.
Here is a dummy gist reflecting the scenario of:
2 measurements for payment_id = 100, with different email domains
2 measurements for payment_id = 200, with the same email domain

I want to write a query that outputs 3 results for this data, one for payment_id = 200 and 2 for payment_id = 100

Does using distinct solve this need?

nathaniel · July 19, 2017, 4:05pm

If I understand correctly you want this data back:

payment_id=100 count=2
payment_id=200 count=1

This TICKscript should do that.

// Select raw data for each payment attempt grouped by payment_id
var email_base = batch
    |query('''
        SELECT value,email_domain FROM "my_db"."autogen"."payment_attempts"
    ''')
        .period(1d)
        .every(10s)
        .groupBy('payment_id')
    // Get just he distinct email domains per payment_id
    |distinct('email_domain')
    // Count the distinct email_domains per payment_id
    |count('distinct')
        .as('email_count')

noamm · July 25, 2017, 2:41pm

Hi Nathaniel,
I work with Tsachi. thank you for this!
This seems the right direction. However our use case requires that the day grouping performed later. We are trying to:

for each email_domain, count the number of distinct payments in which it appeared in each of the past 14 days.

i.e. we want to achieve the results that this query returns:

select count(email_domain) from (select first(payment_id),first(email_domain),count(total_amount) from payment_attempts where time >= now() - 14d group by payment_id,email_domain) where time >= now() - 14d group by email_domain,time(1d)

And I did not find how to do day grouping as in independent node, not as a query node property.
How can this be done?

Thanks,
Noam

nathaniel · July 25, 2017, 3:12pm

Check out the window node.

Then change the period property of the query node to be 14 days.

noamm · July 26, 2017, 2:28pm

Hi Nathaniel, window node seems to run only on a stream, while we do batch/query. Any bypass around that?

nathaniel · July 27, 2017, 8:24pm

Window node can work in both batch and stream tasks but it most consume a batch edge. (Sorry for the confusion of the overloaded terms).

Can you share the TICKscript that gave you an error when using the window node?

tsachi · August 3, 2017, 9:01am

Thank you Nathaniel.
For now I am using the following script to give the average of email_domains.

var email_benchmark = batch
    |query('''
        SELECT value, email_domain FROM "my_db"."autogen"."payment_attempts"
    ''')
        .period(14d)
        .every(5m)
        .groupBy('payment_id','email_domain')
    |groupBy('email_domain')
    |count('email_domain')
        .as('num_of_payments')
    |eval(lambda: float("num_of_payments")/14.0)
        .as('daily_average')

I think it works.

Are there any debugging tools for Kapacitor? They’ll help me run a sanity check on the output.

tsachi · August 13, 2017, 6:10pm

Hi again @nathaniel and @jackzampolin.
I am stuck with finalizing my alert and I hope you can provide the missing block.

I have the following tick script that triggers an alarm if the last day had more than N unique email_domains compared to the last 14 days.

I would like to update the email_benchmark to include 14 days ending yesterday, instead of 14 days ending today.

Simply adding .offset(1d) does not work. My hunch is that the join fails due to the mismatch on times. Do you have any suggestion on how to proceed with that?

Thanks a lot!

Tsachi.

var email_benchmark = batch
    |query('''
        SELECT total_amount,email_domain FROM "my_db"."autogen"."payment_attempts"
    ''')
        .period(14d)
        .every(5m)
        .groupBy('payment_id','email_domain')
        .fill(1)
    |groupBy('email_domain')
    |count('email_domain')
        .as('num_of_payments')
    |eval(lambda: float("num_of_payments")/14.0)
        .as('daily_average')

var email_curr = batch
    |query('''
        SELECT total_amount,email_domain FROM "my_db"."autogen"."payment_attempts"
    ''')
        .period(1d)
        .every(5m)
        .groupBy('payment_id','email_domain')
        .fill(1)
    |groupBy('email_domain')
    |count('email_domain')
        .as('num_of_payments')
    |eval(lambda: float("num_of_payments"))
        .as('curr_num_of_payments')

email_benchmark
    |join(email_curr)
        .tolerance(1m)
        .as('benchmark', 'curr')
    |eval(lambda: float("curr.curr_num_of_payments") / float("benchmark.daily_average"))
        .as('breach_ratio')
    |alert()
        .crit(lambda: "breach_ratio" > N)
        .slack()
        .channel('#alerts')

Topic		Replies	Views
Kapacitor and subqueries? influxdb , kapacitor	7	2527	July 21, 2020
combining groupBy in tickscript kapacitor	0	528	February 27, 2020
Kapacitor - Issues with Query + Join Kapacitor	0	696	February 26, 2018
How to convert influxdb query to Kapacitor TICK script? InfluxDB 2 kapacitor , influxql	3	765	February 1, 2019
Counting number of occurrences for each tag influxdb , kapacitor	0	707	July 9, 2018

Nested queries in Kapacitor

Related topics