Distinct values counting in a join

hi there,

I was trying to get this query here replaced with flux: select count(distinct(tracking_id)) from em where time > now() -15h group by time(1h)

I created a flux query that looks like this, but is creating totally different results somehow.

from(bucket:"genios/autogen")|>range($range)|> filter(fn: (r) =>r._measurement == "em" and r.env == "production" and r._field == "tracking_id" ) |> group(columns:["tracking_id"])|>aggregateWindow(every:1h,fn:count)

We use grafana, and I want to calculate a percentage between two measurements.

Just to make sure I understand correctly, you’re trying to count the number of distinct tracking_ids per hour? If so, this should do it:

from(bucket: "db/autogen")
  |> range(start $range)
  |> filter(fn: (r) => r._measurement == "em")
  |> distinct(column: "tracking_id")
  |> group()
  |> window(every: 1h)
  |> count()

You also submitted this issue on the Flux repo that is a little further down the road and tries to join the two streams with updated field/tag names. (Is tracking_id/tid a field or a tag? If it’s a tag, it’ll change the query)

I’m not sure I fully understand your use case. Can you describe the schema of the input data and your desired output?

Thanks for getting back…

Your suggestion does not work, as I need to join it.

I have two measurements, one of which I need to count the number of unique tracking_ids (or tid - there is a in my example).
The other one just counts the number of rows more or less
Then I want to join those two and calculate a percentage

And This I want to do with several measurements

Context: Its email sending / opening events

Do you need a percentage per unique tracking ID? Or just the number of unique tracking IDs / number of total rows/events?

I just need a total numer of unique tracking ids per hour for example. Or per day maybe.

Ok, a few more questions:

  • Is tracking_id/tid a field or a tag?
  • What percentage are you trying to calculate?

Sorry,

the tracking_id is a field. I need to calculate the distinct number of tracking ids in one measurement and put it in relation of the total count of entries in another measurement.

Context: one measurement contains one entry per mail being sent. The other one contains one entry for mail being opened. But mails are opened more than once, so I need a distinct count on the 2nd measurement.

this is the complete flux query:

sm=from(bucket:"genios/autogen")|>range($range)|> filter(fn: (r) =>r._measurement == "monitoring" and r.env == "prod" and r.sent == "yes" and r._field == "tracking_id") |>group(columns:["tracking_id"]) |>aggregateWindow(every: 1h, fn: count)

op=from(bucket:"genios/autogen")|>range($range)|> filter(fn: (r) =>r._measurement == "email_opened" and r.env == "production" and r._field == "tracking_id" ) |> group(columns:["tracking_id"])|>aggregateWindow(every:1h,fn:count)

join(
    tables: {sentMails:sm, openedMails:op},
    on: ["_time", "_stop", "_start"]
  )
  >> map(fn: (r) => ({
    _time: r._time,
    _value: (float(v:r._value_openedMails) / float(v:r._value_sentMails))
  })
)

And yes, there is a differnce between prod and production…

:wink:

Cheers,

Stephan

Try this:

sent_mail = from(bucket:"genios/autogen")
  |> range($range)
  |> filter(fn: (r) => r._measurement == "monitoring" and r.env == "prod" and r.sent == "yes" and r._field == "tracking_id") 
  |> aggregateWindow(every: 1h, fn: count)
  |> truncateTimeColumn(unit: 1h)

opened_mail =from(bucket:"genios/autogen")
  |> range($range)
  |> filter(fn: (r) => r._measurement == "email_opened" and r.env == "production" and r._field == "tracking_id" ) 
  |> aggregateWindow(every: 1h, fn: (tables=<-, column) => tables |> distinct() |> count())
  |> truncateTimeColumn(unit: 1h)

join(
    tables: {sentMails: sent_mail, openedMails: opened_mail},
    on: ["_time"]
  )
  >> map(fn: (r) => ({
    _time: r._time,
    _value: (float(v:r._value_openedMails) / float(v:r._value_sentMails))
  })
)

I do a couple of things in here:

  1. I high-jack the aggregateWindow fn parameter to do both a distinct and count on each window.
  2. I normalize the timestamps in case there are small differences between the time values returned from aggregateWindow in each stream.

this does not work with influx 1.7.8 (sorry, did not mention that before):

sent_mail = from(bucket:"genios/autogen") |> range(start:-2h) |> filter(fn: (r) => r._measurement == "monitoring" and r.env == "prod" and r.sent == "yes" and r._field == "tracking_id") |> aggregateWindow(every: 1h, fn: count) |> truncateTimeColumn(unit: 1h)

Error: type error 1:230-1:248: undefined identifier "truncateTimeColumn"

Oh, ok. Try removing the truncateTimeColumn calls. You may be ok without those.

ok… tried that. the singel queries work and do return something, but the join fails:

> join(
    tables: {sentMails: sent_mail, openedMails: opened_mail},
    on: ["_time"]
  )
  >> map(fn: (r) => ({
    _time: r._time,
    _value: (float(v:r._value_openedMails) / float(v:r._value_sentMails))
  })
)
Error: loc 0:0-0:0: missing right hand side of expression

I try to run it in influx shell, via Grafana I do not get proper errors back.

This looks like a syntax error and I can’t tell, where… I also tried writing it in one line with a lot of whitespace in between (had some issues with the shell not working properly with newlines), but it did not help…

Does the join (without the map) return anything? I’m wondering if the _time values are different an you don’t actually join anything…

If join does return something, try:

join(
    tables: {sentMails: sent_mail, openedMails: opened_mail},
    on: ["_time"]
  )
  |> map(fn: (r) => ({ r with
    _value: float(v: r._value_openedMails) / float(v: r._value_sentMails)
  })
)

yes, the join itself seems to do something.

but your fix does not work, still the same error.

I am trying the last 2 hours, there is data there, mails are sent once an hour and opened… well, when they are opened :wink: might be, that now there are none, but I doubt it…

Do the _value_openedMails and _value_sentMails columns both exist in the join output?

yes… several times… the output is grouped weirdly:

olution_openedMails, solution_sentMails, type, username]
_field_openedMails:string  _field_sentMails:string  _measurement_openedMails:string  _measurement_sentMails:string         _start_openedMails:time           _start_sentMails:time          _stop_openedMails:time            _stop_sentMails:time  env_openedMails:string    env_sentMails:string             sent:string  solution_openedMails:string  solution_sentMails:string             type:string         username:string        _value_sentMails:int                      _time:time      _value_openedMails:int
-------------------------  -----------------------  -------------------------------  -----------------------------  ------------------------------  ------------------------------  ------------------------------  ------------------------------  ----------------------  ----------------------  ----------------------  ---------------------------  -------------------------  ----------------------  ----------------------  --------------------------  ------------------------------  --------------------------
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user1                           0  2020-03-30T21:00:00.000000000Z                           1
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user1                           0  2020-03-30T22:00:00.000000000Z                           9
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user1                           0  2020-03-30T22:28:50.449664670Z                           2
Table: keys: [_field_openedMails, _field_sentMails, _measurement_openedMails, _measurement_sentMails, _start_openedMails, _start_sentMails, _stop_openedMails, _stop_sentMails, env_openedMails, env_sentMails, sent, solution_openedMails, solution_sentMails, type, username]
_field_openedMails:string  _field_sentMails:string  _measurement_openedMails:string  _measurement_sentMails:string         _start_openedMails:time           _start_sentMails:time          _stop_openedMails:time            _stop_sentMails:time  env_openedMails:string    env_sentMails:string             sent:string  solution_openedMails:string  solution_sentMails:string             type:string         username:string        _value_sentMails:int                      _time:time      _value_openedMails:int
-------------------------  -----------------------  -------------------------------  -----------------------------  ------------------------------  ------------------------------  ------------------------------  ------------------------------  ----------------------  ----------------------  ----------------------  ---------------------------  -------------------------  ----------------------  ----------------------  --------------------------  ------------------------------  --------------------------
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user2                           0  2020-03-30T21:00:00.000000000Z                           1
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user2                           0  2020-03-30T22:00:00.000000000Z                           9
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user2                           0  2020-03-30T22:28:50.449664670Z                           2
Table: keys: [_field_openedMails, _field_sentMails, _measurement_openedMails, _measurement_sentMails, _start_openedMails, _start_sentMails, _stop_openedMails, _stop_sentMails, env_openedMails, env_sentMails, sent, solution_openedMails, solution_sentMails, type, username]
_field_openedMails:string  _field_sentMails:string  _measurement_openedMails:string  _measurement_sentMails:string         _start_openedMails:time           _start_sentMails:time          _stop_openedMails:time            _stop_sentMails:time  env_openedMails:string    env_sentMails:string             sent:string  solution_openedMails:string  solution_sentMails:string             type:string         username:string        _value_sentMails:int                      _time:time      _value_openedMails:int
-------------------------  -----------------------  -------------------------------  -----------------------------  ------------------------------  ------------------------------  ------------------------------  ------------------------------  ----------------------  ----------------------  ----------------------  ---------------------------  -------------------------  ----------------------  ----------------------  --------------------------  ------------------------------  --------------------------
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user3                           0  2020-03-30T21:00:00.000000000Z                           1
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user3                           0  2020-03-30T22:00:00.000000000Z                           9
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user3                           0  2020-03-30T22:28:50.449664670Z

just did :wink: it’s more than that, but it all looks similar… just noticed - there are usernames in there. now its a bit more anonymous.

Ok, I think I see the problem. All the _value_sentMails are 0. You can’t divide by 0. But I think the reason you’re getting all the 0s is because your data is getting grouped by all the different tag values so you’re getting a bunch of separate output tables when you aggregate everything.

Try this:

sent_mail = from(bucket:"genios/autogen")
  |> range($range)
  |> filter(fn: (r) => r._measurement == "monitoring" and r.env == "prod" and r.sent == "yes" and r._field == "tracking_id") 
  |> group(columns: ["_field"])
  |> aggregateWindow(every: 1h, fn: count)

opened_mail =from(bucket:"genios/autogen")
  |> range($range)
  |> filter(fn: (r) => r._measurement == "email_opened" and r.env == "production" and r._field == "tracking_id" ) 
  |> group(columns: ["_field"])
  |> aggregateWindow(every: 1h, fn: (tables=<-, column) => tables |> distinct() |> count())

join(
    tables: {sentMails: sent_mail, openedMails: opened_mail},
    on: ["_time"]
  )
  |> map(fn: (r) => ({
    _time: r._time,
    _value: float(v:r._value_openedMails) / float(v:r._value_sentMails)
  })
)

Hey, this is awesome! This works… (although I truncateTimeColumn still does not work with my version of flux and I need to give the group an array as collumns).

Now I still need to understand it a bit better…

the sent_mail part is rather simple and did not change. The join also.

but what is that fn in aggregateWindow in opened_mail? I do not quite understand what it does…

Thanks a lot for your help!

well, it works kind of…

for whatever reason I get values > 1, which seems wrong. The unique part seems to fail somehow.

this is the query, that needs to be replaced: SELECT count(distinct("tracking_id")) FROM "email_opened" WHERE ("env" = 'production') AND $timeFilter GROUP BY time(1h) fill(null)

As the tracking ids in sentmail and email_opened are the same, the resulting number of dividing the number of unique number of opned_mail tids by the number of total mails sent should never be higher than 1.0 which would be a 100% opening rate…

BUT: as it turned out the problem is the 1h grouping. Mails are sent at e.g. 12:00 am and opened the next hour. If I set the aggregateWindow to 1d it looks way better. Sorry.

This topic was automatically closed 60 minutes after the last reply. New replies are no longer allowed.