Distinct values counting in a join

Stephan · March 27, 2020, 4:15pm

hi there,

I was trying to get this query here replaced with flux: select count(distinct(tracking_id)) from em where time > now() -15h group by time(1h)

I created a flux query that looks like this, but is creating totally different results somehow.

from(bucket:"genios/autogen")|>range($range)|> filter(fn: (r) =>r._measurement == "em" and r.env == "production" and r._field == "tracking_id" ) |> group(columns:["tracking_id"])|>aggregateWindow(every:1h,fn:count)

We use grafana, and I want to calculate a percentage between two measurements.

scott · March 30, 2020, 3:04pm

Just to make sure I understand correctly, you’re trying to count the number of distinct tracking_ids per hour? If so, this should do it:

from(bucket: "db/autogen")
  |> range(start $range)
  |> filter(fn: (r) => r._measurement == "em")
  |> distinct(column: "tracking_id")
  |> group()
  |> window(every: 1h)
  |> count()

You also submitted this issue on the Flux repo that is a little further down the road and tries to join the two streams with updated field/tag names. (Is tracking_id/tid a field or a tag? If it’s a tag, it’ll change the query)

I’m not sure I fully understand your use case. Can you describe the schema of the input data and your desired output?

Stephan · March 30, 2020, 3:09pm

Thanks for getting back…

Your suggestion does not work, as I need to join it.

I have two measurements, one of which I need to count the number of unique tracking_ids (or tid - there is a in my example).
The other one just counts the number of rows more or less
Then I want to join those two and calculate a percentage

And This I want to do with several measurements

Context: Its email sending / opening events

scott · March 30, 2020, 3:21pm

Do you need a percentage per unique tracking ID? Or just the number of unique tracking IDs / number of total rows/events?

Stephan · March 30, 2020, 3:43pm

I just need a total numer of unique tracking ids per hour for example. Or per day maybe.

scott · March 30, 2020, 3:55pm

Ok, a few more questions:

Is tracking_id/tid a field or a tag?
What percentage are you trying to calculate?

Stephan · March 30, 2020, 4:42pm

Sorry,

the tracking_id is a field. I need to calculate the distinct number of tracking ids in one measurement and put it in relation of the total count of entries in another measurement.

Context: one measurement contains one entry per mail being sent. The other one contains one entry for mail being opened. But mails are opened more than once, so I need a distinct count on the 2nd measurement.

this is the complete flux query:

sm=from(bucket:"genios/autogen")|>range($range)|> filter(fn: (r) =>r._measurement == "monitoring" and r.env == "prod" and r.sent == "yes" and r._field == "tracking_id") |>group(columns:["tracking_id"]) |>aggregateWindow(every: 1h, fn: count)

op=from(bucket:"genios/autogen")|>range($range)|> filter(fn: (r) =>r._measurement == "email_opened" and r.env == "production" and r._field == "tracking_id" ) |> group(columns:["tracking_id"])|>aggregateWindow(every:1h,fn:count)

join(
    tables: {sentMails:sm, openedMails:op},
    on: ["_time", "_stop", "_start"]
  )
  >> map(fn: (r) => ({
    _time: r._time,
    _value: (float(v:r._value_openedMails) / float(v:r._value_sentMails))
  })
)

And yes, there is a differnce between prod and production…

Cheers,

Stephan

scott · March 30, 2020, 9:19pm

Try this:

sent_mail = from(bucket:"genios/autogen")
  |> range($range)
  |> filter(fn: (r) => r._measurement == "monitoring" and r.env == "prod" and r.sent == "yes" and r._field == "tracking_id") 
  |> aggregateWindow(every: 1h, fn: count)
  |> truncateTimeColumn(unit: 1h)

opened_mail =from(bucket:"genios/autogen")
  |> range($range)
  |> filter(fn: (r) => r._measurement == "email_opened" and r.env == "production" and r._field == "tracking_id" ) 
  |> aggregateWindow(every: 1h, fn: (tables=<-, column) => tables |> distinct() |> count())
  |> truncateTimeColumn(unit: 1h)

join(
    tables: {sentMails: sent_mail, openedMails: opened_mail},
    on: ["_time"]
  )
  >> map(fn: (r) => ({
    _time: r._time,
    _value: (float(v:r._value_openedMails) / float(v:r._value_sentMails))
  })
)

I do a couple of things in here:

I high-jack the aggregateWindow fn parameter to do both a distinct and count on each window.
I normalize the timestamps in case there are small differences between the time values returned from aggregateWindow in each stream.

Stephan · March 30, 2020, 9:42pm

this does not work with influx 1.7.8 (sorry, did not mention that before):

sent_mail = from(bucket:"genios/autogen") |> range(start:-2h) |> filter(fn: (r) => r._measurement == "monitoring" and r.env == "prod" and r.sent == "yes" and r._field == "tracking_id") |> aggregateWindow(every: 1h, fn: count) |> truncateTimeColumn(unit: 1h)

Error: type error 1:230-1:248: undefined identifier "truncateTimeColumn"

scott · March 30, 2020, 9:44pm

Oh, ok. Try removing the truncateTimeColumn calls. You may be ok without those.

Stephan · March 30, 2020, 10:06pm

ok… tried that. the singel queries work and do return something, but the join fails:

> join(
    tables: {sentMails: sent_mail, openedMails: opened_mail},
    on: ["_time"]
  )
  >> map(fn: (r) => ({
    _time: r._time,
    _value: (float(v:r._value_openedMails) / float(v:r._value_sentMails))
  })
)
Error: loc 0:0-0:0: missing right hand side of expression

I try to run it in influx shell, via Grafana I do not get proper errors back.

This looks like a syntax error and I can’t tell, where… I also tried writing it in one line with a lot of whitespace in between (had some issues with the shell not working properly with newlines), but it did not help…

scott · March 30, 2020, 10:23pm

Does the join (without the map) return anything? I’m wondering if the _time values are different an you don’t actually join anything…

If join does return something, try:

join(
    tables: {sentMails: sent_mail, openedMails: opened_mail},
    on: ["_time"]
  )
  |> map(fn: (r) => ({ r with
    _value: float(v: r._value_openedMails) / float(v: r._value_sentMails)
  })
)

Stephan · March 30, 2020, 10:28pm

yes, the join itself seems to do something.

but your fix does not work, still the same error.

I am trying the last 2 hours, there is data there, mails are sent once an hour and opened… well, when they are opened might be, that now there are none, but I doubt it…

scott · March 30, 2020, 10:30pm

Do the _value_openedMails and _value_sentMails columns both exist in the join output?

Stephan · March 30, 2020, 10:31pm

yes… several times… the output is grouped weirdly:

olution_openedMails, solution_sentMails, type, username]
_field_openedMails:string  _field_sentMails:string  _measurement_openedMails:string  _measurement_sentMails:string         _start_openedMails:time           _start_sentMails:time          _stop_openedMails:time            _stop_sentMails:time  env_openedMails:string    env_sentMails:string             sent:string  solution_openedMails:string  solution_sentMails:string             type:string         username:string        _value_sentMails:int                      _time:time      _value_openedMails:int
-------------------------  -----------------------  -------------------------------  -----------------------------  ------------------------------  ------------------------------  ------------------------------  ------------------------------  ----------------------  ----------------------  ----------------------  ---------------------------  -------------------------  ----------------------  ----------------------  --------------------------  ------------------------------  --------------------------
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user1                           0  2020-03-30T21:00:00.000000000Z                           1
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user1                           0  2020-03-30T22:00:00.000000000Z                           9
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user1                           0  2020-03-30T22:28:50.449664670Z                           2
Table: keys: [_field_openedMails, _field_sentMails, _measurement_openedMails, _measurement_sentMails, _start_openedMails, _start_sentMails, _stop_openedMails, _stop_sentMails, env_openedMails, env_sentMails, sent, solution_openedMails, solution_sentMails, type, username]
_field_openedMails:string  _field_sentMails:string  _measurement_openedMails:string  _measurement_sentMails:string         _start_openedMails:time           _start_sentMails:time          _stop_openedMails:time            _stop_sentMails:time  env_openedMails:string    env_sentMails:string             sent:string  solution_openedMails:string  solution_sentMails:string             type:string         username:string        _value_sentMails:int                      _time:time      _value_openedMails:int
-------------------------  -----------------------  -------------------------------  -----------------------------  ------------------------------  ------------------------------  ------------------------------  ------------------------------  ----------------------  ----------------------  ----------------------  ---------------------------  -------------------------  ----------------------  ----------------------  --------------------------  ------------------------------  --------------------------
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user2                           0  2020-03-30T21:00:00.000000000Z                           1
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user2                           0  2020-03-30T22:00:00.000000000Z                           9
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user2                           0  2020-03-30T22:28:50.449664670Z                           2
Table: keys: [_field_openedMails, _field_sentMails, _measurement_openedMails, _measurement_sentMails, _start_openedMails, _start_sentMails, _stop_openedMails, _stop_sentMails, env_openedMails, env_sentMails, sent, solution_openedMails, solution_sentMails, type, username]
_field_openedMails:string  _field_sentMails:string  _measurement_openedMails:string  _measurement_sentMails:string         _start_openedMails:time           _start_sentMails:time          _stop_openedMails:time            _stop_sentMails:time  env_openedMails:string    env_sentMails:string             sent:string  solution_openedMails:string  solution_sentMails:string             type:string         username:string        _value_sentMails:int                      _time:time      _value_openedMails:int
-------------------------  -----------------------  -------------------------------  -----------------------------  ------------------------------  ------------------------------  ------------------------------  ------------------------------  ----------------------  ----------------------  ----------------------  ---------------------------  -------------------------  ----------------------  ----------------------  --------------------------  ------------------------------  --------------------------
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user3                           0  2020-03-30T21:00:00.000000000Z                           1
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user3                           0  2020-03-30T22:00:00.000000000Z                           9
              tracking_id              tracking_id                     email_opened                     monitoring  2020-03-30T20:28:50.449664670Z  2020-03-30T20:28:50.449664670Z  2020-03-30T22:28:50.449664670Z  2020-03-30T22:28:50.449664670Z              production                    prod                     yes                       WebApp1                       WebApp2                  search                user3                           0  2020-03-30T22:28:50.449664670Z

Stephan · March 30, 2020, 10:34pm

just did it’s more than that, but it all looks similar… just noticed - there are usernames in there. now its a bit more anonymous.

scott · March 30, 2020, 10:48pm

Ok, I think I see the problem. All the _value_sentMails are 0. You can’t divide by 0. But I think the reason you’re getting all the 0s is because your data is getting grouped by all the different tag values so you’re getting a bunch of separate output tables when you aggregate everything.

Try this:

sent_mail = from(bucket:"genios/autogen")
  |> range($range)
  |> filter(fn: (r) => r._measurement == "monitoring" and r.env == "prod" and r.sent == "yes" and r._field == "tracking_id") 
  |> group(columns: ["_field"])
  |> aggregateWindow(every: 1h, fn: count)

opened_mail =from(bucket:"genios/autogen")
  |> range($range)
  |> filter(fn: (r) => r._measurement == "email_opened" and r.env == "production" and r._field == "tracking_id" ) 
  |> group(columns: ["_field"])
  |> aggregateWindow(every: 1h, fn: (tables=<-, column) => tables |> distinct() |> count())

join(
    tables: {sentMails: sent_mail, openedMails: opened_mail},
    on: ["_time"]
  )
  |> map(fn: (r) => ({
    _time: r._time,
    _value: float(v:r._value_openedMails) / float(v:r._value_sentMails)
  })
)

Stephan · March 31, 2020, 5:57am

Hey, this is awesome! This works… (although I truncateTimeColumn still does not work with my version of flux and I need to give the group an array as collumns).

Now I still need to understand it a bit better…

the sent_mail part is rather simple and did not change. The join also.

but what is that fn in aggregateWindow in opened_mail? I do not quite understand what it does…

Thanks a lot for your help!

Stephan · March 31, 2020, 6:09am

well, it works kind of…

for whatever reason I get values > 1, which seems wrong. The unique part seems to fail somehow.

this is the query, that needs to be replaced: SELECT count(distinct("tracking_id")) FROM "email_opened" WHERE ("env" = 'production') AND $timeFilter GROUP BY time(1h) fill(null)

As the tracking ids in sentmail and email_opened are the same, the resulting number of dividing the number of unique number of opned_mail tids by the number of total mails sent should never be higher than 1.0 which would be a 100% opening rate…

BUT: as it turned out the problem is the 1h grouping. Mails are sent at e.g. 12:00 am and opened the next hour. If I set the aggregateWindow to 1d it looks way better. Sorry.

system · March 31, 2020, 8:21am

This topic was automatically closed 60 minutes after the last reply. New replies are no longer allowed.

Topic		Replies	Views
InfluxDB Flux - get count of distinct occurrences Fluxlang influxdb	4	8450	July 30, 2020
Query a counter of distinct values Telegraf influxdb	5	1725	July 19, 2019
Count distinct string values in timeframe Fluxlang	2	614	August 31, 2022
Counting number of groups Fluxlang influxdb , flux	3	7346	July 9, 2020
Select distinct name and count influxdb	1	1551	May 14, 2019

Distinct values counting in a join

Related topics