How to count new distinct values in a histogram taking into account what occured previously?

wonderer · November 22, 2023, 10:35am

Hello!

Let’s say, I have the following data structure and following state of data

shop_measurement

TIMESTAMP, CUSTOMER_NAME
('March 1', 'Mike'),
('March 1', 'Yusuf'),
('March 1', 'John'),

('March 2', 'Ajay'),
('March 2', 'Mike'),
('March 2', 'Anna')

('March 5', 'Emily')
('March 5', 'Anna')
('March 5', 'Emily')

Let’s say, it represents a simple counter that calculates the fact that a person with a certain name entered a shop. Each time a name enters a shop I get a record in InfluxDB.

What I would like to build is a histogram showing how many unique names entered a shop each date. However, the definition of “unique” is not within the date only, but the definition of “unique” is as well within the scope of all previous dates before the date in question.

To explain better the result I would like should be following:

('March 1', 3) -- because Mike, Yusuf, John entered a shop for the first time ever, no previous entries are available
('March 2', 2) -- even though we have 3 entries for March 2 we have seen Mike already on March 1, that is why his entry is not unique, we do not calculate him
('March 3', 0)
('March 4', 0)
('March 5', 1) -- even though we have 3 entries for March 5 we have seen Anna before (on March 2) and Emily entered a shop twice same day

I know how to do in in “SQL” with ROW_NUMBER() and PARTITION tricks. But in InfluxQL I am stuck. The best I could do so far

SELECT
DATE_BIN_GAPFILL(INTERVAL '24 hours', time, '1970-01-01T00:00:00Z'::TIMESTAMP) AS time,
COUNT(*)
FROM 'shop_measurement'
WHERE time >= now() - interval '7 days' AND time <= now()
GROUP BY 1
ORDER BY 1

This gives me histogram with calculation of all shop entries each date in the interval. But how to move from here to what I need I have no idea.

Any help, please? : )

FixTestRepeat · November 24, 2023, 8:25pm

This is an interesting problem

As no others have responded yet, I thought I’d add my 2cents.

A question, how many days will be queried each time this is executed after the first week? And then after the first month, then year? Will the calculation effort grow exponentially?

My honest opinion on these types of problems is that performance/efficiency will generally scale better if a partial or intermediate calculation can be stored or cached. And that in itself also helps to simplify the task too.

Assuming I understand the requirements, can you insert unique names not having been seen before into another measurement table? you can then do a diff /left-join between any names seen “today” vs the list of all unique names ever seen previously.

If the notion of a rolling 7 day window is also a mandatory requirement too, when selecting from the unique names measurement just add that condition there too.

So in essence the heavy calculation originally, gets split into 2 smaller and simpler tasks - the first identifying unique names, and then the second dealing with the count summary

wonderer · November 25, 2023, 8:36am

@FixTestRepeat, thank you for your input!

Yes, you are right, complexity will grow with each new day … I am not sure whether it is an exponential growth though. Because the way you suggest to do it suggests already that this is not an exponentially growing problem. In any non-declarative programming language this would be I guess somewhere closer to linear complexity.

Let’s split my question in two shades:

“General question about expressiveness of InfluxQL as a language.” I am very much interested to understand whether I can do this in one query and without explicitly saving intermediate measurements, because in MSSQL I can solve this problem with one query. And this is exactly the reason for my post
“Operational question how to do this in general just to solve the problem in any possible way.” The way you suggested is not bad at all. My approach would be a bit differrent though in the sense that I would not save intermediate result in a separate measurement table but I would query intermediate result (what names were seen until specific day), deliver it to, say Java, then I would make a second query (show me who entered the shop in a specific time window), deliver it to Java, and the actual logic of comparison I would already execute in Java.

But what I am really interested in is 1).

Topic		Replies	Views
Query a counter of distinct values Telegraf influxdb	5	1724	July 19, 2019
Flux Query to count unique occurrences of fields InfluxDB 2 influxdb	4	2323	April 24, 2023
Unique tag values in time interval InfluxDB 2	4	900	February 25, 2020
InfluxDB Flux - get count of distinct occurrences Fluxlang influxdb	4	8404	July 30, 2020
InfluxDB 1.8 query help (show all values with same timestamp, and specific tag) InfluxQL	2	1213	November 16, 2022

How to count new distinct values in a histogram taking into account what occured previously?

Related topics