Grouped data shows a higher Sum than expected

Hi all!
In my database I have data being inserted on a “origin” measurement, with a low-duration retention policy, and some continuous queries acting on it producing other measurements, with coarser aggregation. For a specific information I need to retrieve one CQ is not enough (I’d need to make something like a count distinct, or a count on a group by, and I read that this can’t be done atm); therefore I’ve been concatenating two CQs.
The second CQ though it’s returning wrong values for the Sum aggregation: being the first measurement “aggrdata.minly.usrs”, and the second one “aggrdata.hourly.usxp”, if for each time-bin of the former you count how many distinct values the tag “usrs” takes for a specific “cnt”, you get one, while the latter shows count values higher than that.

I’m really clueless about why this is happening, I fear I’ve been doing things unorthodoxly, like the two concatenate CQs, or or maybe the way I used CQ’s advanced syntax … I must confess that, needing shorter durations for my retention policies, I even changed the MinRetentionPolicyDuration variable in Influxdb’s source and custom built it. Though I don’t see why any of these should affect how Sum aggregations works. Can anyone cast some light on this mystery? Thanks in advance!

I’ll append the continuous query, and some queries that show what I just now tried to explain. I don’t think my config file is necessary, but if you think otherwise just tell me and I’ll attach it. (also, at the moment the shortest duration for my retention policies is 30m, even though I plan to take it down to 1 or 2m)

CREATE CONTINUOUS QUERY CQ_usexp_m ON aggrdata 
RESAMPLE EVERY 1m FOR 2m 
BEGIN 
    SELECT count(n_files) AS n_users, 
                  mean(n_files) AS avg_files 
    INTO aggrdata.st_m.usexp 
    FROM aggrdata.minly_rt.usrs 
    GROUP BY time(1m), brwsr, cnt 
END

select * from minly_rt.usrs where cnt='SOrBDPy' and brwsr='2' and time > now() - 30m
name: usrs
time                 brwsr cnt     n_files usr
----                 ----- ---     ------- ---
2017-06-06T14:09:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:10:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:11:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:12:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:13:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:14:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:15:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:16:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:17:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:18:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:19:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:20:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:21:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:22:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:23:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:24:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:25:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:26:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:27:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:28:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:29:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:30:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:31:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:32:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:33:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:34:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:35:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:36:00Z 2     SOrBDPy 13      714e70fb-f84d-4914-a6fa-ccaa5287f4e0
2017-06-06T14:37:00Z 2     SOrBDPy 12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0

select * from st_m.usexp where cnt='SOrBDPy' and brwsr='2' and time > now() - 30m
name: usexp
time                 avg_files brwsr cnt     n_users
----                 --------- ----- ---     -------
2017-06-06T14:09:00Z 12        2     SOrBDPy 1
2017-06-06T14:10:00Z 12        2     SOrBDPy 1
2017-06-06T14:11:00Z 12        2     SOrBDPy 3
2017-06-06T14:12:00Z 12        2     SOrBDPy 1
2017-06-06T14:13:00Z 12        2     SOrBDPy 3
2017-06-06T14:14:00Z 12        2     SOrBDPy 1
2017-06-06T14:15:00Z 12        2     SOrBDPy 3
2017-06-06T14:16:00Z 12        2     SOrBDPy 1
2017-06-06T14:17:00Z 12        2     SOrBDPy 3
2017-06-06T14:18:00Z 12        2     SOrBDPy 1
2017-06-06T14:19:00Z 12        2     SOrBDPy 3
2017-06-06T14:20:00Z 12        2     SOrBDPy 5
2017-06-06T14:21:00Z 12        2     SOrBDPy 1
2017-06-06T14:22:00Z 12        2     SOrBDPy 3
2017-06-06T14:23:00Z 12        2     SOrBDPy 1
2017-06-06T14:24:00Z 12        2     SOrBDPy 3
2017-06-06T14:25:00Z 12        2     SOrBDPy 1
2017-06-06T14:26:00Z 12        2     SOrBDPy 1
2017-06-06T14:27:00Z 12        2     SOrBDPy 3
2017-06-06T14:28:00Z 12        2     SOrBDPy 1
2017-06-06T14:29:00Z 12        2     SOrBDPy 3
2017-06-06T14:30:00Z 12        2     SOrBDPy 5
2017-06-06T14:31:00Z 12        2     SOrBDPy 1
2017-06-06T14:32:00Z 12        2     SOrBDPy 1
2017-06-06T14:33:00Z 12        2     SOrBDPy 3
2017-06-06T14:34:00Z 12        2     SOrBDPy 5
2017-06-06T14:35:00Z 12        2     SOrBDPy 1
2017-06-06T14:36:00Z 13        2     SOrBDPy 1
2017-06-06T14:37:00Z 12        2     SOrBDPy 3

I’m confused with your use-case too, but let’s see if we can figure this out! What does the data look like if you just run the inner query above without CQ (on the raw data)?

1 Like

Hi Sebito, thank you very much for answering! Sorry that I took so long in order to reply back, but I had to wait for the db to go under stress tests once again.
I did as you suggested, and the inner query shows the correct results. It looks as if the CQ service fails to run it correctly.

Here’s proof of that:

The continuous query (the second one):

CREATE CONTINUOUS QUERY CQ_usexp_m ON aggrdata 
RESAMPLE EVERY 1m FOR 5m 
BEGIN 
	SELECT 	count(n_files) AS n_users, 
			mean(n_files) AS avg_files 
	INTO aggrdata.st_m.usexp 
	FROM aggrdata.minly_rt.usrs 
	GROUP BY time(1m), brwsr, clnt, cnt, cntry, cty, dvMn, dvOs, dvTp
END

Here’s the raw data (i.e. the output of the first CQ):

>select * from minly_rt.usrs where usr='714e70fb-f84d-4914-a6fa-ccaa5287f4e0' and time ='2017-06-15T09:33:00Z'

name: usrs
time                 brwsr clnt cnt     cntry   cty     dvMn dvOs dvTp n_files usr
----                 ----- ---- ---     -----   ---     ---- ---- ---- ------- ---
2017-06-15T09:33:00Z 2     cdn  SOrBDPy 3175395 3173435 0    3    1    12      714e70fb-f84d-4914-a6fa-ccaa5287f4e0

Here’s the (2nd) CQ’s inner query run on the raw data:

> 	SELECT 	count(n_files) AS n_users, 
			mean(n_files) AS avg_files 
	FROM aggrdata.minly_rt.usrs 
	WHERE 	time = '2017-06-15T09:33:00Z' 
		AND usr='714e70fb-f84d-4914-a6fa-ccaa5287f4e0' 
	GROUP BY time(1m), cnt, cntry, cty, dvMn, dvOs, dvTp

name: usrs
tags: brwsr=2, clnt=cdn, cnt=SOrBDPy, cntry=3175395, cty=3173435, dvMn=0, dvOs=3, dvTp=1
time                 n_users avg_files
----                 ------- --------
2017-06-15T09:33:00Z 1       12

Here’s the output of the (second) CQ onto st_m.usexp (please note the difference in the field n_users between this result and the previous one):

> select * from st_m.usexp where brwsr='2' and cnt='SOrBDPy' and dvOs='3' and clnt='cdn' and time ='2017-06-15T09:33:00Z'

name: usexp
time                 avg_files brwsr clnt cnt     cntry   cty     dvMn dvOs dvTp n_users
----                 --------- ----- ---- ---     -----   ---     ---- ---- ---- -------
2017-06-15T09:33:00Z 12        2     cdn  SOrBDPy 3175395 3173435 0    3    1    7

I have no clue about what’s happening, but whatever it is it happens very often (again, look at the n_users field, it should always show a 1):
> select * from st_m.usexp where brwsr=‘2’ and cnt=‘SOrBDPy’ and dvOs=‘3’ and clnt=‘cdn’

name: usexp
time                 avg_files brwsr clnt cnt     cntry   cty     dvMn dvOs dvTp n_users
----                 --------- ----- ---- ---     -----   ---     ---- ---- ---- -------
2017-06-15T09:30:00Z 12        2     cdn  SOrBDPy 3175395 3173435 0    3    1    1
2017-06-15T09:31:00Z 12        2     cdn  SOrBDPy 3175395 3173435 0    3    1    1
2017-06-15T09:32:00Z 12        2     cdn  SOrBDPy 3175395 3173435 0    3    1    1
2017-06-15T09:33:00Z 12        2     cdn  SOrBDPy 3175395 3173435 0    3    1    1
2017-06-15T09:34:00Z 12        2     cdn  SOrBDPy 3175395 3173435 0    3    1    1
2017-06-15T09:35:00Z 132       2     cdn  SOrBDPy 3175395 3173435 0    3    1    1
2017-06-15T09:36:00Z 12        2     cdn  SOrBDPy 3175395 3173435 0    3    1    1
2017-06-15T09:37:00Z 132       2     cdn  SOrBDPy 3175395 3173435 0    3    1    1
2017-06-15T09:38:00Z 60        2     cdn  SOrBDPy 3175395 3173435 0    3    1    1
2017-06-15T09:39:00Z 60        2     cdn  SOrBDPy 3175395 3173435 0    3    1    1
2017-06-15T09:40:00Z 180       2     cdn  SOrBDPy 3175395 3173435 0    3    1    5
2017-06-15T09:41:00Z 120       2     cdn  SOrBDPy 3175395 3173435 0    3    1    5
2017-06-15T09:42:00Z 12        2     cdn  SOrBDPy 3175395 3173435 0    3    1    1
2017-06-15T09:43:00Z 12        2     cdn  SOrBDPy 3175395 3173435 0    3    1    1
2017-06-15T09:44:00Z 12        2     cdn  SOrBDPy 3175395 3173435 0    3    1    1
2017-06-15T09:45:00Z 108       2     cdn  SOrBDPy 3175395 3173435 0    3    1    1
2017-06-15T09:46:00Z 13        2     cdn  SOrBDPy 3175395 3173435 0    3    1    1
2017-06-15T09:47:00Z 11        2     cdn  SOrBDPy 3175395 3173435 0    3    1    1
2017-06-15T09:48:00Z 12        2     cdn  SOrBDPy 3175395 3173435 0    3    1    5
2017-06-15T09:49:00Z 12        2     cdn  SOrBDPy 3175395 3173435 0    3    1    5
2017-06-15T09:50:00Z 12        2     cdn  SOrBDPy 3175395 3173435 0    3    1    5
2017-06-15T09:51:00Z 1         2     cdn  SOrBDPy 3175395 3173435 0    3    1    5

When I wrote stress conditions I’m talking about 12000 insert per minute.
> select count(*) from minly_rt.base where time >= ‘2017-06-15T10:13:00Z’ and time <‘2017-06-15T10:14:00Z’
name: base
time count_cntTp count_flNm count_https count_rfrr count_rspBps count_rspBs count_rspT count_wrk
---- ----------- ---------- ----------- ---------- ------------ ----------- ---------- ---------
2017-06-15T10:13:00Z 12005 12005 12005 12005 12005 12005 12005 12005

At the moment I’m working with a duration of 2minutes for the minly_rt duration policy, but as in my previous post, the problem showed up even with longer durations (more precisely 30min).

About this confusion I produced, if you have more precise questions please ask, any suggestion is more than welcome. Is it about the two concatenated CQ?

@sebito91
Hi, I don’t know if this is of any help, but I looked at the CPU consumption, and it looks kind of off: though influxdb has been given 4 cpus, it only uses one of them, and it’s almost always at 100%. That’s true even now, when I’m conducting tests with way less data in input, so I guess it should be something related with internal processes (retention policies, CQs, ecc).

Some more infos:
At the moment we are receiving 500 / 1k lines per second, and we are using line protocol.
We are on a single node, and have one 2minutes RP (minly_rt) on 5 tables, and 3 30minutes RP on 3 tables. Also, I have 2 CQ running every two seconds, and 3 running each minute.

Also, even counting lines doesn’t work properly, this is the output of a query counting lines grouped by minutes, as you can see the result depends on the field(!!!):

2017-06-19T08:07:18Z 443 447 451 455 459 463 465 469

UPDATE:
I’ve dropped the two CQs that had been running each second, leaving only those running each minute, and the cpu usage dropped too: now it only shows 100% spikes at the beginning of each minute, so I guess that CPU usage is mainly weighed by CQs. I guess that RP operations have a negligible effect in comparison, please tell me if that’s not the case.
I’m still confused about why it’s using only one of cpu of the 4 that were given to it.
I’m starting to suspect though that this has nothing to do with my problem, i.e. time aggregation computing wrong results. Since RP duration on the base measurement is so short I couldn’t find any definitive evidence, but it looks as though some line duplication took place anyway.

Yet another update:
I’m still experiencing wrong aggregations, but today I found another odd fact, i.e. point duplication in the same measurement. I mean same tag set and same timestamp; I thought that was impossible.
Here’s evidence:
>select * from st_v.usrs limit 20
name: usrs
time brwsr clnt cnt cntry cty dvMn dvOs dvTp n_files usr
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86
1499130000000000000 0 “” “” 3175395 0 0 0 1 1 a4c63dc1-2ffd-42c6-9e52-3f7fbf5d3e86

Sorry, been dormant here for a while…if your previous posting is indeed correct then I would absolutely submit that as a bug PR in github. That should definitely not be the case AFAIK, but to be honest have not done any time resolution to that level before (6 digits), only ever gone subsecond.

no problem, I kept updating the thread only because I happened to have more free time to spend on influx than usual. Now actually I’m really busy, I’d love to reproduce the bug in a VM before i open the bug PR. I’ll see to it as soon as I can!

Anyway thank you very much for your time! :smile:

1 Like