InfluxDB 1.8 [Continuous Queries] oddities occurring recently and then correcting themselves after a few queries

Bryan_Patton · February 12, 2023, 11:04pm

We have been using influx for a few years now and introduced continuous queries over a year ago as well with no issues. However, with no changes that I can think of (other than updating grafana which should be upstream from this issue) we are seeing very weird inaccuracies in our reporting of response time metrics. The underyling data is fine (we retain it for a number of days); however, the hourly rollup is returning completely incorrect summations that seem to actually resolve themselves after a few queries of the data (Which makes no sense at all).

Example Result :

> SELECT mean("mean_time") FROM "five_years"."aggr_api_request_time" WHERE (path =~ /\/mobile\/v1\/standings\/.*/) AND time >= now() - 2d and time <= now() GROUP BY time(1h) fill(null)
name: aggr_api_request_time
time                mean
----                ----
1676066400000000000 
1676070000000000000 254.58870967741936
1676073600000000000 198.06296296296296
1676077200000000000 203.04296875
1676080800000000000 225.34444444444443
1676084400000000000 220.07692307692307
1676088000000000000 203.62385321100916
1676091600000000000 187.17083333333332
1676095200000000000 176.3125
1676098800000000000 149.54761904761904
1676102400000000000 239.13513513513513
1676106000000000000 48.2
1676109600000000000 308.0967741935484
1676113200000000000 81.75
1676116800000000000 187.52941176470588
1676120400000000000 272.375
1676124000000000000 282.4449152542373
1676127600000000000 330.1735537190083
1676131200000000000 360.17164179104475
1676134800000000000 1333908.2593984962
1676138400000000000 1684393.0467625898
1676142000000000000 1622533.2448275862
1676145600000000000 1284304.7123287672
1676149200000000000 1564758.7659574468
1676152800000000000 1351429.3455882352
1676156400000000000 1282479.6984126985
1676160000000000000 1036306.8633093525
1676163600000000000 1147136.9664634147
1676167200000000000 1071668.812903226
1676170800000000000 858072.0947712419
1676174400000000000 798995.1812080537
1676178000000000000 1041145.193877551
1676181600000000000 630366.698630137
1676185200000000000 534394.1020408163
1676188800000000000 916300.2142857143
1676192400000000000 313064.8888888889
1676196000000000000 1321057.04
1676199600000000000 537574.8780487805
1676203200000000000 627612.6333333333
1676206800000000000 1005444.1927710844
1676210400000000000 942260.2222222222
1676214000000000000 738505.3214285715
1676217600000000000 1013028.8658536585
1676221200000000000 765660.0402684563
1676224800000000000 652543.8866279069
1676228400000000000 640709.3970588235
1676232000000000000 762668.6612903225
1676235600000000000 937971.1646706586
1676239200000000000

Result just a few minutes later (this obviously lasted for quite some time). You can see that the latest rollup is incorrect.

> SELECT mean("mean_time") FROM "five_years"."aggr_api_request_time" WHERE (path =~ /\/mobile\/v1\/standings\/.*/) AND time >= now() - 2d and time <= now() GROUP BY time(1h) fill(null)
name: aggr_api_request_time
time                mean
----                ----
1676070000000000000 
1676073600000000000 198.06296296296296
1676077200000000000 203.04296875
1676080800000000000 225.34444444444443
1676084400000000000 220.07692307692307
1676088000000000000 203.62385321100916
1676091600000000000 187.17083333333332
1676095200000000000 176.3125
1676098800000000000 149.54761904761904
1676102400000000000 239.13513513513513
1676106000000000000 48.2
1676109600000000000 308.0967741935484
1676113200000000000 81.75
1676116800000000000 187.52941176470588
1676120400000000000 272.375
1676124000000000000 282.4449152542373
1676127600000000000 330.1735537190083
1676131200000000000 360.17164179104475
1676134800000000000 383.7689393939394
1676138400000000000 532.068345323741
1676142000000000000 505.65862068965515
1676145600000000000 420.1472602739726
1676149200000000000 489.7464788732394
1676152800000000000 397.2977941176471
1676156400000000000 404.531746031746
1676160000000000000 328.7482014388489
1676163600000000000 394.77245508982037
1676167200000000000 330.7548387096774
1676170800000000000 263.3235294117647
1676174400000000000 212.0234899328859
1676178000000000000 214.39795918367346
1676181600000000000 123.42465753424658
1676185200000000000 108.3265306122449
1676188800000000000 194.9047619047619
1676192400000000000 63.111111111111114
1676196000000000000 327.6
1676199600000000000 115.82926829268293
1676203200000000000 139.58333333333334
1676206800000000000 205.3012048192771
1676210400000000000 182.92857142857142
1676214000000000000 150.78571428571428
1676217600000000000 230.85365853658536
1676221200000000000 202.46308724832215
1676224800000000000 145.50872093023256
1676228400000000000 169.8735294117647
1676232000000000000 200.5215053763441
1676235600000000000 237.79640718562874
1676239200000000000 548623.5808383233
1676242800000000000

any insights here? It’s not an issue with incorrect reporting from the app (we know that the non rollup data is fine) and it’s not an issue with the calculation as it’s actually fixing itself after a period of time or some kind of trigger causing it.

> show diagnostics
name: build
Branch Build Time Commit       Version
------ ---------- ------       -------
1.8               688e697c51fd 1.8.10

name: config
bind-address reporting-disabled
------------ ------------------
:8088        false

name: config-coordinator
log-queries-after max-concurrent-queries max-select-buckets max-select-point max-select-series query-timeout write-timeout
----------------- ---------------------- ------------------ ---------------- ----------------- ------------- -------------
0s                0                      0                  0                0                 0s            10s

name: config-cqs
enabled query-stats-enabled run-interval
------- ------------------- ------------
true    false               1s

name: config-data
cache-max-memory-size cache-snapshot-memory-size cache-snapshot-write-cold-duration compact-full-write-cold-duration dir                    max-concurrent-compactions max-index-log-file-size max-series-per-database max-values-per-tag series-file-max-concurrent-compactions series-id-set-cache-size strict-error-handling wal-dir               wal-fsync-delay
--------------------- -------------------------- ---------------------------------- -------------------------------- ---                    -------------------------- ----------------------- ----------------------- ------------------ -------------------------------------- ------------------------ --------------------- -------               ---------------
1073741824            26214400                   10m0s                              4h0m0s                           /var/lib/influxdb/data 0                          1048576                 1000000                 100000             0                                      100                      false                 /var/lib/influxdb/wal 0s

name: config-httpd
access-log-path bind-address enabled https-enabled max-connection-limit max-row-limit
--------------- ------------ ------- ------------- -------------------- -------------
                :8086        true    false         0                    0

name: config-meta
dir
---
/var/lib/influxdb/meta

name: config-monitor
store-database store-enabled store-interval
-------------- ------------- --------------
_internal      true          10s

name: config-precreator
advance-period check-interval enabled
-------------- -------------- -------
30m0s          10m0s          true

name: config-retention
check-interval enabled
-------------- -------
30m0s          true

name: config-subscriber
enabled http-timeout write-buffer-size write-concurrency
------- ------------ ----------------- -----------------
true    30s          1000              40

name: network
hostname
--------
influxdb-b9858c6bd-x2swq

name: runtime
GOARCH GOMAXPROCS GOOS  version
------ ---------- ----  -------
amd64  2          linux go1.13.8

name: system
PID currentTime                    started                        uptime
--- -----------                    -------                        ------
1   2023-02-12T23:04:01.089054092Z 2022-09-15T17:45:59.653198863Z 3605h18m1.435855229s

Topic		Replies	Views
Integrated continuous queries failing silently Telegraf	1	1030	March 16, 2018
Grouped data shows a higher Sum than expected influxdb	6	1422	July 14, 2017
Mean() query issue Telegraf	3	876	October 25, 2018
Continuous query couldn't create MEASUREMENTS Store influxdb	7	2479	June 27, 2017
Derivative downsample CQ Store influxdb , datalifecycle , influxql	21	4810	May 17, 2017

InfluxDB 1.8 [Continuous Queries] oddities occurring recently and then correcting themselves after a few queries

Related topics