We have been using influx for a few years now and introduced continuous queries over a year ago as well with no issues. However, with no changes that I can think of (other than updating grafana which should be upstream from this issue) we are seeing very weird inaccuracies in our reporting of response time metrics. The underyling data is fine (we retain it for a number of days); however, the hourly rollup is returning completely incorrect summations that seem to actually resolve themselves after a few queries of the data (Which makes no sense at all).
Example Result :
> SELECT mean("mean_time") FROM "five_years"."aggr_api_request_time" WHERE (path =~ /\/mobile\/v1\/standings\/.*/) AND time >= now() - 2d and time <= now() GROUP BY time(1h) fill(null)
name: aggr_api_request_time
time mean
---- ----
1676066400000000000
1676070000000000000 254.58870967741936
1676073600000000000 198.06296296296296
1676077200000000000 203.04296875
1676080800000000000 225.34444444444443
1676084400000000000 220.07692307692307
1676088000000000000 203.62385321100916
1676091600000000000 187.17083333333332
1676095200000000000 176.3125
1676098800000000000 149.54761904761904
1676102400000000000 239.13513513513513
1676106000000000000 48.2
1676109600000000000 308.0967741935484
1676113200000000000 81.75
1676116800000000000 187.52941176470588
1676120400000000000 272.375
1676124000000000000 282.4449152542373
1676127600000000000 330.1735537190083
1676131200000000000 360.17164179104475
1676134800000000000 1333908.2593984962
1676138400000000000 1684393.0467625898
1676142000000000000 1622533.2448275862
1676145600000000000 1284304.7123287672
1676149200000000000 1564758.7659574468
1676152800000000000 1351429.3455882352
1676156400000000000 1282479.6984126985
1676160000000000000 1036306.8633093525
1676163600000000000 1147136.9664634147
1676167200000000000 1071668.812903226
1676170800000000000 858072.0947712419
1676174400000000000 798995.1812080537
1676178000000000000 1041145.193877551
1676181600000000000 630366.698630137
1676185200000000000 534394.1020408163
1676188800000000000 916300.2142857143
1676192400000000000 313064.8888888889
1676196000000000000 1321057.04
1676199600000000000 537574.8780487805
1676203200000000000 627612.6333333333
1676206800000000000 1005444.1927710844
1676210400000000000 942260.2222222222
1676214000000000000 738505.3214285715
1676217600000000000 1013028.8658536585
1676221200000000000 765660.0402684563
1676224800000000000 652543.8866279069
1676228400000000000 640709.3970588235
1676232000000000000 762668.6612903225
1676235600000000000 937971.1646706586
1676239200000000000
Result just a few minutes later (this obviously lasted for quite some time). You can see that the latest rollup is incorrect.
> SELECT mean("mean_time") FROM "five_years"."aggr_api_request_time" WHERE (path =~ /\/mobile\/v1\/standings\/.*/) AND time >= now() - 2d and time <= now() GROUP BY time(1h) fill(null)
name: aggr_api_request_time
time mean
---- ----
1676070000000000000
1676073600000000000 198.06296296296296
1676077200000000000 203.04296875
1676080800000000000 225.34444444444443
1676084400000000000 220.07692307692307
1676088000000000000 203.62385321100916
1676091600000000000 187.17083333333332
1676095200000000000 176.3125
1676098800000000000 149.54761904761904
1676102400000000000 239.13513513513513
1676106000000000000 48.2
1676109600000000000 308.0967741935484
1676113200000000000 81.75
1676116800000000000 187.52941176470588
1676120400000000000 272.375
1676124000000000000 282.4449152542373
1676127600000000000 330.1735537190083
1676131200000000000 360.17164179104475
1676134800000000000 383.7689393939394
1676138400000000000 532.068345323741
1676142000000000000 505.65862068965515
1676145600000000000 420.1472602739726
1676149200000000000 489.7464788732394
1676152800000000000 397.2977941176471
1676156400000000000 404.531746031746
1676160000000000000 328.7482014388489
1676163600000000000 394.77245508982037
1676167200000000000 330.7548387096774
1676170800000000000 263.3235294117647
1676174400000000000 212.0234899328859
1676178000000000000 214.39795918367346
1676181600000000000 123.42465753424658
1676185200000000000 108.3265306122449
1676188800000000000 194.9047619047619
1676192400000000000 63.111111111111114
1676196000000000000 327.6
1676199600000000000 115.82926829268293
1676203200000000000 139.58333333333334
1676206800000000000 205.3012048192771
1676210400000000000 182.92857142857142
1676214000000000000 150.78571428571428
1676217600000000000 230.85365853658536
1676221200000000000 202.46308724832215
1676224800000000000 145.50872093023256
1676228400000000000 169.8735294117647
1676232000000000000 200.5215053763441
1676235600000000000 237.79640718562874
1676239200000000000 548623.5808383233
1676242800000000000
any insights here? It’s not an issue with incorrect reporting from the app (we know that the non rollup data is fine) and it’s not an issue with the calculation as it’s actually fixing itself after a period of time or some kind of trigger causing it.
> show diagnostics
name: build
Branch Build Time Commit Version
------ ---------- ------ -------
1.8 688e697c51fd 1.8.10
name: config
bind-address reporting-disabled
------------ ------------------
:8088 false
name: config-coordinator
log-queries-after max-concurrent-queries max-select-buckets max-select-point max-select-series query-timeout write-timeout
----------------- ---------------------- ------------------ ---------------- ----------------- ------------- -------------
0s 0 0 0 0 0s 10s
name: config-cqs
enabled query-stats-enabled run-interval
------- ------------------- ------------
true false 1s
name: config-data
cache-max-memory-size cache-snapshot-memory-size cache-snapshot-write-cold-duration compact-full-write-cold-duration dir max-concurrent-compactions max-index-log-file-size max-series-per-database max-values-per-tag series-file-max-concurrent-compactions series-id-set-cache-size strict-error-handling wal-dir wal-fsync-delay
--------------------- -------------------------- ---------------------------------- -------------------------------- --- -------------------------- ----------------------- ----------------------- ------------------ -------------------------------------- ------------------------ --------------------- ------- ---------------
1073741824 26214400 10m0s 4h0m0s /var/lib/influxdb/data 0 1048576 1000000 100000 0 100 false /var/lib/influxdb/wal 0s
name: config-httpd
access-log-path bind-address enabled https-enabled max-connection-limit max-row-limit
--------------- ------------ ------- ------------- -------------------- -------------
:8086 true false 0 0
name: config-meta
dir
---
/var/lib/influxdb/meta
name: config-monitor
store-database store-enabled store-interval
-------------- ------------- --------------
_internal true 10s
name: config-precreator
advance-period check-interval enabled
-------------- -------------- -------
30m0s 10m0s true
name: config-retention
check-interval enabled
-------------- -------
30m0s true
name: config-subscriber
enabled http-timeout write-buffer-size write-concurrency
------- ------------ ----------------- -----------------
true 30s 1000 40
name: network
hostname
--------
influxdb-b9858c6bd-x2swq
name: runtime
GOARCH GOMAXPROCS GOOS version
------ ---------- ---- -------
amd64 2 linux go1.13.8
name: system
PID currentTime started uptime
--- ----------- ------- ------
1 2023-02-12T23:04:01.089054092Z 2022-09-15T17:45:59.653198863Z 3605h18m1.435855229s