Out-of-memory on backup + queries

Purpose
I wanted to go and check backup / restore of a life Influx database.

Environment

influxdb 1.8.1 running in Docker on an Ubuntu 18.04 VM on a Windows 10 Pro Host.
Simultaneously, some queries are launched (read only) through the Python library from the Windows host.

Effect

  • Before start of backup:
    Queries run fine, about 2 seconds each.

  • Now start the backup (from within the Docker container):
    influxd backup -portable ./backup/

  • Queries start failing:
    Sometimes with this message:
    HTTPConnectionPool(host=‘192.168.123.4’, port=8086): Read timed out.
    Sometimes with the additional mentioning of the timeout value:
    HTTPConnectionPool(host=‘192.168.123.4’, port=8086): Read timed out. (read timeout=30).
    From client side, this is similar to my behaviour described in my previous post (that is not conclusively answered): Python query results in "Read timed out"

  • Backup just stops at a certain point.
    Backup directory contents is:
    -rw-r–r-- 1 root root 1296 Sep 16 08:56 20200916T085634Z.meta
    -rw------- 1 root root 14539886 Sep 16 08:56 20200916T085634Z.s3.tar.gz
    -rw------- 1 root root 409930032 Sep 16 08:57 20200916T085634Z.s4.tar.gz
    -rw------- 1 root root 380261790 Sep 16 08:57 20200916T085634Z.s5.tar.gz
    -rw-r–r-- 1 root root 259618302 Sep 16 09:04 dbMeas.autogen.00029.00.pending

  • Storage availability.
    The / has increased with 1GB which matches the size of the backup directory.
    Plenty of storage capacity remaining.

  • Memory:
    System has 16GB.
    Normal operations: about 4GB in use
    During backup + query: 16GB in use

  • After the issue:

    • Retry the queries: they still fail.
      They make the memory jump from 4GB to 16GB for a single query (in 92 seconds)
      The query then fails (with the timeout shown above)
      CPU goes to 100%
    • Kill the queries (Ctrl-C)
      Influx remains on high CPU and high memory for > 5 minutes.
      Then drops to normal levels (because it crashes and then automatically restarts).
    • Relaunch of the Linux host and restart of the InfluxDB docker causes:
      Influx to restart “normally”
      Once a query is launched, again high CPU/Mem and causes InfluxDB to restart
    • Upgrade from 1.8.1 to image 1.8.2 does not improve the issue.
    • Performing the query inside influx (on its command prompt), results in the same issue. It therefore does not seem related to the Python library or the version thereof.

Logging

  • Log at the moment of the backup operation
    2020-09-16T08:56:54.252852Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: “SELECT last() FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T10:00:00Z’"}
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:08:56:54 +0000] “GET /query?db=dbMeas&q=SELECT+last%28%2A%29+FROM+BBAP+WHERE+time+%3C%3D+%272020-08-20T10%3A00%3A00Z%27 HTTP/1.1” 200 7258 “-” “python-requests/2.24.0” 9175dc74-f7fa-11ea-af10-0242ac140003 2937973
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:08:56:58 +0000] “POST /write?db=dbMeas&precision=u HTTP/1.1” 204 0 “-” “python-requests/2.24.0” 93cb39e8-f7fa-11ea-af11-0242ac140003 15239
    2020-09-16T08:56:58.186329Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: "SELECT last(
    ) FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T11:00:00Z’”}
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:08:56:58 +0000] “GET /query?db=dbMeas&q=SELECT+last%28%2A%29+FROM+BBAP+WHERE+time+%3C%3D+%272020-08-20T11%3A00%3A00Z%27 HTTP/1.1” 200 7256 “-” “python-requests/2.24.0” 93ce0f9e-f7fa-11ea-af12-0242ac140003 2999755
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:08:57:02 +0000] “POST /write?db=dbMeas&precision=u HTTP/1.1” 204 0 “-” “python-requests/2.24.0” 962ab686-f7fa-11ea-af13-0242ac140003 13399
    2020-09-16T08:57:02.165499Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: “SELECT last() FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T12:00:00Z’"}
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:08:57:02 +0000] “GET /query?db=dbMeas&q=SELECT+last%28%2A%29+FROM+BBAP+WHERE+time+%3C%3D+%272020-08-20T12%3A00%3A00Z%27 HTTP/1.1” 200 7258 “-” “python-requests/2.24.0” 962d3ca0-f7fa-11ea-af14-0242ac140003 3040121
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:08:57:06 +0000] “POST /write?db=dbMeas&precision=u HTTP/1.1” 204 0 “-” “python-requests/2.24.0” 98960130-f7fa-11ea-af15-0242ac140003 16187
    2020-09-16T08:57:06.227117Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: "SELECT last(
    ) FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T13:00:00Z’”}
    2020-09-16T08:57:08.000959Z info Cache snapshot (start) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxYd0000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “start”}
    2020-09-16T08:57:08.150366Z info Snapshot for path written {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxYd0000”, “op_name”: “tsm1_cache_snapshot”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/5”, “duration”: “149.425ms”}
    2020-09-16T08:57:08.150769Z info Cache snapshot (end) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxYd0000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “end”, “op_elapsed”: “149.826ms”}
    2020-09-16T08:57:08.151126Z info Cache snapshot (start) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxZDl000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “start”}
    2020-09-16T08:57:08.151481Z info Snapshot for path written {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxZDl000”, “op_name”: “tsm1_cache_snapshot”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/5”, “duration”: “0.360ms”}
    2020-09-16T08:57:08.151796Z info Cache snapshot (end) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxZDl000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “end”, “op_elapsed”: “0.673ms”}
    2020-09-16T08:57:08.152694Z info Cache snapshot (start) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxZE0000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “start”}
    2020-09-16T08:57:08.153043Z info Snapshot for path written {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxZE0000”, “op_name”: “tsm1_cache_snapshot”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/5”, “duration”: “0.354ms”}
    2020-09-16T08:57:08.153329Z info Cache snapshot (end) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXxZE0000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “end”, “op_elapsed”: “0.638ms”}
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:08:57:06 +0000] “GET /query?db=dbMeas&q=SELECT+last%28%2A%29+FROM+BBAP+WHERE+time+%3C%3D+%272020-08-20T13%3A00%3A00Z%27 HTTP/1.1” 200 7527 “-” “python-requests/2.24.0” 9898fd9b-f7fa-11ea-af16-0242ac140003 1999587
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:08:57:09 +0000] “POST /write?db=dbMeas&precision=u HTTP/1.1” 204 0 “-” “python-requests/2.24.0” 9a5d355a-f7fa-11ea-af17-0242ac140003 6505
    2020-09-16T08:57:09.200308Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: “SELECT last() FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T14:00:00Z’"}
    2020-09-16T08:57:40.227054Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: "SELECT last(
    ) FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T14:00:00Z’”}
    2020-09-16T08:57:47.285878Z info Cache snapshot (start) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXzx5G000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “start”}
    2020-09-16T08:57:50.199264Z info Snapshot for path written {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXzx5G000”, “op_name”: “tsm1_cache_snapshot”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/29”, “duration”: “2913.408ms”}
    2020-09-16T08:57:50.199706Z info Cache snapshot (end) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHXzx5G000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “end”, “op_elapsed”: “2913.848ms”}
    2020-09-16T08:57:50.200622Z info Cache snapshot (start) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHX~7U0000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “start”}
    2020-09-16T08:57:50.200997Z info Snapshot for path written {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHX~7U0000”, “op_name”: “tsm1_cache_snapshot”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/29”, “duration”: “0.383ms”}
    2020-09-16T08:57:50.201319Z info Cache snapshot (end) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHX~7U0000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “end”, “op_elapsed”: “0.701ms”}
    2020-09-16T08:57:50.201668Z info Cache snapshot (start) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHX~7UG000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “start”}
    2020-09-16T08:57:50.201993Z info Snapshot for path written {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHX~7UG000”, “op_name”: “tsm1_cache_snapshot”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/29”, “duration”: “0.329ms”}
    2020-09-16T08:57:50.202309Z info Cache snapshot (end) {“log_id”: “0OpEd4OW000”, “engine”: “tsm1”, “trace_id”: “0PHX~7UG000”, “op_name”: “tsm1_cache_snapshot”, “op_event”: “end”, “op_elapsed”: “0.643ms”}
    2020-09-16T08:58:11.649056Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: “SELECT last() FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T14:00:00Z’"}
    2020-09-16T08:58:44.167503Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: "SELECT last(
    ) FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T15:00:00Z’”}
    2020-09-16T08:59:17.649781Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: “SELECT last() FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T15:00:00Z’"}
    2020-09-16T08:59:53.787673Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: "SELECT last(
    ) FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T15:00:00Z’”}
    2020-09-16T09:00:29.261036Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: “SELECT last() FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T16:00:00Z’"}
    2020-09-16T09:01:23.629055Z info failed to store statistics {“log_id”: “0OpEd4OW000”, “service”: “monitor”, “error”: “timeout”}
    2020-09-16T09:01:28.717779Z info Executing query {“log_id”: “0OpEd4OW000”, “service”: “query”, “query”: "SELECT last(
    ) FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-20T16:00:00Z’”}
    2020-09-16T09:02:05.021257Z info failed to store statistics {“log_id”: “0OpEd4OW000”, “service”: “monitor”, “error”: “timeout”}
    [httpd] 192.168.123.1 - BDBUser [16/Sep/2020:09:01:10 +0000] “GET /query?db=dbMeas&q=SELECT+last%28%2A%29+FROM+BBAP+WHERE+time+%3C%3D+%272020-08-20T16%3A00%3A00Z%27 HTTP/1.1” 200 60 “-” “python-requests/2.24.0” 2a633b2a-f7fb-11ea-af1f-0242ac140003 94571698
    2020-09-16T09:02:57.746893Z info failed to store statistics {“log_id”: “0OpEd4OW000”, “service”: “monitor”, “error”: “timeout”}
    2020-09-16T09:03:35.089833Z info failed to store statistics {“log_id”: “0OpEd4OW000”, “service”: “monitor”, “error”: “timeout”}
    2020-09-16T09:03:55.958611Z info failed to store statistics {“log_id”: “0OpEd4OW000”, “service”: “monitor”, “error”: “timeout”}
    fatal error: runtime: out of memory
    runtime stack:
    runtime.throw(0x166acb8, 0x16)
    /usr/local/go/src/runtime/panic.go:774 +0x72
    runtime.sysMap(0xc510000000, 0x4000000, 0x35aec78)
    /usr/local/go/src/runtime/mem_linux.go:169 +0xc5
    runtime.(*mheap).sysAlloc(0x3595c80, 0x756000, 0xdc000, 0x0)
    /usr/local/go/src/runtime/malloc.go:701 +0x1cd
    (… cut a bit to stay in post limit…)
    /usr/local/go/src/runtime/asm_amd64.s:370 +0x66
    runtime.mstart()
    /usr/local/go/src/runtime/proc.go:1146

    goroutine 14069512 [running]:
    runtime.systemstack_switch()
    /usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc004e74ed0 sp=0xc004e74ec8 pc=0x45ee90
    runtime.mallocgc(0x755160, 0x1570e40, 0x13e7401, 0xc00cb21a10)
    /usr/local/go/src/runtime/malloc.go:1032 +0x895 fp=0xc004e74f70 sp=0xc004e74ed0 pc=0x40e405
    runtime.makeslice(0x1570e40, 0x0, 0x75516, 0xc2cf698000)
    /usr/local/go/src/runtime/slice.go:49 +0x6c fp=0xc004e74fa0 sp=0xc004e74f70 pc=0x44789c
    github.com/influxdata/influxdb/tsdb/engine/tsm1.FloatValues.Merge(0xc00a29c000, 0x3e8, 0x3e8, 0xc2cf698000, 0x7512e, 0x7512e, 0x3e8, 0x3e8, 0x3e8)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/encoding.gen.go:438 +0x15f fp=0xc004e75030 sp=0xc004e74fa0 pc=0x120dcbf
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*KeyCursor).ReadFloatBlock(0xc00adfd320, 0xc00cf311f0, 0x0, 0x123c295, 0xc0001fea90, 0xc06afeff50, 0xc0001fea90)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/file_store.gen.go:173 +0xba0 fp=0xc004e75150 sp=0xc004e75030 pc=0x12354a0
    github.com/influxdata/influxdb/tsdb/engine/tsm1.newFloatDescendingCursor(0x162cfeb0133ac000, 0x0, 0x0, 0x0, 0xc00adfd320, 0xc0001fea90)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/iterator.gen.go:515 +0x107 fp=0xc004e751c0 sp=0xc004e75150 pc=0x12492b7
    github.com/influxdata/influxdb/tsdb/engine/tsm1.newFloatCursor(0x162cfeb0133ac000, 0x2549000, 0x0, 0x0, 0x0, 0xc00adfd320, 0x162cfeb0133ac000, 0xc00cf31000)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/iterator.gen.go:372 +0xb6 fp=0xc004e75200 sp=0xc004e751c0 pc=0x1248b36
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).buildFloatCursor(0xc00072d180, 0x25490a0, 0xc06afeff50, 0xc00e23c700, 0x4, 0xc0003b4b40, 0x14, 0xc0088c57e0, 0x15, 0x253e320, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.gen.go:20 +0x26b fp=0xc004e75490 sp=0xc004e75200 pc=0x121a7bb
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).buildCursor(0xc00072d180, 0x25490a0, 0xc06afeff50, 0xc00e23c700, 0x4, 0xc0003b4b40, 0x14, 0xc002312ed0, 0x1, 0x1, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2963 +0xa7d fp=0xc004e75690 sp=0xc004e75490 pc=0x123252d
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).createVarRefSeriesIterator(0xc00072d180, 0x25490a0, 0xc06afeff50, 0xc0041f18e0, 0xc00e23c700, 0x4, 0xc0003b4b40, 0x14, 0xc00cf31130, 0x0, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2736 +0x1980 fp=0xc004e75a78 sp=0xc004e75690 pc=0x1231970
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).createTagSetGroupIterators(0xc00072d180, 0x25490a0, 0xc06afeff50, 0xc0041f18e0, 0xc00e23c700, 0x4, 0xc0033fc550, 0x1, 0x1, 0xc00cf31130, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2691 +0x1af fp=0xc004e75c98 sp=0xc004e75a78 pc=0x122fbbf
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).createTagSetIterators.func1(0xc00ba630a0, 0xc00072d180, 0x25490a0, 0xc06afeff50, 0xc0041f18e0, 0xc00e23c700, 0x4, 0xc02e260de0, 0x1, 0x1, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2651 +0x17f fp=0xc004e75f78 sp=0xc004e75c98 pc=0x127a13f
    runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc004e75f80 sp=0xc004e75f78 pc=0x460f61
    created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).createTagSetIterators
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2649 +0x399

  • Logging after restart, just launching query, no backup running
    2020-09-16T10:58:59.935121Z info InfluxDB starting {“log_id”: “0PHdvpsG000”, “version”: “1.8.1”, “branch”: “1.8”, “commit”: “af0237819ab9c5997c1c0144862dc762b9d8fc25”}
    2020-09-16T10:58:59.935291Z info Go runtime {“log_id”: “0PHdvpsG000”, “version”: “go1.13.8”, “maxprocs”: 1}
    2020-09-16T10:59:00.035977Z info Using data dir {“log_id”: “0PHdvpsG000”, “service”: “store”, “path”: “/var/lib/influxdb/data”}
    2020-09-16T10:59:00.036008Z info Compaction settings {“log_id”: “0PHdvpsG000”, “service”: “store”, “max_concurrent_compactions”: 1, “throughput_bytes_per_second”: 50331648, “throughput_bytes_per_second_burst”: 50331648}
    2020-09-16T10:59:00.036021Z info Open store (start) {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “op_event”: “start”}
    2020-09-16T10:59:00.041432Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/_internal/monitor/22/000000004-000000002.tsm”, “id”: 0, “duration”: “0.185ms”}
    2020-09-16T10:59:00.086220Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/_internal/monitor/22”, “duration”: “49.327ms”}
    2020-09-16T10:59:00.087278Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/_internal/monitor/23/000000003-000000002.tsm”, “id”: 0, “duration”: “0.159ms”}
    2020-09-16T10:59:00.114756Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/_internal/monitor/23”, “duration”: “27.953ms”}
    2020-09-16T10:59:00.115762Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/_internal/monitor/24/000000004-000000002.tsm”, “id”: 0, “duration”: “0.135ms”}
    2020-09-16T10:59:00.135778Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/_internal/monitor/24”, “duration”: “20.472ms”}
    2020-09-16T10:59:00.136771Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/_internal/monitor/25/000000001-000000001.tsm”, “id”: 0, “duration”: “0.111ms”}
    2020-09-16T10:59:00.156026Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/_internal/monitor/25”, “duration”: “19.678ms”}
    2020-09-16T10:59:00.157023Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/_internal/monitor/26/000000002-000000002.tsm”, “id”: 0, “duration”: “0.124ms”}
    2020-09-16T10:59:00.175908Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/_internal/monitor/26”, “duration”: “19.329ms”}
    2020-09-16T10:59:00.176966Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/_internal/monitor/27/000000003-000000002.tsm”, “id”: 0, “duration”: “0.131ms”}
    2020-09-16T10:59:00.196604Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/_internal/monitor/27”, “duration”: “20.103ms”}
    2020-09-16T10:59:00.197623Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/_internal/monitor/28/000000004-000000002.tsm”, “id”: 0, “duration”: “0.139ms”}
    2020-09-16T10:59:00.216059Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/_internal/monitor/28”, “duration”: “18.895ms”}
    2020-09-16T10:59:00.217132Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/_internal/monitor/31/000000001-000000001.tsm”, “id”: 0, “duration”: “0.105ms”}
    2020-09-16T10:59:00.217700Z info Reading file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “cacheloader”, “path”: “/var/lib/influxdb/wal/_internal/monitor/31/_00005.wal”, “size”: 10500255}
    2020-09-16T10:59:00.827089Z info Reading file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “cacheloader”, “path”: “/var/lib/influxdb/wal/_internal/monitor/31/_00006.wal”, “size”: 10487530}
    2020-09-16T10:59:01.423870Z info Reading file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “cacheloader”, “path”: “/var/lib/influxdb/wal/_internal/monitor/31/_00007.wal”, “size”: 10488877}
    2020-09-16T10:59:02.145961Z info Reading file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “cacheloader”, “path”: “/var/lib/influxdb/wal/_internal/monitor/31/_00008.wal”, “size”: 1135272}
    2020-09-16T10:59:02.303970Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/_internal/monitor/31”, “duration”: “2087.289ms”}
    2020-09-16T10:59:02.305407Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/29/000000128-000000001.tsm”, “id”: 1, “duration”: “0.100ms”}
    2020-09-16T10:59:02.306038Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/29/000000127-000000002.tsm”, “id”: 0, “duration”: “0.232ms”}
    2020-09-16T10:59:02.319848Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/29”, “duration”: “15.269ms”}
    2020-09-16T10:59:02.320782Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/3/000000005-000000001.tsm”, “id”: 1, “duration”: “0.064ms”}
    2020-09-16T10:59:02.321268Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/3/000000004-000000002.tsm”, “id”: 0, “duration”: “0.092ms”}
    2020-09-16T10:59:02.339783Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/3”, “duration”: “19.397ms”}
    2020-09-16T10:59:02.340906Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/30/000000032-000000003.tsm”, “id”: 0, “duration”: “0.130ms”}
    2020-09-16T10:59:02.359422Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/30”, “duration”: “19.102ms”}
    2020-09-16T10:59:02.360845Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/4/000000128-000000001.tsm”, “id”: 1, “duration”: “0.100ms”}
    2020-09-16T10:59:02.361508Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/4/000000127-000000002.tsm”, “id”: 0, “duration”: “0.230ms”}
    2020-09-16T10:59:02.379655Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/4”, “duration”: “19.690ms”}
    2020-09-16T10:59:02.380972Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/5/000000117-000000001.tsm”, “id”: 1, “duration”: “0.094ms”}
    2020-09-16T10:59:02.381680Z info Opened file {“log_id”: “0PHdvpsG000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/5/000000116-000000002.tsm”, “id”: 0, “duration”: “0.229ms”}
    2020-09-16T10:59:02.400102Z info Opened shard {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/dbMeas/autogen/5”, “duration”: “19.946ms”}
    2020-09-16T10:59:02.400886Z info Open store (end) {“log_id”: “0PHdvpsG000”, “service”: “store”, “trace_id”: “0PHdvqH0000”, “op_name”: “tsdb_open”, “op_event”: “end”, “op_elapsed”: “2364.865ms”}
    2020-09-16T10:59:02.401649Z info Opened service {“log_id”: “0PHdvpsG000”, “service”: “subscriber”}
    2020-09-16T10:59:02.402075Z info Starting monitor service {“log_id”: “0PHdvpsG000”, “service”: “monitor”}
    2020-09-16T10:59:02.402521Z info Registered diagnostics client {“log_id”: “0PHdvpsG000”, “service”: “monitor”, “name”: “build”}
    2020-09-16T10:59:02.402904Z info Registered diagnostics client {“log_id”: “0PHdvpsG000”, “service”: “monitor”, “name”: “runtime”}
    2020-09-16T10:59:02.403321Z info Registered diagnostics client {“log_id”: “0PHdvpsG000”, “service”: “monitor”, “name”: “network”}
    2020-09-16T10:59:02.403733Z info Registered diagnostics client {“log_id”: “0PHdvpsG000”, “service”: “monitor”, “name”: “system”}
    2020-09-16T10:59:02.404124Z info Starting precreation service {“log_id”: “0PHdvpsG000”, “service”: “shard-precreation”, “check_interval”: “10m”, “advance_period”: “30m”}
    2020-09-16T10:59:02.404590Z info Starting snapshot service {“log_id”: “0PHdvpsG000”, “service”: “snapshot”}
    2020-09-16T10:59:02.404660Z info Starting continuous query service {“log_id”: “0PHdvpsG000”, “service”: “continuous_querier”}
    2020-09-16T10:59:02.404682Z info Starting HTTP service {“log_id”: “0PHdvpsG000”, “service”: “httpd”, “authentication”: true}
    2020-09-16T10:59:02.404699Z info opened HTTP access log {“log_id”: “0PHdvpsG000”, “service”: “httpd”, “path”: “stderr”}
    2020-09-16T10:59:02.404716Z info Auth is enabled but shared-secret is blank. BearerAuthentication is disabled. {“log_id”: “0PHdvpsG000”, “service”: “httpd”}
    2020-09-16T10:59:02.404924Z info Listening on HTTP {“log_id”: “0PHdvpsG000”, “service”: “httpd”, “addr”: “[::]:8086”, “https”: false}
    2020-09-16T10:59:02.405074Z info Starting retention policy enforcement service {“log_id”: “0PHdvpsG000”, “service”: “retention”, “check_interval”: “30m”}
    2020-09-16T10:59:02.405212Z info Listening for signals {“log_id”: “0PHdvpsG000”}
    2020-09-16T10:59:02.404284Z info Storing statistics {“log_id”: “0PHdvpsG000”, “service”: “monitor”, “db_instance”: “_internal”, “db_rp”: “monitor”, “interval”: “10s”}
    2020-09-16T10:59:02.406025Z info Sending usage statistics to usage.influxdata.com {“log_id”: “0PHdvpsG000”}
    [httpd] 127.0.0.1 - - [16/Sep/2020:10:59:32 +0000] “GET /ping HTTP/1.1” 204 0 “-” “InfluxDBShell/1.8.1” b3103bac-f80b-11ea-8001-0242ac140003 12463
    2020-09-16T10:59:42.046742Z info Executing query {“log_id”: “0PHdvpsG000”, “service”: “query”, “query”: “SELECT last() FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-18T06:00:00Z’"}
    2020-09-16T11:00:12.755068Z info Executing query {“log_id”: “0PHdvpsG000”, “service”: “query”, “query”: "SELECT last(
    ) FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-18T06:00:00Z’”}
    2020-09-16T11:00:44.204467Z info Executing query {“log_id”: “0PHdvpsG000”, “service”: “query”, “query”: “SELECT last() FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-18T06:00:00Z’"}
    2020-09-16T11:01:16.295800Z info Executing query {“log_id”: “0PHdvpsG000”, “service”: “query”, “query”: "SELECT last(
    ) FROM dbMeas.autogen.BBAP WHERE time <= ‘2020-08-18T07:00:00Z’”}
    2020-09-16T11:06:51.709863Z info failed to store statistics {“log_id”: “0PHdvpsG000”, “service”: “monitor”, “error”: “timeout”}
    2020-09-16T11:07:40.786842Z info failed to store statistics {“log_id”: “0PHdvpsG000”, “service”: “monitor”, “error”: “timeout”}
    fatal error: runtime: out of memory
    runtime stack:
    runtime.throw(0x166acb8, 0x16)
    /usr/local/go/src/runtime/panic.go:774 +0x72
    runtime.sysMap(0xc5cc000000, 0x4000000, 0x35aec78)
    /usr/local/go/src/runtime/mem_linux.go:169 +0xc5
    runtime.(*mheap).sysAlloc(0x3595c80, 0x125a000, 0x3595c80, 0x7f9253320360)
    /usr/local/go/src/runtime/malloc.go:701 +0x1cd
    (… cut a bit to stay in post limit…)
    runtime.systemstack(0x0)
    /usr/local/go/src/runtime/asm_amd64.s:370 +0x66
    runtime.mstart()
    /usr/local/go/src/runtime/proc.go:1146

    goroutine 4899 [running]:
    runtime.systemstack_switch()
    /usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc04b50eed0 sp=0xc04b50eec8 pc=0x45ee90
    runtime.mallocgc(0x1259f90, 0x1570e40, 0x13e7401, 0xc00164d7a0)
    /usr/local/go/src/runtime/malloc.go:1032 +0x895 fp=0xc04b50ef70 sp=0xc04b50eed0 pc=0x40e405
    runtime.makeslice(0x1570e40, 0x0, 0x1259f9, 0xc1d00b6000)
    /usr/local/go/src/runtime/slice.go:49 +0x6c fp=0xc04b50efa0 sp=0xc04b50ef70 pc=0x44789c
    github.com/influxdata/influxdb/tsdb/engine/tsm1.FloatValues.Merge(0xc004f4e000, 0x3e8, 0x3e8, 0xc1d00b6000, 0x125611, 0x125611, 0x3e8, 0x3e8, 0x3e8)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/encoding.gen.go:438 +0x15f fp=0xc04b50f030 sp=0xc04b50efa0 pc=0x120dcbf
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*KeyCursor).ReadFloatBlock(0xc027c56bd0, 0xc02d766ac0, 0x0, 0x123c295, 0xc001545520, 0xc004f90b10, 0xc001545520)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/file_store.gen.go:173 +0xba0 fp=0xc04b50f150 sp=0xc04b50f030 pc=0x12354a0
    github.com/influxdata/influxdb/tsdb/engine/tsm1.newFloatDescendingCursor(0x162c47556ad7c000, 0x0, 0x0, 0x0, 0xc027c56bd0, 0xc001545520)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/iterator.gen.go:515 +0x107 fp=0xc04b50f1c0 sp=0xc04b50f150 pc=0x12492b7
    github.com/influxdata/influxdb/tsdb/engine/tsm1.newFloatCursor(0x162c47556ad7c000, 0x2549000, 0x0, 0x0, 0x0, 0xc027c56bd0, 0x162c47556ad7c000, 0xc02d766a00)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/iterator.gen.go:372 +0xb6 fp=0xc04b50f200 sp=0xc04b50f1c0 pc=0x1248b36
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).buildFloatCursor(0xc00017f180, 0x25490a0, 0xc004f90b10, 0xc004cefd50, 0x4, 0xc00138f3a0, 0x14, 0xc00160b7e0, 0x14, 0x253e320, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.gen.go:20 +0x26b fp=0xc04b50f490 sp=0xc04b50f200 pc=0x121a7bb
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).buildCursor(0xc00017f180, 0x25490a0, 0xc004f90b10, 0xc004cefd50, 0x4, 0xc00138f3a0, 0x14, 0xc02fe0a1e0, 0x1, 0x1, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2963 +0xa7d fp=0xc04b50f690 sp=0xc04b50f490 pc=0x123252d
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).createVarRefSeriesIterator(0xc00017f180, 0x25490a0, 0xc004f90b10, 0xc0050fb3a0, 0xc004cefd50, 0x4, 0xc00138f3a0, 0x14, 0xc02d766a50, 0x0, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2736 +0x1980 fp=0xc04b50fa78 sp=0xc04b50f690 pc=0x1231970
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).createTagSetGroupIterators(0xc00017f180, 0x25490a0, 0xc004f90b10, 0xc0050fb3a0, 0xc004cefd50, 0x4, 0xc001cb7e00, 0x1, 0x1, 0xc02d766a50, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2691 +0x1af fp=0xc04b50fc98 sp=0xc04b50fa78 pc=0x122fbbf
    github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).createTagSetIterators.func1(0xc0013cbb20, 0xc00017f180, 0x25490a0, 0xc004f90b10, 0xc0050fb3a0, 0xc004cefd50, 0x4, 0xc00d454300, 0x1, 0x1, …)
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2651 +0x17f fp=0xc04b50ff78 sp=0xc04b50fc98 pc=0x127a13f
    runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc04b50ff80 sp=0xc04b50ff78 pc=0x460f61
    created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*Engine).createTagSetIterators
    /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/engine.go:2649 +0x399

Additional info: Seems Ubuntu killed influx since it went out-of-memory.

Sep 16 13:11:22 ubuntu1804ip4 kernel: [ 3538.483705] Out of memory: Kill process 13910 (influxd) score 953 or sacrifice child
Sep 16 13:11:22 ubuntu1804ip4 kernel: [ 3538.484156] Killed process 13910 (influxd) total-vm:30931432kB, anon-rss:18434156kB, file-rss:0kB, shmem-rss:0kB
Sep 16 13:11:22 ubuntu1804ip4 kernel: [ 3539.178199] oom_reaper: reaped process 13910 (influxd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I’ll change some config and report back asap.

As far as the “out-of-memory” is concerned, a switch to tsi1 indexes solves the issue.

Next, I retried the combination of backup and running queries. The queries were barely influenced by the running backup and both still completed fine. No out-of-memory, no timeouts.

Is such a dramatic difference with “tsm” expected? My database only has two measurements, both with 1 series. First measurement has 1300 fields, second about 15. One month of data is about 2GB.

Conclusion: problem solved, root cause not completely understood.

… I was too fast…

It now just took quite a bit longer, but influx got still killed by Ubunto because of an out-of-memory-panic.
The system now just seemed to use additional swap memory before crashing, so it took longer.
Right before its dead, influx had about 22GB in use (of which 6GB in swap)

I’ve seen already a lot of requests concerning how to avoid out-of-memory on influxdb, but never a definitive answer from InfluxData. My desire: Influxd lives with the memory that I assign it and adjusts its behavior accordingly.

Using the TSI (on-disk) version will generally reduce memory pressure as TSI is then memory mapped as needed instead of residing entirely in memory. You have seen this already.

Unfortunately, there is no maximum memory setting. You will need to allocate more resources (memory) to your vm or reduce the complexity of your queries. Generally, the OS manages memory mapped files and will attempt to provide as much virtual space as possible.

Thanks for your answer.

Isn’t this very strange for a production-db?
That means that my customer can ask any question to the db and in some cases the db will just crash.
There is no possibility for me to:

  • either degrade in performance: the query still succeeds, but becomes slow
  • either fail gracefully: the user gets the reply that his query exceeded the available memory and the query is aborted, but all other queries/updates still run fine.

Do I understand this correctly? In that case, I guess we will not be adopting Influx.
Am I overlooking something in this? Maybe the version 2 (currently in beta) has such an option?

Johan

@qootec - Johan -

Sorry for my delayed reply, I’ve been on holiday. It’s true that we aren’t satisfied with InfluxDB 1.x memory management options. It’s something we’re working on actively for the next major storage engine architecture. We have as goals a maximum that the system operates within and degradation without crashing. InfluxDBv2 (release candidate) has some controls for the memory that queries can use and the number of queries that can be executing at once. See this PR. The default amount is unlimited so adjust the flag query-memory-bytes down to something in the 100s of gb range to start with.

On v1.8, there is also a query timeout value on the server side which will kill queries that are taking too long, such as in your case where the system is taxed with running backups too. This is different than the read time out in the client as you likely know!

There are a few other suggestions in this post. I think series-id-set-cache-size might be the most useful to you. I don’t know how many tags you have, but if you have many tags creating many series this cache could be large. I don’t think you’ll gain much (~1gb max) from adjusting down the TSM cache size since it’s only about 6% of your available memory, but it could help a some.

My coworker did point out that 1300 fields within a single measurement is a lot. The query SELECT last(*) is materializing all of them into memory. I believe computationally this query looks at each field individually for the newest value. If you can subset this query, it will perform better. Are you running this every 4 seconds?

How many shards do you have? Information about your data schema and the queries you are interested in running would be helpful. It does seem to me that the problem isn’t in the queries but in the backup happening concurrently. There are options to segment the backup process by time range, shard, and/or retention policy. Reducing what you backup at once should help: Back up and restore data in InfluxDB v1.8 | InfluxDB OSS 1.8 Documentation

Let us know what you figure out. You can try getting a pprof heap profile too when the memory use is high.

I doubt this makes you feel better, but you are not alone here: Issues · influxdata/influxdb · GitHub

Further reading; but you already likely have a firm grasp of this: In-memory indexing and the Time-Structured Merge Tree (TSM) | InfluxDB OSS 1.8 Documentation

1 Like

Hi,

I hope you enjoyed your holidays. Good to have you back.
As a reply to your questions:

  • General structure is as described in the linked thread:
    InfluxDB evaluation scenario - memory/performance issue - #3 by qootec that you closed in favour of the current one.

    • So, two measurements:
      • First: 1 tag (single value) and 1300 fields
      • Second: 1 tag (single value) and 15 fields
        The rest of the properties is kept the same throughout the experiments as descibed in that other thread.
    • The 1300 fields is a lot, but their data content is sparse.
      • For a single timestamp, sometimes only one field has a value, sometimes 10, sometimes 1000…
      • Some fields will have values almost every 50ms, other fields only have a daily value or less.
    • The fields all reflect one of the properties of a given system. Does not feel natural to go and split that in different measurements.
  • You are (understandably) wondering about my intensions with this query “select last(*)…”

    • As described above, some fields don’t even have daily values.
      This happens because we log data values only “on change” and some values just don’t change that often.
      In several operational queries, we need their value though.
      If I for instance want to query some aggregate information from today and an involved field’s value did not have any change today, the query will not yield any value for this field.
    • I’ve been looking at some other posts that talk about “last-value outside of time range”, but the answer seems to be that Influx does not have a way to obtain that.
    • As a compensation/workaround, I was walking over my database, inserting the last value of an hour on hour+5ms.
      That way, my data at least contains one value for every field, every hour. We could live with that.
    • You noticed that I was running this kind of query every 4 seconds, that is because the query took about that time and I just schedule them to run sequentially.
      After the query finishing, I write one timestamp with 1300 fields at hour+5ms and start over for the next hour.
    • Final intention would be to run that process once daily (e.g. at midnight+1h) for every hour of the previous day (to compensate for late arriving data).
  • I have been playing with the number of shards.
    One database used daily shards for a year (but currently only 50 days of data is captured).
    The other uses weekly shards, also with data covering about 2 months.
    Full data size is currently about 2GB (full influx’ data directory).

Best regards,
Johan

I have been studying this scenario somewhat further today and I really can’t follow.
I perform the query
SELECT last(*) FROM … WHERE time >= timestamp-1d and time <=timestamp
that determines the last values of 1300 fields during the last day before a given timestamp.
Successful executions of this query take about 30 seconds each time (connected to influx on localhost)

A few iterations are successful, and then suddenly it starts failing.
When I look at the logs, I see Influx seems to be executing my query multiple times (every 30 seconds) but finally fails to respond. Or is this “intermediate status logging” from Influx indicating that it is still busy with this query?

(log lines starting with Me come from my system, those with influxdb come from Influx.)

Me: 2020-10-02 14:01:27,494 | InfluxUpdateLast | INFO | Handling '2020-10-02T05:00:00Z'.
influxdb    | [httpd] 104.45.2.57 - DBUser [02/Oct/2020:14:00:38 +0000] "POST /write?db=dbMeas&precision=m HTTP/1.1" 204 0 "-" "python-requests/2.24.0" a6d71ceb-04b7-11eb-8d8e-0242ac180002 19295
influxdb    | 2020-10-02T14:01:27.498452Z       info    Executing query {"log_id": "0PbNLBil000", "service": "query", "query": "SELECT last(*) FROM dbMeas.autogen.BBAP WHERE time >= '2020-10-02T05:00:00Z' - 1d AND time <= '2020-10-02T05:00:00Z'"}
influxdb    | 2020-10-02T14:01:57.583399Z       info    Executing query {"log_id": "0PbNLBil000", "service": "query", "query": "SELECT last(*) FROM dbMeas.autogen.BBAP WHERE time >= '2020-10-02T05:00:00Z' - 1d AND time <= '2020-10-02T05:00:00Z'"}
influxdb    | 2020-10-02T14:02:28.418107Z       info    Executing query {"log_id": "0PbNLBil000", "service": "query", "query": "SELECT last(*) FROM dbMeas.autogen.BBAP WHERE time >= '2020-10-02T05:00:00Z' - 1d AND time <= '2020-10-02T05:00:00Z'"}
Me: 2020-10-02 14:02:58,445 | InfluxUpdateLast | ERROR | + 90.95=   90.95s  - Period '2020-10-02T05:00:00Z' fails with exception: HTTPConnectionPool(host='localhost', port=8086): Read timed out..

info Executing query messages are logged for each statement in a query. Since you just have one statement, your log file is indicating that the query was submitted 3 times. Did you say you were running the query multiple times A few iterations?

I like this idea:

Final intention would be to run that process once daily (e.g. at midnight+1h) for every hour of the previous day (to compensate for late arriving data).

Did we talk about continuous queries? They are designed for regular downsampling or regular data processing needs. They have the performance impact of running the same query manually but they might make automating your data processing easier. They could not do your suggestion of a once daily process, which I think is likely superior for your use case.