Influxdb crash every monday at 0 am

Hi,
we have Influxdb 2.4 OpenSource version. It runs on Linux Centos 7. We are inserting TAGs into Influxdb from machines via NodeRED, which is connected to KepwareEx (data chain: machines->KepwareEx->OPC UA->NodeRED->InfluxDB->Grafana). There si approx. 300 inserts per second. Data retention of bucket is set to 1 year.

Problem: Every Monday at 0 am UTC (we are UTC+2) Influxdb stop responding and results error code: http 5xx. I can not connect to bucket via web gui and we must restart influxdb service. Same issue was with version 2.0, then we upgraded to 2.4 and issue is still persisting. We are monitoring influx and linux server by Prometheus, but there is nothing useful.

Have you had a look at the logs? can you share them?

Not completely sure, but given the retention of 1 year, I expect the compaction of the index to run once a week, it may lead to an Out Of Memory error.
This process can be limited via config options like storage-max-concurrent-compactions.

Log is deleted, because influx was falling after restart so we restarted machine and jurnaling was not set to persistent (after restart was journalctl deleted). Now is set to persistent, so I can post something on next monday … .

My influxdb config:

{
        "assets-path": "",
        "bolt-path": "/var/lib/influxdb/influxd.bolt",
        "e2e-testing": false,
        "engine-path": "/var/lib/influxdb/engine",
        "feature-flags": null,
        "flux-log-enabled": false,
        "hardening-enabled": false,
        "http-bind-address": ":8086",
        "http-idle-timeout": 180000000000,
        "http-read-header-timeout": 10000000000,
        "http-read-timeout": 0,
        "http-write-timeout": 0,
        "influxql-max-select-buckets": 0,
        "influxql-max-select-point": 0,
        "influxql-max-select-series": 0,
        "instance-id": "",
        "log-level": "info",
        "metrics-disabled": false,
        "nats-max-payload-bytes": 0,
        "nats-port": 0,
        "no-tasks": false,
        "pprof-disabled": false,
        "query-concurrency": 1024,
        "query-initial-memory-bytes": 0,
        "query-max-memory-bytes": 0,
        "query-memory-bytes": 0,
        "query-queue-size": 1024,
        "reporting-disabled": false,
        "secret-store": "bolt",
        "session-length": 60,
        "session-renew-disabled": false,
        "sqlite-path": "/var/lib/influxdb/influxd.sqlite",
        "storage-cache-max-memory-size": 1073741824,
        "storage-cache-snapshot-memory-size": 26214400,
        "storage-cache-snapshot-write-cold-duration": "10m0s",
        "storage-compact-full-write-cold-duration": "4h0m0s",
        "storage-compact-throughput-burst": 50331648,
        "storage-max-concurrent-compactions": 0,
        "storage-max-index-log-file-size": 1048576,
        "storage-no-validate-field-size": false,
        "storage-retention-check-interval": "30m0s",
        "storage-series-file-max-concurrent-snapshot-compactions": 0,
        "storage-series-id-set-cache-size": 0,
        "storage-shard-precreator-advance-period": "30m0s",
        "storage-shard-precreator-check-interval": "10m0s",
        "storage-tsm-use-madv-willneed": false,
        "storage-validate-keys": false,
        "storage-wal-fsync-delay": "0s",
        "storage-wal-max-concurrent-writes": 0,
        "storage-wal-max-write-delay": 600000000000,
        "storage-write-timeout": 10000000000,
        "store": "disk",
        "testing-always-allow-setup": false,
        "tls-cert": "",
        "tls-key": "",
        "tls-min-version": "1.2",
        "tls-strict-ciphers": false,
        "tracing-type": "",
        "ui-disabled": false,
        "vault-addr": "",
        "vault-cacert": "",
        "vault-capath": "",
        "vault-client-cert": "",
        "vault-client-key": "",
        "vault-client-timeout": 0,
        "vault-max-retries": 0,
        "vault-skip-verify": false,
        "vault-tls-server-name": "",
        "vault-token": ""
}

OOM is not reason (graph 1) because memory is raising after influx is not responding (500 node-fetch /api/v2/write is raising up and 204 write is falling down), so data are stocked in queue in NodeRed (last 2 graphs). Problem starts every monday around 2:05 am UTC+2

Here is log:

After “Reindexing WAL data” there are “Flux query failed” from Grafana (checking some alerts) and I see from Influxdb: lvl=warn msg=“internal error not returned to client”

Nothing else … some advice?

[root@svsk0101 ~]# journalctl -S "2022-09-12 00:50:00" -U "2022-09-12 02:30:00"
-- Logs begin at Mon 2022-09-05 09:50:48 CEST, end at Mon 2022-09-12 07:59:14 CEST. --
Sep 12 01:01:01 svsk0101 systemd[1]: Created slice user-0.slice.
Sep 12 01:01:01 svsk0101 systemd[1]: Starting user-0.slice.
Sep 12 01:01:01 svsk0101 CROND[30077]: (root) CMD (run-parts /etc/cron.hourly)
Sep 12 01:01:01 svsk0101 systemd[1]: Started Session 58 of user root.
Sep 12 01:01:01 svsk0101 systemd[1]: Starting Session 58 of user root.
Sep 12 01:01:01 svsk0101 run-parts(/etc/cron.hourly)[30080]: starting 0anacron
Sep 12 01:01:01 svsk0101 anacron[30086]: Anacron started on 2022-09-12
Sep 12 01:01:01 svsk0101 anacron[30086]: Normal exit (0 jobs run)
Sep 12 01:01:01 svsk0101 run-parts(/etc/cron.hourly)[30088]: finished 0anacron
Sep 12 01:01:01 svsk0101 run-parts(/etc/cron.hourly)[30090]: starting 0yum-hourly.cron
Sep 12 01:15:11 svsk0101 run-parts(/etc/cron.hourly)[16998]: finished 0yum-hourly.cron
Sep 12 01:15:11 svsk0101 systemd[1]: Removed slice user-0.slice.
Sep 12 01:15:11 svsk0101 systemd[1]: Stopping user-0.slice.
Sep 12 01:17:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-11T23:17:00.354626Z lvl=info msg="Retention policy deletion check (start)" log_id=0cpebx6W000 service=retention op_name=retention_delete_check op_event=start
Sep 12 01:17:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-11T23:17:00.356010Z lvl=info msg="Retention policy deletion check (end)" log_id=0cpebx6W000 service=retention op_name=retention_delete_check op_event=end op_elapsed=
Sep 12 01:29:56 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-11T23:29:56.352939Z lvl=info msg="Cache snapshot (start)" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=start
Sep 12 01:29:56 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-11T23:29:56.487131Z lvl=info msg="Snapshot for path written" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot path=/var/lib/influxdb
Sep 12 01:29:56 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-11T23:29:56.487165Z lvl=info msg="Cache snapshot (end)" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=end op_elapsed=134
Sep 12 01:47:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-11T23:47:00.354460Z lvl=info msg="Retention policy deletion check (start)" log_id=0cpebx6W000 service=retention op_name=retention_delete_check op_event=start
Sep 12 01:47:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-11T23:47:00.354668Z lvl=info msg="Retention policy deletion check (end)" log_id=0cpebx6W000 service=retention op_name=retention_delete_check op_event=end op_elapsed=
Sep 12 01:59:59 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-11T23:59:59.940299Z lvl=info msg="index opened with 8 partitions" log_id=0cpebx6W000 service=storage-engine index=tsi
Sep 12 01:59:59 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-11T23:59:59.940879Z lvl=info msg="Reindexing TSM data" log_id=0cpebx6W000 service=storage-engine engine=tsm1 db_shard_id=502
Sep 12 01:59:59 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-11T23:59:59.940892Z lvl=info msg="Reindexing WAL data" log_id=0cpebx6W000 service=storage-engine engine=tsm1 db_shard_id=502
Sep 12 02:00:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:00:00.030844Z lvl=info msg="index opened with 8 partitions" log_id=0cpebx6W000 service=storage-engine index=tsi
Sep 12 02:00:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:00:00.031491Z lvl=info msg="Reindexing TSM data" log_id=0cpebx6W000 service=storage-engine engine=tsm1 db_shard_id=499
Sep 12 02:00:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:00:00.031521Z lvl=info msg="Reindexing WAL data" log_id=0cpebx6W000 service=storage-engine engine=tsm1 db_shard_id=499
Sep 12 02:00:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:00:00.056250Z lvl=info msg="index opened with 8 partitions" log_id=0cpebx6W000 service=storage-engine index=tsi
Sep 12 02:00:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:00:00.056986Z lvl=info msg="Reindexing TSM data" log_id=0cpebx6W000 service=storage-engine engine=tsm1 db_shard_id=498
Sep 12 02:00:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:00:00.057004Z lvl=info msg="Reindexing WAL data" log_id=0cpebx6W000 service=storage-engine engine=tsm1 db_shard_id=498
Sep 12 02:00:04 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:00:04.265651Z lvl=info msg="index opened with 8 partitions" log_id=0cpebx6W000 service=storage-engine index=tsi
Sep 12 02:00:04 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:00:04.268975Z lvl=info msg="Reindexing TSM data" log_id=0cpebx6W000 service=storage-engine engine=tsm1 db_shard_id=506
Sep 12 02:00:04 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:00:04.268992Z lvl=info msg="Reindexing WAL data" log_id=0cpebx6W000 service=storage-engine engine=tsm1 db_shard_id=506
Sep 12 02:01:01 svsk0101 systemd[1]: Created slice user-0.slice.
Sep 12 02:01:01 svsk0101 systemd[1]: Starting user-0.slice.
Sep 12 02:01:01 svsk0101 systemd[1]: Started Session 59 of user root.
Sep 12 02:01:01 svsk0101 systemd[1]: Starting Session 59 of user root.
Sep 12 02:01:01 svsk0101 CROND[14185]: (root) CMD (run-parts /etc/cron.hourly)
Sep 12 02:01:01 svsk0101 run-parts(/etc/cron.hourly)[14188]: starting 0anacron
Sep 12 02:01:01 svsk0101 anacron[14194]: Anacron started on 2022-09-12
Sep 12 02:01:01 svsk0101 anacron[14194]: Normal exit (0 jobs run)
Sep 12 02:01:01 svsk0101 run-parts(/etc/cron.hourly)[14196]: finished 0anacron
Sep 12 02:01:01 svsk0101 run-parts(/etc/cron.hourly)[14198]: starting 0yum-hourly.cron
Sep 12 02:08:21 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:08:21.951015144+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:08:21 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:08:46 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:08:46.947090143+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:08:46 svsk0101 grafana-server[1291]: now()) )\r\n  |> filter(fn: (r) => r[\"tag\"] =~ /[Tt]hermometer/)\r\n  |> last()\r\n  //|> keep( columns: [ \"_time\", \"_value\", \"_measurement\"])\r\n  |> map( fn:(r) => ({ r with _m
Sep 12 02:08:46 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:08:46.948573Z lvl=warn msg="internal error not returned to client" log_id=0cpebx6W000 handler=error_logger error="context canceled"
Sep 12 02:08:52 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:08:52.254908949+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:08:52 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:09:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:09:00.111467Z lvl=info msg="Cache snapshot (start)" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=start
Sep 12 02:09:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:09:00.122266Z lvl=info msg="Snapshot for path written" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot path=/var/lib/influxdb
Sep 12 02:09:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:09:00.122304Z lvl=info msg="Cache snapshot (end)" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=end op_elapsed=10.
Sep 12 02:09:01 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:09:01.056883Z lvl=info msg="Cache snapshot (start)" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=start
Sep 12 02:09:01 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:09:01.060706Z lvl=info msg="Snapshot for path written" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot path=/var/lib/influxdb
Sep 12 02:09:01 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:09:01.060726Z lvl=info msg="Cache snapshot (end)" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=end op_elapsed=3.8
Sep 12 02:09:22 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:09:22.259337374+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:09:22 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:09:46 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:09:46.949569726+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:09:46 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:09:46.949688Z lvl=warn msg="internal error not returned to client" log_id=0cpebx6W000 handler=error_logger error="context canceled"
Sep 12 02:09:46 svsk0101 grafana-server[1291]: now()) )\r\n  |> filter(fn: (r) => r[\"tag\"] =~ /[Tt]hermometer/)\r\n  |> last()\r\n  //|> keep( columns: [ \"_time\", \"_value\", \"_measurement\"])\r\n  |> map( fn:(r) => ({ r with _m
Sep 12 02:09:52 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:09:52.26435544+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" query
Sep 12 02:09:52 svsk0101 grafana-server[1291]: r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n/
Sep 12 02:09:54 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:09:54.352552Z lvl=info msg="Cache snapshot (start)" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=start
Sep 12 02:09:54 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:09:54.379240Z lvl=info msg="Snapshot for path written" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot path=/var/lib/influxdb
Sep 12 02:09:54 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:09:54.379278Z lvl=info msg="Cache snapshot (end)" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=end op_elapsed=26.
Sep 12 02:10:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:10:00.353068Z lvl=info msg="Cache snapshot (start)" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=start
Sep 12 02:10:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:10:00.427776Z lvl=info msg="Snapshot for path written" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot path=/var/lib/influxdb
Sep 12 02:10:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:10:00.427817Z lvl=info msg="Cache snapshot (end)" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=end op_elapsed=74.
Sep 12 02:10:22 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:10:22.269086076+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:10:22 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:10:36 svsk0101 run-parts(/etc/cron.hourly)[26986]: finished 0yum-hourly.cron
Sep 12 02:10:36 svsk0101 systemd[1]: Removed slice user-0.slice.
Sep 12 02:10:36 svsk0101 systemd[1]: Stopping user-0.slice.
Sep 12 02:10:46 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:10:46.9488609+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" query=
Sep 12 02:10:46 svsk0101 grafana-server[1291]: w()) )\r\n  |> filter(fn: (r) => r[\"tag\"] =~ /[Tt]hermometer/)\r\n  |> last()\r\n  //|> keep( columns: [ \"_time\", \"_value\", \"_measurement\"])\r\n  |> map( fn:(r) => ({ r with _mea
Sep 12 02:10:46 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:10:46.949960Z lvl=warn msg="internal error not returned to client" log_id=0cpebx6W000 handler=error_logger error="context canceled"
Sep 12 02:10:52 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:10:52.273377195+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:10:52 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:11:22 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:11:22.278197374+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:11:22 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:11:41 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:11:41.947088764+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:11:41 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:11:41.947177Z lvl=warn msg="internal error not returned to client" log_id=0cpebx6W000 handler=error_logger error="context canceled"
Sep 12 02:11:41 svsk0101 grafana-server[1291]: now()) )\r\n  |> filter(fn: (r) => r[\"tag\"] =~ /[Tt]hermometer/)\r\n  |> last()\r\n  //|> keep( columns: [ \"_time\", \"_value\", \"_measurement\"])\r\n  |> map( fn:(r) => ({ r with _m
Sep 12 02:11:52 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:11:52.283299916+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:11:52 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:12:22 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:12:22.287889915+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:10:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:10:00.353068Z lvl=info msg="Cache snapshot (start)" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=start
Sep 12 02:10:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:10:00.427776Z lvl=info msg="Snapshot for path written" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot path=/var/lib/influxdb
Sep 12 02:10:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:10:00.427817Z lvl=info msg="Cache snapshot (end)" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=end op_elapsed=74.
Sep 12 02:10:22 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:10:22.269086076+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:10:22 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:10:36 svsk0101 run-parts(/etc/cron.hourly)[26986]: finished 0yum-hourly.cron
Sep 12 02:10:36 svsk0101 systemd[1]: Removed slice user-0.slice.
Sep 12 02:10:36 svsk0101 systemd[1]: Stopping user-0.slice.
Sep 12 02:10:46 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:10:46.9488609+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" query=
Sep 12 02:10:46 svsk0101 grafana-server[1291]: w()) )\r\n  |> filter(fn: (r) => r[\"tag\"] =~ /[Tt]hermometer/)\r\n  |> last()\r\n  //|> keep( columns: [ \"_time\", \"_value\", \"_measurement\"])\r\n  |> map( fn:(r) => ({ r with _mea
Sep 12 02:10:46 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:10:46.949960Z lvl=warn msg="internal error not returned to client" log_id=0cpebx6W000 handler=error_logger error="context canceled"
Sep 12 02:10:52 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:10:52.273377195+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:10:52 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:11:22 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:11:22.278197374+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:11:22 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:11:41 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:11:41.947088764+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:11:41 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:11:41.947177Z lvl=warn msg="internal error not returned to client" log_id=0cpebx6W000 handler=error_logger error="context canceled"
Sep 12 02:11:41 svsk0101 grafana-server[1291]: now()) )\r\n  |> filter(fn: (r) => r[\"tag\"] =~ /[Tt]hermometer/)\r\n  |> last()\r\n  //|> keep( columns: [ \"_time\", \"_value\", \"_measurement\"])\r\n  |> map( fn:(r) => ({ r with _m
Sep 12 02:11:52 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:11:52.283299916+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:11:52 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:12:22 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:12:22.287889915+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:12:22 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:12:46 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:12:46.947596446+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:12:46 svsk0101 grafana-server[1291]: now()) )\r\n  |> filter(fn: (r) => r[\"tag\"] =~ /[Tt]hermometer/)\r\n  |> last()\r\n  //|> keep( columns: [ \"_time\", \"_value\", \"_measurement\"])\r\n  |> map( fn:(r) => ({ r with _m
Sep 12 02:12:46 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:12:46.948824Z lvl=warn msg="internal error not returned to client" log_id=0cpebx6W000 handler=error_logger error="context canceled"
Sep 12 02:12:52 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:12:52.292701191+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:12:52 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:13:22 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:13:22.297537698+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:13:22 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:13:46 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:13:46.947344502+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:13:46 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:13:46.947522Z lvl=warn msg="internal error not returned to client" log_id=0cpebx6W000 handler=error_logger error="context canceled"
Sep 12 02:13:46 svsk0101 grafana-server[1291]: now()) )\r\n  |> filter(fn: (r) => r[\"tag\"] =~ /[Tt]hermometer/)\r\n  |> last()\r\n  //|> keep( columns: [ \"_time\", \"_value\", \"_measurement\"])\r\n  |> map( fn:(r) => ({ r with _m
Sep 12 02:13:52 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:13:52.302605327+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:13:52 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:14:22 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:14:22.307811205+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:14:22 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:14:46 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:14:46.948930972+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:14:46 svsk0101 grafana-server[1291]: now()) )\r\n  |> filter(fn: (r) => r[\"tag\"] =~ /[Tt]hermometer/)\r\n  |> last()\r\n  //|> keep( columns: [ \"_time\", \"_value\", \"_measurement\"])\r\n  |> map( fn:(r) => ({ r with _m
Sep 12 02:14:46 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:14:46.954278Z lvl=warn msg="internal error not returned to client" log_id=0cpebx6W000 handler=error_logger error="context canceled"
Sep 12 02:14:52 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:14:52.314096329+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:14:52 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:15:22 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:15:22.319044469+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:15:22 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:15:46 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:15:46.947990004+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:15:46 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:15:46.948334Z lvl=warn msg="internal error not returned to client" log_id=0cpebx6W000 handler=error_logger error="context canceled"
Sep 12 02:15:46 svsk0101 grafana-server[1291]: now()) )\r\n  |> filter(fn: (r) => r[\"tag\"] =~ /[Tt]hermometer/)\r\n  |> last()\r\n  //|> keep( columns: [ \"_time\", \"_value\", \"_measurement\"])\r\n  |> map( fn:(r) => ({ r with _m
Sep 12 02:15:52 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:15:52.323842704+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:15:52 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:16:22 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:16:22.329287539+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:16:22 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:16:46 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:16:46.948462523+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:16:46 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:16:46.948570Z lvl=warn msg="internal error not returned to client" log_id=0cpebx6W000 handler=error_logger error="context canceled"
Sep 12 02:16:46 svsk0101 grafana-server[1291]: now()) )\r\n  |> filter(fn: (r) => r[\"tag\"] =~ /[Tt]hermometer/)\r\n  |> last()\r\n  //|> keep( columns: [ \"_time\", \"_value\", \"_measurement\"])\r\n  |> map( fn:(r) => ({ r with _m
Sep 12 02:16:52 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:16:52.334282669+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:16:52 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:17:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:17:00.059911Z lvl=info msg="Cache snapshot (start)" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=start
Sep 12 02:17:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:17:00.068507Z lvl=info msg="Snapshot for path written" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot path=/var/lib/influxdb
Sep 12 02:17:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:17:00.068559Z lvl=info msg="Cache snapshot (end)" log_id=0cpebx6W000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=end op_elapsed=8.6
Sep 12 02:17:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:17:00.354282Z lvl=info msg="Retention policy deletion check (start)" log_id=0cpebx6W000 service=retention op_name=retention_delete_check op_event=start
Sep 12 02:17:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:17:00.356300Z lvl=info msg="Deleted shard group" log_id=0cpebx6W000 service=retention op_name=retention_delete_check db_instance=8716f6e9e41fde66 db_shard_grou
Sep 12 02:17:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:17:00.358186Z lvl=info msg="Deleted shard group" log_id=0cpebx6W000 service=retention op_name=retention_delete_check db_instance=d374d7eddb702911 db_shard_grou
Sep 12 02:17:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:17:00.362215Z lvl=info msg="Deleted shard" log_id=0cpebx6W000 service=retention op_name=retention_delete_check db_instance=d374d7eddb702911 db_shard_id=476 db_
Sep 12 02:17:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:17:00.364924Z lvl=info msg="Deleted shard" log_id=0cpebx6W000 service=retention op_name=retention_delete_check db_instance=8716f6e9e41fde66 db_shard_id=490 db_
Sep 12 02:17:00 svsk0101 influxd-systemd-start.sh[1295]: ts=2022-09-12T00:17:00.366853Z lvl=info msg="Retention policy deletion check (end)" log_id=0cpebx6W000 service=retention op_name=retention_delete_check op_event=end op_elapsed=
Sep 12 02:17:22 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:17:22.338949243+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:17:22 svsk0101 grafana-server[1291]: \r\n    |> last()\r\n\r\nD = C\r\n\t|> map( fn: (r) => ({\r\n\t\tr with \"alarm\" : if r._value < r.min or r._value > r.max then 1.0 else 0.0 } ) )\r\n    \r\nGRAFANA_ALARM_VALUE = D\r\n
Sep 12 02:17:46 svsk0101 grafana-server[1291]: logger=tsdb.influx_flux t=2022-09-12T02:17:46.948846918+02:00 level=warn msg="Flux query failed" err="Post \"http://localhost:8086/api/v2/query?org=TDK\": context deadline exceeded" quer
Sep 12 02:17:46 svsk0101 grafana-server[1291]: now()) )\r\n  |> filter(fn: (r) => r[\"tag\"] =~ /[Tt]hermometer/)\r\n  |> last()\r\n  //|> keep( columns: [ \"_time\", \"_value\", \"_measurement\"])\r\n  |> map( fn:(r) => ({ r with _m
...
...
...

I was coming here for the same symptom / similar issue with InfluxDB v2.4.0 . For me, Sunday night 8pm which happens to be 00:00 Monday UTC.


I have 24G of RAM and 4G of swap. I added more swap but not sure how to properly size things. I do have a fair bit of data I supposed. I have a local telegraf agent running in the VM that once I restart influxdb2, unloads its collected information.
I will also try

storage-max-concurrent-compactions = 0

in /etc/influxdb/config.toml


Sep 11 23:59:02 influxdb influxd-systemd-start.sh[819]: ts=2022-09-11T23:59:02.737785Z lvl=info msg="Cache snapshot (start)" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=start
Sep 11 23:59:03 influxdb influxd-systemd-start.sh[819]: ts=2022-09-11T23:59:03.033635Z lvl=info msg="Snapshot for path written" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot path=/var/lib/influxdb/engine/data/fe5285a70f105db0/autogen/369 duration=295.847ms
Sep 11 23:59:03 influxdb influxd-systemd-start.sh[819]: ts=2022-09-11T23:59:03.033888Z lvl=info msg="Cache snapshot (end)" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=end op_elapsed=296.109ms
Sep 11 23:59:14 influxdb loki-linux-amd64[771]: level=info ts=2022-09-11T23:59:14.505899621Z caller=table_manager.go:169 msg="uploading tables"
Sep 11 23:59:23 influxdb influxd-systemd-start.sh[819]: ts=2022-09-11T23:59:23.738899Z lvl=info msg="Cache snapshot (start)" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=start
Sep 11 23:59:24 influxdb influxd-systemd-start.sh[819]: ts=2022-09-11T23:59:24.030899Z lvl=info msg="Snapshot for path written" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot path=/var/lib/influxdb/engine/data/fe5285a70f105db0/autogen/369 duration=292.009ms
Sep 11 23:59:24 influxdb influxd-systemd-start.sh[819]: ts=2022-09-11T23:59:24.031330Z lvl=info msg="Cache snapshot (end)" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=end op_elapsed=292.438ms
Sep 11 23:59:57 influxdb influxd-systemd-start.sh[819]: ts=2022-09-11T23:59:57.531593Z lvl=info msg="index opened with 8 partitions" log_id=0cn0mGUW000 service=storage-engine index=tsi
Sep 11 23:59:57 influxdb influxd-systemd-start.sh[819]: ts=2022-09-11T23:59:57.532335Z lvl=info msg="Reindexing TSM data" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 db_shard_id=387
Sep 11 23:59:57 influxdb influxd-systemd-start.sh[819]: ts=2022-09-11T23:59:57.532420Z lvl=info msg="Reindexing WAL data" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 db_shard_id=387
Sep 12 00:00:00 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:00.520397Z lvl=info msg="index opened with 8 partitions" log_id=0cn0mGUW000 service=storage-engine index=tsi
Sep 12 00:00:00 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:00.521032Z lvl=info msg="Reindexing TSM data" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 db_shard_id=385
Sep 12 00:00:00 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:00.521125Z lvl=info msg="Reindexing WAL data" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 db_shard_id=385
Sep 12 00:00:00 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:00.557149Z lvl=info msg="index opened with 8 partitions" log_id=0cn0mGUW000 service=storage-engine index=tsi
Sep 12 00:00:00 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:00.557542Z lvl=info msg="Reindexing TSM data" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 db_shard_id=386
Sep 12 00:00:00 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:00.557649Z lvl=info msg="Reindexing WAL data" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 db_shard_id=386
Sep 12 00:00:08 influxdb systemd[1]: Starting Discard unused blocks on filesystems from /etc/fstab...
Sep 12 00:00:08 influxdb systemd[1]: Starting Rotate log files...
Sep 12 00:00:08 influxdb systemd[1]: Starting Daily man-db regeneration...
Sep 12 00:00:08 influxdb logrotate[53119]: error: Ignoring influxdb because it is writable by group or others.
Sep 12 00:00:08 influxdb systemd[1]: man-db.service: Succeeded.
Sep 12 00:00:08 influxdb systemd[1]: Finished Daily man-db regeneration.
Sep 12 00:00:08 influxdb systemd[1]: logrotate.service: Succeeded.
Sep 12 00:00:08 influxdb systemd[1]: Finished Rotate log files.
Sep 12 00:00:08 influxdb fstrim[53117]: /: 73.3 GiB (78673907712 bytes) trimmed on /dev/disk/by-uuid/b3992e1c-58d6-4a55-8c84-72cad5f516b3
Sep 12 00:00:08 influxdb systemd[1]: fstrim.service: Succeeded.
Sep 12 00:00:08 influxdb systemd[1]: Finished Discard unused blocks on filesystems from /etc/fstab.
Sep 12 00:00:11 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:11.807929Z lvl=info msg="index opened with 8 partitions" log_id=0cn0mGUW000 service=storage-engine index=tsi
Sep 12 00:00:11 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:11.808357Z lvl=info msg="Reindexing TSM data" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 db_shard_id=388
Sep 12 00:00:11 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:11.808369Z lvl=info msg="Reindexing WAL data" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 db_shard_id=388
Sep 12 00:00:14 influxdb loki-linux-amd64[771]: level=info ts=2022-09-12T00:00:14.505858643Z caller=table_manager.go:169 msg="uploading tables"
Sep 12 00:00:26 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:26.037237Z lvl=error msg="Unable to write gathered points" log_id=0cn0mGUW000 service=scraper scraper-name="new target" error=timeout
Sep 12 00:00:26 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:26.295457Z lvl=info msg="index opened with 8 partitions" log_id=0cn0mGUW000 service=storage-engine index=tsi
Sep 12 00:00:26 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:26.295879Z lvl=info msg="Reindexing TSM data" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 db_shard_id=389
Sep 12 00:00:26 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:26.295889Z lvl=info msg="Reindexing WAL data" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 db_shard_id=389
Sep 12 00:00:28 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:28.301193Z lvl=info msg="Cache snapshot (start)" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=start
Sep 12 00:00:28 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:28.595528Z lvl=info msg="Snapshot for path written" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot path=/var/lib/influxdb/engine/data/7a038e6d102418c6/autogen/389 duration=294.325ms
Sep 12 00:00:28 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:28.595558Z lvl=info msg="Cache snapshot (end)" log_id=0cn0mGUW000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=end op_elapsed=294.382ms
Sep 12 00:00:36 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:36.029085Z lvl=error msg="Unable to write gathered points" log_id=0cn0mGUW000 service=scraper scraper-name="new target" error=timeout
Sep 12 00:00:46 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:46.036073Z lvl=error msg="Unable to write gathered points" log_id=0cn0mGUW000 service=scraper scraper-name="new target" error=timeout
Sep 12 00:00:56 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:00:56.038366Z lvl=error msg="Unable to write gathered points" log_id=0cn0mGUW000 service=scraper scraper-name="new target" error=timeout
Sep 12 00:01:06 influxdb influxd-systemd-start.sh[819]: ts=2022-09-12T00:01:06.036601Z lvl=error msg="Unable to write gathered points" log_id=0cn0mGUW000 service=scraper scraper-name="new target" error=timeout

Hello @mdtancsa
Did trying that option help?

@Jarda_K,
I’m not sure what’s going on here. I’m asking around and I’ll get back to you as soon as I hear back. Half of the company is meeting though so there might be a little delay. I appreciate your patience.

Hi @Anaisdg thanks for checking in! I have not seen the problem yet, but it seems to be every Monday at 00:00:01 UTC. Is there a way to force the app to do whatever clean up it normally does at that time? Or should I just wait.

Hi there,
we are trying restart influxdb automatically via NodeRED when it is not responding (this is not solution, but we do not want to miss data). It works so that when read from nodered is failing, then after 1 min is influx restarted via nodered. Result is this:

At 2 a.m. there is fail of influx, then it is restarted via nodered and after approx. 12 min it fails again four time in row … . I thing there is some problem with shards. Any idea?

Hi
is there any update regarding this issue?
Every Monday at 02:00 CET (00:00 UTC) we have a same problem
after some time server uses all memory and OOM happens…

Jul 24 06:10:10 lvpinfluxdb01 kernel: influxd invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Jul 24 06:10:10 lvpinfluxdb01 kernel: CPU: 17 PID: 3727954 Comm: influxd Kdump: loaded Not tainted 4.18.0-425.3.1.el8.x86_64 #1
Jul 24 06:10:10 lvpinfluxdb01 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
Jul 24 06:10:10 lvpinfluxdb01 kernel: Call Trace:
Jul 24 06:10:10 lvpinfluxdb01 kernel: dump_stack+0x41/0x60
Jul 24 06:10:10 lvpinfluxdb01 kernel: dump_header+0x4a/0x1df
Jul 24 06:10:10 lvpinfluxdb01 kernel: oom_kill_process.cold.33+0xb/0x10
Jul 24 06:10:10 lvpinfluxdb01 kernel: out_of_memory+0x1bd/0x4e0
Jul 24 06:10:10 lvpinfluxdb01 kernel: __alloc_pages_slowpath+0xc24/0xd10
Jul 24 06:10:10 lvpinfluxdb01 kernel: __alloc_pages_nodemask+0x2e2/0x320
Jul 24 06:10:10 lvpinfluxdb01 kernel: pagecache_get_page+0xce/0x310
Jul 24 06:10:10 lvpinfluxdb01 kernel: filemap_fault+0x78b/0xa10
Jul 24 06:10:10 lvpinfluxdb01 kernel: ? __mod_lruvec_page_state+0x5e/0x80
Jul 24 06:10:10 lvpinfluxdb01 kernel: ? page_add_file_rmap+0x99/0x130
Jul 24 06:10:10 lvpinfluxdb01 kernel: ? pmd_devmap_trans_unstable+0x2e/0x40
Jul 24 06:10:10 lvpinfluxdb01 kernel: ? alloc_set_pte+0x1f1/0x3f0
Jul 24 06:10:10 lvpinfluxdb01 kernel: ? filemap_map_pages+0x271/0x410
Jul 24 06:10:10 lvpinfluxdb01 kernel: ext4_filemap_fault+0x2c/0x40 [ext4]
Jul 24 06:10:10 lvpinfluxdb01 kernel: __do_fault+0x38/0xc0
Jul 24 06:10:10 lvpinfluxdb01 kernel: handle_pte_fault+0x55d/0x880
Jul 24 06:10:10 lvpinfluxdb01 kernel: __handle_mm_fault+0x453/0x6c0
Jul 24 06:10:10 lvpinfluxdb01 kernel: handle_mm_fault+0xc1/0x1e0
Jul 24 06:10:10 lvpinfluxdb01 kernel: do_user_addr_fault+0x1b9/0x450
Jul 24 06:10:10 lvpinfluxdb01 kernel: do_page_fault+0x37/0x130
Jul 24 06:10:10 lvpinfluxdb01 kernel: ? page_fault+0x8/0x30
Jul 24 06:10:10 lvpinfluxdb01 kernel: page_fault+0x1e/0x30
Jul 24 06:10:10 lvpinfluxdb01 kernel: RIP: 0033:0x7fa9a127aa28
Jul 24 06:10:10 lvpinfluxdb01 kernel: Code: 48 01 f2 48 2b 91 a0 00 00 00 48 89 d6 48 c1 ea 0c 48 8d 14 92 48 c1 e2 02 48 03 91 98 00 00 00 81 e6 ff 0f 00 00 48 c1 ee 08 <8b> 3a 48 83 fe 10 73 75 0f b6 54 32 04 01 fa eb 03 44 89 c2 48 8b
Jul 24 06:10:10 lvpinfluxdb01 kernel: RSP: 002b:000000c0040dd1a8 EFLAGS: 00010206
Jul 24 06:10:10 lvpinfluxdb01 kernel: RAX: 000000000001f66c RBX: fffffffffffffffa RCX: 00007fa9a59418a0
Jul 24 06:10:10 lvpinfluxdb01 kernel: RDX: 00007fa9a381ef2c RSI: 0000000000000006 RDI: 0000000000000001
Jul 24 06:10:10 lvpinfluxdb01 kernel: RBP: 000000c0040dd1b8 R08: 00007fa9a45643e0 R09: 0000000000000008
Jul 24 06:10:10 lvpinfluxdb01 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 000000c0040dd260
Jul 24 06:10:10 lvpinfluxdb01 kernel: R13: 0000000000000008 R14: 000000c01352c680 R15: 00007fa9a3469717
Jul 24 06:10:10 lvpinfluxdb01 kernel: Mem-Info:

last message regarding influxdb was at 02:11

Jul 24 02:11:51 lvpinfluxdb01 influxd-systemd-start.sh[3727810]: ts=2023-07-24T00:11:51.455326Z lvl=info msg=“Retention policy deletion check (start)” log_id=0j4KJWYl000 service=retention op_name=retention_delete_check op_ev
ent=start
Jul 24 02:11:51 lvpinfluxdb01 influxd-systemd-start.sh[3727810]: ts=2023-07-24T00:11:51.466000Z lvl=info msg=“Deleted shard group” log_id=0j4KJWYl000 service=retention op_name=retention_delete_check db_instance=89bfaf3146b44
8a3 db_shard_group=3352 db_rp=autogen
Jul 24 02:11:51 lvpinfluxdb01 influxd-systemd-start.sh[3727810]: ts=2023-07-24T00:11:51.472287Z lvl=info msg=“Deleted shard group” log_id=0j4KJWYl000 service=retention op_name=retention_delete_check db_instance=c37cc16d68b2c
021 db_shard_group=3238 db_rp=autogen
Jul 24 02:11:51 lvpinfluxdb01 influxd-systemd-start.sh[3727810]: ts=2023-07-24T00:11:51.476061Z lvl=info msg=“Deleted shard group” log_id=0j4KJWYl000 service=retention op_name=retention_delete_check db_instance=42f6dde37d2ee
914 db_shard_group=3432 db_rp=autogen
Jul 24 02:11:51 lvpinfluxdb01 influxd-systemd-start.sh[3727810]: ts=2023-07-24T00:11:51.482324Z lvl=info msg=“Deleted shard group” log_id=0j4KJWYl000 service=retention op_name=retention_delete_check db_instance=1e850838ccb6a
0e3 db_shard_group=3240 db_rp=autogen
Jul 24 02:11:51 lvpinfluxdb01 influxd-systemd-start.sh[3727810]: ts=2023-07-24T00:11:51.503265Z lvl=info msg=“Deleted shard” log_id=0j4KJWYl000 service=retention op_name=retention_delete_check db_instance=c37cc16d68b2c021 db
_shard_id=3239 db_rp=autogen
Jul 24 02:11:51 lvpinfluxdb01 influxd-systemd-start.sh[3727810]: ts=2023-07-24T00:11:51.864439Z lvl=info msg=“Deleted shard” log_id=0j4KJWYl000 service=retention op_name=retention_delete_check db_instance=1e850838ccb6a0e3 db
_shard_id=3241 db_rp=autogen
Jul 24 02:11:51 lvpinfluxdb01 influxd-systemd-start.sh[3727810]: ts=2023-07-24T00:11:51.889957Z lvl=info msg=“Deleted shard” log_id=0j4KJWYl000 service=retention op_name=retention_delete_check db_instance=89bfaf3146b448a3 db
_shard_id=3353 db_rp=autogen
Jul 24 02:11:51 lvpinfluxdb01 influxd-systemd-start.sh[3727810]: ts=2023-07-24T00:11:51.890319Z lvl=info msg=“Retention policy deletion check (end)” log_id=0j4KJWYl000 service=retention op_name=retention_delete_check op_even
t=end op_elapsed=435.003ms

thanks
Tomislav

@tmihaldinec Not really as the OOM is due to hardware sizing…

did you try to change storage-max-concurrent-compactions?
I solved this (a lot of time ago) by setting it to 1, meaning there is no parallelism in compaction, it will take longer and consume fewer resources

Hi
We re running server with 140GB of RAM and during normal workout it doesn’t consume more than 24G
Server has 16CPUs and FS is on very fast enterprise storage

problem is that bucket is 2TB

i will play with storage-max-concurrent-compactions but issue as you know is that i will have to wait next Monday 02:00 CET

i was searching and i could not found why it starts only on Mondays at 00:00 UTC… is there any setting where we can change this schedule?

Many thanks
Tomislav

The compaction frequency depends on the RP duration and there is no way to change it (afaik)

Thanks @Giovanni_Luisotto

i changed storage-max-concurrent-compactions to 1 and lets see how it gonna work on next Monday 02:00 CET

@Anaisdg do you have any comment please?

Thanks
Tomislav