InfluxDB 1.7.4 fails after 9 months without issues

Hi everyone,

first of all I would like to praise InfluxDB. It runs great on the RasPi 3B and has been doing so for about 9 months now. It is running on Raspbian, everything is up to date. I really appreciate that armhf packages for Debian are being built natively, this makes keeping up to date very easy.

While things were great for about the last 9 months, now InfluxDB seems to have failed on me. The whole system is running on an SSD and is only 7% full, so that is not the issue. nothing at all has changed, I have not done any system updates for days nor even logged into the system for a couple days. I am using Grafana 6.0.2 to display data from InfluxDB

Symptoms:

  • The influxd process is between 100% and 350% CPU usage, of course this is not normal
  • Not responding to any http queries or data posts
  • Then I try to get the “influx” console I get:
    Failed to connect to http://localhost:8086: Get http://localhost:8086/ping: dial tcp [::1]:8086: connect: connection refused
    Please check your connection settings and ensure ‘influxd’ is running.
    Of course influxd is running.
  • I cannot find any logs. i have not modified the config file in this respect. The folder /var/log/influxdb is created but empty.
  • Grafana also cannot show data. When it tries to access the DB it reports “Network Error: Bad Gateway(502)”, although Influx is running on the same machine at localhost:8086.
  • I tried to stop grafana and do a manual backup to get some output:
    sudo influxd backup -portable ~/krftwrk_backup/
    2019/03/25 19:02:28 backing up metastore to /home/alessio/krftwrk_backup/meta.00
    2019/03/25 19:02:58 Download shard 0 failed copy backup to file: err=read tcp 127.0.0.1:57698->127.0.0.1:8088: read: connection reset by peer, n=0. Waiting 2s and retrying (0)…
    2019/03/25 19:03:29 Download shard 0 failed copy backup to file: err=read tcp 127.0.0.1:57810->127.0.0.1:8088: read: connection reset by peer, n=0. Waiting 2s and retrying (1)…
    2019/03/25 19:04:01 Download shard 0 failed copy backup to file: err=read tcp 127.0.0.1:57914->127.0.0.1:8088: read: connection reset by peer, n=0. Waiting 2s and retrying (2)…
    2019/03/25 19:04:33 Download shard 0 failed copy backup to file: err=read tcp 127.0.0.1:58018->127.0.0.1:8088: read: connection reset by peer, n=0. Waiting 2s and retrying (3)…
    2019/03/25 19:05:05 Download shard 0 failed copy backup to file: err=read tcp 127.0.0.1:58122->127.0.0.1:8088: read: connection reset by peer, n=0. Waiting 2s and retrying (4)…
    2019/03/25 19:05:33 Download shard 0 failed copy backup to file: err=read tcp 127.0.0.1:58254->127.0.0.1:8088: read: connection reset by peer, n=0. Waiting 2s and retrying (5)…
    2019/03/25 19:06:05 Download shard 0 failed copy backup to file: err=read tcp 127.0.0.1:58334->127.0.0.1:8088: read: connection reset by peer, n=0. Waiting 3.01s and retrying (6)…
    2019/03/25 19:06:38 Download shard 0 failed copy backup to file: err=read tcp 127.0.0.1:58470->127.0.0.1:8088: read: connection reset by peer, n=0. Waiting 11.441s and retrying (7)…
    2019/03/25 19:07:19 Download shard 0 failed copy backup to file: err=read tcp 127.0.0.1:58602->127.0.0.1:8088: read: connection reset by peer, n=0. Waiting 43.477s and retrying (8)…
    2019/03/25 19:08:33 Download shard 0 failed copy backup to file: err=read tcp 127.0.0.1:58854->127.0.0.1:8088: read: connection reset by peer, n=0. Waiting 2m45.216s and retrying (9)…
  • When manually running influxd I get entries like:
    2019-03-25T18:20:06.737327Z info Opened file {“log_id”: “0EPX8OA0000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/machinestats/autogen/470/000000001-000000001.tsm”, “id”: 0, “duration”: “6.258ms”}
    2019-03-25T18:20:06.741604Z info Opened shard {“log_id”: “0EPX8OA0000”, “service”: “store”, “trace_id”: “0EPX8O_0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/machinestats/autogen/459”, “duration”: “18.055ms”}
    2019-03-25T18:20:06.748781Z info Opened file {“log_id”: “0EPX8OA0000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/machinestats/autogen/481/000000001-000000001.tsm”, “id”: 0, “duration”: “3.113ms”}
    2019-03-25T18:20:06.750204Z info Opened shard {“log_id”: “0EPX8OA0000”, “service”: “store”, “trace_id”: “0EPX8O_0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/machinestats/autogen/470”, “duration”: “21.187ms”}
    2019-03-25T18:20:06.754831Z info Opened file {“log_id”: “0EPX8OA0000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/machinestats/autogen/492/000000001-000000001.tsm”, “id”: 0, “duration”: “2.349ms”}
    2019-03-25T18:20:06.762807Z info Opened shard {“log_id”: “0EPX8OA0000”, “service”: “store”, “trace_id”: “0EPX8O_0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/machinestats/autogen/481”, “duration”: “20.889ms”}
    2019-03-25T18:20:06.770042Z info Opened file {“log_id”: “0EPX8OA0000”, “engine”: “tsm1”, “service”: “filestore”, “path”: “/var/lib/influxdb/data/machinestats/autogen/505/000000001-000000001.tsm”, “id”: 0, “duration”: “3.404ms”}
    2019-03-25T18:20:06.771709Z info Opened shard {“log_id”: “0EPX8OA0000”, “service”: “store”, “trace_id”: “0EPX8O_0000”, “op_name”: “tsdb_open”, “index_version”: “inmem”, “path”: “/var/lib/influxdb/data/machinestats/autogen/492”, “duration”: “21.151ms”}

So to be honest I am not quite sure where to look. The DB might be corrupted (why? There were no unplanned reboots etc.), the power supply is fine and I am stumped because nothign has changed on the machine. Does anyone have hints where to look?

Thanks in advance :slight_smile:

Hi, I would like to bring this up again. The following has changed:

  • I left influxd running since my last post. Most of the time the database is unreachable and the poor ARM SoC has been hammered with about 340% use across the 4 cores by influxdb since months

  • I have updated to each next version and am currently at influxdb version 1.7.7

  • As I have found no easy way to log, I have started the influxdb service in the terminal and piped the output to a file. I also saw it crash after a few minutes. Here are some interesting lines in my opinion:

ts=2019-07-16T07:23:51.367782Z lvl=info msg=“Opened shard” log_id=0GfQT5bG000 service=store trace_id=0GfQT60G000 op_name=tsdb_open index_version=inmem path=/var/lib/influxdb/data/inverter/autogen/814 duration=14491.488ms

As you can see the duration is about 15 seconds! I see durations of 3 to 1000 milliseconds for most entries and then the duration explodes and goes up.

This is the point where things really fall apart in the about 2-3 minutes influxd ran before crashing, they are the entries immediately after the 15 second duration above:

ts=2019-07-16T07:23:51.426201Z lvl=info msg=“Open store (end)” log_id=0GfQT5bG000 service=store trace_id=0GfQT60G000 op_name=tsdb_open op_event=end op_elapsed=42752.248ms
ts=2019-07-16T07:23:51.430215Z lvl=info msg=“Opened service” log_id=0GfQT5bG000 service=subscriber
ts=2019-07-16T07:23:51.430406Z lvl=info msg=“Starting monitor service” log_id=0GfQT5bG000 service=monitor
ts=2019-07-16T07:23:51.430528Z lvl=info msg=“Registered diagnostics client” log_id=0GfQT5bG000 service=monitor name=build
ts=2019-07-16T07:23:51.430628Z lvl=info msg=“Registered diagnostics client” log_id=0GfQT5bG000 service=monitor name=runtime
ts=2019-07-16T07:23:51.430699Z lvl=info msg=“Registered diagnostics client” log_id=0GfQT5bG000 service=monitor name=network
ts=2019-07-16T07:23:51.430857Z lvl=info msg=“Registered diagnostics client” log_id=0GfQT5bG000 service=monitor name=system
ts=2019-07-16T07:23:51.431033Z lvl=info msg=“Starting precreation service” log_id=0GfQT5bG000 service=shard-precreation check_interval=10m advance_period=30m
ts=2019-07-16T07:23:51.439967Z lvl=info msg=“Starting snapshot service” log_id=0GfQT5bG000 service=snapshot
ts=2019-07-16T07:23:51.444103Z lvl=info msg=“Starting continuous query service” log_id=0GfQT5bG000 service=continuous_querier
ts=2019-07-16T07:23:51.457765Z lvl=info msg=“Starting HTTP service” log_id=0GfQT5bG000 service=httpd authentication=false
ts=2019-07-16T07:23:51.457950Z lvl=info msg=“opened HTTP access log” log_id=0GfQT5bG000 service=httpd path=stderr
ts=2019-07-16T07:23:51.465478Z lvl=info msg=“Listening on HTTP” log_id=0GfQT5bG000 service=httpd addr=[::]:8086 https=false
ts=2019-07-16T07:23:51.468378Z lvl=info msg=“Starting retention policy enforcement service” log_id=0GfQT5bG000 service=retention check_interval=30m
ts=2019-07-16T07:23:51.480459Z lvl=info msg=“Listening for signals” log_id=0GfQT5bG000
ts=2019-07-16T07:23:52.397582Z lvl=info msg=“TSM compaction (start)” log_id=0GfQT5bG000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0GfQVloG000 op_name=tsm1_compact_group op_event=start
ts=2019-07-16T07:23:52.397795Z lvl=info msg=“Beginning compaction” log_id=0GfQT5bG000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0GfQVloG000 op_name=tsm1_compact_group tsm1_files_n=2
ts=2019-07-16T07:23:52.397878Z lvl=info msg=“Compacting file” log_id=0GfQT5bG000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0GfQVloG000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/var/lib/influxdb/data/_internal/monitor/852/000000032-000000003.tsm
ts=2019-07-16T07:23:52.397955Z lvl=info msg=“Compacting file” log_id=0GfQT5bG000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0GfQVloG000 op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/var/lib/influxdb/data/_internal/monitor/852/000000064-000000003.tsm
[httpd] ::1 - alessio [16/Jul/2019:09:23:52 +0200] “POST /write?db=inverter&p=%5BREDACTED%5D&u=alessio HTTP/1.1” 204 0 “-” “curl/7.52.1” a9ce69d2-a79a-11e9-8001-b827ebc6652d 27101
[httpd] ::1 - alessio [16/Jul/2019:09:23:52 +0200] “POST /write?db=ess&p=%5BREDACTED%5D&u=alessio HTTP/1.1” 204 0 “-” “curl/7.52.1” a9da6daa-a79a-11e9-8002-b827ebc6652d 3072
[httpd] ::1 - alessio [16/Jul/2019:09:23:52 +0200] “POST /write?db=ess&p=%5BREDACTED%5D&u=alessio HTTP/1.1” 204 0 “-” “curl/7.52.1” a9e42a68-a79a-11e9-8003-b827ebc6652d 2939
[httpd] ::1 - alessio [16/Jul/2019:09:23:52 +0200] “POST /write?db=ess&p=%5BREDACTED%5D&u=alessio HTTP/1.1” 204 0 “-” “curl/7.52.1” a9ecb024-a79a-11e9-8004-b827ebc6652d 4150
ts=2019-07-16T07:23:53.376484Z lvl=info msg=“Sending usage statistics to usage.influxdata.com” log_id=0GfQT5bG000
runtime: out of memory: cannot allocate 32768-byte block (715423744 in use)
fatal error: out of memory

So ultimately influxd seems to run out of memory. But it has not done so for the first 9 months, so something seems to be overflowing or not? I have read that disabling the internal statistics database may help (in config: “store-enabled = false”), so I have done that but the behaviour is the same.

Any comments or anything else I could deliver to help find the cause?

EDIT: I have also changed the following in the config file, hoping to reduce memory / CPU allocation but the behaviour stays the same:

cache-max-memory-size = “250m”
cache-snapshot-memory-size = “25m”
cache-snapshot-write-cold-duration = “10m”
max-concurrent-compactions = 1

Hi @Dwarf ,

I see you are using inmem indexes … ?

It is recommended to use tsi indexes to reduce memory consumption …
does this make sense ? :slight_smile:

https://docs.influxdata.com/influxdb/v1.7/concepts/tsi-details/

1 Like

Hi MarcV,

thank you very much for responding and the suggestion. Regrettably it brings no result. After about the same time influxd fails, again with low memory. Something seems to be overflowing very quickly within litterally a couple minutes:

ts=2019-07-16T11:27:42.128207Z lvl=info msg=“Open store (end)” log_id=0GfdQ770000 service=store trace_id=0GfdQ7XG000 op_name=tsdb_open op_event=end op_elapsed=42154.779ms
ts=2019-07-16T11:27:42.128606Z lvl=info msg=“Opened service” log_id=0GfdQ770000 service=subscriber
ts=2019-07-16T11:27:42.128706Z lvl=info msg=“Starting monitor service” log_id=0GfdQ770000 service=monitor
ts=2019-07-16T11:27:42.129863Z lvl=info msg=“Registered diagnostics client” log_id=0GfdQ770000 service=monitor name=build
ts=2019-07-16T11:27:42.131227Z lvl=info msg=“Registered diagnostics client” log_id=0GfdQ770000 service=monitor name=runtime
ts=2019-07-16T11:27:42.131302Z lvl=info msg=“Registered diagnostics client” log_id=0GfdQ770000 service=monitor name=network
ts=2019-07-16T11:27:42.131468Z lvl=info msg=“Registered diagnostics client” log_id=0GfdQ770000 service=monitor name=system
ts=2019-07-16T11:27:42.131634Z lvl=info msg=“Starting precreation service” log_id=0GfdQ770000 service=shard-precreation check_interval=10m advance_period=30m
ts=2019-07-16T11:27:42.131757Z lvl=info msg=“Starting snapshot service” log_id=0GfdQ770000 service=snapshot
ts=2019-07-16T11:27:42.136101Z lvl=info msg=“Starting continuous query service” log_id=0GfdQ770000 service=continuous_querier
ts=2019-07-16T11:27:42.143213Z lvl=info msg=“Starting HTTP service” log_id=0GfdQ770000 service=httpd authentication=false
ts=2019-07-16T11:27:42.143307Z lvl=info msg=“opened HTTP access log” log_id=0GfdQ770000 service=httpd path=stderr
ts=2019-07-16T11:27:42.157541Z lvl=info msg=“Listening on HTTP” log_id=0GfdQ770000 service=httpd addr=[::]:8086 https=false
ts=2019-07-16T11:27:42.161475Z lvl=info msg=“Starting retention policy enforcement service” log_id=0GfdQ770000 service=retention check_interval=30m
ts=2019-07-16T11:27:42.169244Z lvl=info msg=“Listening for signals” log_id=0GfdQ770000
[httpd] ::1 - alessio [16/Jul/2019:13:27:42 +0200] “POST /write?db=inverter&p=%5BREDACTED%5D&u=alessio HTTP/1.1” 204 0 “-” “curl/7.52.1” b9f13b87-a7bc-11e9-8001-b827ebc6652d 18061
[httpd] ::1 - alessio [16/Jul/2019:13:27:42 +0200] “POST /write?db=ess&p=%5BREDACTED%5D&u=alessio HTTP/1.1” 204 0 “-” “curl/7.52.1” b9fe9473-a7bc-11e9-8002-b827ebc6652d 5324
[httpd] ::1 - alessio [16/Jul/2019:13:27:42 +0200] “POST /write?db=ess&p=%5BREDACTED%5D&u=alessio HTTP/1.1” 204 0 “-” “curl/7.52.1” ba089fbb-a7bc-11e9-8003-b827ebc6652d 2988
[httpd] ::1 - alessio [16/Jul/2019:13:27:42 +0200] “POST /write?db=ess&p=%5BREDACTED%5D&u=alessio HTTP/1.1” 204 0 “-” “curl/7.52.1” ba101bd0-a7bc-11e9-8004-b827ebc6652d 4110
ts=2019-07-16T11:27:43.120484Z lvl=info msg=“TSM compaction (start)” log_id=0GfdQ770000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0GfdSl40000 op_name=tsm1_compact_group op_event=start
ts=2019-07-16T11:27:43.120641Z lvl=info msg=“Beginning compaction” log_id=0GfdQ770000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0GfdSl40000 op_name=tsm1_compact_group tsm1_files_n=2
ts=2019-07-16T11:27:43.120684Z lvl=info msg=“Compacting file” log_id=0GfdQ770000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0GfdSl40000 op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/var/lib/influxdb/data/_internal/monitor/852/000000032-000000003.tsm
ts=2019-07-16T11:27:43.120725Z lvl=info msg=“Compacting file” log_id=0GfdQ770000 engine=tsm1 tsm1_strategy=full tsm1_optimize=false trace_id=0GfdSl40000 op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/var/lib/influxdb/data/_internal/monitor/852/000000064-000000003.tsm
runtime: out of memory: cannot allocate 32768-byte block (719749120 in use)
fatal error: out of memory

EDIT: Very interesting, this is where the break seems to happen and the duration grows:

ts=2019-07-16T11:27:11.665934Z lvl=info msg=“Opened file” log_id=0GfdQ770000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/inverter/autogen/814/000000007-000000002.tsm id=0 duration=0.216ms
ts=2019-07-16T11:27:11.665868Z lvl=info msg=“Opened file” log_id=0GfdQ770000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/inverter/autogen/825/000000001-000000001.tsm id=0 duration=0.307ms
ts=2019-07-16T11:27:11.666535Z lvl=info msg=“Reading file” log_id=0GfdQ770000 engine=tsm1 service=cacheloader path=/var/lib/influxdb/wal/inverter/autogen/814/_00001.wal size=18332736
ts=2019-07-16T11:27:11.674350Z lvl=info msg=“Opened shard” log_id=0GfdQ770000 service=store trace_id=0GfdQ7XG000 op_name=tsdb_open index_version=inmem path=/var/lib/influxdb/data/inverter/autogen/825 duration=9.835ms
ts=2019-07-16T11:27:11.676162Z lvl=info msg=“Opened file” log_id=0GfdQ770000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/inverter/autogen/835/000000005-000000002.tsm id=0 duration=0.300ms
ts=2019-07-16T11:27:11.676722Z lvl=info msg=“Reading file” log_id=0GfdQ770000 engine=tsm1 service=cacheloader path=/var/lib/influxdb/wal/inverter/autogen/835/_00001.wal size=18065553
ts=2019-07-16T11:27:17.717191Z lvl=info msg=“Reading file” log_id=0GfdQ770000 engine=tsm1 service=cacheloader path=/var/lib/influxdb/wal/ess/autogen/847/_00002.wal size=10485814
ts=2019-07-16T11:27:19.424352Z lvl=info msg=“Reading file” log_id=0GfdQ770000 engine=tsm1 service=cacheloader path=/var/lib/influxdb/wal/_internal/monitor/853/_00121.wal size=7805680
ts=2019-07-16T11:27:23.300072Z lvl=info msg=“Reading file” log_id=0GfdQ770000 engine=tsm1 service=cacheloader path=/var/lib/influxdb/wal/ess/autogen/847/_00003.wal size=2529643
ts=2019-07-16T11:27:23.935335Z lvl=info msg=“Opened shard” log_id=0GfdQ770000 service=store trace_id=0GfdQ7XG000 op_name=tsdb_open index_version=inmem path=/var/lib/influxdb/data/inverter/autogen/814 duration=12270.892ms

Weirdly, although I changed to:

index-version = “tsi1”

In the config file it still shows in the log:

op_name=tsdb_open index_version=inmem

EDIT 2: When I am running influxdb normally, I am running it via systemd (on Raspbian), so I can see the the last output and how it is launched. I double-checked that the config file is being correctly used and it is:

● influxdb.service - InfluxDB is an open-source, distributed, time series database
Loaded: loaded (/lib/systemd/system/influxdb.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2019-07-16 13:57:38 CEST; 25s ago
Docs: https://docs.influxdata.com/influxdb/
Main PID: 1596 (influxd)
Tasks: 10 (limit: 4915)
CGroup: /system.slice/influxdb.service
└─1596 /usr/bin/influxd -config /etc/influxdb/influxdb.conf

So it looks like it is correctly using:

-config /etc/influxdb/influxdb.conf

As mentioned there I have made the change to index-version = “tsi1”, but nonetheless the systemd daemon instance of influxdb is also showing:

index_version=inmem

This seemed very bizarre. I have found elsewhere that it is important run the index migration with:

sudo -H -u influxdb bash -c 'influx_inspect buildtsi -datadir /var/lib/influxdb/data -waldir /var/lib/influxdb/wal/'

I have done that and now tsi1 is being used, so far so good:

2019-07-16T12:17:58.841448Z info Opened shard {“log_id”: “0GfgGqCG000”, “service”: “store”, “trace_id”: “0GfgGqdl000”, “op_name”: “tsdb_open”, “index_version”: “tsi1”, “path”: “/var/lib/influxdb/data/machinestats/autogen/845”, “duration”: “5325.990ms”}

But the influxdb remains unresponsive and still consumes the same approx. 360% load on 4 cores. It also aborts similar to before:

2019-07-16T12:20:03.695205Z error [500] - “[shard 898] open /var/lib/influxdb/data/machinestats/autogen/898/index/1/MANIFEST: too many open files” {“log_id”: “0GfgGqCG000”, “service”: “httpd”}
runtime: out of memory: cannot allocate 8192-byte block (828571648 in use)
fatal error: out of memory

There are now very many errors, pretty much all of them:

too many open files

EDIT 3
I can confirm from top that the CPU load is in the > 300% range, the memory goes from less than 10% up to 80% throughout a minute or two and then the process fails and is automatically re-launched by systemd. So exactly the same behaviour as when I launch influxd manually except that after it aborts it is automatically relaunched indefinitely.

Just to be clear the data volume sent to this influxdb is very low. Every 10 seconds I am sending roughly 50 values and on top of that every minute another roughly 20-30 integers/floats. So I assume this is from an issue within influx and not because of too much data is being thrown at it.

Hi @Dwarf ,

can you solve the too many open files issue ?
( ulimit ? )

best regards ,

I set ulimit to 65536 temporarily, but same behaviour. Now I am locked out of the system too because it is completely unresponsive.
I’ll try to set ulimit to something more sane (original value was 1024) to see if it makes things better or worse, but I do not think that is the issue. I also do not understand why with the relatively low data writes into the db I am making (no automatic recurring queries etc.) the db would eat so much resources. I think it is another issue.

The frustrating thing is I know I can just empty the db and things will be perfect for the next 9 months, when this will happen all over again. I have not found any indication of a tutorial for limiting influxdb in terms of a certain RAM limit etc… As many people are using a Raspberry Pi surely they would run into similar issues, some probably much earlier if they are collecting more data. Either way tsi1 seems like a good plausible way to, in theory, reduce RAM consumption by influxdb.

EDIT: So as expected I have set the ulimit to 10240 and I do not see any “too many open files” messages in the first minute of running influxd. But the original OOM problem remains:

2019-07-17T09:19:44.020487Z info index opened with 8 partitions {“log_id”: “0GgoUbuW000”, “index”: “tsi”}
2019-07-17T09:19:44.025483Z info Opened shard {“log_id”: “0GgoUbuW000”, “service”: “store”, “trace_id”: “0GgoUcLW000”, “op_name”: “tsdb_open”, “index_version”: “tsi1”, “path”: “/var/lib/influxdb/data/machinestats/autogen/714”, “duration”: “123.496ms”}
2019-07-17T09:19:44.080762Z info index opened with 8 partitions {“log_id”: “0GgoUbuW000”, “index”: “tsi”}
2019-07-17T09:19:44.084020Z info Opened shard {“log_id”: “0GgoUbuW000”, “service”: “store”, “trace_id”: “0GgoUcLW000”, “op_name”: “tsdb_open”, “index_version”: “tsi1”, “path”: “/var/lib/influxdb/data/machinestats/autogen/725”, “duration”: “145.562ms”}
2019-07-17T09:19:44.136367Z info index opened with 8 partitions {“log_id”: “0GgoUbuW000”, “index”: “tsi”}
2019-07-17T09:19:44.139801Z info Opened shard {“log_id”: “0GgoUbuW000”, “service”: “store”, “trace_id”: “0GgoUcLW000”, “op_name”: “tsdb_open”, “index_version”: “tsi1”, “path”: “/var/lib/influxdb/data/machinestats/autogen/736”, “duration”: “145.108ms”}
2019-07-17T09:19:44.141209Z info index opened with 8 partitions {“log_id”: “0GgoUbuW000”, “index”: “tsi”}
2019-07-17T09:19:44.149556Z info Opened shard {“log_id”: “0GgoUbuW000”, “service”: “store”, “trace_id”: “0GgoUcLW000”, “op_name”: “tsdb_open”, “index_version”: “tsi1”, “path”: “/var/lib/influxdb/data/machinestats/autogen/74”, “duration”: “144.990ms”}
2019-07-17T09:19:44.166166Z info index opened with 8 partitions {“log_id”: “0GgoUbuW000”, “index”: “tsi”}
2019-07-17T09:19:44.169739Z info Opened shard {“log_id”: “0GgoUbuW000”, “service”: “store”, “trace_id”: “0GgoUcLW000”, “op_name”: “tsdb_open”, “index_version”: “tsi1”, “path”: “/var/lib/influxdb/data/machinestats/autogen/747”, “duration”: “142.525ms”}
2019-07-17T09:19:44.190840Z info index opened with 8 partitions {“log_id”: “0GgoUbuW000”, “index”: “tsi”}
2019-07-17T09:19:44.200604Z info Opened shard {“log_id”: “0GgoUbuW000”, “service”: “store”, “trace_id”: “0GgoUcLW000”, “op_name”: “tsdb_open”, “index_version”: “tsi1”, “path”: “/var/lib/influxdb/data/machinestats/autogen/758”, “duration”: “116.282ms”}
2019-07-17T09:19:44.269711Z info index opened with 8 partitions {“log_id”: “0GgoUbuW000”, “index”: “tsi”}
2019-07-17T09:19:44.275095Z info Opened shard {“log_id”: “0GgoUbuW000”, “service”: “store”, “trace_id”: “0GgoUcLW000”, “op_name”: “tsdb_open”, “index_version”: “tsi1”, “path”: “/var/lib/influxdb/data/weather/autogen/95”, “duration”: “134.995ms”}
2019-07-17T09:19:44.279168Z info index opened with 8 partitions {“log_id”: “0GgoUbuW000”, “index”: “tsi”}
2019-07-17T09:19:44.296546Z info index opened with 8 partitions {“log_id”: “0GgoUbuW000”, “index”: “tsi”}
2019-07-17T09:19:44.299261Z info Opened shard {“log_id”: “0GgoUbuW000”, “service”: “store”, “trace_id”: “0GgoUcLW000”, “op_name”: “tsdb_open”, “index_version”: “tsi1”, “path”: “/var/lib/influxdb/data/weather/autogen/106”, “duration”: “129.144ms”}
2019-07-17T09:19:44.310728Z info index opened with 8 partitions {“log_id”: “0GgoUbuW000”, “index”: “tsi”}
2019-07-17T09:19:44.313432Z info Opened shard {“log_id”: “0GgoUbuW000”, “service”: “store”, “trace_id”: “0GgoUcLW000”, “op_name”: “tsdb_open”, “index_version”: “tsi1”, “path”: “/var/lib/influxdb/data/weather/autogen/117”, “duration”: “112.499ms”}
2019-07-17T09:19:44.331582Z info Opened shard {“log_id”: “0GgoUbuW000”, “service”: “store”, “trace_id”: “0GgoUcLW000”, “op_name”: “tsdb_open”, “index_version”: “tsi1”, “path”: “/var/lib/influxdb/data/machinestats/autogen/769”, “duration”: “180.258ms”}
2019-07-17T09:19:44.377919Z info index opened with 8 partitions {“log_id”: “0GgoUbuW000”, “index”: “tsi”}
2019-07-17T09:19:44.381233Z info Opened shard {“log_id”: “0GgoUbuW000”, “service”: “store”, “trace_id”: “0GgoUcLW000”, “op_name”: “tsdb_open”, “index_version”: “tsi1”, “path”: “/var/lib/influxdb/data/machinestats/autogen/780”, “duration”: “104.575ms”}
2019-07-17T09:19:44.440785Z info index opened with 8 partitions {“log_id”: “0GgoUbuW000”, “index”: “tsi”}
2019-07-17T09:19:44.445279Z info Opened shard {“log_id”: “0GgoUbuW000”, “service”: “store”, “trace_id”: “0GgoUcLW000”, “op_name”: “tsdb_open”, “index_version”: “tsi1”, “path”: “/var/lib/influxdb/data/machinestats/autogen/791”, “duration”: “144.449ms”}
2019-07-17T09:19:44.440796Z info index opened with 8 partitions {“log_id”: “0GgoUbuW000”, “index”: “tsi”}
2019-07-17T09:19:44.454405Z info index opened with 8 partitions {“log_id”: “0GgoUbuW000”, “index”: “tsi”}
2019-07-17T09:19:44.462489Z info Opened shard {“log_id”: “0GgoUbuW000”, “service”: “store”, “trace_id”: “0GgoUcLW000”, “op_name”: “tsdb_open”, “index_version”: “tsi1”, “path”: “/var/lib/influxdb/data/weather/autogen/125”, “duration”: “148.673ms”}
2019-07-17T09:19:44.472972Z info Opened shard {“log_id”: “0GgoUbuW000”, “service”: “store”, “trace_id”: “0GgoUcLW000”, “op_name”: “tsdb_open”, “index_version”: “tsi1”, “path”: “/var/lib/influxdb/data/machinestats/autogen/802”, “duration”: “139.673ms”}
2019-07-17T09:19:44.495325Z info index opened with 8 partitions {“log_id”: “0GgoUbuW000”, “index”: “tsi”}
2019-07-17T09:19:44.501469Z info Opened shard {“log_id”: “0GgoUbuW000”, “service”: “store”, “trace_id”: “0GgoUcLW000”, “op_name”: “tsdb_open”, “index_version”: “tsi1”, “path”: “/var/lib/influxdb/data/weather/autogen/139”, “duration”: “119.932ms”}
runtime: out of memory: cannot allocate 65536-byte block (983629824 in use)
fatal error: out of memory
runtime: out of memory: cannot allocate 65536-byte block (983629824 in use)
fatal error: out of memory

Hi , what is your retention ?
I ask it because you say it is after 9 months , so If you could limit the retention to less than 9 months ?

I have no retention policy at the moment. I figured with a bit over 1 GB of db data I should be fine, but I may be wrong?

Here is the current total db size:

sudo du -sh /var/lib/influxdb/data
1.2G /var/lib/influxdb/data

1.2 G should be no problem , I just checked and we have one with 12G running …
Once you get your db up and running again you could consider changing the default retention policy to a few months ? This will apply only to new shards , so you will have to remove manually some old shards until all shards are from after the retention policy modification …

from the doc :

When you create a database, InfluxDB creates a retention policy called autogen with an infinite duration, a replication factor set to one, and a shard group duration set to seven days. For more information, see Retention policy management.

I will make sure to do that in the future, but I see no reason that even without a retention policy such a small db would eat almost 1 GB of RAM and then fail accordingly. Influxdb seems to be opening files and not closing them from the logs as otherwise it would not run into the ulimit which was set to 1024 originally. I do not comprehend how > 1000 files would be opened from such a small db, something is seriously wrong here and I would be eager to know where I can change things to prevent this behaviour.

Again, I am sure you are right that running without a retention policy is not best practice, but in my (very limited) view it is not an explanation for the problems I am seeing.

EDIT: One more question, what type of hardware is your 12G db running on? i would be very curious to know of other low-resource systems running for months and what settings they are using and what complexity their data has as a hint for what I could improve.

I can see from your logs the databases : inverter , ess , weather , _internal and machinestats …

maybe the sum of these explains the 1000 files …

I don’t know if you can take some risks on the current database ,
I would remove some folders like /var/lib/influxdb/data/_internal if that exists and /wal/_internal …
( because you have set store-enabled = false )
which subfolders do you find in /var/lib/influxdb/data ?

My db runs on a virtual linux machine , it is not a low-resource one.

1 Like

I would have liked to DROP DATABASE from the influx CLI but I could not access the db from CLI because it is stalling (due to the ongoing high CPU and RAM usage by influxd). Because my data is not important I went ahead to delete the files. _internal in /var/lib/influxdb/data was 76 MB large before deletion and _internal in the /var/lib/influxdb/wal folder was 28 MB.

Again, I tried to access the inlux CLI many times and finally it worked. At that point I had influxd running manually and not as a service from systemd. So I saw the output in terminal and noticed it had calmed down. In other words influxd is running normally now!

The last I saw before it returned to normal operation was “compacting complete” or “compaction complete”. This leads me to believe that the process of compacting shards was eating up the RAM, but that is speculation. It would be great to know more about this and how to constrain influx’ RAM and CPU usage while compacting. I had already set max-concurrent-compactions = 1 days ago to try to limit this to one CPU core, but influx was still thrashing all 4 cores and either ignoring this setting or something else was causing it.

In an effort to prevent this from happening again I have also set to cater to low write rates on an RPi, even though I am using an SSD (that is connected via the slow USB 2.0 controller that is also sharing Ethernet on the RPi, potentially compounding this issue in the first place):

compact-throughput = “5m”
compact-throughput-burst = “8m”

If all this fails again, then I am ready to:

  1. Remove all dbs
  2. Re-create dbs empty
  3. Disable _internal from the start
  4. Enable tsi1 from the start
  5. Define a retention policy to start removing data >= 3 months old
  6. Hope for the best and pray that this will be enough to give me a stable db for years :confounded:

I will try to educate myself on whether I ran retro-actively define a retention policy to thin out data older than 3 months.

Thank you MarkV for your help thus far. If I find valuable information regarding this I will continue to post it here for other users of resource-confined machines to potentially find.

EDIT: Though running normally now, influxd is still taking most of my 1GB RAM (65.1% here) so I may still have to clean out the dbs soon:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7599 influxdb 20 0 1770384 650176 27560 S 1.6 65.1 1:53.90 influxd

1 Like