InfluxDB restart problem and data loss on 32b ARM SBC

I’ve been running InfluxDB on my small ARM machine for a couple of months now, and it has been seemingly working well except when the daemon has to restart.

8 core ARM, 2GB RAM, 256GB SSD, Ubuntu 16.04, InfluxDB 1.3.5 (and rest of tick stack). Telegraf and a custom application feeding db via udp. Chronograf showing multiple plots as far back as when database was last restarted.

A week or so in there was a problem after the machine had an abrupt restart. I had to recreate the db to get it now running again, and just chalked it up to a result of some of the initial fumbling around to get it running.

Yesterday I had to restart the machine again, and when it came back up there was only a couple of hours of data. I searched around for a solution and came across a variety of config settings to adjust (cache and memory sizes, number of concurrent tasks, etc) which I did… but now the daemon won’t start at all, failing with an error about tsm being unable to allocate memory.

Searching around about that brought up a few concerning posts about 32b address space limiting the db size to 2-3GB. That is concerning because the goal is to (nearly) fill the ssd. Seems like a strange limitation for a database, if true.

So can anyone offer any input on:

  1. how to avoid loss of data beyond a few hours old when restarting
  2. how to fix the tsm out of memory
  3. whether influx is actually useful in my use case… I’d hate to have to abandon what seems like quite a nice tool stack

Thanks,
Andrew

Update: I updated to 1.3.7 (leaving my influxdb.conf file untouched) and now the service starts and some of the data (about a week) is showing up in Chronograf. I see a few instances of this error in the journal:

error compacting TSM files: cannot allocate memory engine=tsm1

There is also a HUGE volume of messages in the journal – and journalctl reports suppressing ~30,000 others. Is there a way to turn down the volume of messages generated? I don’t want to fill my boot volume with logging. The messages look like this:

Oct 30 13:08:46 hc1 influxd[17793]: #011/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/file_store.go:395 +0x2a4
Oct 30 13:08:46 hc1 influxd[17793]: created by github.com/influxdata/influxdb/tsdb/engine/tsm1.(*FileStore).Open
Oct 30 13:08:46 hc1 influxd[17793]: #011/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/file_store.go:399 +0x3f4
Oct 30 13:08:46 hc1 influxd[17793]: goroutine 61010 [chan send, 1 minutes]:
Oct 30 13:08:46 hc1 influxd[17793]: github.com/influxdata/influxdb/tsdb/engine/tsm1.(*FileStore).Open.func1(0x10c51810, 0x1d56fa80, 0x26b6, 0x…

Repeated over and over and over. Thousands per second.

Another update: my system has crashed (as in shutdown unexpectedly) twice since I got influxdb up and running again, and both times it was while opening a Chronograf dashboard (and therefore doing queries against the InfluxDB). Linux doesn’t usually go down without a fight, so this is somewhat surprising.

And now Chronograf erases most of its graphs to black, so I can only see 1-2 out of 5 at a time.

Andrew - were you able to resolve your data loss problem? I have similar behavior on reboot where all history is gone after a restart of the system.

No. I tried a whole variety of things, but in the end gave up and went with a competing solution (which has been working well ever since).

What is the solution now?

timescaledb – has been running for just about 2 years now without failure or data loss. Interfaces to Grafana well, backups are simple, queries are SQL, and my database is up to about 30GB (and I haven’t started using their new compression support yet).

version 1.3 is very out of date. An upgrade to 1.7 would be a first step. On these 32-bit ARM SBCs, due to the limited RAM space, you need to be sure you are using tsi and not inmem for your database, and it’s also important to keep your shard size relatively small in order to not go into am OOM-loop when it come time to compact shards.

That’s been my experience at least, and I run InfluxDB on a bunch of 32- and 64-bit SBCs and have been doing so for 2 years.

dg