Influx DB Error Memory error

Hi,

We have seen several occurrences of the below error in influxDB which causes the process to fail:

unexpected fault address 0x7fa106c1e804
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x7fa106c1e804 pc=0xae9921]

goroutine 14534703 [running]:
runtime.throw(0xcb15e3, 0x5)

  • /usr/local/go/src/runtime/panic.go:605 +0x95 fp=0xc4ffe98630 sp=0xc4ffe98610 pc=0x42baa5*
    runtime.sigpanic()
  • /usr/local/go/src/runtime/signal_unix.go:374 +0x227 fp=0xc4ffe98680 sp=0xc4ffe98630 pc=0x442577*
    encoding/binary.binary.bigEndian.Uint32(…)
  • /usr/local/go/src/encoding/binary/binary.go:112*
    *github.com/influxdata/influxdb/tsdb/engine/tsm1.(indirectIndex).search.func1(0x7fa106c1e804, 0x4, 0xad808, 0xc489124a00)
  • /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/reader.go:669 +0x41 fp=0xc4ffe986c8 sp=0xc4ffe98680 pc=0xae9921*
    github.com/influxdata/influxdb/pkg/bytesutil.SearchBytesFixed(0x7fa106b71000, 0x15b00c, 0x15b00c, 0x4, 0xc4ffe987b0, 0x42a379)

the full error is a continuous go stack dump of over 8k lines

It happened several times over the course of 4-5 days (3 or 4 times a day at random), and each time the process fails we have the OS guardian that will restart it for us and then it ran fine again for several hours, so we don’t tend to lose and data.

But we haven’t see it for a few days now…

I can see the odd query from Grafana prior to the issue happening, but nothing consistent for each occurrence and the only other logging I see is for the compacting of the shards which is pretty much constant anyway in the logs.

Some more details :

  • Influx DB version 1.4.2.

  • Server has 512gb of memory and charts show it was no where near the box limits.

  • We have been running pretty much the same setup for over a year and nothing has changed recently in the config.

  • I have run the the verification tool on the data dir and this returned no issues and everything healthy.

    /xxxxxxx/xxxxxxx/var/lib/influxdb/data/dsl/ms_database/90day/869/000000007-000000001.tsm: healthy
    Broken Blocks: 0 / 353985902, in 319.519655774s

We have already started the process to upgrade the DB to the latest version 1.8, in the hope this may prevent it happening again, but we would still like to get to the bottom of what happened, if someone can advise what the root cause might be or suggest what other investigations I can do…

Happy to provide more information if needed

I hit the similar issue to yours. It is also an “unexpected fault address” fault and caused by encoding/binary.binary.bigEndian.Uint32(…). The influxdb function who calls this is also located in tsm1.

Apr 24 21:42:44 localhost xxxx: unexpected fault address 0x7f74afb9a07a
Apr 24 21:42:44 localhost xxxx: fatal error: fault
Apr 24 21:42:44 localhost xxxx: [signal SIGSEGV: segmentation violation code=0x1 addr=0x7f74afb9a07a pc=0x9ccce4]
Apr 24 21:42:44 localhost xxxx:
Apr 24 21:42:44 localhost xxxx: goroutine 2488 [running]:
Apr 24 21:42:44 localhost xxxx: runtime.throw(0x23e7dac, 0x5)
Apr 24 21:42:44 localhost xxxx: #011/usr/local/go/src/runtime/panic.go:774 +0x72 fp=0xc00108dee8 sp=0xc00108deb8 pc=0x433bf2
Apr 24 21:42:44 localhost xxxx: runtime.sigpanic()
Apr 24 21:42:44 localhost xxxx: #011/usr/local/go/src/runtime/signal_unix.go:401 +0x3de fp=0xc00108df18 sp=0xc00108dee8 pc=0x44967e
Apr 24 21:42:44 localhost xxxx: encoding/binary.bigEndian.Uint16(...)
Apr 24 21:42:44 localhost xxxx: #011/usr/local/go/src/encoding/binary/binary.go:101
Apr 24 21:42:44 localhost xxxx: github.com/influxdata/influxdb/tsdb/engine/tsm1.readKey(...)