Hi,
We have seen several occurrences of the below error in influxDB which causes the process to fail:
unexpected fault address 0x7fa106c1e804
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x7fa106c1e804 pc=0xae9921]
goroutine 14534703 [running]:
runtime.throw(0xcb15e3, 0x5)
- /usr/local/go/src/runtime/panic.go:605 +0x95 fp=0xc4ffe98630 sp=0xc4ffe98610 pc=0x42baa5*
runtime.sigpanic() - /usr/local/go/src/runtime/signal_unix.go:374 +0x227 fp=0xc4ffe98680 sp=0xc4ffe98630 pc=0x442577*
encoding/binary.binary.bigEndian.Uint32(…) - /usr/local/go/src/encoding/binary/binary.go:112*
*github.com/influxdata/influxdb/tsdb/engine/tsm1.(indirectIndex).search.func1(0x7fa106c1e804, 0x4, 0xad808, 0xc489124a00) - /go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/reader.go:669 +0x41 fp=0xc4ffe986c8 sp=0xc4ffe98680 pc=0xae9921*
github.com/influxdata/influxdb/pkg/bytesutil.SearchBytesFixed(0x7fa106b71000, 0x15b00c, 0x15b00c, 0x4, 0xc4ffe987b0, 0x42a379)
the full error is a continuous go stack dump of over 8k lines
It happened several times over the course of 4-5 days (3 or 4 times a day at random), and each time the process fails we have the OS guardian that will restart it for us and then it ran fine again for several hours, so we don’t tend to lose and data.
But we haven’t see it for a few days now…
I can see the odd query from Grafana prior to the issue happening, but nothing consistent for each occurrence and the only other logging I see is for the compacting of the shards which is pretty much constant anyway in the logs.
Some more details :
-
Influx DB version 1.4.2.
-
Server has 512gb of memory and charts show it was no where near the box limits.
-
We have been running pretty much the same setup for over a year and nothing has changed recently in the config.
-
I have run the the verification tool on the data dir and this returned no issues and everything healthy.
…
/xxxxxxx/xxxxxxx/var/lib/influxdb/data/dsl/ms_database/90day/869/000000007-000000001.tsm: healthy
Broken Blocks: 0 / 353985902, in 319.519655774s
…
We have already started the process to upgrade the DB to the latest version 1.8, in the hope this may prevent it happening again, but we would still like to get to the bottom of what happened, if someone can advise what the root cause might be or suggest what other investigations I can do…
Happy to provide more information if needed