Recover from an "invalid series segment"

I’m running InfluxDB v1.6.3

In have a situation where the host machine was rebooted while (or immediately after) a temporary database was being dropped. The resulting state would not allow Influx to launch once the system had rebooted.

It attempts to start but hangs at this point and won’t accept any connections.

2019-03-13T17:56:59.009426Z	info	InfluxDB starting	{"log_id": "0EA32b0G000", "version": "1.6.3", "branch": "1.6", "commit": "389de31c961831de0a9f4172173337d4a6193909"}
2019-03-13T17:56:59.009461Z	info	Go runtime	{"log_id": "0EA32b0G000", "version": "go1.10.3", "maxprocs": 4}
2019-03-13T17:56:59.120332Z	info	Using data dir	{"log_id": "0EA32b0G000", "service": "store", "path": "/root/influxdb/data"}
2019-03-13T17:56:59.120393Z	info	Compaction settings	{"log_id": "0EA32b0G000", "service": "store", "max_concurrent_compactions": 2, "throughput_bytes_per_second": 50331648, "throughput_burst_bytes": 50331648}
2019-03-13T17:56:59.120417Z	info	Open store (start)	{"log_id": "0EA32b0G000", "service": "store", "trace_id": "0EA32bS0000", "op_name": "tsdb_open", "op_event": "start"}
2019-03-13T17:56:59.135750Z	info	Opened file	{"log_id": "0EA32b0G000", "engine": "tsm1", "service": "filestore", "path": "/root/influxdb/data/_internal/monitor/10/000000005-000000002.tsm", "id": 0, "duration": "2.219ms"}
2019-03-13T17:56:59.136875Z	info	Opened file	{"log_id": "0EA32b0G000", "engine": "tsm1", "service": "filestore", "path": "/root/influxdb/data/_internal/monitor/105/000000012-000000002.tsm", "id": 0, "duration": "3.077ms"}
2019-03-13T17:56:59.140402Z	info	Opened file	{"log_id": "0EA32b0G000", "engine": "tsm1", "service": "filestore", "path": "/root/influxdb/data/_internal/monitor/207/000000013-000000002.tsm", "id": 0, "duration": "5.200ms"}
2019-03-13T17:56:59.154700Z	info	Opened file	{"log_id": "0EA32b0G000", "engine": "tsm1", "service": "filestore", "path": "/root/influxdb/data/_internal/monitor/157/000000013-000000002.tsm", "id": 0, "duration": "14.662ms"}

influx_inspect reveals that all of the tsm files are healthy, but the series files related to the dropped database are corrupted.

$ influx_inspect verify-seriesfile -dir ~/influxdb/data
2019-03-13T18:48:59.331294Z	error	Error opening segment	{"log_id": "0EA612gG000", "path": "/root/influxdb/data/t128_tmp/_series", "partition": "00", "segment": "0000", "error": "invalid series segment"}
2019-03-13T18:48:59.332114Z	error	Error opening segment	{"log_id": "0EA612gG000", "path": "/root/influxdb/data/t128_tmp/_series", "partition": "01", "segment": "0000", "error": "invalid series segment"}
2019-03-13T18:48:59.332078Z	error	Error opening segment	{"log_id": "0EA612gG000", "path": "/root/influxdb/data/t128_tmp/_series", "partition": "02", "segment": "0000", "error": "invalid series segment"}
2019-03-13T18:48:59.332154Z	error	Error opening segment	{"log_id": "0EA612gG000", "path": "/root/influxdb/data/t128_tmp/_series", "partition": "04", "segment": "0000", "error": "invalid series segment"}
2019-03-13T18:48:59.332222Z	error	Error opening segment	{"log_id": "0EA612gG000", "path": "/root/influxdb/data/t128_tmp/_series", "partition": "03", "segment": "0000", "error": "invalid series segment"}

Is there a recommended approach to recover from this state? And are there other failure conditions I can expect to cause it?

Hi ,
You can try to move the files from the dropped database to a different directory .

@MarcV,

Thanks for your response! Which files do you mean? The files under the _series directory, everything under the data/t128_tmp directory, the wal/t128_tmp directory, or all of the above?

I guess that’s straight forward enough here since the data is unwanted, but even since I posted I’ve had the same failure for an un-dropped databases where I do need to salvage the data. I would like a procedure that allows most, if not all, of the data to be maintained.

I mean all directories t128_tmp,

I guess the only way is taking backups daily and restore in case of failure ?
That means you can loose a day of data ,
Is that acceptable ?

I guess that’s an option. So you’re saying there is no way to recover from this state? There is no means for rebuilding the segments from the TSM files?

As a test, I did try deleting the segment directories rm -rf influxdb/data/t128_tmp/_series/*. After that, Influx could launch and I haven’t found any adverse affects. The series directories reappeared - I assume they are recreated during launch. I can even access data from that temporary database.

Is that a viable approach, or am I missing something?

Hi Greg ,
I was focused on getting your database open again :slight_smile:
It is strange that you can access data from that temporary database , or is it configured somewhere in telegraf ? It means the database is not dropped or was recreated … ?

I don’t know if they can be recreated , influx_inspect has the command buildtsi ,
( generates tsi1 indexes from tsm1 data ) and series are indexed so probably you can recreate the ‘series’ files with this command.

update : I tested it ,

  1. rm _series/*
  2. influx_inspect buildtsi -database dbcpu -datadir “/home/influxdb/data” -waldir “/home/influxdb/wal”

and the series are back …

Excellent! Thanks, that’s what I was looking for.

I’m working in a somewhat volatile environment and it’s likely this will happen again. I’d like to have a procedure in place and automated ideally.