TSM Compaction loop

I have had a problem where every few weeks Influx gets stuck in a compaction loop. This eats the CPU and disk.

This is InfluxDB 2.0 (happened with multiple versions).

  • What is the fix? Is it possible to delete some files while keeping data? So far only solution was to completely reload the database from original source - when I deleted some of these possibly corrupted files I neded up with missing data)
    s=2021-05-03T13:57:34.631567Z lvl=info msg=“TSM compaction (start)” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group op_event=start
    ts=2021-05-03T13:57:34.631691Z lvl=info msg=“Beginning compaction” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group tsm1_files_n=4
    ts=2021-05-03T13:57:34.631695Z lvl=info msg=“Compacting file” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/root/.influxdbv2/engine/data/000000000002680-000000003.tsm
    ts=2021-05-03T13:57:34.631705Z lvl=info msg=“Compacting file” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/root/.influxdbv2/engine/data/000000000002712-000000003.tsm
    ts=2021-05-03T13:57:34.631708Z lvl=info msg=“Compacting file” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group tsm1_index=2 tsm1_file=/root/.influxdbv2/engine/data/000000000002744-000000003.tsm
    ts=2021-05-03T13:57:34.631711Z lvl=info msg=“Compacting file” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group tsm1_index=3 tsm1_file=/root/.influxdbv2/engine/data/000000000002776-000000003.tsm
    ts=2021-05-03T13:57:41.934475Z lvl=info msg=“Error compacting TSM files” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group error=EOF
    ts=2021-05-03T13:57:42.934667Z lvl=info msg=“TSM compaction (end)” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group op_event=end op_elapsed=8303.103ms
    ts=2021-05-03T13:57:43.631706Z lvl=info msg=“TSM compaction (start)” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group op_event=start
    ts=2021-05-03T13:57:43.631742Z lvl=info msg=“Beginning compaction” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group tsm1_files_n=4
    ts=2021-05-03T13:57:43.631745Z lvl=info msg=“Compacting file” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/root/.influxdbv2/engine/data/000000000002680-000000003.tsm
    ts=2021-05-03T13:57:43.631750Z lvl=info msg=“Compacting file” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/root/.influxdbv2/engine/data/000000000002712-000000003.tsm
    ts=2021-05-03T13:57:43.631754Z lvl=info msg=“Compacting file” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group tsm1_index=2 tsm1_file=/root/.influxdbv2/engine/data/000000000002744-000000003.tsm
    ts=2021-05-03T13:57:43.631758Z lvl=info msg=“Compacting file” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group tsm1_index=3 tsm1_file=/root/.influxdbv2/engine/data/000000000002776-000000003.tsm
    ts=2021-05-03T13:57:50.970879Z lvl=info msg=“Error compacting TSM files” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group error=EOF
    ts=2021-05-03T13:57:51.971082Z lvl=info msg=“TSM compaction (end)” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group op_event=end op_elapsed=8339.375ms
    ts=2021-05-03T13:57:52.631539Z lvl=info msg=“TSM compaction (start)” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group op_event=start
    ts=2021-05-03T13:57:52.631569Z lvl=info msg=“Beginning compaction” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group tsm1_files_n=4
    ts=2021-05-03T13:57:52.631573Z lvl=info msg=“Compacting file” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group tsm1_index=0 tsm1_file=/root/.influxdbv2/engine/data/000000000002680-000000003.tsm
    ts=2021-05-03T13:57:52.631576Z lvl=info msg=“Compacting file” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group tsm1_index=1 tsm1_file=/root/.influxdbv2/engine/data/000000000002712-000000003.tsm
    ts=2021-05-03T13:57:52.631579Z lvl=info msg=“Compacting file” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group tsm1_index=2 tsm1_file=/root/.influxdbv2/engine/data/000000000002744-000000003.tsm
    ts=2021-05-03T13:57:52.631594Z lvl=info msg=“Compacting file” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group tsm1_index=3 tsm1_file=/root/.influxdbv2/engine/data/000000000002776-000000003.tsm
    ts=2021-05-03T13:57:59.975213Z lvl=info msg=“Error compacting TSM files” log_id=0TbhIEYl000 service=storage-engine engine=tsm1 tsm1_level=3 tsm1_strategy=level op_name=tsm1_compact_group error=EOF
    ts=2021-05-03T1

Any chance of an update here – pointer to documentation on tsm files etc?

@Frans_Coetzee there’s a tool called influx_inspect in the 1.8.x packages that hasn’t been fully migrated into 2.0.x yet. If you pull that tool out of the latest 1.8.x, you can run:

influx_inspect verify -dir path/to/data

where path/to/data is the path to your root data dir. This is an off-line tool that will scan the data dir recursively for .tsm files and check to ensure each is valid. Salvaging data from a corrupt .tsm file is challenging and not always practical. You can move them aside to get the database into a healthy state again.

Documentation on the .tsm file format can be found here.

Thanks.

Just to be clear, this corruption has happened twice now; it seems to happen when server was rebooted.

From documentation it is not clear to me what the downside would be are when you remove a corrupted tsm file. The docs seem to say the WAL files (or other source files) are only removed once a new tsm file has been successfully written and checksummed. Hence, removing a corrupted file should be safe, compaction will just restart from the (still present) source files.

Here eg I found the corrupted file, and moved it, so should be ok. But last time, I also ended removing a corrupted file and then found test queries failed and definitely data was lost. In the end I had to completely rebuild the whole database from raw csv files, a massive compute load.

Can you clarify for me that (a) I can be sure data is not missing or (b) tell me how to figure out what was missing. Seems pretty bad if the DB is not ACID