Startup of Influx is really slow

My InfluxDB lives in AWS. I did a snapshot restore to EBS volume and now the start of the InfluxDB is taking around 20 minutes. What’s wrong here and how can I correct it?

This is what my logs look like while the db starts:

Feb 28 20:00:05 ip-10-110-113-252 influxd: ts=2020-02-28T20:00:05.782589Z lvl=info msg=“Opened file” log_id=0LFLugF0000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/telegraf/autogen/230/000032814-000000007.tsm id=0 duration=209.248ms
Feb 28 20:00:06 ip-10-110-113-252 influxd: ts=2020-02-28T20:00:06.084226Z lvl=info msg=“Opened file” log_id=0LFLugF0000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/telegraf/autogen/230/000032814-000000008.tsm id=1 duration=301.567ms
Feb 28 20:00:07 ip-10-110-113-252 influxd: ts=2020-02-28T20:00:07.158444Z lvl=info msg=“Opened file” log_id=0LFLugF0000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/telegraf/autogen/212/000026914-000000007.tsm id=2 duration=2860.066ms
Feb 28 20:00:07 ip-10-110-113-252 influxd: ts=2020-02-28T20:00:07.901511Z lvl=info msg=“Opened file” log_id=0LFLugF0000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/telegraf/autogen/230/000032814-000000009.tsm id=2 duration=1817.226ms
Feb 28 20:00:09 ip-10-110-113-252 influxd: ts=2020-02-28T20:00:09.388456Z lvl=info msg=“Opened file” log_id=0LFLugF0000 engine=tsm1 service=filestore path=/var/lib/influxdb/data/telegraf/autogen/230/000032814-000000010.tsm id=3 duration=2229.940ms

Snapshot restore is lazy. AWS says restore has completed, wile it continue to restore files in the background and on-demand. As InfluxDB starts up, it reads all shards and needs them to be actually restored before it can start. You can either wait for restore to actually complete by monitoring IO utilization to drop to zero, or just wait for InfluxDB to start.

I also wanted to notice, that 20 minutes is not bad. Our instance starts almost an hour (not after restore), so we have to run a separate instance for short retention policy with just two shards, and have middle layer join results with historical instance of InfluxDB, if it is up and running. I wish InfluxDB was as lazy as AWS and started HTTP service only after scanning the latest shard.