Failure when restoring backup - Influx 2

Hello,

I’m currently having problems when restoring a backup from influx version 2.3.

To generate the backup file I use the following command: influx backup $DIR --skip-verify

The database has crashed, and didn’t restore itself. So, I had to install it again.

After setting up users and tokens, I perform the following command:

influx restore $DIR --full --skip-verify

The restoration process starts, however, after a while it fails showing error 500

The logs are not very helpful in this case, since the file itself exists in the folder

After the restore, I’ve changed the CLI token, however, the error is always the same.

I have made different backups at different times, and none of them could have been restored using the --ful argument.

Any thoughts on what might be the problem?

1 Like

Hello @giuliano.lm,
I’ve asked the team. This is what I’ve heard so far:

This jumped out at me.
To generate the backup file I use the following command: influx backup $DIR --skip-verify
The database has crashed, and didn’t restore itself.
I read that as the process crashed while backing up, which if that’s the case I guess it wouldn’t be terribly surprising that things weren’t working. Maybe that’s not what they meant though

Have you tried since? Is it working now? I’ll keep commenting here if I get any more answers.

Hello, @Anaisdg

To do the backup the process runs fine in my case. We can generate the backup files with manifests and etc.

The failure occurs during the restoration process. The problem remains.

Another way that I tried was not using the --full argument: influx restore $DIR --skip-verify

In that case, the RAM usage gets quite high, and “unexpected EOF” error appears. As a consequence, the DB crashes.

In this second way, if I give enough time between restoring each bucket (I would imagine for the cache to be written), the DB does not crash.

Our DB has 8GB of RAM, so I expected that not to be a problem. Anyway, one solution that I’m testing is simply increasing the available RAM for that server.

If you need more detailed description of the problem, let me know!

Thank you for your attention.

Regards.

@giuliano.lm,
Would you be willing to share your backup? You can leave out the kv and sqlite files to avoid sharing any creds, but having a look at the shards/manifest would hopefully make it much easier to reproduce/debug.

@Anaisdg

I cannot share with you my entire backup, but I can share an specific bucket. Hope you will be able to replicate the error when using the “–full” argument.

Please, let me know the best way to share the file with you.

Regarding the second error I mentioned (unexpected EOF, following by a DB crash), it disappeared after we increased our available RAM.

Specific bucket will work! Thank you! You can message me directly too if you prefer

Do you have any updates on this? I have what appears to be the same problem.

I’m running InfluxDB OSS 2.4. I have tried a couple of backups, but experiencing the same issue with both backups.

Thank you.

Hello, @MattiasB

I’ve sent them the bucket, but they couldn’t replicate the error.

In my case, I didn’t really need to restore everything from the files (–full); so as a part solution I’m now just restoring bucket per bucket.

Hope it helps!

Hi Anais, i have a problem, first i make a full backup from influxdb2 in windows 10 64gb(PC), then i make the setup configuration for copy the same user,password, organization and dbprimary name in a windows 10 64 bits server, then, i run this influx.exe restore influx --full for a full restore, when the process finished i restart influxdb, all looks good but when im trying to view the data, for my surprise i can´t see any data in influxdb data explorer.

We found a workaround for this issue which was working up until recently. It used a bash script to iterate over the buckets in InfluxDB 2.4 backing them up individually then used a separate bash script to restore each one individually to the backup server, which we also upgraded to 32Gb RAM. Now however it appears as though something is causing a peak in RAM usage during even this improved procedure which causes the kernel to kill InfluxDB as the OS runs out of memory. This is a major problem for us as we have no guarantee that the backups we have of all of our data could be restored in an emergency.
From the timestamps in the system log and the restore procedure I’m unsure if the OOM kill is being caused by a corrupted backup file or if the memory usage is peaking and then causing the “Unexpected EOF”.
We’d be really interested to hear from anyone who may have found a solution to this issue, perhaps by tuning certain InfluxDB configuration options.

Update: I added a line to our influx restore bash script which waits until the used memory drops below 70% in between each individual bucket restore, memory usage can reach as high as 96% during or just after bucket restores, depending on bucket size, and I I don’t use this throttle then I would again get the OOM kill issue. Based on this, I think we can conclude that the issue is with Influx’s memory management, I don’t understand exactly how influx indexes its data but perhaps this process after each bucket restore takes a long time and consumes a lot of memory? With this throttle in place the influx data restoration procedure completes but takes a lot longer than without it as it can take up to 3hr for memory usage to drop back down to required levels