Hello, there!
I am having an intermittent problem with my InfluxDB nightly backup. Sometimes a single shard will fail to back up. It looks as if it is the last shard. I have observed this in versions 1.7.9 and 1.8.2
After 10 tries, the backup will proceed, but the .manifest file is never written.
I am invoking the backup with this command line:
influxd backup -host 127.0.0.1:8088 -portable /path/to/backup_folder
Here is the error message from the influxd output
...
2020/09/28 04:57:03 backing up db=telegraf rp=autogen shard=635 to /bigdisk1/influxdb_backup/influxbackup_10.42.133.111/telegraf.autogen.00635.00 since 0001-01-01T00:00:00Z
2020/09/28 04:59:42 backing up db=telegraf rp=autogen shard=645 to /bigdisk1/influxdb_backup/influxbackup_10.42.133.111/telegraf.autogen.00645.00 since 0001-01-01T00:00:00Z
2020/09/28 04:59:42 Download shard 645 failed copy backup to file: err=<nil>, n=0. Waiting 2s and retrying (0)...
2020/09/28 04:59:44 Download shard 645 failed copy backup to file: err=<nil>, n=0. Waiting 2s and retrying (1)...
2020/09/28 04:59:46 Download shard 645 failed copy backup to file: err=<nil>, n=0. Waiting 2s and retrying (2)...
2020/09/28 04:59:48 Download shard 645 failed copy backup to file: err=<nil>, n=0. Waiting 2s and retrying (3)...
2020/09/28 04:59:50 Download shard 645 failed copy backup to file: err=<nil>, n=0. Waiting 2s and retrying (4)...
2020/09/28 04:59:52 Download shard 645 failed copy backup to file: err=<nil>, n=0. Waiting 2s and retrying (5)...
2020/09/28 04:59:54 Download shard 645 failed copy backup to file: err=<nil>, n=0. Waiting 3.01s and retrying (6)...
2020/09/28 04:59:57 Download shard 645 failed copy backup to file: err=<nil>, n=0. Waiting 11.441s and retrying (7)...
2020/09/28 05:00:08 Download shard 645 failed copy backup to file: err=<nil>, n=0. Waiting 43.477s and retrying (8)...
2020/09/28 05:00:52 Download shard 645 failed copy backup to file: err=<nil>, n=0. Waiting 2m45.216s and retrying (9)...
2020/09/28 05:03:37 error (copy backup to file: err=<nil>, n=0) when backing up db: telegraf, rp autogen, shard 645. continuing backup on remaining shards
2020/09/28 05:03:37 backup failed: copy backup to file: err=<nil>, n=0
backup: copy backup to file: err=<nil>, n=0
...
It works most days, and the failures seem random. I have seen this on more than one different influxdb server.
Has anybody else seen this?