[InfluxDB] Imprecise backup time on 1.5.2 (also question about duplicates)

Hi everybody,

I’ve some question about how the new online backup/restore procedure deals with time on InfluxDB 1.5.2.

I made a backup of everything on my main InfluxDB (about 1year / 30GB of data), limiting the data to midnight of today; here my backup command (executed from another host):

influxd backup -portable -host myPrimaryDB:8090 -end 2018-05-08T00:00:00Z /influxdb/backup

The targeting host is a “clone” of the first one so, basically, either the OS, the packages versions, and the InfluxDB version one are the same between those.

After the backup was done, I proceeded to restore with this:

influxd restore -portable -host 127.0.0.1:8090 /influxdb/backup

Restore done, I’ve seen the databases are all restored, either the policies and everything else, great.
Except for the fact that, on the restored database, I found data newer than the end date issued in the backup command:

> select * from swap order by time desc limit 1
name: swap
time                 free       host     in       out       total      used      used_percent
----                 ----       ----     --       ---       -----      ----      ------------
2018-05-08T01:35:13Z 4112416768 myhost   45965312 411484160 4294963200 182546432 4.250244379276637

The last copied data date, also, isn’t identical for all measurements. On other databases I have different timestamps as last value:

> select * from swap order by time desc limit 1
name: swap
time                 free       host          in      out      total      used    used_percent
----                 ----       ----          --      ---      -----      ----    ------------
2018-05-08T01:11:35Z 1065242624 anotherhost   1282048 12476416 1073737728 8495104 0.7911712309693564

So, my questions are basically two:

  1. How InfluxDB backup works when dealing with time? Why there are data newer than what I expecting?
  2. In the case I’ll ignore that behaviour and proceed to perform another backup from today at 00:00:01, restoring it on the same InfluxDB instance, how duplicate data will be managed during the restore process?

Hoping someone have answers to my weird questions, thanks in advance,
Matteo

Hi Matteo,

  1. Backups do not scan the data row-by-row but in labeled batches called blocks. Each block is annotated with a start/end time range. If the block overlaps your parameters -end 2018-05-08T00:00:00Z the whole block is backed up. This will result in the extra data you are seeing after a restore. This is done because the blocks are highly compressed—extracting them to inspect each row would create both a computational and disk-space burden on the running system.

This also explains why you are getting a different ‘final’ time in all of your databases, since block boundaries will not be regularly spaced.

  1. Duplicate data points will over-write the existing data. So if they were incidentally imported on the previous restore, they’ll be written again.

Let me know if you have any more questions.

We will enhance the documentation to highlight this – somewhat unexpected – outcome. Apologies for any confusion this may have caused.

@aanthony1243

Backups do not scan the data row-by-row but in labeled batches called blocks… [cut]

Thanks for the explanation, this behaviour explain totally what it’s happen (also, makes totally sense from a performance perspective). Now I’m courious: how this labeled batches was generated? It use the shards as “block” or other logics? There is any method to tune blocks boundaries (or size) for additional tuning during backup/restore operations?

Duplicate data points will over-write the existing data. So if they were incidentally imported on the previous restore, they’ll be written again

Great news, but now I’ve another question: if on the source InfluxDB I miss (for some reason) data in a particular time range (eg: 5 minutes) inside a bigger batch, and on the destination one I’ve data for that range, a restore will delete the data on the second InfluxDB or I’ll find some kind of delta as result? This isn’t a real situation, but knowing in advance how things works can be handy during daily operations.

@tim.hall

We will enhance the documentation to highlight this – somewhat unexpected – outcome. Apologies for any confusion this may have caused.

Tim, nothing to apologizing here. Btw more detailed documentation on this side can help future users facing same behaviours, so thanks in advance if your team will put some effort on that side.

Thanks,
Matteo

Hi Matteo,

A shard is made up of many blocks. Block size is not configurable.

For your second question, the general rule is that InfluxDB uses a write-last tie breaker. If two writes have the same (time, retention, measurement, [tags]) then this is considered a duplicate row and the newer write will replace the old one. Since you are using the online restore in this hypothetical scenario, the existing points on the target database will remain in place since there was no data in the source backup to over-write that sequence.

Many many thanks @aanthony1243 for the clarifications.

Regards.
Matteo