Performance of copying a measurement into another retention policy

I have a 200 GB hosted Influx DB, of which one of the measurements needs to be retained longer, so I need to copy it to another retention policy. I already changed my application to do two HTTP posts, the second one containing only the measurement for the longer retention policy. Now, I need to copy the existing data.

After some Googling, I found this:

select * into "fouryears"."51" from "autogen"."51" group by *;

I tested this on a dev DB and seems to work.

The live database contains 1.5 U.S. billion (1529039108) points in “51”. I’m a bit reluctant to run that query, also because I never seem to able to abort running queries.

Is this indeed the best way to go about it?

Hi @wiebeytec,

When you say “200GB hosted InfluxDB”, do you mean on InfluxCloud, or hosted on your own infrastructure?

Did you know you can also change the retention policy, instead of copying into another? So this could be a better option for you, if you don’t want to duplicate a whole host of data.

Thanks,
David

Hi @rawkode,

I meant a InfluxCloud hosted instance, with 200 GB of 512 GB used. Our default retention of the autogen policy is 6 months.

My issue is that I want one out of hundreds of measurements to be retained longer. As far as I know, this can’t be easily done in Influx, because the retention policy is a name space / schema of sort. Hence the proposed solution.

I’m just worried it will take very long, and cause excessive load or disk space. Because I can’t easily clone the DB elsewhere and test, I was hoping someone can put my worries to rest.

You could use Kapacitor to run your data in batches and move it to a secondary retention policy.

We tend to collect 30 days of raw data and roll this up every 15 minutes into 5 minute chunks and store the aggregated data for 90 days.

Kapacitor as a CQ Engine

You could use InfluxDB to run the CQ but this adds extra strain to the influx instance. Kapacitor will do the same job but add less strain on to the influx instance. It would impact it slightly when the batch task runs, depending on how many datapoints you query with each batch. Kapacitor will also let you run more complex math functions on your data which CQ’s will not.

This way you could just define a TICK script which can be edited as and when you need to. Another advantage of CQ’s as these cannot be modified once they are active. They need to be deleted and recreated. Plus, the syntax is a git.

I’m not sure you can abort the query once you run it, it will either complete or cause your InfluxDB to OOM and fall over.

I decided to spit up the back-filling with multiple queries with a time span of a week; I simply generated them with a bash script. It takes about 10-15 minutes, about 65 million points per week. Looking at the co-monitoring, it doesn’t impact it much.