Move data from Influx V1 to Influx V2

We have different instances running V1 and collecting data and we cannot upgrade them at the moment.
We created a new machine where we have V2 and we would like to import all data from the different instances of V1 to the V2, so that we have all data in the new version. influxdb upgrade is not an option as we are working on different machines.
We tried exporting data from the databases as csv, but this operation on the V1 consumes the whole ram and swap, so it does not have good performances.

Would it be possible to create a db backup in V1 and import it into V2?

What is the suggested way to do that?

Thanks

Hello @mirkocomparetti,
Since we don’t have an easy button to do imports into our InfluxDB Cloud I wrote out some suggested paths below. Something to be aware of is the rate limits.

1) Dual Writing

I would suggest sending the data from your clients to both the cloud instance and your existing local instance. That way you can make sure the cloud has the buckets/ data/ format/ schema/ etc. that you are wanting. It is easier to get this right at the beginning since we are schema on write.

2) Export the data in line protocol to a .csv file on the local machine. (Usually done via a client library that queries all data that you are wanting to move and saves the output to a file). See step 9 here for basic example: Upgrade from InfluxDB 2.0 beta to InfluxDB 2.0 (stable) | InfluxDB OSS 2.0 Documentation (influxdata.com)

3) Break the file export into smaller chunks

Prepare the file for upload

We will now split that csv export file into 10000 chunks. If for example, we over run a cardinality limit our writes will be rejected. If that happens 20% of the way though uploading a single file we don’t really have an easy way to know where we got to in order to try again once the limit is increased. By splitting the file into a number of smaller chunks we can resume from where we left off. We will use the linux split tool to do this for us. It can split a file into a number of chunks and retain a full line of LP in each file. If we arbitrarily split the file on, say, a number of bytes we could end up with truncated LP lines. Create a directory called splits to hold the split files.

split -n l/1000 -d <export file> ../splits/

This will take some time and will generate files named 000 to 999 .

Once complete you could delete the larger csv file if you wanted. Probably safer not to though, not yet at least.

4) Push the data into InfluxDB cloud in small chucks.

Upload the chunks to C2

Now you have your data in 1000 chunks you can start to upload each chunk to Cloud 2. We can use the 2.0 influx CLI tool to do this for us as it allows you to pass the name of a file containing Line Protocol to be uploaded. In order to do this you will need:

  • The org_id of the destination org (here represented by $ORG_ID )
  • The bucket name of the the destination bucket (here represented by $BUCKET_NAME )
  • A valid write token (here represented by $TOKEN )
  • The URL of the destination cluster (Represented by $HOST )

Run the influx command from the 2.0 OSS directory you created previously.

./influx write -b $BUCKET_NAME --host $HOST --org $ORG_ID --token $TOKEN -f <a chunk of the file>

This command will upload the specified chunk file. It’s worth testing with a single chunk file first. If everything works then the fastest way to upload all 1000 chunks is to use a bash script to loop through the files and parallelize the uploads.

Create a bash script upload_chunk.sh with this code:

#!/bin/bash
echo "Uploading chunk: $1"
<path_to_influx_2.0 command>/influx write -b $BUCKET_NAME --host $HOST --org $ORG_ID --token $TOKEN -f $1
echo "Finished chunk: $1"

This is a very basic example, you might like to add better logging. e.g. if the exit code from influx write is 0 then we know that the upload completed successfully. This script will only upload a single chunk which is passed in as the first argument $1 . We will use the GNU parallel command to call this script repeatedly and in parallel. This will significantly reduce the time taken to upload all 1000 chunks.

find <path to chunks> -type f | parallel --halt soon,fail=1 --joblog /tmp/parallel.log --jobs 14 -I% --max-args 1 ./upload_chunk.sh %

This will find all the chunk files, excluding directories (e.g. . and .. ) and pass the file names to the parallel command. --halt soon,fail=1 tells parallel to stop calling the upload_chunks.sh script once an error is reported but allow the existing jobs to finish first. --joblog is where parallel will log. --jobs 14 will run 14 uploads in parallel - this can result in about 40k points per second hitting the cluster and seems to not stress the clusters too much. On a busy cluster you such as us-west-2 you might want to scale that back to 7. Watch TTBR on the destination cluster, if it starts to climb as you upload you are pushing too fast. -I% tells parallel to substitute the % symbol at the end of the line for the filename that was passed in. --max-args 1 tells parallel that the upload_chunk.sh script only accepts one argument.

When you run the find and pipe it through to parallel you can tail the job log to see whats going on. Your data will now start uploading. If it fails for any reason, e.g. you breach a cardinality limit, you could use the job log to delete the completed chunks and then rerun the upload.

Sorry there is not an easy button, but the basic outline is:

  1. Start Dual Writing to make sure your bucket in cloud is set the way you want it for your data
  2. Export your historical data to a file via a query that saves to a file.
  3. Divide the historical file into smaller chucks for ingestion
  4. Upload the csv chucks using the basic bash script as an example.
2 Likes

Also please note that the 2.x team is building an importer to make this process easy. So if that seems overwhelming and you have the option you could wait too. Please let me know what you decide on and if you need help!

2 Likes

Hello @Anaisdg,
Thanks a lot for all those options. Soon I’ll try them out.
As a side note, dual writing is not an option for us as we do not have internet in the field and also that would not export data saved in the past, but we will keep it as an option in other use cases!

We might need to do some rearrangement of data while (or before) importing them as we changed the data structure, but we will find a way using your suggestions.

And we look forward to the import tool!

We’ll keep you posted!

Thanks,
Mirko

Hello @Anaisdg,
Could you give more details about it ? The importer would allow us to import data from V1 to V2 directly ? Do you have any idea of planning ?
Thanks

Hello @Yuann_B,
I’m sorry, I don’t know. If I find anything related, I’ll make sure to circle back and include it here.

Not sure if this is recreating the wheel but could you stream data from your 1.x db using
influx_inspect export” to Telegraf or a Telegraf Gateway. Then have Telegraf send that data to your 2.x db? This way you could avoid the disk IO of reading/writing csv files and Telegraf would to all the hard work of data batching and transport.

1 Like

Thanks @Pete that’s a good point.

Hi @Anaisdg ,
Any updates on the development of this importer? Could you inform us about the planning? If not, who could give us an estimation about planning for this? As more people are migrating from local 1.x OSS versions to cloud version, a seamless upgrade seems like a big deal to me. The current workaround is fine if you have limited amounts of data, but is practically unfeasible for large amounts of data.
thanks in advance for the update

1 Like

I ran into a similar problem: I wanted to copy data from a V1.8 production system to a V2.1 test system.
Both self-hosted, so full access. And not a migration or an update, that comes later, if at all.

The long recipe given by @Anaisdg about a year ago doesn’t work for me.
The cited link of step 2 describes howto extract data from a V2.0 system.
It simply doesn’t apply to a V1.8 system.
I ended up with extracting the data with influx_inspect like

influx_inspect export -database dca \
  -datadir /var/lib/influxdb/data \
  -waldir /var/lib/influxdb/wal \
  -compress \
  -out dca.line.gz

That file is copied from the production to the test system and split into reasonable chunks with

time zcat dca.line.gz | split -l 50000 -d -a 4 - dca.line_

These files are finally curl'ed into the target system via a shell script (as described earlier).
For a ~500MB database I got a ~600 MB compressed file, and ~1500 chunk files with a total of ~4 GB.