Hello @mirkocomparetti,
Since we don’t have an easy button to do imports into our InfluxDB Cloud I wrote out some suggested paths below. Something to be aware of is the rate limits.
1) Dual Writing
I would suggest sending the data from your clients to both the cloud instance and your existing local instance. That way you can make sure the cloud has the buckets/ data/ format/ schema/ etc. that you are wanting. It is easier to get this right at the beginning since we are schema on write.
2) Export the data in line protocol to a .csv file on the local machine. (Usually done via a client library that queries all data that you are wanting to move and saves the output to a file). See step 9 here for basic example: Upgrade from InfluxDB 2.0 beta to InfluxDB 2.0 (stable) | InfluxDB OSS 2.0 Documentation (influxdata.com)
3) Break the file export into smaller chunks
Prepare the file for upload
We will now split that csv export file into 10000 chunks. If for example, we over run a cardinality limit our writes will be rejected. If that happens 20% of the way though uploading a single file we don’t really have an easy way to know where we got to in order to try again once the limit is increased. By splitting the file into a number of smaller chunks we can resume from where we left off. We will use the linux split
tool to do this for us. It can split a file into a number of chunks and retain a full line of LP in each file. If we arbitrarily split the file on, say, a number of bytes we could end up with truncated LP lines. Create a directory called splits
to hold the split files.
split -n l/1000 -d <export file> ../splits/
This will take some time and will generate files named 000
to 999
.
Once complete you could delete the larger csv file if you wanted. Probably safer not to though, not yet at least.
4) Push the data into InfluxDB cloud in small chucks.
Upload the chunks to C2
Now you have your data in 1000 chunks you can start to upload each chunk to Cloud 2. We can use the 2.0 influx
CLI tool to do this for us as it allows you to pass the name of a file containing Line Protocol to be uploaded. In order to do this you will need:
- The
org_id
of the destination org (here represented by $ORG_ID
)
- The
bucket name
of the the destination bucket (here represented by $BUCKET_NAME
)
- A valid write token (here represented by
$TOKEN
)
- The URL of the destination cluster (Represented by
$HOST
)
Run the influx
command from the 2.0 OSS directory you created previously.
./influx write -b $BUCKET_NAME --host $HOST --org $ORG_ID --token $TOKEN -f <a chunk of the file>
This command will upload the specified chunk file. It’s worth testing with a single chunk file first. If everything works then the fastest way to upload all 1000 chunks is to use a bash script to loop through the files and parallelize the uploads.
Create a bash script upload_chunk.sh
with this code:
#!/bin/bash
echo "Uploading chunk: $1"
<path_to_influx_2.0 command>/influx write -b $BUCKET_NAME --host $HOST --org $ORG_ID --token $TOKEN -f $1
echo "Finished chunk: $1"
This is a very basic example, you might like to add better logging. e.g. if the exit code from influx write
is 0 then we know that the upload completed successfully. This script will only upload a single chunk which is passed in as the first argument $1
. We will use the GNU parallel command to call this script repeatedly and in parallel. This will significantly reduce the time taken to upload all 1000 chunks.
find <path to chunks> -type f | parallel --halt soon,fail=1 --joblog /tmp/parallel.log --jobs 14 -I% --max-args 1 ./upload_chunk.sh %
This will find all the chunk files, excluding directories (e.g. .
and ..
) and pass the file names to the parallel
command. --halt soon,fail=1
tells parallel to stop calling the upload_chunks.sh
script once an error is reported but allow the existing jobs to finish first. --joblog
is where parallel will log. --jobs 14
will run 14 uploads in parallel - this can result in about 40k points per second hitting the cluster and seems to not stress the clusters too much. On a busy cluster you such as us-west-2 you might want to scale that back to 7. Watch TTBR on the destination cluster, if it starts to climb as you upload you are pushing too fast. -I%
tells parallel to substitute the % symbol at the end of the line for the filename that was passed in. --max-args 1
tells parallel that the upload_chunk.sh script only accepts one argument.
When you run the find and pipe it through to parallel you can tail the job log to see whats going on. Your data will now start uploading. If it fails for any reason, e.g. you breach a cardinality limit, you could use the job log to delete the completed chunks and then rerun the upload.
Sorry there is not an easy button, but the basic outline is:
- Start Dual Writing to make sure your bucket in cloud is set the way you want it for your data
- Export your historical data to a file via a query that saves to a file.
- Divide the historical file into smaller chucks for ingestion
- Upload the csv chucks using the basic bash script as an example.