Hello,
We are trying to migrate a large dataset (1 year of data with ~6M records per month) from a database in SQL Server into InfluxDB. Firstly, we tried using flux and the sql.from function to import the data, but it didn’t work out. I read in a post that sql.from may not be suited for this kind of massive migration.
Currently, we are exporting the data into CSV files by months and loading these into InfluxDB using the write command.
Nevertheless, I was looking for a solution that allowed us to automate the process using the Python client library. Does anyone know if this an optimal way of migrating a large amount of data into InfluxDB?
Thank you in advance!
Hello @Juan_Alonso_Pla,
Unfortunately there isn’t really an optimal way.
Are you using OSS or Cloud? This will change your considerations for optimization.
You could use flux but yah it would take a long time/you’d have to do it incrementally.
The CSV approach is the right one. You’re going to just have to iterate through your data and create CSVs with timestamps and then write the data to influxDB. You might consider using Telegraf file input plugin to write the CSV’s instead of the python client. It might be more efficient.
Here are some alternatives for ppl stumbling on this post that have similar questions:
You could also consider a bash script to iterate through your csv’s and use the InfluxDB CLI to write CSV as well.
Here’s an example of you how write regular csv with the InfluxDB CLI.
Here’s an example for bulk migration with annotated CSV with bash script and the InfluxDB CLI:
#!/bin/bash
###
#
# This script will create daily annotated CSV files from a InfluxDB org using the Influx CLI.
# Notes:
# 1) The InfluxDB credentials must be set in the influx CLI.
# 2) The annotated CSV files are saved to the current directory where the script is run.
#
###
BUCKET=telegraf
NUM_DAYS=3
END_DATE=now # now or YYYY-MM-DD:00:00Z
###
cli_ver=`influx version`
if [[ $cli_ver != *"Influx CLI"* ]]; then
echo "influx CLI not found."
exit 1
fi
echo $cli_ver
if [ "${END_DATE}" == "now" ]; then
end_date=`date +"%Y-%m-%dT%H:%M:00Z"`
else
end_date=${END_DATE}
fi
echo "Processing ${NUM_DAYS} day(s) ending on ${end_date}..."
for ((n=0; n < ${NUM_DAYS};n++));
do
start_date=`date -d "${end_date} ${n} day ago" +"%Y-%m-%dT00:00:00Z"`
start_date_short=`date -d "${start_date}" +"%Y-%m-%d"`
echo ${BUCKET}_data_${start_date_short}.csv.gz
if [[ -v stop_date ]]; then
stop_date=`date -d "${start_date} + 1 day" +"%Y-%m-%dT00:00:00Z"`
else
stop_date=now\(\)
fi
# output=`influx query -r "from(bucket: \"${BUCKET}\") |> range(start: ${start_date}, stop: ${stop_date}) |> drop(columns: [\"_start\", \"_stop\"])" | gzip > ${BUCKET}_data_${start_date_short}.csv.gz`
output=`influx query -r "from(bucket: \"${BUCKET}\") |> range(start: ${start_date}, stop: ${stop_date}) |> aggregateWindow(every: 1h, fn: mean, createEmpty: false) |> to() |> filter(fn: (r) => false == true)" | ${BUCKET}_data_${start_date_short}.csv`
if [[ $output != *"exceeded limited_query plan limit"* ]]; then
echo "Error encountered."
exit 1
fi
sleep 10
done
This reply may come a bit late but thank you for your help @Anaisdg, it was indeed quite clear.
I finally ended up following your advice and writing a bash script that queries the data from SQL Server and writes it to InfluxDB through the CLI.
1 Like
@Juan_Alonso_Pla,
So happy you were able to find a solution.