Export big data from influxdb

naiconovalabs · August 31, 2017, 2:11pm

Hi!

I need to export some days of data from influxDB to files (csv and excel).
Can you to suggest me the right way to proceed?

I tried these solutions:

curl

curl -G 'http://my.server.com:8086/query?db=mydb' --data-urlencode "q=select * from sensor where time > now() - 2d" > output_file

pros: only 6 seconds
cons: hard to format output in csv and excel

python InfluxDBClient

...
client = influxdb.InfluxDBClient( ... )
points = client.query(query, chunked=True, chunk_size=10000).get_points()
dfs = pd.DataFrame(points)
...
for d in dfs:
    d.to_excel(writer, "Sheet")
...

pros: easy to format output with pandas lib. No influx max-row-limit (default=10000)
cons: It takes 2 minutes and 24 seconds

python DataFrameClient

...
client = influxdb.DataFrameClient( ... )
dfs_dict = client.query(query, chunked=True)
ret = dfs_dict["sensor"]
...
for d in ret:
    d.to_excel(writer, "Sheet")
...

pros: easy to format output with pandas lib. It take 10 seconds
cons: influx max-row-limit (default=10000)

Cons in each solution is unacceptable for me
Where I’m wrong?
I can’t wait so much time, I can’t change max-row-limit from influxdb.conf and I need to split output and write it in csv+excel format.
I hope in your help.

Thanks!

sbains · August 31, 2017, 5:38pm

You can define the output type as json|csv|column

-format ‘json|csv|column’
Format specifies the format of the server responses: json, csv, or column.

So your command will become
curl -G ‘http://my.server.com:8086/query?db=mydb’ --data-urlencode “q=select * from sensor where time > now() - 2d” -format csv> output_file

naiconovalabs · September 1, 2017, 11:55am

Hi!

Thanks for reply.
I can’t find -format option for curl.
But anyway It couldn’t be used for export in excel format

sbains · September 1, 2017, 2:09pm

Yeah ur right it is not a curl option but a influx execute option. You can directly run it on the server which has the data:

naiconovalabs · September 1, 2017, 2:18pm

Ok thanks but i can’t access directly on server which has data.

resteeee · September 1, 2017, 3:33pm

Hi guys,
What’s the impact of this kind of query on the Influx db server?
We’re experimenting the same thing (trying to backup every day into a csv file) and we saw a saturation of the RAM memory and then continuous swapping to the disk (the query takes really long time).

We have a single instance node, 4 cores, 8-10 GB RAM (the total RAM is 16GB but we have other processes running in it).
We are inserting 2.400.000 points per hour and we query like that: select * from telemetry where time >= 2017-09-01 00:00:00 and time <= 2017-09-01 23:59:59

naiconovalabs · September 2, 2017, 2:52pm

We also have the same problem of resteeee and at the moment we “cut” one query in many:

select * from sensor where time >= 0 and time < 10
select * from sensor where time >= 10 and time < 20
select * from sensor where time >= 20 and time < 30

and so on …

but how can i know in advance if i need to use many queries instead of one. And how many?
And what to do if influx return me only [max-row-limit] rows?

resteeee · September 13, 2017, 7:14am

That’s a good question. Can anybody from InfluxData reply to these comments? Please?

naiconovalabs · October 23, 2017, 1:45pm

up.
Now i’m trying to extract 200 000 000 rows (7 days data) using this minimal script:

client = influxdb.DataFrameClient(host, port, user, pwd, db)
dfs_dict = client.query(query, chunked=True, chunk_size=100000)
measurement = next(iter(dfs_dict))
ret = dfs_dict[measurement]
ret.to_csv(file_path, sep=",", encoding="utf-8")

Result: saturation of 8Gb RAM + 8Gb swap
Possible solution: cut my query into 10000 smaller queries.
InfluxData team, this is the right and faster way to proceed?
Which client is the most suitable for this purpose?

Thanks

retorquere · November 14, 2017, 1:14pm

Very interested in a solution. We’re facing the same problem; currently, our data is safely stored in InfluxDB with no way to get it out for analysis

Marco_Pas · January 4, 2018, 7:03pm

Any ideas on how to deal with the memory saturation? Having the same issue. Need to export 180.000.000 rows to a csv file but the memory just blows.

Marco_Pas · January 4, 2018, 9:33pm

Also interested in a solution for challenges/problems like this. I am facing the same problem.

retorquere · January 5, 2018, 1:38am

I currently dump month-by-month and that works. It’s incredibly slow, but it works.

Marco_Pas · January 5, 2018, 6:38am

Are you using Influx CLI for that or something else?

hgomez · February 23, 2018, 1:23pm

You may have a look to InfluxDB Fetcher, GitHub - hgomez/influxdb: InfluxDB Tools

Very usefull to craft a select request and generate a writeprotocol file

retorquere · February 23, 2018, 1:40pm

The problem is not the format, but that requesting a data dump via query-and-save is unusably slow. The InfluxDB fetcher would have the same problem, as it’s the data production on the InfluxDB end that’s slow, not the saving.

aanthony1243 · March 9, 2018, 9:15pm

There are new backup/restore features in the latest 1.5 release of influxdb that should address this problem:

The feature of interest here is to set time bounds for the export. This can be done with the new -start and -end flags. Read more here:

-Adam

delixd · December 21, 2018, 7:24am

It’s ultra fast, but it backs up the data in a binary compressed format

Paulus_Potter · September 30, 2020, 10:38am

I’m using the latest influx python client (GitHub - influxdata/influxdb-client-python: InfluxDB 2.0 python client) and would like to use the query chunked method. Unfortunately, the method described in the first post doesn’t work anymore for this version. I don’t understand how you can achieve the same with the new library. Anyone who knows?

logerar · February 8, 2021, 10:47am

Hello naiconovalabs,

I also have to export data from InfluxDB to externally readable files, preferrably csv.

I’m personally running InfluxDB 1.8.4 for Windows and am having trouble using the CLI query. Here, the query I give it under -execute returns an error with a text something like “couldn’t find the file”. I also get this error even if I just try to print it to the console rather than exporting to a file. I also use InfluxDB for Linux, where this approach works. My conclusion is that the InfluxDB release for Windows seems to have an issue here.

I was forced to look for alternatives on Windows, and found the HTTP query approach with curl. You write that with curl you get a file that is hard to transform to csv. I managed to overcome this by adding -H "Accept: application/csv" to the curl command. This actually causes the output file to be a comma-separated csv.

Your whole command would then look like this:
curl -H "Accept: application/csv" -G 'http://my.server.com:8086/query?db=mydb' --data-urlencode "q=select * from sensor where time > now() - 2d"

I’m currently also looking for an alternative analogous to -precision, but haven’t found anything yet.

Hope this helps a bit,

logerar

Topic		Replies	Views
Exporting influxdb data as csv file in Influx-Python-Client Client SDKs	9	8684	August 22, 2021
How to export large number of measurements from InfluxDB to csv? Store influxdb , backup , influxql	4	10914	May 2, 2021
Send csv to to influxdb Telegraf client-libraries	2	1237	August 10, 2020
CSV data Write operation is very slow in speed using python	4	3285	July 22, 2019
Backup InfluxDB restore to CSV with Python Welcome & Getting Started influxdata , python	2	1805	April 20, 2021

Export big data from influxdb

curl

python InfluxDBClient

python DataFrameClient

Related topics