Export big data from influxdb

Hi!

I need to export some days of data from influxDB to files (csv and excel).
Can you to suggest me the right way to proceed?

I tried these solutions:

curl

curl -G 'http://my.server.com:8086/query?db=mydb' --data-urlencode "q=select * from sensor where time > now() - 2d" > output_file

pros: only 6 seconds
cons: hard to format output in csv and excel


python InfluxDBClient

...
client = influxdb.InfluxDBClient( ... )
points = client.query(query, chunked=True, chunk_size=10000).get_points()
dfs = pd.DataFrame(points)
...
for d in dfs:
    d.to_excel(writer, "Sheet")
...

pros: easy to format output with pandas lib. No influx max-row-limit (default=10000)
cons: It takes 2 minutes and 24 seconds


python DataFrameClient

...
client = influxdb.DataFrameClient( ... )
dfs_dict = client.query(query, chunked=True)
ret = dfs_dict["sensor"]
...
for d in ret:
    d.to_excel(writer, "Sheet")
...

pros: easy to format output with pandas lib. It take 10 seconds
cons: influx max-row-limit (default=10000)


Cons in each solution is unacceptable for me :frowning:
Where Iā€™m wrong?
I canā€™t wait so much time, I canā€™t change max-row-limit from influxdb.conf and I need to split output and write it in csv+excel format.
I hope in your help.

Thanks!

You can define the output type as json|csv|column

-format ā€˜json|csv|columnā€™
Format specifies the format of the server responses: json, csv, or column.

So your command will become
curl -G ā€˜http://my.server.com:8086/query?db=mydbā€™ --data-urlencode ā€œq=select * from sensor where time > now() - 2dā€ -format csv> output_file

Hi!

Thanks for reply.
I canā€™t find -format option for curl.
But anyway It couldnā€™t be used for export in excel format :disappointed:

Yeah ur right it is not a curl option but a influx execute option. You can directly run it on the server which has the data:

Ok thanks but i canā€™t access directly on server which has data.

Hi guys,
Whatā€™s the impact of this kind of query on the Influx db server?
Weā€™re experimenting the same thing (trying to backup every day into a csv file) and we saw a saturation of the RAM memory and then continuous swapping to the disk (the query takes really long time).

We have a single instance node, 4 cores, 8-10 GB RAM (the total RAM is 16GB but we have other processes running in it).
We are inserting 2.400.000 points per hour and we query like that: select * from telemetry where time >= 2017-09-01 00:00:00 and time <= 2017-09-01 23:59:59

We also have the same problem of resteeee and at the moment we ā€œcutā€ one query in many:

select * from sensor where time >= 0 and time < 10
select * from sensor where time >= 10 and time < 20
select * from sensor where time >= 20 and time < 30

and so on ā€¦

but how can i know in advance if i need to use many queries instead of one. And how many?
And what to do if influx return me only [max-row-limit] rows?

1 Like

Thatā€™s a good question. Can anybody from InfluxData reply to these comments? Please?

2 Likes

up.
Now iā€™m trying to extract 200 000 000 rows (7 days data) using this minimal script:

client = influxdb.DataFrameClient(host, port, user, pwd, db)
dfs_dict = client.query(query, chunked=True, chunk_size=100000)
measurement = next(iter(dfs_dict))
ret = dfs_dict[measurement]
ret.to_csv(file_path, sep=",", encoding="utf-8")

Result: saturation of 8Gb RAM + 8Gb swap
Possible solution: cut my query into 10000 smaller queries.
InfluxData team, this is the right and faster way to proceed?
Which client is the most suitable for this purpose?

Thanks

1 Like

Very interested in a solution. Weā€™re facing the same problem; currently, our data is safely stored in InfluxDB with no way to get it out for analysis :frowning:

Any ideas on how to deal with the memory saturation? Having the same issue. Need to export 180.000.000 rows to a csv file but the memory just blows.

2 Likes

Also interested in a solution for challenges/problems like this. I am facing the same problem.

I currently dump month-by-month and that works. Itā€™s incredibly slow, but it works.

Are you using Influx CLI for that or something else?

You may have a look to InfluxDB Fetcher, GitHub - hgomez/influxdb: InfluxDB Tools

Very usefull to craft a select request and generate a writeprotocol file

The problem is not the format, but that requesting a data dump via query-and-save is unusably slow. The InfluxDB fetcher would have the same problem, as itā€™s the data production on the InfluxDB end thatā€™s slow, not the saving.

There are new backup/restore features in the latest 1.5 release of influxdb that should address this problem:

The feature of interest here is to set time bounds for the export. This can be done with the new -start and -end flags. Read more here:

-Adam

Itā€™s ultra fast, but it backs up the data in a binary compressed format

Iā€™m using the latest influx python client (GitHub - influxdata/influxdb-client-python: InfluxDB 2.0 python client) and would like to use the query chunked method. Unfortunately, the method described in the first post doesnā€™t work anymore for this version. I donā€™t understand how you can achieve the same with the new library. Anyone who knows?

1 Like

Hello naiconovalabs,

I also have to export data from InfluxDB to externally readable files, preferrably csv.

Iā€™m personally running InfluxDB 1.8.4 for Windows and am having trouble using the CLI query. Here, the query I give it under -execute returns an error with a text something like ā€œcouldnā€™t find the fileā€. I also get this error even if I just try to print it to the console rather than exporting to a file. I also use InfluxDB for Linux, where this approach works. My conclusion is that the InfluxDB release for Windows seems to have an issue here.

I was forced to look for alternatives on Windows, and found the HTTP query approach with curl. You write that with curl you get a file that is hard to transform to csv. I managed to overcome this by adding -H "Accept: application/csv" to the curl command. This actually causes the output file to be a comma-separated csv.

Your whole command would then look like this:
curl -H "Accept: application/csv" -G 'http://my.server.com:8086/query?db=mydb' --data-urlencode "q=select * from sensor where time > now() - 2d"

Iā€™m currently also looking for an alternative analogous to -precision, but havenā€™t found anything yet.

Hope this helps a bit,

logerar