Export big data from influxdb

Hi!

I need to export some days of data from influxDB to files (csv and excel).
Can you to suggest me the right way to proceed?

I tried these solutions:

curl

curl -G 'http://my.server.com:8086/query?db=mydb' --data-urlencode "q=select * from sensor where time > now() - 2d" > output_file

pros: only 6 seconds
cons: hard to format output in csv and excel


python InfluxDBClient

...
client = influxdb.InfluxDBClient( ... )
points = client.query(query, chunked=True, chunk_size=10000).get_points()
dfs = pd.DataFrame(points)
...
for d in dfs:
    d.to_excel(writer, "Sheet")
...

pros: easy to format output with pandas lib. No influx max-row-limit (default=10000)
cons: It takes 2 minutes and 24 seconds


python DataFrameClient

...
client = influxdb.DataFrameClient( ... )
dfs_dict = client.query(query, chunked=True)
ret = dfs_dict["sensor"]
...
for d in ret:
    d.to_excel(writer, "Sheet")
...

pros: easy to format output with pandas lib. It take 10 seconds
cons: influx max-row-limit (default=10000)


Cons in each solution is unacceptable for me :frowning:
Where I’m wrong?
I can’t wait so much time, I can’t change max-row-limit from influxdb.conf and I need to split output and write it in csv+excel format.
I hope in your help.

Thanks!

You can define the output type as json|csv|column

-format ā€˜json|csv|column’
Format specifies the format of the server responses: json, csv, or column.

So your command will become
curl -G ā€˜http://my.server.com:8086/query?db=mydb’ --data-urlencode ā€œq=select * from sensor where time > now() - 2dā€ -format csv> output_file

Hi!

Thanks for reply.
I can’t find -format option for curl.
But anyway It couldn’t be used for export in excel format :disappointed:

Yeah ur right it is not a curl option but a influx execute option. You can directly run it on the server which has the data:

Ok thanks but i can’t access directly on server which has data.

Hi guys,
What’s the impact of this kind of query on the Influx db server?
We’re experimenting the same thing (trying to backup every day into a csv file) and we saw a saturation of the RAM memory and then continuous swapping to the disk (the query takes really long time).

We have a single instance node, 4 cores, 8-10 GB RAM (the total RAM is 16GB but we have other processes running in it).
We are inserting 2.400.000 points per hour and we query like that: select * from telemetry where time >= 2017-09-01 00:00:00 and time <= 2017-09-01 23:59:59

We also have the same problem of resteeee and at the moment we ā€œcutā€ one query in many:

select * from sensor where time >= 0 and time < 10
select * from sensor where time >= 10 and time < 20
select * from sensor where time >= 20 and time < 30

and so on …

but how can i know in advance if i need to use many queries instead of one. And how many?
And what to do if influx return me only [max-row-limit] rows?

1 Like

That’s a good question. Can anybody from InfluxData reply to these comments? Please?

2 Likes

up.
Now i’m trying to extract 200 000 000 rows (7 days data) using this minimal script:

client = influxdb.DataFrameClient(host, port, user, pwd, db)
dfs_dict = client.query(query, chunked=True, chunk_size=100000)
measurement = next(iter(dfs_dict))
ret = dfs_dict[measurement]
ret.to_csv(file_path, sep=",", encoding="utf-8")

Result: saturation of 8Gb RAM + 8Gb swap
Possible solution: cut my query into 10000 smaller queries.
InfluxData team, this is the right and faster way to proceed?
Which client is the most suitable for this purpose?

Thanks

1 Like

Very interested in a solution. We’re facing the same problem; currently, our data is safely stored in InfluxDB with no way to get it out for analysis :frowning:

Any ideas on how to deal with the memory saturation? Having the same issue. Need to export 180.000.000 rows to a csv file but the memory just blows.

2 Likes

Also interested in a solution for challenges/problems like this. I am facing the same problem.

I currently dump month-by-month and that works. It’s incredibly slow, but it works.

Are you using Influx CLI for that or something else?

You may have a look to InfluxDB Fetcher, GitHub - hgomez/influxdb: InfluxDB Tools

Very usefull to craft a select request and generate a writeprotocol file

The problem is not the format, but that requesting a data dump via query-and-save is unusably slow. The InfluxDB fetcher would have the same problem, as it’s the data production on the InfluxDB end that’s slow, not the saving.

There are new backup/restore features in the latest 1.5 release of influxdb that should address this problem:

The feature of interest here is to set time bounds for the export. This can be done with the new -start and -end flags. Read more here:

-Adam

It’s ultra fast, but it backs up the data in a binary compressed format

I’m using the latest influx python client (GitHub - influxdata/influxdb-client-python: InfluxDB 2.0 python client) and would like to use the query chunked method. Unfortunately, the method described in the first post doesn’t work anymore for this version. I don’t understand how you can achieve the same with the new library. Anyone who knows?

1 Like

Hello naiconovalabs,

I also have to export data from InfluxDB to externally readable files, preferrably csv.

I’m personally running InfluxDB 1.8.4 for Windows and am having trouble using the CLI query. Here, the query I give it under -execute returns an error with a text something like ā€œcouldn’t find the fileā€. I also get this error even if I just try to print it to the console rather than exporting to a file. I also use InfluxDB for Linux, where this approach works. My conclusion is that the InfluxDB release for Windows seems to have an issue here.

I was forced to look for alternatives on Windows, and found the HTTP query approach with curl. You write that with curl you get a file that is hard to transform to csv. I managed to overcome this by adding -H "Accept: application/csv" to the curl command. This actually causes the output file to be a comma-separated csv.

Your whole command would then look like this:
curl -H "Accept: application/csv" -G 'http://my.server.com:8086/query?db=mydb' --data-urlencode "q=select * from sensor where time > now() - 2d"

I’m currently also looking for an alternative analogous to -precision, but haven’t found anything yet.

Hope this helps a bit,

logerar