Export big data from influxdb

influxdb

#1

Hi!

I need to export some days of data from influxDB to files (csv and excel).
Can you to suggest me the right way to proceed?

I tried these solutions:

curl

curl -G 'http://my.server.com:8086/query?db=mydb' --data-urlencode "q=select * from sensor where time > now() - 2d" > output_file

pros: only 6 seconds
cons: hard to format output in csv and excel


python InfluxDBClient

...
client = influxdb.InfluxDBClient( ... )
points = client.query(query, chunked=True, chunk_size=10000).get_points()
dfs = pd.DataFrame(points)
...
for d in dfs:
    d.to_excel(writer, "Sheet")
...

pros: easy to format output with pandas lib. No influx max-row-limit (default=10000)
cons: It takes 2 minutes and 24 seconds


python DataFrameClient

...
client = influxdb.DataFrameClient( ... )
dfs_dict = client.query(query, chunked=True)
ret = dfs_dict["sensor"]
...
for d in ret:
    d.to_excel(writer, "Sheet")
...

pros: easy to format output with pandas lib. It take 10 seconds
cons: influx max-row-limit (default=10000)


Cons in each solution is unacceptable for me :frowning:
Where I’m wrong?
I can’t wait so much time, I can’t change max-row-limit from influxdb.conf and I need to split output and write it in csv+excel format.
I hope in your help.

Thanks!


#2

You can define the output type as json|csv|column

-format 'json|csv|column’
Format specifies the format of the server responses: json, csv, or column.

So your command will become
curl -G ‘http://my.server.com:8086/query?db=mydb’ --data-urlencode “q=select * from sensor where time > now() - 2d” -format csv> output_file


#3

Hi!

Thanks for reply.
I can’t find -format option for curl.
But anyway It couldn’t be used for export in excel format :disappointed:


#4

Yeah ur right it is not a curl option but a influx execute option. You can directly run it on the server which has the data:


#5

Ok thanks but i can’t access directly on server which has data.


#6

Hi guys,
What’s the impact of this kind of query on the Influx db server?
We’re experimenting the same thing (trying to backup every day into a csv file) and we saw a saturation of the RAM memory and then continuous swapping to the disk (the query takes really long time).

We have a single instance node, 4 cores, 8-10 GB RAM (the total RAM is 16GB but we have other processes running in it).
We are inserting 2.400.000 points per hour and we query like that: select * from telemetry where time >= 2017-09-01 00:00:00 and time <= 2017-09-01 23:59:59


#7

We also have the same problem of resteeee and at the moment we “cut” one query in many:

select * from sensor where time >= 0 and time < 10
select * from sensor where time >= 10 and time < 20
select * from sensor where time >= 20 and time < 30

and so on …

but how can i know in advance if i need to use many queries instead of one. And how many?
And what to do if influx return me only [max-row-limit] rows?


#8

That’s a good question. Can anybody from InfluxData reply to these comments? Please?


#9

up.
Now i’m trying to extract 200 000 000 rows (7 days data) using this minimal script:

client = influxdb.DataFrameClient(host, port, user, pwd, db)
dfs_dict = client.query(query, chunked=True, chunk_size=100000)
measurement = next(iter(dfs_dict))
ret = dfs_dict[measurement]
ret.to_csv(file_path, sep=",", encoding="utf-8")

Result: saturation of 8Gb RAM + 8Gb swap
Possible solution: cut my query into 10000 smaller queries.
InfluxData team, this is the right and faster way to proceed?
Which client is the most suitable for this purpose?

Thanks


#10

Very interested in a solution. We’re facing the same problem; currently, our data is safely stored in InfluxDB with no way to get it out for analysis :frowning:


#11

Any ideas on how to deal with the memory saturation? Having the same issue. Need to export 180.000.000 rows to a csv file but the memory just blows.


#12

Also interested in a solution for challenges/problems like this. I am facing the same problem.


#13

I currently dump month-by-month and that works. It’s incredibly slow, but it works.


#14

Are you using Influx CLI for that or something else?


#15

You may have a look to InfluxDB Fetcher, https://github.com/hgomez/influxdb

Very usefull to craft a select request and generate a writeprotocol file


#16

The problem is not the format, but that requesting a data dump via query-and-save is unusably slow. The InfluxDB fetcher would have the same problem, as it’s the data production on the InfluxDB end that’s slow, not the saving.


#17

There are new backup/restore features in the latest 1.5 release of influxdb that should address this problem:

https://www.influxdata.com/blog/release-announcement-influxdb-1-5-0-influxdb-enterprise-1-5-0/

The feature of interest here is to set time bounds for the export. This can be done with the new -start and -end flags. Read more here:

-Adam