Export daily data to csv

Hi guys,
We have a single instance node, keeping all telemetry data in one single measurement. For each sensor we have at max 20 fields (float values). We are inserting 2.400.000 points per hour.
We want to save the daily data in an external csv file. This operation will be executed each day.

We’ll execute a query like that:
select * from telemetry where time >= 2017-09-01 00:00:00 and time <= 2017-09-01 23:59:59

What’s the impact of this kind of query on the Influx db server?
We saw a saturation of the RAM memory and then continuous swapping to the disk (the query takes really long time).

We have a single instance node, 4 cores, 8-10 GB RAM (the total RAM is 16GB but we have other processes running in it).

Hey,

I don’t have an answer, but i’m very interested as i will have the exact same kind of exports to do.

I was wondering, do you have an SSD or a HDD ?
What is the size of the swap ?

And maybe you could split that request, by selecting only 6 hours at a time for example, that would probably avoid the swapping. But if the query takes too long (let’s say more than 1hour), you may also have an issue on RAM because you will insert points and create CSV at the same time…

Do you insert your points all at once each hour ? or is it every X minutes / seconds / … ?

I’ll let you know if i have the same problem or if i find a better solution.

Hello,

Did you manage to make it ?

Also, can i ask you what you use to export your CSV ? Do you know if influx has some kind of native tool to export the CSV ? As we can do with MySQL for example ?

Influx can produce the CSV for you. Have a look at influx --help output.

For example, under bash, assuming you have the influx executable in your path, you can do something along the line of this:

influx -username your_user_if_any -password "secret!" -database 'db_name' -host 'localhost' -execute 'SELECT * FROM "db_name"."your_retention_policy_or_nothing"."your_measurement_name" WHERE time > '\''2017-09-05'\'' and time < '\''2017-09-05T23:59:59Z'\'' AND other_conditions_if_required' -format 'csv' > /tmp/your_measurement_name_20170905.csv

This should help you getting started. (The tricky part was the single quote escape on (ba)sh).

5 Likes

Hi NoxWorld,
Sorry for the late, i was on holiday :slight_smile:
We haven’t decided yet how to proceed. I think we’ll split the query with a less interval (less than 24 hours, maybe 12 or less). So we’ll schedule a procedure every 12/6 hours to dump the interested data into a separate csv file. You can also request the csv format querying via the Influx API.
We insert point every 2000 points or 200 ms (batch inserts).
How are you proceeding?

PS: We have SSD

I havn’t decided yet how i will proceed either … I will have the same kind of problem as you i think.
We also have SSD, but i don’t think it will be enough to make the difference.

I think i’ll ask if we really need the same precision in the export, and i hope it won’t be the case.
So i will be allowed to downsample data, and export the less precise less heavy measurement.

I’ll let you know if i find anything too.

Hi NoxWorld,
let’s keep in touch about that argument. Just a question: which language are you using? We’re building the exporter in Java, but i was wondering if maybe python is better (we don’t have to do computations on data)

Bye!

Hi,

I’m using node.js right now, i wanted something simple that could work asynchronously because i might have different kind of datas / database to handle at once.

I have absolutely no idea about python, but i guess you will be using the python client ?

For the record, i readt something about the python client having problem with very large queries.
And i don’t know if the node.js client is very effective either, it just feels like there are 20 libraries & plugins used to make just a few http requests …

I don’t know what to think about the clients out there …

Ah ok, we have implemented a data backupper application in Java (our platform is mainly written in Java) just for testing. We’ve also implemented a simulator that writes data into Influx (batch points). Currently we’re not developing for production, 'cause the project is still in the initial phases. For sure the query has to be broken down into small chunks (3 hours? 6 hours? 12 hours?), but it will be interesting to know if someone from InfluxData had the same problem.

Let’s keep in touch, please!

Could you tell us more about the set up ?
Maybe we can figure out at which point the system is being flooded and becomes laggy.

What is your hardware configuration exactly ?
How much points do you have per day, and with how many tags ?

So we can try to estimate the index size in RAM and try to see how much it takes.
Even if it’s very approximative, it could give us a scale.

Also, what time precision are you using ? Because this affects the RAM index too i think, it would make sense as everything in a measurement is indexed based on time…

Hi NoxWorld,
it was a load test that I performed on my laptop, so the machine was not entirely dedicated to Influx ()
Is a quad core i7, 16GB RAM and 512GB SSD.
As I told you before these are the other infos:

  • We built a simulator to insert data into Influx. The simulator is written in Java and was running on the same machine of InfluxDB
  • We have a single instance node, keeping all telemetry data in one single measurement.
  • For each sensor we have at max 20 fields (float values) and 2 tags (deviceId and plantId). We are inserting 2.400.000 points per hour. We have 2000 sensors.
  • We are using the millisecond precision

When I dump to CSV using the influx CLI, it is killed after 12 minutes.