What is the highest-performance method of getting data in/out of InfluxDB

time-series
influxdb
#1

Is there a resource that I could review to understand the best practices to achieve the highest possible performance reading/writing to/from InfluxDB?

Is Line Protocol over UDP the fastest? Is there a significant improvement with this over HTTP?
What are optimal settings for batching/collecting/sorting/arranging time-series data before writing?

Thanks and Regards,
Jeff

#2

@jefferyanderson There is not a huge performance gain with UDP over HTTP especially when batching properly. I’ve seen batches of between 5k-10k field values per batch to be the most performant for insert.

Inserting data in chronological order is also more performant. If you have a lot of tags in your data sorting them into alphabetical order can also increase write throughput marginally.

Those are the main pointers. I would also suggest this white paper which has a couple of other tips.

Hope that helps!

1 Like
#3

Awesome, thank you very much. That’s really helpful.

I saw this comment:
In contrast, single nodes of InfluxDB generally hit peak performance at just under 1 million writes per second, even on the most performant hardware.

What are the typical bottlenecks for the nodes - CPU, RAM, or disks?

2 Likes
#4

@jefferyanderson For write speed its CPU. Handling HTTP requests and persisting to disk are the major operations. Memory needs will be dependent primarily on series cardinality.

1 Like
#5

Great - thanks for the helpful and prompt responses Jack.

1 Like
#6

Hi Jack,

I am looking at moving many million measurement data points from PostgreSQL to InfluxDB. The data goes back to early 2014 and comes from roughly 1500 devices, with roughly a dozen field values per measurement.

I have worked out my series tags and fields and I would like some suggestions for how to import the data in such a way that the database will be as efficient as possible. You said, “Inserting data in chronological order is also more performant … sorting [tags] into alphabetical order can also increase write throughput …”

My original intention was to load InfluxDB series-by-series i.e. grab the PostgreSQL measurements corresponding to a particular InfluxDB series and load those into InfluxDB in timestamp order, then move on to the measurements for the next series, and so on. Would it be better to instead simply work through the PostgreSQL table in timestamp order rather than in “series” order?

Finally, would my loader (a Perl script) be better off writing batches directly to InfluxDB using HTTP, or would it be faster to generate flat files containing InfluxDB line protocol and then import them using the influx -import command?

Thanks!

InfluxDB ingestion rate chokes up after reaching just over 1.5+ million records (size >~2GB)
#7

@JeremySTX Write directly to the database in batches of 5k-10k field values per batch. Time sorted points will be quickest. Start with the oldest ones first.

Hope that helps! If you run into any issues drop me a line.

1 Like
#8

To give some perspective, we are lucky to have some incredibly beefy boxes at our disposal and have been able to handle >1M points/sec on a single instance. All of that comes with a massive caveat which Jack ever so casually mentioned…series cardinality.

Even with the most tailored, amazing, phenomenally large machine in the world if your series are stuck under one measurement consuming your cardinality, you are toast. We encountered this in our own setup very recently and had to reformat our data collection in order to get back to proper ingest capabilities (>1M pps).

Our particular issue was caused by >90% of series under one measurement (you can check this with influx_inspect report -detailed /path/to/shard/num) which bottlenecked any data coming into or out from the machine. We refactored the collector to send measurements in a much cleaner way (splitting into multiple measurements instead of one ‘umbrella’ measurement) and brought query times back down to sub-second levels quickly.

Outside of pure hardware improvements, it really does help to see how much impact your measurement collection has on the structure of the index altogether. Even with incredibly vertical machines, cardinality is still king.

2 Likes
#9

Thanks for that, it confirms my understanding. Although I initially read 5000 field values per batch I thought in terms of measurement “rows” (for want of a better term). Given our upload will have around a dozen fields per measurement I think I had better upload around 800 measurements per batch (which will be between 9k-10k fields per upload).

#10

I am not sure influxDB can do 1 million writes per second. I have python program that run processing on 25 cores and these 25 process trying to write to influxdb simultaneously using curl and the curl failed with error code 7. Error code 7 it mean server not responding.

I even tried to write serially 18k records using curl it took 150 seconds. so what is it mean by 1 million writes per second?

1 Like
#11

I changed my logic, instead of inserting single records parallel via python. I compile all the records in single file and use “influx” utility to ingest and it gives better performance.
I was able to ingest 1541494 records 13.79 seconds.

1 Like