Tuning for sustained high throughput

time-series
influxdb
#1

I’ve seen posts that recommend changing some tuning parameters for limiting cache size for high (historical) load jobs.
I’m running some tests that is having issues with memory usage and also limited throughput.

I notice when I load a large batch of 24M rows (2-3 fields/row) I can’t get above around 150K rows/second.
This is on a 8core/16thread Ryzen 3.8Ghz processor with 32GB memory. There are moments of full CPU utilization, but it seems to go in cycles, I’d say 50% or less of the time is CPU maxed. This is also on NVMe SSD drive. >1 GB/sec throughput. The memory will slowly grow to between 24-32GB depending on number of load clients. Once the load completes from the client side (2-4 minutes for 2-8 clients) Influx continues processing (indexing?) data for another few minutes. Eventually, the memory will go back down. Sometimes, the box will run out of memory, or if I run a basic query right after the load, it will cause an OOM error in the logs and crash Influx.
~6M unique series. I had most of these under a single measurement, but after reading that can cause performance issues, I spread them out over 600 measurements which didn’t help much.
Also note, I’m not specifying a timestamp on the input, it appears influx picks a value at the start of a file load.
I’m using the curl POST method to load the files. I’ve tried 100K and 10K rows/file. Didn’t seem to make much difference. I’ve also tried both index models.

How do I setup the measurements/series to get 1M/sec throughput?
Is there a document that would help identify tuning parameters for historical loads or high sustained throughput?

Daily processing need to be able to push at minimum 1.3B rows into Influx at 60K/second sustained, or ideally, 14B rows at 655K/second sustained. I’ll need to cluster in production for availability which would drop my throughput in half or more (according to docs).

Side question, when I run the stats query against Influx internal database, the # of series appears correct, but when I run the influx_inspect against the tsm files, it’s wildly different (much higher and wrong?) for it’s estimates. What is influx_inspect actually reporting? Or, is that an indicator, I’m doing something wrong in my loads?

1 Like
#2

I have similar size of the dataset and have the same observations.

My bottleneck was also CPU and after adding more cores to VM throughput grew up proportionally.
I also noticed that % disk utilization (workload) increased, so I felt that adding more CPU will make disk a bottleneck.
There are periods of higher CPU and 100% disk utilization during compaction phases.

After upload stops InfluxDB still processes inputs from WAL files, performs compactions and keeps data cache for a while.

Out of Memory crashes InfluxDB 100% of times. Switching to TSI1 helps and memory usage is more like logarithmic from number of series in the active shard.

I am interested to hear what you ended up doing to increase the throughput. I was able to get 150k/sec on a single VM. Did you get to 1M/sec?

What are the settings for backfilling efficiently? I have dozens of Terabytes, which may take months.
Sending historical data would benefit from compression, but it seems that there is none on the input side.

#3

We have same symptomes: amount of throughput and increasing CPU usage when TSM work starts. Have somebody any ideas to improve perfomance?

We are using the latest version on influxdb, 1.7.4