What are the expected performance penalties with unsorted tag keys in line protocol?

influxdb
#1

There is a performance tip in influx line protocol docs:

Sort tags by key before sending them to the database.

Could you clarify:

  1. What are the expected performance penalties when this recomendation is not followed?

  2. What performance does it affect and how much?

  3. Will it slow down data load into influxdb only or also affect all future SELECT/SHOW queries?

In my use case I need to merge data lines from csv file with some extra lookup info to produce influx line protocol strings. The final full tags key-value set includes fields from both sources. I suspect, that overhead of sorting this set by key in python script may be higher then performance loss from unsorted tags.

If unsorted tags only affect line protocol data load performance and not queries I’d prefer to keep loader script logic as simple as possible.

One more question:
Are there any guidelines or benchmarks available to compare bulk data loading via http api and “influx -import”? My current input data stream is approx. 500K (will grow to1.5-3M) lineprotocol lines every 5 minutes.

#2

@yuyu The overhead in Python would be much higher than in golang (InfluxDB)! The unsorted keys only affect write performance, and only at the margins.

I’ve actually found influx -import a little slower than the HTTP API for usecases like this. When you are writing just make sure to break the points up into batches. Shoot for batches of around 10k field values.

Hope this helps!

#3

Thanks for prompt reply! That was exactly what I expected. I already do http post in batches, didn’t notice big difference in 4K-10K batch size range (no exact benchmarking though).

What I observe are some rare sporadic return_code 500: {“error”:“timeout”} errors after batch post call, usually 1 more post retry is enough to push batch to db. But thats another story.

#4

@yuyu You should build your clients to expect backpressure and retry those requests. Telegraf implements this functionality natively. Occasional 500s are expected, but anything with more frequency can point to issues.

You are right, not a big difference between 4k and 10k batches.