Does my message belong here or on slack? I don’t know, so deciding “here” is the right place.
I’m facing performance and memory management issues for the evaluation scenario described below. Should we add memory, change config (TSM->TSI), change sharding? I don’t find the right answer.
Community to the rescue…
- InfluxDB 1.7.10 default Docker image
- Hosted on Ubuntu 18.04.04LTS server
- 4-vCPU VM with 16GB of memory
- Config: TSM index, retention 2w, shard duration 1d
We use the Python SDK to send data to Influx.
Each dataframe is written with this command:
Always the same measurement and the same (single) tag.
We are processing historic files that carry 5 minutes of data values.
We chop this information up, based on the data field’s name.
Each such dataframe contains:
- One float64 column with a given name: A, B, C etc
The field name is one of the about 200 names our data values can have.
So, we could be writing a block of 10000 rows for field A.
Next, a block of 3000 rows for field B.
Next, 2 rows for field C.
Timestamps might overlap e.g. if a B-sample was taken on the same ms as an A-sample.
About 100.000 samples are written every minute.
The result is a sparse dataset.
- Some rows will contain only one field with a value.
- Some rows may contain many fields that have a value set.
As indicated, we are uncertain how to find an optimal balance between memory / performance.
Currently we see a lot of memory usage.
Below a memory assignment of 8GB, InfluxDB just hangs after some time with all memory consumed.
I guess that is because of the in-memory nature of TSM.
Is TSI expected to be a solution? We only have 1 measurement and one tag (and about 200 fields in a sparse dataset).
Is InfluxDB not the best database for this kind of data structure?
Should we restructure our data or method of writing?
Thanks for your suggestions,