Storing timeseries of different sampling rates

asti · April 7, 2017, 4:39pm

Not all measurements would have the same sampling rate, but they do need to be analysed together.

For example, board temp#3860erature could be sampled every 10 seconds, but S.M.A.R.T.attributes for hard-disks may only be read once every hour. Is it better to store both of these values together or should they be stored separately?

Keeping them together would have some values staying the same value for thousands of entries - which should compress well, but nonetheless would have a storage overhead. Keeping them separately would need to join two different series with dissimilar intervals outside the db - it might perform much worse than if the data were in one series - though, there would be less data to be read. Some kind of reconciliation would also have to be run to pinpoint the value of a low frequency measurement at the same time as a high frequency one.

This is referencing #3860

jackzampolin · April 7, 2017, 6:07pm

@asti My suggestion would be to store the values together. The compression for repeated values is excellent. We store a pointer to the original so that there is hardly any storage impact < 2 bytes per value.

asti · April 7, 2017, 9:29pm

Thank you for the reply, Jack.
That low compression overhead is excellent.
Does it still apply for multiple entities within the same measurement?
That is, if I store timeseries data that mostly repeats, but they are stored together as:

server1,10,20,30
server2,40,50,60
server1,10,20,30
server2,40,50,60
server1,10,20,30
server2,40,50,60
server1,10,20,30
server2,40,50,60

Or would server1 and server2 have to stored independently?
If it’s snappy compression, then only the overall symbols should matter, and not the deltas - can I correctly assume this is the case?

jackzampolin · April 8, 2017, 12:18am

@asti You should store server as a tag and have the measurement name be something descriptive of the values being collected. We use snappy for strings and ``double delta compression for floats and integers.

asti · April 8, 2017, 9:15am

There’s a schema design recommendation in InfluxDb docs to avoid using an identifier as a tag - it states that a large number of unique tags degrade the index.
Should it be artificially partitioned into multiple tags, Say region1,machine1? Or are a few thousand unique values of a tag acceptable?

jackzampolin · April 10, 2017, 5:53am

@asti A few thousand unique tags is fine! A single instance can handle between 5-10M series. That number is going to increase significantly when tsi (new index implementation) will increase this number significantly.

Topic		Replies	Views
Storage of sampled streaming data Store	1	977	August 15, 2018
Single database+multiple measurements vs Multiple databases Store influxdb , schema	6	4283	September 20, 2022
Storing high-frequency (10kHz) data Telegraf time-series	5	3317	November 11, 2022
Multiple variables in one field Store influxdb , time , date	5	6562	June 1, 2018
Store sampled datas Store influxdb	2	406	July 28, 2020

Storing timeseries of different sampling rates

Related topics