Hi. Like many, I am evaluating influxdb for an IoT sensor logging product, but my use case is to run the DB and store the data locally to the device and not to the cloud. The long-term TSM storage will likely be an SD card. (Still deciding where WAL should go.) One of the easiest ways to keep the SD card from wearing out is to store a small file size compared to the overall space on the card, which gives the SD card’s wear leveling algorithms lots of room to work with. So I’m very interested in optimizing my schema for compression with regard to my use case.
I have a few different types of sensors, and multiple sensors of a given type. Each sensor reports multiple data items. For example, all sensors of type A might report temperature and current draw. My plan is to make a measurement for each sensor type, with fields for each data item for that type, then tag every entry with the identifier of the sensor that made the measurement. Seems like a natural fit.
My sensor data arrives at very regular, predictable rates for each sensor, so it’s a good fit for the delta-of-deltas timestamp compression. My question instead regards the compression of the field values themselves.
My sensor data is available as 16 bit unsigned integers. Usually, these are from analog to digital converters, but the source is not important. The encoding of the integers is such that 0 means some minimum value (call it MIN) and 0xFFFF means some maximum value MAX. For example, the 16 bit range could cover 10 deg MIN to 50 deg MAX, so a 16 bit value of 0x7FFF in the middle would equate to 0x7FFF/0xFFFF * (50 - 10) + 10 = ~29.99 degrees. Also, consecutive sensor readings only change by a small number of counts in the 16 bit range in most circumstances, so they are pretty close, but rarely are consecutive readings exactly the same.
I have read about simple8b encoding for int64 and the float64 XOR encoding in the Facebook Gorilla paper. If my understanding is correct, neither of these will provide significant storage savings compared to storing uint16 values. In fact, they will very likely be worse.
- Option 1 NOT ACCURATE. SEE EDIT2: store the raw 16 bit sensor value (for now, never mind the added complexity of needing to do the MIN/MAX scaling at query time or post processing). Half of the possible values can be stored in 15 bits and the other half in 16 bits. The storage type would be int64 and encoded in simple8b. This would cause the encoding selector to be either 12 or 13, meaning 4 or 3 sensor readings could fit into the int64, respectively. If I just stored uint16 directly, the same 64 bits would hold 4 readings at all times. So the integer “compression” is actually worse than uncompressed.
- Option 2: store the MIN/MAX scaled value as a float64. Consecutive values that are exactly the same will have awesome compression to only 1 bit, but this is a rare case. Most cases will follow the Case A (0 control bit) case from the Gorilla paper. Using Facebook’s data as an example, their average compression was 26.6 bits down from 64 bits for this case. 26.6 bits is much higher than 16 bits from a raw uint16, so again, this is not an improvement.
- Option 3: store the raw 16 bit sensor value as a float64. I think the analysis from option 2 still applies in this case.
Option 1 seems like the better option out of the three, but still worse than uint16 directly.
Is there any way to get good compression savings when the data elements are only unsigned shorts uint16?
Edit: perhaps it’s possible to delta-encode the int64 data type using cumulative_sum, special tags, and custom insertion logic? For instance, tag the first reading for each sensor during a logging session as int64 raw value, and thereafter only insert the deltas from value to value. Then reconstruct the uint16s using some kind of complicated query.
Edit2: Solution The documentation for int64 encoding is not accurate, or perhaps just misleading. Int64 values are not ZigZag encoded like the documentation says. Their deltas are ZigZag encoded, as can be seen by inspecting the source code. Here. So Option 1 is the best option by far and should provide significant compression savings.