Schema and compression of scaled 16 bit integer values from IoT sensor ADCs

Hi. Like many, I am evaluating influxdb for an IoT sensor logging product, but my use case is to run the DB and store the data locally to the device and not to the cloud. The long-term TSM storage will likely be an SD card. (Still deciding where WAL should go.) One of the easiest ways to keep the SD card from wearing out is to store a small file size compared to the overall space on the card, which gives the SD card’s wear leveling algorithms lots of room to work with. So I’m very interested in optimizing my schema for compression with regard to my use case.

I have a few different types of sensors, and multiple sensors of a given type. Each sensor reports multiple data items. For example, all sensors of type A might report temperature and current draw. My plan is to make a measurement for each sensor type, with fields for each data item for that type, then tag every entry with the identifier of the sensor that made the measurement. Seems like a natural fit.

My sensor data arrives at very regular, predictable rates for each sensor, so it’s a good fit for the delta-of-deltas timestamp compression. My question instead regards the compression of the field values themselves.

My sensor data is available as 16 bit unsigned integers. Usually, these are from analog to digital converters, but the source is not important. The encoding of the integers is such that 0 means some minimum value (call it MIN) and 0xFFFF means some maximum value MAX. For example, the 16 bit range could cover 10 deg MIN to 50 deg MAX, so a 16 bit value of 0x7FFF in the middle would equate to 0x7FFF/0xFFFF * (50 - 10) + 10 = ~29.99 degrees. Also, consecutive sensor readings only change by a small number of counts in the 16 bit range in most circumstances, so they are pretty close, but rarely are consecutive readings exactly the same.

I have read about simple8b encoding for int64 and the float64 XOR encoding in the Facebook Gorilla paper. If my understanding is correct, neither of these will provide significant storage savings compared to storing uint16 values. In fact, they will very likely be worse.

  • Option 1 NOT ACCURATE. SEE EDIT2: store the raw 16 bit sensor value (for now, never mind the added complexity of needing to do the MIN/MAX scaling at query time or post processing). Half of the possible values can be stored in 15 bits and the other half in 16 bits. The storage type would be int64 and encoded in simple8b. This would cause the encoding selector to be either 12 or 13, meaning 4 or 3 sensor readings could fit into the int64, respectively. If I just stored uint16 directly, the same 64 bits would hold 4 readings at all times. So the integer β€œcompression” is actually worse than uncompressed.
  • Option 2: store the MIN/MAX scaled value as a float64. Consecutive values that are exactly the same will have awesome compression to only 1 bit, but this is a rare case. Most cases will follow the Case A (0 control bit) case from the Gorilla paper. Using Facebook’s data as an example, their average compression was 26.6 bits down from 64 bits for this case. 26.6 bits is much higher than 16 bits from a raw uint16, so again, this is not an improvement.
  • Option 3: store the raw 16 bit sensor value as a float64. I think the analysis from option 2 still applies in this case.

Option 1 seems like the better option out of the three, but still worse than uint16 directly.

Is there any way to get good compression savings when the data elements are only unsigned shorts uint16?

Edit: perhaps it’s possible to delta-encode the int64 data type using cumulative_sum, special tags, and custom insertion logic? For instance, tag the first reading for each sensor during a logging session as int64 raw value, and thereafter only insert the deltas from value to value. Then reconstruct the uint16s using some kind of complicated query.

Edit2: Solution The documentation for int64 encoding is not accurate, or perhaps just misleading. Int64 values are not ZigZag encoded like the documentation says. Their deltas are ZigZag encoded, as can be seen by inspecting the source code. Here. So Option 1 is the best option by far and should provide significant compression savings.

Hi,

I think you have summarised things really well. As it currently stands, InfluxDB doesn’t support 16-bit integers (unsigned or otherwise). Therefore all integer-specified values get treated as 64-bit integers, regardless of their size. We then apply our compression strategies before writing them.

For integers, we currently support RLE (run length encoding) and simple8b. In the future we might consider adding other compression types, e.g., for when all values in fewer than 64 bits.

If you want to be able to easily query the data then I think that inserting integers would be the best way to go about things. The only other suggestion would be if you combine four readings into one 64 bit value. Of course, you wouldn’t be able to query them without some post-processing. And you would have to lose some time precision by combining multiple time stamps into one.

Cheers,
Edd

1 Like

Thanks for the confirmation, edd. I ran a test last night and checked the compaction data this morning with influx_inspect. The test set had 1 measurement with 3 fields, all integers. There were two sensors (and thus two tags), with sensor 1 running at 10 Hz and sensor 2 at 5 Hz. So, if I were to store this as raw data, the data usage per point would be (ignoring tagging metadata)

64 bit timestamp + 3x uint16 = 14 bytes (if storing raw)

Influx_inspect reports on average 1.40 bytes per point, so that’s a 90% reduction. Pretty darn good! I think I can improve this a little more if I work on preprocessing my timestamps a bit. Right now I’m inserting them with msec precision, but there is slight variation from one sample to the next. I think I could simply round to the nearest, say, 10 msec to eliminate a lot of this variation and improve the compression ratio. It would make my timestamps slightly less accurate, but wouldn’t be a big deal at the rates I’m using.

Edit: I checked the DB and the data is being stored as float type, not integer. So the compression might even be better if I fix this. I’ll have to check why the python client library is not posting integer data.

@fluffynukeit Did you complete the test with storing as integers instead of floats? What was the result?

I don’t remember the result precisely, but I did see significant compression savings. I believe they were above 90% that I saw on floats. If I remember correctly, my issue at the time was that the python influx client library does not recognize a Numpy integer as an integer type, so it submits Numpy integers to the DB as floats. I don’t use the python library anymore and instead generate the line protocol data myself. (To submit as an integer, I believe the data value must end with a lower case β€œi”.)