Schema Design for IoT Metrics

influxdb
time-series
iot
schema

#1

Hi guys,
We are planning to use InfluxDB as a storage for IoT data (sensors and different measurements).
We have that situation:

  • 3000 Devices
  • Each “typical” device has 8 sensors, (4 temperature sensors, 4 Power consumption sensors). The number of sensors per device can vary and in future maybe there will be other different types of devices with different sensors (not only temperature or power consumption).
  • Each sensor sends 4 messages per minute

The typical queries will be per sensor, but i don’t exclude that we’ll need some global statistics. For that i’m planning to use a single measurement, instead multiple measurements (divided by device? Or by measure type).

I’ve some questions:

  • Is good to use a single measurement to store that data? Do you see any implication? There is a difference in using single or multiple measurement in terms of memory and cpu usage?
  • What is the ideal point structure in this case? We have to choose between these two structures:
  1. TIMESTAMP | VALUE (FIELD) | SENSOR_ID (TAG) | DEVICE_ID (TAG) -> 1 row for each sensor
  2. TIMESTAMP | SENSOR_A_VALUE (FIELD) | SENSOR_B_VALUE (FIELD) | SENSOR_C_VALUE (FIELD) | … | DEVICE_ID(TAG) -> 1 row containing the data of all the sensors of the device at time T

Keep in mind that the number of sensors, for some devices, can vary so, for the solution number 2, some rows will have less fields, some rows will require to add more fields (due to more than 8 sensors present). For the solution number 1 the value field will be a float, what we can do if we need to store another type? We must add a new field?

Thank you in advance,
Matteo


Schema design: peer review request
#2

Hi Mateo,

I’ve recently done a similar thing, though not with as many sensors, and I asked much the same question. My data was also multi-sensor data, etc. Here’s the suggestion I got:

Typically you’ll have the sensor id/name and other metadata as a tag then the fields will be the various sensors depending on how much high level grouping you would like to have, the sensors could also be grouped into their own measurements so your line protocol will probably end up looking something like: thingamabob_sensor,id=1234,city=sf temp=22.2,humidity=0.80 1502747933000000000

So in my case, I have influx data_sensors as my ‘thingamabob_sensor’ and the two tags id and city, with the sensor readings following. Since my sensors all grab a reading for each value every second, I send the whole thing as one line-protocol write to the InfluxDB instance:

String::format("influxdata_sensors,id=%s,location=%s temp_c=%f,temp_f=%f,humidity=%f,pressure=%f,altitude=%f,broadband=%d,infrared=%d,lux=%f", myID.c_str(), myLoc.c_str(), temperature, fTemp, humidity, pressure, altitude, broadband, infrared, lux);

Making a single write to the database cuts down on the amount of time spent in setting up and tearing down the http connection within my sensor app – time better spent taking environmental readings and doing other important tasks.

I hope this helps!


#3

Thank you david :slight_smile:
Ok i understood your solution (and i like it :)), also the “writing optimization” part makes me think that we can go in that direction (in alternative the solution to avoid HTTP Connection setup was to use batch inserts).
But what if some (not all) of your sensors will send more than 8 measures (for example because you have built and released on market a new device version and that version sends 9 measures)?
For example, instead having that:

String::format(“influxdata_sensors,id=%s,location=%s temp_c=%f,temp_f=%f,humidity=%f,pressure=%f,altitude=%f,broadband=%d,infrared=%d,lux=%f”, myID.c_str(), myLoc.c_str(), temperature, fTemp, humidity, pressure, altitude, broadband, infrared, lux);

you have 1 more measure related to Power Consumption:

String::format(“influxdata_sensors,id=%s,location=%s temp_c=%f,temp_f=%f,humidity=%f,pressure=%f,altitude=%f,broadband=%d,infrared=%d,lux=%f,powerConsumption=%f”, myID.c_str(), myLoc.c_str(), temperature, fTemp, humidity, pressure, altitude, broadband, infrared, lux, powerConsumption);

Will that scheme support that kind of variation? Keep in mind that we do not have control on what and how devices send data. We’re just a “collector” of that data and we’ll need to plot on dashboards the fields

thank you!


#4

I’m thinking about also at the continuous queries.

Suppose i’ve that measurement:

TIMESTAMP | SENSOR_1_VALUE (FIELD) | SENSOR_2_VALUE (FIELD) | DEVICE_ID (TAG)

Suppose that on that measurement i compute the daily mean of sensors with a continuous query and i store it inside another measurement.
At some time i have newer devices with one more sensor.

TIMESTAMP | SENSOR_1_VALUE (FIELD) | SENSOR_2_VALUE (FIELD) | SENSOR_3_VALUE (FIELD) | DEVICE_ID (TAG)

What i have to do with the continuous query? I’ve to fix it? Or maybe i can create a dynamic continuous query that uses regular expressions and consider every field in the measurement?


#5

As an addition to the “continuous queries problem” see that Feature Request:


#6

Adding a new sensor reading at a later date shouldn’t be a problem. Adding a new field should (and in my experience does) add the new field to the schema on-the-fly and things just work.


#7

Hi David,
yes you’re right, it works. Also the possibility to use mean(*) in continuous query works with sparse data! That’s cool!
What we need to verify with the customer is if the single values coming from sensors have the same timestamp or not (so it is feasible to group them in a single row).
After that, i think we’ll try the solution:

  • Single measurement
  • Multiple fields per row

And then we’ll see with customer if we can split in more measurements based on devices or some other stuff (we want to know if they have to perform some statistical queries across devices / plants or not ;))
Thank you!


#8

Glad to help, and let us know how it goes!