We are currently facing the challenge of storing various sensor data in an InfluxDB and do not know exactly how to name the measurements, tags and fields optimally. Specifically, we need to store sensors that have a unique ID and a fixed installation location. A wide variety of areas are measured, such as the fill level of a waste garbage can or the meter reading of an electricity meter. However, the sensor not only transmits a sensor value, but usually also the temperature and battery level.
In our opinion, data that does not change regularly should be stored separately as metadata, such as the installation location.
In this example, are the measured variables such as consumption, fill level, temperature and battery level suitable measurements, or does it make more sense to select the sensor IDs as measurements? What are suitable tags and suitable field_keys?
What version of InfluxDB are you using? This may change some of the schema recommendations.
How many different unique sensor IDs do you expect to have? Thousands? Millions?
Generally speaking, a measurement should contain a logical grouping of related metrics. All points in a measurement should be homogeneous, meaning all rows (timestamp) should have values for all columns. I might suggest using the type/class or interval at which data is reported to determine a measurement. For example:
garbage (measurement)
sensor_id (tag)
sensor_temp (field)
sensor_battery_level (field)
fill_level (field)
electricity (measurement)
sensor_id (tag)
sensor_temp (field)
sensor_battery_level (field)
meter_reading (field)
But answers to the question above will help to choose the best path forward.
With the storage engine in 2.x (TSM), tag cardinality (number of unique tag values) can be a performance limiting factor, but only in the 100,000s to 1,000,000s, so you shouldn’t have any problems there.
I’d store the sensor IDs as a tag. After that, it’s really just deciding how you want to (or not want to) split data up into measurements. While homogenous measurements are still recommended with the TSM storage engine, they’re not as important as they are with the 3.x storage engine, so you could just store everything in a single measurement. It’s really just a matter of preference.
One thing to possibly consider is your query patterns. What types of queries will you be running on the data? For example:
Return specific fields from a specific sensor
Return the average of a field from a set of specific sensors
Query battery temp and batter level as well as the sensor value
Right now, we’re actually considering just storing everything into one single measurement, as using several measurements doesn’t seem to bring us any benefit.
The most common uses cases at the moment are:
showing the latest measured values of each sensor (think temperature of weather stations, meter readings, …)
showing the “production” values of today (diff between meter reading now and meter reading at midnight, so e.g. how much energy was consumed by some building “today”)
showing aggregates over a variable time period (avg. consumption between dates on a day/week/month basis for reports)
measuring the utilization of parking lots (calculating the time difference between “occupied” and “free” datapoints provided by parking cameras)
Another question: we have sensors that measure various kinds of (related) things, like power_input_day, power_input_night, power_output_day, power_output_night.
Right now, we’re creating one point, like
measurement: xy
tags: sensor_id: 123
fields: power_input_day: 123, power_input_night: 456, power_output_day: 234, power_output_night: 345
But, not all sensors measure output, some only measure input. Then we would have
True, not much difference. The reason I brought up this example is that sometimes we retrieve data based on a location (like the building where the sensors are installed). Sometimes there are different types of sensors installed there, e.g. a sensor with values for input and output, and a sensor that only measures input.
If we fetch sensors or data based on this location (or just request several different ids in one query), the result would contain data that has 4 entries (day_in, day_out, night_in, night_out) and data that has only two entries (day_in, night_in). This becomes interesting when we use
|> pivot(rowKey: [“_time”], columnKey: [“_field”], valueColumn: “_value”)
to put all the entries into a single row. Now we have rows that contain more “information” than the others.
This can be quite annoying when you need to check which fields are present in your result object at application level. Below is a sample implementation of such behaviour:
@Chris_KIP I guess it depends on what you want your application to do when it is missing values. On the Flux side, it would be simple enough to fill all the null values in the pivoted data with a default value, but I don’t know if that really helps in your use case. Depending on your application, you may still need to check for a non-default/non-zero value.