Best practices for choosing measurement, tags and fields

Chris_KIP · December 20, 2023, 11:57am

Hi everyone,

We are currently facing the challenge of storing various sensor data in an InfluxDB and do not know exactly how to name the measurements, tags and fields optimally. Specifically, we need to store sensors that have a unique ID and a fixed installation location. A wide variety of areas are measured, such as the fill level of a waste garbage can or the meter reading of an electricity meter. However, the sensor not only transmits a sensor value, but usually also the temperature and battery level.

In our opinion, data that does not change regularly should be stored separately as metadata, such as the installation location.

In this example, are the measured variables such as consumption, fill level, temperature and battery level suitable measurements, or does it make more sense to select the sensor IDs as measurements? What are suitable tags and suitable field_keys?

scott · December 20, 2023, 4:09pm

@Chris_KIP A few questions:

What version of InfluxDB are you using? This may change some of the schema recommendations.
How many different unique sensor IDs do you expect to have? Thousands? Millions?

Generally speaking, a measurement should contain a logical grouping of related metrics. All points in a measurement should be homogeneous, meaning all rows (timestamp) should have values for all columns. I might suggest using the type/class or interval at which data is reported to determine a measurement. For example:

garbage (measurement)
- sensor_id (tag)
- sensor_temp (field)
- sensor_battery_level (field)
- fill_level (field)
electricity (measurement)
- sensor_id (tag)
- sensor_temp (field)
- sensor_battery_level (field)
- meter_reading (field)

But answers to the question above will help to choose the best path forward.

Chris_KIP · December 20, 2023, 4:25pm

Hi @scott , thanks for your answer!

Right now, we’re using the self-hosted InfluxDB, Version 2.X
For this project, about a houndred unique sensors are the be expected

scott · December 20, 2023, 4:51pm

With the storage engine in 2.x (TSM), tag cardinality (number of unique tag values) can be a performance limiting factor, but only in the 100,000s to 1,000,000s, so you shouldn’t have any problems there.

I’d store the sensor IDs as a tag. After that, it’s really just deciding how you want to (or not want to) split data up into measurements. While homogenous measurements are still recommended with the TSM storage engine, they’re not as important as they are with the 3.x storage engine, so you could just store everything in a single measurement. It’s really just a matter of preference.

One thing to possibly consider is your query patterns. What types of queries will you be running on the data? For example:

Return specific fields from a specific sensor
Return the average of a field from a set of specific sensors
Query battery temp and batter level as well as the sensor value
etc.

Chris_KIP · December 21, 2023, 3:10pm

Hi @scott , again thanks for your reply!

Right now, we’re actually considering just storing everything into one single measurement, as using several measurements doesn’t seem to bring us any benefit.

The most common uses cases at the moment are:

showing the latest measured values of each sensor (think temperature of weather stations, meter readings, …)
showing the “production” values of today (diff between meter reading now and meter reading at midnight, so e.g. how much energy was consumed by some building “today”)
showing aggregates over a variable time period (avg. consumption between dates on a day/week/month basis for reports)
measuring the utilization of parking lots (calculating the time difference between “occupied” and “free” datapoints provided by parking cameras)

scott · December 21, 2023, 3:41pm

Sounds good. Based on your use cases, the single measurement doesn’t sound like it will be an issue.

Chris_KIP · December 21, 2023, 3:58pm

Thanks @scott !

Another question: we have sensors that measure various kinds of (related) things, like power_input_day, power_input_night, power_output_day, power_output_night.

Right now, we’re creating one point, like
measurement: xy
tags: sensor_id: 123
fields: power_input_day: 123, power_input_night: 456, power_output_day: 234, power_output_night: 345

But, not all sensors measure output, some only measure input. Then we would have

measurement: xy
tags: sensor_id: 234
fields: power_input_day: 123, power_input_night: 456

This isn’t really consistent overall. Would it be better to instead create the 4 points like:

measurement: xy
tags: sensor_id: 123, type: power_input_day
fields: power: 123

measurement: xy
tags: sensor_id: 123, type: power_input_night
fields: power: 234

…, so we could have identically names fields (power instead of power_in_day, power_in_night, …)?

Are there arguments for one or the other, or does it not matter in the end?

scott · December 21, 2023, 4:23pm

I think for slightly simpler queries and better readability, I’d use your first proposed strategy:

So for example, let’s say you want to query both the power_input_day and power_input_night fields for a specific sensor_id:

from(bucket: "example-bucket")
    |> range(start: -30d)
    |> filter(fn: (r) => r._measurement == "xy")
    |> filter(fn: (r) => r.sensor_id == "123")
    |> filter(fn: (r) => r._field == "power_input_day" or r._field == "power_input_night")

Your results would look something like:

_time	_measurement	sensor_id	_field	_value
2023-12-21T00:00:00Z	xy	123	power_input_day	123

_time	_measurement	sensor_id	_field	_value
2023-12-21T00:00:00Z	xy	123	power_input_night	456

With the second schema strategy, to query the same information, you’d do this:

from(bucket: "example-bucket")
    |> range(start: -30d)
    |> filter(fn: (r) => r._measurement == "xy")
    |> filter(fn: (r) => r.sensor_id == "123")
    |> filter(fn: (r) => r.type == "power_input_day" or r.type == "power_input_night")
    |> filter(fn: (r) => r._field == "power")

And the results would look something like:

_time	_measurement	sensor_id	type	_field	_value
2023-12-21T00:00:00Z	xy	123	power_input_day	power	123

_time	_measurement	sensor_id	type	_field	_value
2023-12-21T00:00:00Z	xy	123	power_input_night	power	456

Not a crazy difference, but something to consider.

Chris_KIP · December 22, 2023, 8:51am

Hi @scott, thanks again for your valuable input!

True, not much difference. The reason I brought up this example is that sometimes we retrieve data based on a location (like the building where the sensors are installed). Sometimes there are different types of sensors installed there, e.g. a sensor with values for input and output, and a sensor that only measures input.
If we fetch sensors or data based on this location (or just request several different ids in one query), the result would contain data that has 4 entries (day_in, day_out, night_in, night_out) and data that has only two entries (day_in, night_in). This becomes interesting when we use
|> pivot(rowKey: [“_time”], columnKey: [“_field”], valueColumn: “_value”)
to put all the entries into a single row. Now we have rows that contain more “information” than the others.
This can be quite annoying when you need to check which fields are present in your result object at application level. Below is a sample implementation of such behaviour:

next: (row: string[], tableMeta: FluxTableMetaData) => {
            const o = tableMeta.toObject(row);

            if (o.meter) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.meter
              );
              o.meter_consumption = calculatedConsumption;
            }
            if (o.meter_in_day) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.meter_in_day
              );
              o.meter_in_day_consumption = calculatedConsumption;
            }
            if (o.meter_out_day) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.meter_out_day
              );
              o.meter_out_day_consumption = calculatedConsumption;
            }
            if (o.meter_in_night) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.meter_in_night
              );
              o.meter_in_night_consumption = calculatedConsumption;
            }
            if (o.meter_out_night) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.meter_out_night
              );
              o.meter_out_night_consumption = calculatedConsumption;
            }

            if (o.ullage_cm) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.ullage_cm
              );
              o.ullage_cm_consumption = calculatedConsumption;
            }

            result[o.device_id] = { ...o };
          },

Do you have any advice for dealing with such situations?

scott · January 2, 2024, 4:51pm

@Chris_KIP I guess it depends on what you want your application to do when it is missing values. On the Flux side, it would be simple enough to fill all the null values in the pivoted data with a default value, but I don’t know if that really helps in your use case. Depending on your application, you may still need to check for a non-default/non-zero value.

// ...
    |> pivot(rowKey: [“_time”], columnKey: [“_field”], valueColumn: “_value”)
    |> fill(column: "day_in", value: 0.0)
    |> fill(column: "night_in", value: 0.0)

Topic		Replies	Views
Multiple measurements with no tags or one single measurement? Store influxdb , iot , schema	1	1691	January 26, 2018
Database Structure - Measurement/Tags/Fields	1	450	September 2, 2022
Label sensor data as fields or use tags? InfluxDB 2 schema	1	346	January 12, 2024
InfluxDB understanding basics and IoT Schema design InfluxDB 1 influxdb , time-series , influxdata , schema , query	3	2118	November 3, 2021
Storing sensordata correctly influxdb , time-series , iot	3	2818	October 13, 2020

Best practices for choosing measurement, tags and fields

Related topics