Best practices for choosing measurement, tags and fields

Hi everyone,

We are currently facing the challenge of storing various sensor data in an InfluxDB and do not know exactly how to name the measurements, tags and fields optimally. Specifically, we need to store sensors that have a unique ID and a fixed installation location. A wide variety of areas are measured, such as the fill level of a waste garbage can or the meter reading of an electricity meter. However, the sensor not only transmits a sensor value, but usually also the temperature and battery level.

In our opinion, data that does not change regularly should be stored separately as metadata, such as the installation location.

In this example, are the measured variables such as consumption, fill level, temperature and battery level suitable measurements, or does it make more sense to select the sensor IDs as measurements? What are suitable tags and suitable field_keys?

@Chris_KIP A few questions:

  • What version of InfluxDB are you using? This may change some of the schema recommendations.
  • How many different unique sensor IDs do you expect to have? Thousands? Millions?

Generally speaking, a measurement should contain a logical grouping of related metrics. All points in a measurement should be homogeneous, meaning all rows (timestamp) should have values for all columns. I might suggest using the type/class or interval at which data is reported to determine a measurement. For example:

  • garbage (measurement)

    • sensor_id (tag)
    • sensor_temp (field)
    • sensor_battery_level (field)
    • fill_level (field)
  • electricity (measurement)

    • sensor_id (tag)
    • sensor_temp (field)
    • sensor_battery_level (field)
    • meter_reading (field)

But answers to the question above will help to choose the best path forward.

Hi @scott , thanks for your answer!

  • Right now, we’re using the self-hosted InfluxDB, Version 2.X
  • For this project, about a houndred unique sensors are the be expected

With the storage engine in 2.x (TSM), tag cardinality (number of unique tag values) can be a performance limiting factor, but only in the 100,000s to 1,000,000s, so you shouldn’t have any problems there.

I’d store the sensor IDs as a tag. After that, it’s really just deciding how you want to (or not want to) split data up into measurements. While homogenous measurements are still recommended with the TSM storage engine, they’re not as important as they are with the 3.x storage engine, so you could just store everything in a single measurement. It’s really just a matter of preference.

One thing to possibly consider is your query patterns. What types of queries will you be running on the data? For example:

  • Return specific fields from a specific sensor
  • Return the average of a field from a set of specific sensors
  • Query battery temp and batter level as well as the sensor value
  • etc.

Hi @scott , again thanks for your reply!

Right now, we’re actually considering just storing everything into one single measurement, as using several measurements doesn’t seem to bring us any benefit.

The most common uses cases at the moment are:

  • showing the latest measured values of each sensor (think temperature of weather stations, meter readings, …)
  • showing the “production” values of today (diff between meter reading now and meter reading at midnight, so e.g. how much energy was consumed by some building “today”)
  • showing aggregates over a variable time period (avg. consumption between dates on a day/week/month basis for reports)
  • measuring the utilization of parking lots (calculating the time difference between “occupied” and “free” datapoints provided by parking cameras)

Sounds good. Based on your use cases, the single measurement doesn’t sound like it will be an issue.

1 Like

Thanks @scott !

Another question: we have sensors that measure various kinds of (related) things, like power_input_day, power_input_night, power_output_day, power_output_night.

Right now, we’re creating one point, like
measurement: xy
tags: sensor_id: 123
fields: power_input_day: 123, power_input_night: 456, power_output_day: 234, power_output_night: 345

But, not all sensors measure output, some only measure input. Then we would have

measurement: xy
tags: sensor_id: 234
fields: power_input_day: 123, power_input_night: 456

This isn’t really consistent overall. Would it be better to instead create the 4 points like:

measurement: xy
tags: sensor_id: 123, type: power_input_day
fields: power: 123

measurement: xy
tags: sensor_id: 123, type: power_input_night
fields: power: 234

…, so we could have identically names fields (power instead of power_in_day, power_in_night, …)?

Are there arguments for one or the other, or does it not matter in the end?

I think for slightly simpler queries and better readability, I’d use your first proposed strategy:

So for example, let’s say you want to query both the power_input_day and power_input_night fields for a specific sensor_id:

from(bucket: "example-bucket")
    |> range(start: -30d)
    |> filter(fn: (r) => r._measurement == "xy")
    |> filter(fn: (r) => r.sensor_id == "123")
    |> filter(fn: (r) => r._field == "power_input_day" or r._field == "power_input_night")

Your results would look something like:

_time _measurement sensor_id _field _value
2023-12-21T00:00:00Z xy 123 power_input_day 123
_time _measurement sensor_id _field _value
2023-12-21T00:00:00Z xy 123 power_input_night 456

With the second schema strategy, to query the same information, you’d do this:

from(bucket: "example-bucket")
    |> range(start: -30d)
    |> filter(fn: (r) => r._measurement == "xy")
    |> filter(fn: (r) => r.sensor_id == "123")
    |> filter(fn: (r) => r.type == "power_input_day" or r.type == "power_input_night")
    |> filter(fn: (r) => r._field == "power")

And the results would look something like:

_time _measurement sensor_id type _field _value
2023-12-21T00:00:00Z xy 123 power_input_day power 123
_time _measurement sensor_id type _field _value
2023-12-21T00:00:00Z xy 123 power_input_night power 456

Not a crazy difference, but something to consider.

1 Like

Hi @scott, thanks again for your valuable input! :slight_smile:

True, not much difference. The reason I brought up this example is that sometimes we retrieve data based on a location (like the building where the sensors are installed). Sometimes there are different types of sensors installed there, e.g. a sensor with values for input and output, and a sensor that only measures input.
If we fetch sensors or data based on this location (or just request several different ids in one query), the result would contain data that has 4 entries (day_in, day_out, night_in, night_out) and data that has only two entries (day_in, night_in). This becomes interesting when we use
|> pivot(rowKey: [“_time”], columnKey: [“_field”], valueColumn: “_value”)
to put all the entries into a single row. Now we have rows that contain more “information” than the others.
This can be quite annoying when you need to check which fields are present in your result object at application level. Below is a sample implementation of such behaviour:

next: (row: string[], tableMeta: FluxTableMetaData) => {
            const o = tableMeta.toObject(row);

            if (o.meter) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.meter
              );
              o.meter_consumption = calculatedConsumption;
            }
            if (o.meter_in_day) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.meter_in_day
              );
              o.meter_in_day_consumption = calculatedConsumption;
            }
            if (o.meter_out_day) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.meter_out_day
              );
              o.meter_out_day_consumption = calculatedConsumption;
            }
            if (o.meter_in_night) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.meter_in_night
              );
              o.meter_in_night_consumption = calculatedConsumption;
            }
            if (o.meter_out_night) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.meter_out_night
              );
              o.meter_out_night_consumption = calculatedConsumption;
            }

            if (o.ullage_cm) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.ullage_cm
              );
              o.ullage_cm_consumption = calculatedConsumption;
            }

            result[o.device_id] = { ...o };
          },

Do you have any advice for dealing with such situations?

@Chris_KIP I guess it depends on what you want your application to do when it is missing values. On the Flux side, it would be simple enough to fill all the null values in the pivoted data with a default value, but I don’t know if that really helps in your use case. Depending on your application, you may still need to check for a non-default/non-zero value.

// ...
    |> pivot(rowKey: [“_time”], columnKey: [“_field”], valueColumn: “_value”)
    |> fill(column: "day_in", value: 0.0)
    |> fill(column: "night_in", value: 0.0)