Best practices for choosing measurement, tags and fields

Hi everyone,

We are currently facing the challenge of storing various sensor data in an InfluxDB and do not know exactly how to name the measurements, tags and fields optimally. Specifically, we need to store sensors that have a unique ID and a fixed installation location. A wide variety of areas are measured, such as the fill level of a waste garbage can or the meter reading of an electricity meter. However, the sensor not only transmits a sensor value, but usually also the temperature and battery level.

In our opinion, data that does not change regularly should be stored separately as metadata, such as the installation location.

In this example, are the measured variables such as consumption, fill level, temperature and battery level suitable measurements, or does it make more sense to select the sensor IDs as measurements? What are suitable tags and suitable field_keys?

@Chris_KIP A few questions:

  • What version of InfluxDB are you using? This may change some of the schema recommendations.
  • How many different unique sensor IDs do you expect to have? Thousands? Millions?

Generally speaking, a measurement should contain a logical grouping of related metrics. All points in a measurement should be homogeneous, meaning all rows (timestamp) should have values for all columns. I might suggest using the type/class or interval at which data is reported to determine a measurement. For example:

  • garbage (measurement)

    • sensor_id (tag)
    • sensor_temp (field)
    • sensor_battery_level (field)
    • fill_level (field)
  • electricity (measurement)

    • sensor_id (tag)
    • sensor_temp (field)
    • sensor_battery_level (field)
    • meter_reading (field)

But answers to the question above will help to choose the best path forward.

Hi @scott , thanks for your answer!

  • Right now, we’re using the self-hosted InfluxDB, Version 2.X
  • For this project, about a houndred unique sensors are the be expected

With the storage engine in 2.x (TSM), tag cardinality (number of unique tag values) can be a performance limiting factor, but only in the 100,000s to 1,000,000s, so you shouldn’t have any problems there.

I’d store the sensor IDs as a tag. After that, it’s really just deciding how you want to (or not want to) split data up into measurements. While homogenous measurements are still recommended with the TSM storage engine, they’re not as important as they are with the 3.x storage engine, so you could just store everything in a single measurement. It’s really just a matter of preference.

One thing to possibly consider is your query patterns. What types of queries will you be running on the data? For example:

  • Return specific fields from a specific sensor
  • Return the average of a field from a set of specific sensors
  • Query battery temp and batter level as well as the sensor value
  • etc.

Hi @scott , again thanks for your reply!

Right now, we’re actually considering just storing everything into one single measurement, as using several measurements doesn’t seem to bring us any benefit.

The most common uses cases at the moment are:

  • showing the latest measured values of each sensor (think temperature of weather stations, meter readings, …)
  • showing the “production” values of today (diff between meter reading now and meter reading at midnight, so e.g. how much energy was consumed by some building “today”)
  • showing aggregates over a variable time period (avg. consumption between dates on a day/week/month basis for reports)
  • measuring the utilization of parking lots (calculating the time difference between “occupied” and “free” datapoints provided by parking cameras)

Sounds good. Based on your use cases, the single measurement doesn’t sound like it will be an issue.

1 Like

Thanks @scott !

Another question: we have sensors that measure various kinds of (related) things, like power_input_day, power_input_night, power_output_day, power_output_night.

Right now, we’re creating one point, like
measurement: xy
tags: sensor_id: 123
fields: power_input_day: 123, power_input_night: 456, power_output_day: 234, power_output_night: 345

But, not all sensors measure output, some only measure input. Then we would have

measurement: xy
tags: sensor_id: 234
fields: power_input_day: 123, power_input_night: 456

This isn’t really consistent overall. Would it be better to instead create the 4 points like:

measurement: xy
tags: sensor_id: 123, type: power_input_day
fields: power: 123

measurement: xy
tags: sensor_id: 123, type: power_input_night
fields: power: 234

…, so we could have identically names fields (power instead of power_in_day, power_in_night, …)?

Are there arguments for one or the other, or does it not matter in the end?

I think for slightly simpler queries and better readability, I’d use your first proposed strategy:

So for example, let’s say you want to query both the power_input_day and power_input_night fields for a specific sensor_id:

from(bucket: "example-bucket")
    |> range(start: -30d)
    |> filter(fn: (r) => r._measurement == "xy")
    |> filter(fn: (r) => r.sensor_id == "123")
    |> filter(fn: (r) => r._field == "power_input_day" or r._field == "power_input_night")

Your results would look something like:

_time _measurement sensor_id _field _value
2023-12-21T00:00:00Z xy 123 power_input_day 123
_time _measurement sensor_id _field _value
2023-12-21T00:00:00Z xy 123 power_input_night 456

With the second schema strategy, to query the same information, you’d do this:

from(bucket: "example-bucket")
    |> range(start: -30d)
    |> filter(fn: (r) => r._measurement == "xy")
    |> filter(fn: (r) => r.sensor_id == "123")
    |> filter(fn: (r) => r.type == "power_input_day" or r.type == "power_input_night")
    |> filter(fn: (r) => r._field == "power")

And the results would look something like:

_time _measurement sensor_id type _field _value
2023-12-21T00:00:00Z xy 123 power_input_day power 123
_time _measurement sensor_id type _field _value
2023-12-21T00:00:00Z xy 123 power_input_night power 456

Not a crazy difference, but something to consider.

1 Like

Hi @scott, thanks again for your valuable input! :slight_smile:

True, not much difference. The reason I brought up this example is that sometimes we retrieve data based on a location (like the building where the sensors are installed). Sometimes there are different types of sensors installed there, e.g. a sensor with values for input and output, and a sensor that only measures input.
If we fetch sensors or data based on this location (or just request several different ids in one query), the result would contain data that has 4 entries (day_in, day_out, night_in, night_out) and data that has only two entries (day_in, night_in). This becomes interesting when we use
|> pivot(rowKey: [“_time”], columnKey: [“_field”], valueColumn: “_value”)
to put all the entries into a single row. Now we have rows that contain more “information” than the others.
This can be quite annoying when you need to check which fields are present in your result object at application level. Below is a sample implementation of such behaviour:

next: (row: string[], tableMeta: FluxTableMetaData) => {
            const o = tableMeta.toObject(row);

            if (o.meter) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.meter
              );
              o.meter_consumption = calculatedConsumption;
            }
            if (o.meter_in_day) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.meter_in_day
              );
              o.meter_in_day_consumption = calculatedConsumption;
            }
            if (o.meter_out_day) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.meter_out_day
              );
              o.meter_out_day_consumption = calculatedConsumption;
            }
            if (o.meter_in_night) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.meter_in_night
              );
              o.meter_in_night_consumption = calculatedConsumption;
            }
            if (o.meter_out_night) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.meter_out_night
              );
              o.meter_out_night_consumption = calculatedConsumption;
            }

            if (o.ullage_cm) {
              const [_, calculatedConsumption] = calculationMapping(
                o.device_id,
                0,
                o.ullage_cm
              );
              o.ullage_cm_consumption = calculatedConsumption;
            }

            result[o.device_id] = { ...o };
          },

Do you have any advice for dealing with such situations?

@Chris_KIP I guess it depends on what you want your application to do when it is missing values. On the Flux side, it would be simple enough to fill all the null values in the pivoted data with a default value, but I don’t know if that really helps in your use case. Depending on your application, you may still need to check for a non-default/non-zero value.

// ...
    |> pivot(rowKey: [“_time”], columnKey: [“_field”], valueColumn: “_value”)
    |> fill(column: "day_in", value: 0.0)
    |> fill(column: "night_in", value: 0.0)

Hello,

Is it possible to explain when do you want to store the sensor IDs as a tag or as a field ?

If I understood well, if you use it as a tag, you increase your series cardinality (so lower your performance). However, if you use it as a field, you lower your performance if you do a lot of queries with it.

For information, I am also using InfluxDB V2.x and I am new to this solution.

Up !
Can anyone answer to this question ?

Thank you in advance !

PS / EDIT : For my instance, we have more than 5000 sensors and at least 3 tags / sensor.

@Fyore It depends on a few things:

  • How many tags do you have?
  • How many unique combinations of those tags do you have?

Let’s say you have the following tags, each with a specific set of potential unique values:

  • sensor_id (5000)
  • facility (10)
  • building (4)
  • device (200)

Each unique combination of these tags is a “series.” To optimize queries, each series key is indexed in memory. The more series you have, the more memory it takes to maintain the index and the less overall performant the database is. So with this schema, you could have up to 5000 × 10 × 4 × 200 (40,000,000) series (known as series cardinality). That’s a lot to store in memory.

In truth, you wouldn’t have a series for each and every possible combination of these tags and the actual series count would be much lower, but it all depends on your schema.

You can store that ID as a field, but then you take the hit when trying to query by a sensor ID. Fields are not indexed, so the queries would take longer to locate the data you’re trying to query and then operate on it.

5000 doesn’t seem like an extremely high number of unique values for a tag. I would guess you’d probably be fine storing sensor ID as a tag. Where you start to see issues, especially in OSS v2, is when you have series cardinality in the 50,000+ range. But again, it all depends on your schema.

2 Likes

Thank you for your answer !

Yes, because “dependent tags do not increase series cardinality.”
However, if you use sensor_id as a tag, aren’t most of the other tags almost always dependent on sensor_id?

I don’t really see any tags that wouldn’t be dependent on sensor_id, except maybe for what you put in the field values (and so it’s not a tag). That might not be true in very specific cases, but it should be quite rare, right?
Honestly, I’m having a hard time understanding how a case like this could even exist.

Also, we were initially considering storing our data in a regular MySQL database.
Since we have one value per second per sensor, we thought about creating one database per sensor per month to limit the size of each database.

Should we apply the same kind of logic with InfluxDB? For example, by splitting data using measurements or buckets with retention policies?

I imagine it depends on several factors, such as series cardinality, but do you have any idea of how much data (or how many rows) InfluxDB can typically handle?

Thank you in advance, and I wish you a wonderful day!

Correct. And with only 5000 unique sensor IDs, I don’t think you’ll have cardinality issues.

There’s no need to segment data like this. The overall size of a database or bucket doesn’t affect query performance (unless you query everything at the same time). Doing this may also make writing queries more challenging.

Measurements should be logical groupings of related fields. For example, a weather measurement could have temp, hum, wind_speed, etc. fields.
Tags in a measurement should uniquely identify the source of the data.

Retention policies automate the lifecycle of data by deleting it after it reaches a certain age.

In your case, you can just store everything in the same bucket. You may want to split fields into different measurements, but that’s really a data organization decision. If all the fields are related, keep them in the same measurement. There’s no need to split data across measurements based on tag values.

1 Like

Okay, thank you for your answer !