Ask for help about schema design

Hi, I have some problems with my schema design. I want to share my practice and hope get some advices.

Data Profile:

  • 2k+ devices
  • Each device has 20~70 sensors. A sensor under a device collects a time series.
  • The time series from the same device are time synchronized.
  • The time series frequency is not fixed. The average period is around 10 seconds.
  • Each time series has four metadata: Describe, No, Type, Unit. The four metadatas only depends on SensorId. Once SensorId is determined, the four metadatas can be determined.

Query workload:

  • Given a device, a subset of the sensors, and a time range (usually not more than two hours), return the time series of these sensors of this device during this time period. Only the timestamps and values of the time series are needed. Metadatas are not queried.

Current schema design:

  • InfluxDB version: 1.4.2

  • Use a single measurement. The tag keys and field keys in this measurement are as follows:

time DeviceId SensorId No Type Describe Unit Value
tag key tag key tag key tag key field key field key field key

My concerns:

  • In my case, is it a good design to store all series in a single measurement?
  • I’m not sure how to arrange No, Type, Describe, Unit.
    Is it a bad idea to make Describe, Unit field keys?
    Is it a good idea to make No, Type tag keys?
    I heard about that there are independent tags and dependent tags. Is it a good idea to make No, Type, Describe, Unit tag keys dependent on SensorId? But I can’t find tutorials on how to set dependent tags. Or, dependent tags are just a description of the situation, not a switch?
    Or should I just remove No, Type, Describe, Unit out of InfluxDB, store them in a relational table elsewhere, and access them using SensorId as a primary key when in need?
  • By the way, regarding upgrading, is there a strong recommendation to upgrade from 1.4 to a later version?

I don’t really have great suggestions, but I’ll share what I know.

  1. Let’s start from the version
    I don’t see why not upgrading, you just get a better product, most importantly a new index type TSI to manage high cardinality it should also have an overall better performance.
    I invite you to estimate your cardinality, or just get it from the DB to see what’s your current cardinality.

  2. Design - Measurements
    I treat measurements just as logical containers, if all the values in them represent the same “thing” then it’s fine to me.

  3. Design - Tags vs Fields
    In general, my rule is, if you don’t need something as a tag, then don’t store it as a tag.
    here the choice is yours, you are the one who knows the requirements and data meaning and structure
    Just remember that you can’t use GROUP BY on fields, but you CAN use them in filters, it’s not as performing as filtering tags but for small ranges and amounts of data it works well.
    Given your workload, I don’t see the need on having No and Type as tags, as they are not even used. (but I might be wrong)

  4. Dependant Tags
    This is not something you set, it’s something that just happens when data are depending on each other so

  • a subset of some other data Fullname → Name and Surname
  • a constraint of 1 to 1 between attributes, like Username and Email, as usually, you can have only one username linked to one email.
    maybe the sample in the docs will be clearer, this story is related to series cardinality
3 Likes

Thank you for your help!

I runned show series exact cardinality and the cardinality is 70429. I see that Time Series Index (TSI) was first introduced in v1.5. Upgrading from v1.4 looks like a promising idea.

About tags vs fields, I agree that there is little need on having No and Type as tags. But I heard that tag is stored once while field is stored for every point of the series? So I am worried about the amplified storage of the four metadatas as fields. Or since a metadata of a time series remains constant, will the compression relieve the storage size problem? But the timestamps of the series seem stored anyway. Or, is it just inevitable to trade storage space for storing metadatas as fields in InfluxDB?

InfluxDB is a columnar database, all the data will be compressed anyway (the process is called compaction), but only tags will be indexed.
The more distinct data you store, the more disk space will be required, as the compression will be less efficient.
Tags will also require RAM as their index will be kept in memory

2 Likes

A bit late to the party but this popped up in my email digest so figured I’d chime in.

@March_Leslie can you clarify how the metadata Describe, No, Type, Unit will actually be used during querying? If you simply need that metadata displayed alongside the values you’re querying then you can store those as fields. But as @Giovanni_Luisotto states, if you are going to do filtering and/or grouping by any of those, then you definitely want them as tags.

Personally, based on what you’ve said those bits of metadata definitely look like tags and not fields. Apart from the flexibility that you’ll get regarding querying on tags, also consider the following:

  • In your schema those would be 100% dependent tags (@Giovanni_Luisotto explained that nicely), so having them be tags does not impact your cardinality at all
  • You are right that the actual values of tags are stored once, while field values are stored with each record. If “Decribe” is (hypothetically) a 50-character description of the sensor, then you definitely want that description stored once as a tag, rather than an additional 50 bytes of data stored with each record.

We’ve been using Influx in production for about 4 years, and I can safely say that the biggest single pain point has been realizing that we don’t have the correct metadata tags applied, and needing to add tags down the line. The ability to join tabular data on an Influx timeseries would be great, but is unfortunately still in its infancy.

2 Likes