Schema Considerations - 1 tag key instead of multiple field keys

@jackzampolin
My question is pretty similar to this: Schema for Plants, Devices and Signals (so I tried to do my homework and check whether anyone has already commented on that topic)

We have to set up a time series database and migrate data from pre-existing (vertical E-A-V) structure. It is all about multiple machines with different parameters being read from these machines. Imagine we have somewhat 1000-2000 tags/attributes being read in total, but one machine uses 20, another uses 100, another uses 25 - but no tag is common (there is no intersection in these tags)

Then we have to decide between:

  • A structure with thousands of fields within a single measurement:
    TagKey = Machine, TagValue=MachineID, FieldKey1=Tag1, FieldKey2=Tag2, …FieldKey1000=Tag1000, etc…
  • A structure with a single FieldValue and extra TagKey:
    TagKey1=Machine, TagValue1=MachineID, TagKey2=TagID, TagValue2=TagName (Temperature, Pressure, PartCounter, … - 2000 possible tag values), FieldKey1=TagValueInt, FieldKey2=TagValueFloat, FieldKey3=TagValueString with the values being the actual measured values

So speaking in RDBMS terms:
option1 has 2000 columns and fewer rows (1 column is indexed)
option 2 has 5 columns and MANY millions of rows (2 columns indexed)

@Ivo_Andreev Can you share some more information about what kind of data is being collected? Is there any way to migrate to a newer collector that writes more standardized schema?

Also is there any way to normalize the data coming from the machines? Are there some common groupings between the different attributes being written (i.e. can you combine all of the pressure values)?

The second option you mention is a possible way to handle this case. Just so I understand you, the points in that case would look as follows?

legacy_data,machine=81423819190,tagID=temperature tagValueInt=1i,tagValueFloat=1,tagValueString="1" TS
legacy_data,machine=81423873104,tagID=pressure tagValueInt=1i,tagValueFloat=1,tagValueString="1" TS
legacy_data,machine=81423819190,tagID=partCounter tagValueInt=1i,tagValueFloat=1,tagValueString="1" TS
legacy_data,machine=81423873104,tagID=foo tagValueInt=1i,tagValueFloat=1,tagValueString="1" TS

@jackzampolin

  • What is a must is to migrate the data we have currently (if you remember my other topic - migration from PI DB and even SQL server EAV to InfluxDB). PI stores data quite differently - everything is a tag and we have plenty of these.
  • You could think that if you have a reading from machine - this is a single tag. Imagine tags like:

machine1_temperature, machine1_pressure, machine2_RPM, machine2_temperature. These readings could be of different data types.

  • What do you mean by more standardized schema? We have no option to change the collector without starting to create own plugins for connection with various machines, which is more that we could handle on a single step. (At the end - we need to migrate and have operational system soon) .

  • As for your question on second option - a reading is having data type. Only one of the fields will have data populated (i.e. if the type is float, the value will be written in a FIELD (not an Influx Tag) and the FieldKey will be ValueFloat). The reason why I used FieldKey1=TagValueInt was because it is called tags in the DB from which we migrate.

legacy_data,machine=81423873104,tagID=foo fieldValueInt=NULL,fieldValueFloat=1.0,fieldValueString=NULL
(at the end, we could not have inconsistency in data types for a give field.

  • A far as we do not expect mixing data among customers, we intend to use different measures for each customer. This ends up in the same series cardinality as using Tag Values to distinguish customers. What do you think about this?
1 Like

Hey @jackzampolin, guess you have been busy these days. As far as no one else has commented that topic (and the other one I referred to). Is it a stupid question, or a too complex one? Do you think you could share a design guideline in this respect?

@Ivo_Andreev Sorry for not getting back sooner:

I think the schema you mention in your post and writes I show above should work great for you! I understand that making changes to clients is not an option. Only populating one field on each write will work perfectly. That will avoid type collisions.

That is your decision on measurement name vs tag value. I personally would store the customer ID as a tag to make for fewer measurements. I find this easier to reason about.

Thanks @jackzampolin
Once we manage to make the migration and have the solution operational I will write here to share some findings.

The drawback of the proposed solution is that we sometimes need to have correlated field values (i.e. temperature measured together with pressure, together with some user specific measurement. Storing in a point with multiple values would have made live somewhat easier, but still - it does not seem to be an alternative of the triple field schema.

1 Like