Schema Design & Metadata

I’m trying to store data from a Rockwell PLC that uses PlantPAx libraries in the PLC code that follow an S88 structure/nomenclature. This means that 95-100% of all points come from Control Modules. There are many control module types - PID controller, 2-State Device like a valve or a motor, Deadband controllers, Analog Indicators, Discrete Indicators, etc.

All control module types follow the same basic structure such that they will all have the same Tags in InfluxDB. The name of each module is called a Tag in S88 language so we may use Module instead to prevent confusion. They each have a Description, Module Type, and we want to organize them in production or plant groupings such as Site (e.g. Houston), Process Cell (e.g. Crude), Production Unit (e.g. Vacuum Tower), and Unit Subsystem (e.g. Bottoms). For flow control we may have a Module named FIC-101 which is the flow controller out of the tower bottoms. In this case we have the following tags:
Module=FIC-101
Site=Houston
Cell=Crude
Unit=Vacuum Tower
Subsystem=Bottoms
Description=Crude Vacuum Tower Bottoms Flow Controller

The tag values for one control module instance will never change. For a given control module instance, the Module, Type, Site, Cell, Unit, and Subsystem are fixed values.

There will be many fields including:
PV, SP, CV, P-Gain, I-Gain,D-Gain, Mode (possibly many more).

My questions are:

  1. Since all control modules will have the same set of tags, should they all be in the same measurement - Control Modules?
  2. Since the tags remain constant could they be saved in a different measurement where data is only reported one time and then they can cross referenced by Module (which is unique among all modules) To the measurements that store the field data. Or, do we repeat all of the tag values with every new field value. The reason I am asking is because they can actually change in reality but very rarely and I would not want to create a new series because there was a typo in the description. If my main data store saves Module=FIC-101 with field values PV=1537.6,SP=1540,CV=49.68. Field values will change very quickly (PV and CV but not SP). If I write a value to the database with the Module and field values, then those other tags can change over time without creating a new series but I want to be able to cross reference the other tag metadata based on Tag.
  3. Is it OK to save one field in one record and another field in another record with different timestamps? Or should I write all field values to the database every time I write any value. For example, maybe the PV changes but none of the other fields change. Can I just write the PV alone then write the CV alone if it changes later?
  4. I know there are many ways to structure this differently - for example make a Module a measurement but that poses problems when querying the data. When a user wants to pull up data for some plant device, they may not know its Module name. If they know where the point in question exists in the physical plant model, they can filter the ControlModule measurement by Site, Cell, Unit, and Subsystem to give them a list of Modules and their Descriptions that make them much easier to find. Keep in mind there can easily be 10,000 modules or more in the database so these filters are important for a user in Grafana to find the module they want.

I hope I didn’t get too long winded. The basic question is about schema design for a DCS type of control system data.

Hi @jsever

Very thorough description and good questions. Not sure how much I can add, but I’ll give you my thoughts anyway…

First of all, are you ‘blazing the trail’ with this? Is there a Grafana-InfluxDB type solution already offered by Rockwell? Anything you have found that “almost” does what you want?

If you are indeed a pioneer with this application, congrats and don’t get discouraged. When I was starting out with this, I found this webinar from Influxdb about Schema Design for IoT to be extremely helpful. It’s quite old now, but I think the speaker does a good job explaining how correctly defining fields and tags, even in v1 of InfluxDB, is important. I noticed you tagged your post as v2. Have you considered waiting for v3 (presuming you want a locally installed database and not cloud)? There are a lot of selling points with v3 that could make your life easier. And are you already familiar with Flux, or perhaps SQL? I only ask because there is a lot of chatter about the query language and how easy or difficult that can make one’s life.

OK, moving on to your questions…

I’d say yes. Create one measurement called Control Modules.

  1. I think you would repeat all of the tag values with every new field value. On the subject of PV and SP, I prefer to model it using a tag called “MeasType” which can be Actual or Setpoint. When ingesting the data into InfluxDB, I always record the field (e.g. Temperature) and each value with the corresponding tag (e.g. value = 1536.6, MeasType = Actual and value = 1540, MeasType = Setpoint). And on top of that, I think you would also record the other tags you have, such as Site, Cell, Unit, Subsystem, etc.
  2. One of the reasons I came up with the MeasType tag is because so many of the queries I write are comparisons of the Actual vs Setpoint. Early on, I was recording them as separate fields (like you have mentioned, i.e. Temperature PV and Temperature SP), but the fact that they were not always ingested with the same timestamp made operations & queries on them more complicated and (I think) less performant, but I’m no Jedi when it comes to this, so…

I would have a field for the parameter that PV and SP are measuring, such as Temperature, Pressure, Current, Voltage, etc. and always tag each value with the MeasType tag that I explained above, and ideally record them with the same timestamp. It sounds like CV should also be recorded at the same timestamp as PV so that you could see the relationship of how CV moves the PV closer to the SP, correct?

  1. Even if the user does not know the module name, if they would know the Site, Cell, Unit and Subsystem, then they would then choose those out of defined template variables in Grafana and be able to see the list of Modules belonging to that Site, Cell, Unit and Subsystem, correct? Have you played with chained variables in Grafana to get the proof of concept working for some sample data?

Thank you for your response. It is very helpful. To answer your questions.

Not offered by Rockwell. They have their own products they offer that are expensive and difficult to use. I’ve been using their products for 30 years and the future looks bleak for those products.

I’d love to use V3 but I’m not aware of an on premise option for V3. I am planning to use V2 Open Source. Some of my customer applications are very small. Is there a V3 Open Source option?

I like this idea. I was beginning to think this way myself. I think it is a great idea. In this design, the field of interest is always value. How would you handle attributes like engineering units? For example, the PV and SV may have units of gallons per minute but the CV will be in Percent Open of a control valve. Would that be just another MeasType? Again, these are values that seldom or never change - they are metadata around the module. It’s not likely I would ever want to filter on engineering units but I would want those values to use in a Grafana chart.

I have been thinking of a way to store the meta-data in a SQL Server database. Is it possible to write a single query that joins the Influx data with SQL data to make all the meta-data available for the client to decorate time series data with meta-data?

Good to know that I should record all the other tag values with every new value. I think I understand it doesn’t take up extra disk space but it just helps to uniquely identify the series to which the value belongs.

It sounds like you’ve tried both approaches and had much better results with the MeasType field. If you have tried both methods, that is very good to know.

Your last question about Chained Variables - I have not heard of them. Are they a V3 feature? I’ll look into them.

Thanks again for your response.

Hi @jsever and thanks for the answers & feedback. Just a few more thoughts…

  • Can’t say I am surprised about Rockwell’s approach, i.e. expensive and difficult. I just read an article about the HVAC industry and how everything they use contains proprietary programs with arcane logic and programming, meaning any recent graduates in computer science cannot use more modern tools.
  • V3 is not currently available as an open source or on-prem option, but if I understand correctly from the posts from @pauldix, it will be soon. Meanwhile, it may be a good idea just to set up a small test system in Cloud V3 and play around with ingesting data, querying data, etc. just to see how you think it compares to Flux.
  • Any engineering units, at the risk of sounding obvious, are never stored in InfluxDB. If I have a field called Flowrate, it wil be unclear looking at the data if I am measuring in gal/min, L/hour, mL/sec, etc. If the field is CV and is measuring the percent open of a control valve for that Flowrate, I think MeasType tag is not applicable. I would just not use that tag for that field, and in Grafana specify the unit as %. All your other tags would probably still apply to the CV, so that way if there are 100 CV fields in your database, then I think you’d have to give them unique names (e.g. CV1, CV2, etc.) and then the tags should make it clear which CV you are referring to. So bottom line is that you would not include any engineering units in InfluxDB, and instead specify them all in Grafana (incl. any custom units, which are possible in Grafana).
  • Re: SQL storing meta-data and performing a join, yes, I believe that’s definitely possible in v2, and probably in v3.
  • Chained variables are not an InfluxDB feature, but a Grafana feature. I believe you can find more in the Grafana docs, forums, Slack channels, etc. I personally do not have much experience with them, but I believe they work as you describe, i.e. filter down consecutively by different tags until you can hone in on the module(s) that you want to view.

Thanks @grant1,
I looked over @pauldix page discussing 3.0 OSS and it appears it won’t be available until 2025. Still no alpha available and no dates announced.

I understand units in Grafana, that is what I am doing now with Grafana & PI. However, I’m a little confused why we would not use MeasType tag for CV. My plan now is to have the following tags for MeasType:

  1. PV
  2. SP
  3. CV
  4. P
  5. I
  6. D
  7. Mode
    PV and CV have the most dynamic data changing every second. SP changes occasionally in some cases (where SP is entered by an operator) or changes frequently for secondary loops in a Cascade strategy. Mode changes occasionally and P, I, & D change rarely but we want to know whenever it changes. I could potentially store changes to P, I, & D in SQL server because those changes are so rare - still it’s nice to have it in one data store.

As for working with SQL Server, I am currently using InfluxQL in Grafana - it’s very easy to query the data in Grafana with a GUI that provides drop down lists for measurements, filters, and fields. The SQL data source for InfluxDB requires the Flux language. InfluxQL is the only query language in Grafana that is not in Alpha or Beta - SQL is in alpha and Flux is in Beta. I switched to Flux in Grafana to see if there is a GUI but alas, there is not. I’m sure I can learn Flux but in the long term, my customers will need the more user friendly interface to create charts.

Regarding Grafana variables, they work on an entire dashboard. They are great for creating dashboard templates. I use this for a customer that has 50 product tanks with all the same measurements. I can create a single dashboard with several panels. There is a tank variable that the user can select from a dropdown list at the top of the dashboard. All of the panel queries use the variable so when the user selects a different tank, all of the panels update with the tank specific data. It allows me to build one dashboard that works for 50 different tanks instead of building 50 different dashboards and managing them all. The filtering I want to do is in the query for a single panel however I think the basic functionality will work fine as it is.

Let’s make sure my terminology is in sync with yours. Please confirm my understanding is correct:

  1. PV = present value, i.e. the measured value of the process you’re trying to control
  2. SP = setpoint of the process
  3. CV = control variable, i.e. the output signal from the controller that manipulates the process to bring the PV closer to the SP.
  4. P = proportional component of PID control loop feedback that calculates an output that is proportional to the current difference between the desired setpoint (target value) and the actual process variable.
  5. I = integral component of PID control loop that integrates the error over time to account for any accumulated offset
  6. D = derivative component of PID control loop that looks at the rate of change of the error.
  7. Mode = refers to the specific control actions that are active at any given time, e.g. Automatic (uses the full PID algorithm), Manual (operator directly sets the control variable), or any other modes you may use.

Assuming the above are indeed correct, and you want to have each of the above be possible tag values for the tag MeasType, then how you would describe a field (e.g. temperature)? Let’s say the temperature is 1576 and let’s say that value is the present value, so the tag MeasType would be PV. If you have other ways to describe (via tag values) the value 1576, what do you envision?

It sounds like you might want Fields for CV, P, I, D and Mode, and not have these be tag values for the tag MeasType.

The only reason I use the MeasType tag is because I want to ingest the data at the same timestamp and make the queries easy and efficient. This data screenshot from Influx and graph screenshot from Grafana are not from the same time window, but hopefully you get the point. The timestamps (even though they are not shown in my screenshot) are exactly the same whenever we ingest the actual and setpoint temperature.

You are 100% correct on the terminology.
The temperature is baked into the module name. For example, the PID controller TIC-100-01 is the temperature controller for process unit 100. So TIC-100-01.PV is the current temperature. TIC-100-01.SP is the controller temperature setpoint. TIC-100-01.CV is the output to the control valve (0-100% open). The other parameters (P, I, D, & Mode) are also attributes of TIC-100-01 as you describe above. The user knows that PV is the temperature measurement.

Why do we need the same timestamp for the various MeasTypes? In our current time series database, we store data that changes - when it changes but we almost never store a value if the data does not change. Therefore, if we want to know all the values at any point in time, we interpolate for a specific timestamp. We don’t know when any of the data is going to change so the ingestion simply logs each value for each MeasType independently of the others. Here is a trend we have in Grafana. The back end data is not InfluxDB.

You’ll notice gaps in data for SP and CV. The time series database compresses by rejecting data where the change is not significant. Each point configured in the system is defined with parameters that define a significant change for that point. If a straight line can be drawn between two points an hour apart and none of the data between them lies significantly outside that straight line, they are not stored. Since PV is the most active, it stores the most data to describe those changes. Because SP seldom changes, and tends to change in steps, it does not require much data to describe it’s movement. On the chart, I configured P, I, and Mode to have a zero line thickness so they don’t show up as lines because they rarely change and I typically don’t need to see them as line traces over time.

I guess my main question is: Do I need to log all these values with the same timestamp and if so, why?

Hi @grant1,
I just did some work with test data. I am generating data with python writing to a text file then using the Telegraf Tail plugin to ingest the data. Initially in this test, I was only loading a single value. Now I am loading multiple values and they do not have the same timestamps.

When viewing the data in InfluxDB Data Explorer, the data shows up but the tooltip windows does not include a value unless the cursor is right on top of the value.

In Grafana, it will show all values whether the cursor is above a saved value or not. This is the behavior to which I am accustomed. At the same time, it is very easy to read when all the data lines up by timestamp.

Were you able to discuss with your InfluxDB resources about using timestamps that are not all synchronized?

Hi @jsever

I use Node-RED to inject mutliple values with the same timestamp. Not sure how this is done in Telegraf.

I was basically told that the amount of storage needed for storing constant values (like a setpoint value that rarely changes) is miniscule. I understand Huffman coding is used, and that storing all the values makes processing easier. I did not inquire about non-synchronized timestamps. Maybe start another thread on that subject or search the forum.

That is very interesting. I’m coming from a tradition that used different technology for compressing data. The compression is lossy in that it discardeds values considered an insignificant change relative to the last or next value. They have two types of compression algorithms. The first is a simple deadband where values that lie within the configured deadband are discarded. Values outside the deadband are then forwarded to the second compression algorithm called the swinging door algorithm. Again this compression decides to keep or reject values. The archived data is compressed by throwing out values that are considered insignificant. There is a very good video that shows these types of compression in action with very good visuals.

The downside to this style of compression is that we have to configure (or tune) the deadband and swinging door tolerances for every point in the archive. Too tight and the archives grow too fast. Too wide and we lose too much data.

For very large plants with 10,000 points or more, they often create multiple servers for collecting, and archiving data. For example, I worked at a refinery where every large processing unit (crude tower, vacuum tower, cat cracker, reformer, utilities, etc) had its own historian where data was collected at a 1 second interval but not stored for very long - e.g. 6 months. That local unit data was sent to a site-wide server where the data was down-sampled. Every processing unit funneled its high frequency data to the site-server for down-sampling so it could be stored for many years because the tolerances were widened and more data was thrown out.

InfluxDB compresses very differently where no data is lost. Based on the metrics you described to me about your application - about 1,000 points or more and 4 (or was it 6) years of data with less than a GB of disk space - I suspect the InfluxDB compression methods are much more space efficient.

I think I should jettison what I’ve know about time series data compression - or at least not worry about it and get on board with the InfluxDB approach to handling data.

Based on what I’m learning from you in this thread, I am thinking of a new schema design. Tell me what you think…
I will have three measurements to capture everyhing. They are:

  • Analog
  • Discrete
  • Config
  1. Analog will capture PV, SP, and CV. The two main tags will be Name (e.g. FIC-101) and Attrib (e.g PV, SP, CV). These all have the same fields - value, mode, eu, eu0, eu1 - where eu is engineering units, eu0 is the minimum value for a chart, and eu1 is the maximum value for a chart. Mode is a state represented by an integer that means Manual, Auto, Cascade, or Ratio mode. All are captured at 1 second intervals and sent to Influx.
  2. Discrete is for discrete devices such as two-state valves, motors, level/flow/pressure/temperature switches. These are not necessarily binary - 1:0, true:false, on:off but can have many different named states. If we store the raw integer data then we have to create value maps in the client tools. Alternatively, we can cross reference the integer to a state during the ingestion process though I don't have any idea right now how to do this or which way is best. The primary tags again are Name and Attrib where the fields are value, enum, and maybe svalue. Value is the integer value, and svalue is the string equivalent. Enum is the name of the enumeration that maps integer values to the state strings - defined elsewhere.
  3. Config is data that rarely changes but we want to store these value like the P, I, and D values for a PID controller. This could also be string data like a module's description. It may be more reasonable to save this data in a relational database with a timestamp because of the static nature of this data. Discrete and Analog measurements have unique Name values where a Name in the Analog measurement is not found in the Discrete measurement (and visa versa). However, Config contains Names found in both Analog and Discrete measurements as cross rerences. Again, the primary tags are Name and Attrib. Fields are value, svalue, eu
The last measurement doesn't seem to have a proper home in InfluxDB. It seems better suited in a Relational Database because it changes so infrequently. We still want to capture the data and any changes that do occur but I don't want to continually ingest that data in InfluxDB if it won't change for 7 years. I could move eu, eu0, and eu1 to the config data since it doesn't change much. The only reason I have it with the Analogs is because it can be used to dynamically set these same values in a Grafana display - greatly simplifying dashboard configuration.

Another option that seems to make sense to me for Analog measurements is to eliminate the Attrib tag and have PV, SV, CV, Mode as fields. This way, I can ingest a single line for all 4 values with a single timestamp.

What do you think?