What is the criteria to use multiple fields per measurement?

#1

Reading through the documentation, I found examples of a single field per measurment:

  1. Measurement: cpu, Tag(s): host/region, Field(s): value
  2. Measurement: foodships, Tag(s): park_id/planet, Field(s): #_foodships
  3. Measurement: weather_sensor, Tag(s): crop/plot/region, Field(s): temp

And two fields per measurement:

  1. Measurement: orders, Tag(s): -, Field(s): phone/website
  2. Measurement: payment, Tag(s): device/product/method, Field(s): billed/licenses
  3. Measurement: stock, Tag(s): symbol, Field(s): bid/ask
  4. Measurement: temperature, Tag(s): machine/type, Field(s): external/internal
  5. Measurement: census, Tag(s): location/scientist, Field(s): butterflies/honeybees
  6. Measurement: h2o_feet, Tag(s): location, Field(s): “level description”/water_level
  7. Measurement: weather, Tag(s): location, Field(s): temperature/bug_concentration

For all the examples with two fields, what is the benefit over just adding a tag? For instance, orders could use tag source (phone/website), temperature could use source location (external/internal), etc.

The first five two field examples in a sense have the same measurement but separated based on what they are applied to (type or location), the second to last has one field which is dependent upon the other (is this bad design?), and the last has two fields which are totally different. Are there implications of taking one approach over the other?

Several other potential examples which I thought of are as follows. Are these good uses?

  1. Measure a vector (i.e. velocity is speed and direction).
  2. Measure single phase power (volts, amps, phase).
  3. Measure three phase power (volts, amps, phase for each phase).

What should one do if a sensor type is used in a measurement with a single field and a second measurement in another field? For instance, there are many voltage measurements but a couple power measurements which share the same voltage.

What criteria makes using multiple fields per measurement a good or bad choice? For instance:

  1. All fields have values for each inserted timestamp (Ah, this is a must have! Per https://docs.influxdata.com/influxdb/v1.3/concepts/glossary/#field-value, NULL is not supported. This would require using zero or something if a value wasn’t provided which is not correct.).
  2. All fields receive data from the same device (probably not important).
  3. Common queries include an aggregate function on both fields over the same GROUP BY duration.
  4. Anything else?

EDIT. A little off topic, but how application specific should a measurement be? For instance, temperature when BBQing a rib eye steak is pretty specific and tags can be the chief, quality grade of meat, etc. But plain old temperature as a measurement can span from my steak example, the temperature on Pluto, and the temperature of the coals of my BBQ do not seem to have much in common. Maybe when BBQing something with a tag of what is being BBQ’d could work…

Thank you

#2

I’m fairly new on influxDB, so take what i will say for the ultimate truth.

However, here is my opinion :slight_smile:

Short version :

  1. Yeah, not supporting null feels wrong. Just use a convention of your own, if you’re sure you will have positive value, you can put 0 or -1.
  2. If there was different devices, you would probably just add a tag “device_type” or “device_name”.
  3. Remember you can only group by TAGS.
  4. Remember TAGS are indexed, so don’t put everything as TAGS if you don’t need to group by or use it as research criteria. Using fields instead allows you to use numeric operators on your numeric values.
  5. Basically, everything that is not a research criteria, and is not “unique” (consider TAG SET as an indexed concat Primary Key), should be a field.

Long boring explaination :slight_smile:

As far as i understood from the docs, videos etc. Fields and Tags are just not the same at all, and doesn’t serve the same purpose.

Fields VS Tags :
Tags are indexed, they can be used with regex in queries, and are supposedly search criteria in your queries.
Tags contain String type, so if you put numeric value in it, you won’t be able to use numeric operators anymore, as it’s a String now.
Fields are not indexed, and can’t use regex functions in queries.

So you should use Tags for reasearch criteria. The benefit of NOT using a tag, is to prevent the index to grow.
Basically , as far as i understand, influxDB works with 2 index: 1 time based index, and 1 based on Tag Set. So if you add tags, your index (which is in your RAM), will grow as will your tag set.

I will take one of your examples :
Measurement: temperature, Tag(s): machine/type, Field(s): external/internal

There will be a value for internal and external temperatures, so it wil probably no be a search criteria (You will more likely want values over a period, or from different devices, and will want both internal & external to compare them).
But the machine / type will. Also using internal and external as fields allow you to manipulate it as numeric values, and not string, which is very convenient.

Edit :
Another example of yours :
Measure three phase power (volts, amps, phase for each phase).
Measurement : power_monitoring Tags : device_type, sensor_id, phase_name Fields : voltage, amp

So you would have the possibility to use any tag as search criteria. And still have voltage / amp for each phase. I don’t really see the problem if you have the same voltage or amp for multiple phase, it just means everything is ok. On the other hand, if something goes wrong, you will have different values and would want to know it, as it will probably help you understand where the problem is coming from (Phase1 , 2 or 3 for example).

About your off topic question, i would say that you’d rather separate measurements than making one big bucket containing a lot of data which, in the end, will probably make no sense.

Example :
Measurement restauration_monitoring : Tags : chef, restaurant, quality_grade Fields : temperature, timestamp
Measurement planet_monitoring : Tags : planet Fields : temperature, timestamp

Why ? Because you willl probably never need these 2 temperatures in the same application. As you very well said, there is no logic in storing the temperature of your bbq just right to a planet temperature. Unless you are measuring all temperatures everywhere around the galaxy.

So, basically, 2 different applications / logic / sense = 2 different measurements.
Actually i think would even make 2 different database for this example…

#3

Regarding storing 0 or -1 in the field instead of NULL, I think will prevent using an aggregate functions on the field as one can’t/shouldn’t filter it out in the WHERE clause as it isn’t indexed. This is kind of a big deal, and hopefully the docs will be updated to provide better direct when (and when not) multiple fields could be used.

In regards to limiting the number of tags, I also think you want to limit the possible strings stored in tags. Probably not really a big deal unless it gets excessive.

I appreciate your examples, and they helped me think through them. Would would you recommend if you new nothing about the value being stored other than some unique indicator ID? It seems the closest thing to the examples in the docs is a separate measurement for each thing being stored with a single field and no tag, but maybe a single measurement with a single field and a tag for the ID would be better?

Thanks again!

#4

Definitely put a Tag as unique_id instead doing multiple measurement, as long as the data makes sense together.
For example, if your unique_id are coming from sensors, which are in the same building / installation / enterprise / measure the same kind of things, if any of these make sense, i don’t see any reason to multiply measurements.

As far as i’m concerned, i see tags as a concatenated index of FK / PK when you compare to relationnal database.
And measurements are kinda tables if you want a comparison, even though it’s a little bit different…
So i don’t see reasons to separate measurements if your datas makes sense together. Also, you will probably want to compare datas from sensor1 (id1) , and sensor2 (id2) at some point. And having 2 different measurements will be a pain in the ass at that moment.

Also, if you thought about separate measurements because you will have a lot of data over time.
For example : you have data each second from sensor but you will want a graphic over the day.
Then you should look into continuous queries & downsampling data.

Doc about continuous queries here.
Video of downsampling data with continuous queries.

Long story short, you should create a second measurement (for example “downsampled_measurement”), and create a continuous query which will execute every X time and insert datas in it. Then you will have a measurement with full precision and A LOT of data, and a downsampled measurement which will contain less data, but will be much easier to manipulate and faster to use when looking over large period of times.

#5

For all the examples with two fields, what is the benefit over just adding a tag? For instance, orders could use tag source (phone/website), temperature could use source location (external/internal), etc.

When designing your data schema you should think about the queries you want to perform. One of the biggest advantages of using multiple fields is that it allows for easy use of functions. For instance if you have data structured like this:

temperature,machine=unit42,type=assembly external=25,internal=37 1434067467000000000

It is easy to perform queries like:

select internal-external as difference from temperature

If you had structured the data with a single field:

temperature,machine=unit42,type=assembly,sensor=external value=25 1434067467000000000
temperature,machine=unit42,type=assembly,sensor=internal value=37 1434067467000000000

Then you won’t be able to perform functions or use mathematical operators easily.

All fields have values for each inserted timestamp (Ah, this is a must have! Per https://docs.influxdata.com/influxdb/v1.3/concepts/glossary/#field-value2, NULL is not supported. This would require using zero or something if a value wasn’t provided which is not correct.).

If you just don’t set the field, it can essentially be thought of as null. This is usually preferred over using a 0 value and won’t mess up your aggregations.

#6

In addition to not being easy, will it also be inefficient?

Ah, perfect. Do you know where this is documented?

Thanks!

#7

I don’t actually know a way to use functions or operators if they are structured as the same value, other than adjusting it to the multi-field schema with a continuous query which would be quite expensive.

The closest documentation I can find for unset fields is in the fill syntax docs, here we find that fill(null):

Reports null for time intervals with no data but returns a timestamp. This is the same as the default behavior.