Is there performance penaly for having multipe measurements instead of one measurement with multiple tags

From documentation I see the only advantage pointed out is querying without regex.
like here

Although influx says its better to have tags docs also say that influx stores separate tags in different series so I’m wondering does it have any performance/storage improvement if you have measurement with multiple tags instead of multiple measurements.

Also there is statement that if cardinality is high on multiple series in measurement it will increase memory.

That makes a counter argument for following this practice where we’d store all data into separate tags just for queries without regex

any thoughts

1 Like

Hello @Nikola,
Tags are indexed (measurements are not), so you can query data faster by using them. It’s also good to note that often times tags are dependent, so the series cardinality is smaller than the case where you separate the data out into different measurements with one tag. See this documentation for an example of what I mean by dependent tags. Thanks for the good question!

in concrete task I’m trying to tackle there is plenty of cardinality with floating point measurements

these measurements are having tags
they have one big series with plenty of float data and some tags that index smaller amount of values along that big series

I’m not wondering if I combine multiple measurements that have plenty of float values(I assume this is high cardinality)
then does it offer any improvement over having them in multiple measurements. The only benefit in the docs its like querying without regexp.

One way that I tried reasoning about this is like so

When combining multiple floating point measurements that have this assumed high cardinality then when they are indexed they will blow up on memory as some blog posts say

but there might be some saving in space due to dependent tags
but that might blow up memory quite a bit due to high cardinality

with having them separate in different measurements then there is bit extra space used due to not dependant tags
but there is no issue of big index being made
since only few floating points are indexed in current tags

if we’d combine two measurements influx would need to index pretty much entire long series to figure things out

does this reasoning sound legit?

Regards,
Nikola

You want to store your data in different tags, not measurements. Influx is made to handle extremely high series cardinality use cases. Dependent tags decrease your cardinality. You can enable TSI for high cardinality use cases. Having separate measurements will still increase your cardinality. I would just calculate your series cardinality with data stored in tags and then use this sizing guide.

2 Likes

that sounds nice
I was like reading online saw some blog posts that raises alarms about memory and tag cardinality

so main concern is like if you have floating data in span of around 30
and you have lots of it over long period of time
and you want to run this on medium instances
if you put 100 of these or lets say 1000 of these or more in tags
will influx make index in memory for each of these so it can find the datapoints

like main concern is like if you have it all in tags
then influx makes mem structure for finding where is each tag data
and it takes space

and if those are separate in different measurements
does then influx not make some mem structure but instead loads it from disk

thanks for help Anaisdg.

@Nikola You’re welcome! The shape won’t have an impact on the mem structure. You still use mem for each tag stored in separate measurements vs the same. What blog were you reading that confused you?

You could also just take a sample of your data set and try both schemas and monitor your influx instance to verify for yourself. That sounds like a fun experiment.

https://blog.zhaw.ch/icclab/influxdb-design-guidelines-to-avoid-performance-issues/ it says at some point

This would result in the need for more than 32 GB, because InfluxDB would try to construct an inverted index in memory, which would always be growing with the cardinality.

so this inverted index is what I imagined influx uses as something to figure out where is data on disk for each tag if stored in single measurement
but I thought that if you have two measurements it does not need it if they are separated on disk
I assumed it uses bit more disk but no need to make this stuff in memory to figure things out

I could try to benchmark but I was hoping to google bit to avoid benchmarking work or ask here :-D.
one of things with benchmarking is
this thing would have to run for long time and have plenty of data inside to be workhorse database pretty much
since its a bit tricky to reload influx I’m hoping to minimize the risk or avoid situation where I’d pick unfortunate setup that in few months once filled with a lot of stuff start chocking and then it needs reload because data might start to come from plenty of sources and quite soon it can go quite up

if tags offer no extra cost but save disk space that would be great
but if separate measurements would offer for price of bit more disk longer performance on smaller hardware then I was thinking disk space is the cheapest thing and most prob it won’t go up that much
so now from your comments I’m starting to think that this reasoning is bit faulty but not sure if things are all clear

if you say tags are no extra cost but save disk space
then I might as well go for that

I get it that once it loads in memory it uses mem but if there would be multiple measurements and they get loaded on and off in memory it might be nicer than to have some extra memory usage just because I hoped to save some space on disk