Schema design to decrease cardinality

I’m trying to wrap my head around how to design a schema in order to keep cardinality as low as possible. I think I understand cardinality, how it is calculated, and what the impact is, but as I see it there are multiple different design choices to solve the problem below. I wonder if any of these choices are preferred over another?

In the example below I’ve simplified, we don’t have discuss whether or not the example constitutes a “real” problem, given the volumes exemplified. I still want to know which solution to apply when the problem is real (and the number of tags, or possible tag values, grow).

Consider the following:
I have a measurement (M) with three tags (“three properties”), T1-T3, and one field (“the value”), V. The three tags are required as they’re all used when querying the data, either to filter, or to group, or both, which means I don’t want them to be fields. Assuming the tags have the following number of possible values: T1=100, T2=10, T3=10, the calculated theoretical cardinality would be 100*10*10=10000 (exclude dependent tags, and assume all combinations will exist).

Here’s why I can reduce cardinality:
I only want to be able to query the data based on any of the following conditions:

  • No tag used for neither filtering nor grouping (i.e. some “total” aggregation)
  • Only using any single tag key, T1, T2 OR T3
  • The combination of T1 and T2
  • The combination of T1 and T3

The combination of T2/T3 could be of some value, but I opt to never combine T2 and T3, or all three tags, in order to be able to reduce cardinality (let’s assume 10k is to high).

This means I can reduce the cardinality by recording different values instead, in one of several different ways, representing T1&T2 and T1&T3 respectively. It will double the number of points recorded, but reduce the series cardinality from 10k to 2k (100*10 + 100*10). I can choose to sample my values either as “per T1/T2/T3”, or as separate “value per T1/T2” and “value per T1/T3”, so there’s no problem in doing this separation.

I don’t worry too much about the number of points, nor about the ingestion rate, as the sampling rate will be “relatively low”. Assume retention will be high, and query-rate will be quite high (multiple consumers), which is why I want to keep cardinality as low as possible (the hardware resources available will be “low” or “moderate”, and I want the best possible query response times I can get).

There are at least four different ways of solving this that I can think of, which all works based on my requirements (I’ve tested), all produce the same total cardinality (2000), but which one is the best, if any, and why? Is there a “best practice” for which solution to choose, or anything else I should think about? Or another way to solve it which I haven’t thought of?

  1. Record two points with different tags, towards the same measurement.
    I find it somewhat strange this works, but it does. I would’ve assumed all points (or rather series) within a measurement, would share a common set of tag keys, but it appears they don’t have to.
    M1,T1=X,T2=Y V=value TIMESTAMP
    M1,T1=X,T3=W V=value TIMESTAMP
    One downside is that if I want to query and aggregate using only T1, or no tag, I have to single out one of the two tag key sets (e.g., where T2 has any value, OR where T3 has any value), as the total values of each set represent all my values. Querying the entire measurement without this separation, would yield double the result I would want. As an alternative to using T2 or T3 to select “half”, this could be solved by a third tag with a constant value, used as a discriminator (which wouldn’t increase cardinality), we could call that option 1b.

  2. Separate measurements entirely by name, with different tag keys.
    M2_by_T2,T1=X,T2=Y V=value
    M2_by_T3,T1=X,T3=W V=value
    Basically an option to #1, with a “discriminator” in the measurement name instead. Downside, well I get “complex” measurement names. “M” is still the measurement, I just want different ways of viewing/graphing/aggregating it…

  3. “no-op” the “irrelevant” tag, with a constant “garbage” (invalid and irrelevant) value (cardinality 100*10*1 + 100*1*10)
    M3,T1=X,T2=Y,T3=0 V=value TIMESTAMP
    M3,T1=X,T2=0,T3=W V=value TIMESTAMP
    Also an option to #1, where the “garbage” value on T2/T3 can be used as a discriminator to single out “half” the points based on T1 or no tag.
    This would keep the set of tag keys constant for all points in the measurement, I don’t know if it’s preferable for some reason??

  4. Combine the possible values of T2/T3, in one tag (assuming the possible values doesn’t overlap), giving a cardinality of 100*20 instead.
    Greatly increases the complexity of querying, as I would have to filter out either set of 10 values, to get “half” of the points which represents all of my values (in cases when I only want to query using T1, or no tag). I can’t possibly see this as a good choice, but it does work.

Why should I choose one over the other, what would you choose/recommend and why?
I may be new to Influx, but I have read everything I’ve come across regarding schema design and cardinality. Haven’t found a good answer to this. Is the answer really just “pick one”?

If you’ve read this far, thank you! :slightly_smiling_face:
If you have an answer, thank you even more :smiley:

1 Like

The only best practice is to keep just the data you need, and don’t create useless tags.

As you already noticed (opt 1) the series in a measurement can be completely different, but I don’t really like to have a measurement with “mixed objects” (missing a property is fine… but recording different data not that much) as you will always have to properly filter it in order to get what you want, which makes querying it very annoying. (practical example any eav model, like windows performance counters)

Since you are recording the same thing from two different points of view, that are both complete and not comparable, I’d go for the option 2, the multiple measurement one. Querying will be easier, and the schema is clean.

The only problem I see is about incoherent data, let’s say M1 lost some points while M2 didn’t… the totals will change based on the measurement you are using… (but this apply to any solution in which you split your dataset in 2 different ones, be it by tag or measurement)

It is important to design a good schema, but I won’t focus too much on cardinality unless strictly necessary, of course if you can reduce it to 1/5 it’s gonna be good… but is it worth the complexity and lost flexibility?

Hope you find my opinion helpful

1 Like

Thank you Giovanni, absolutely helpful!

In my real-world scenario, I will in fact have more tags (six to be accurate, one “common”, T1, and five “point of views”, T2-T6), and thus a much higher theoretical cardinality.

As the actual cardinality in the end will depend on use-cases outside of my control, I have to consider the worst case scenario. The different “point of views” as you accurately described them, may very well have less than 10 values each in the end, or dependent tags (which I “loose” by doing the separation). But as I don’t know that, I consider the worst case. They won’t “run away”, there will be at max between 10-20 tag values per tag, and more (~100) for the “common” tag, but doing that math on the low end result in 100*10*10*10*10*10=10000000, high cardinality (with relatively few points per serie). On the high end, 320000000, ridiculously high…

Splitting the views in five separate measurements, “by_T2…by_T6”, getting a total cardinality between 5000-10000 instead, is definitely worth it I would say.

I hadn’t really thought about the incoherent data problem, good point! I think it’s acceptable though, some loss of points is to be expected. And with “five views” which should have the same total, the most accurate total will always be the highest number in my case (sum aggregate over some time-frame). I can always compare/max/sum the different series, should I have the need.

I think I’ll go for option 2, I agree it is the cleanest most straightforward option.

Just as a followup, I came up with the following idea to get the “best of both worlds”: low-cardinal long-term (downsampled) data, and coherent easy-to-record data. Just wanted to know if anyone (@Giovanni_Luisotto :wink: ) see a problem:

  1. I sample and record the points on the “full view”, with all tags T1-T6, thus I only have one measurement and one value to record for each new point. This has the risk of very high cardinality over time, but not short term. Only a relative few series (<10k) of the theoretically possible series (>1M), will actually record new points short-term (let’s say, within 24h, as an example). This goes into a measurement, in a bucket with 24h retention.
  2. I downsample, and in the process convert this data, to the “option 2” format, separating and grouping tags into separate measurements (beyond the usual aggregateWindow downsampling). Hourly aggregates, as an example, using Task(s) scheduled hourly. These resulting measurements, goes into a separate bucket, with long/infinite retention.

As a result, I should get low-cardinal data in long term archiving, thanks to the “option 2” format, without the risk of incoherent data. They’re all downsampled from the same recorded data, and either all misses points, or don’t. My “potential high cardinal” data, is kept low, due to the low retention (in combination with the irregular sampling of different series, in my specific case). But it also gives full flexibility short-term (“live” view of the last 24h, from any point-of-view, or filter/group combination of tags), as well as simplified sampling and recording using “the one” measurement.

Opinions??

Makes sense to me, I don’t see any issue…
I’ve used something like this as a sort of “temporary” measurement/bucket and it works well, you just have to be careful with the queries.

Might be a little annoying if you have to present both long-term + real-time data (in the same chart), but that’s up to the requirements and still more than doable. in my case to avoid this I’ve got a 10m delay (which is the CQ/task frequency) on some data and that’s it.