I’m trying to wrap my head around how to design a schema in order to keep cardinality as low as possible. I think I understand cardinality, how it is calculated, and what the impact is, but as I see it there are multiple different design choices to solve the problem below. I wonder if any of these choices are preferred over another?
In the example below I’ve simplified, we don’t have discuss whether or not the example constitutes a “real” problem, given the volumes exemplified. I still want to know which solution to apply when the problem is real (and the number of tags, or possible tag values, grow).
Consider the following:
I have a measurement (M) with three tags (“three properties”), T1-T3, and one field (“the value”), V. The three tags are required as they’re all used when querying the data, either to filter, or to group, or both, which means I don’t want them to be fields. Assuming the tags have the following number of possible values: T1=100, T2=10, T3=10, the calculated theoretical cardinality would be 100*10*10=10000 (exclude dependent tags, and assume all combinations will exist).
Here’s why I can reduce cardinality:
I only want to be able to query the data based on any of the following conditions:
- No tag used for neither filtering nor grouping (i.e. some “total” aggregation)
- Only using any single tag key, T1, T2 OR T3
- The combination of T1 and T2
- The combination of T1 and T3
The combination of T2/T3 could be of some value, but I opt to never combine T2 and T3, or all three tags, in order to be able to reduce cardinality (let’s assume 10k is to high).
This means I can reduce the cardinality by recording different values instead, in one of several different ways, representing T1&T2 and T1&T3 respectively. It will double the number of points recorded, but reduce the series cardinality from 10k to 2k (100*10 + 100*10). I can choose to sample my values either as “per T1/T2/T3”, or as separate “value per T1/T2” and “value per T1/T3”, so there’s no problem in doing this separation.
I don’t worry too much about the number of points, nor about the ingestion rate, as the sampling rate will be “relatively low”. Assume retention will be high, and query-rate will be quite high (multiple consumers), which is why I want to keep cardinality as low as possible (the hardware resources available will be “low” or “moderate”, and I want the best possible query response times I can get).
There are at least four different ways of solving this that I can think of, which all works based on my requirements (I’ve tested), all produce the same total cardinality (2000), but which one is the best, if any, and why? Is there a “best practice” for which solution to choose, or anything else I should think about? Or another way to solve it which I haven’t thought of?
-
Record two points with different tags, towards the same measurement.
I find it somewhat strange this works, but it does. I would’ve assumed all points (or rather series) within a measurement, would share a common set of tag keys, but it appears they don’t have to.
M1,T1=X,T2=Y V=value TIMESTAMP
M1,T1=X,T3=W V=value TIMESTAMP
One downside is that if I want to query and aggregate using only T1, or no tag, I have to single out one of the two tag key sets (e.g., where T2 has any value, OR where T3 has any value), as the total values of each set represent all my values. Querying the entire measurement without this separation, would yield double the result I would want. As an alternative to using T2 or T3 to select “half”, this could be solved by a third tag with a constant value, used as a discriminator (which wouldn’t increase cardinality), we could call that option 1b. -
Separate measurements entirely by name, with different tag keys.
M2_by_T2,T1=X,T2=Y V=value
M2_by_T3,T1=X,T3=W V=value
Basically an option to #1, with a “discriminator” in the measurement name instead. Downside, well I get “complex” measurement names. “M” is still the measurement, I just want different ways of viewing/graphing/aggregating it… -
“no-op” the “irrelevant” tag, with a constant “garbage” (invalid and irrelevant) value (cardinality 100*10*1 + 100*1*10)
M3,T1=X,T2=Y,T3=0 V=value TIMESTAMP
M3,T1=X,T2=0,T3=W V=value TIMESTAMP
Also an option to #1, where the “garbage” value on T2/T3 can be used as a discriminator to single out “half” the points based on T1 or no tag.
This would keep the set of tag keys constant for all points in the measurement, I don’t know if it’s preferable for some reason?? -
Combine the possible values of T2/T3, in one tag (assuming the possible values doesn’t overlap), giving a cardinality of 100*20 instead.
Greatly increases the complexity of querying, as I would have to filter out either set of 10 values, to get “half” of the points which represents all of my values (in cases when I only want to query using T1, or no tag). I can’t possibly see this as a good choice, but it does work.
Why should I choose one over the other, what would you choose/recommend and why?
I may be new to Influx, but I have read everything I’ve come across regarding schema design and cardinality. Haven’t found a good answer to this. Is the answer really just “pick one”?
If you’ve read this far, thank you!
If you have an answer, thank you even more