Series cardinality calculation

I have read the Data Layout and Schema Design Best Practices for InfluxDB (Data Layout and Schema Design Best Practices for InfluxDB | InfluxData) blog post and I am now considering how I should setup my database.

According to the blog post “Series cardinality is the number of unique bucket, measurements, tag sets, and field keys combinations in an organization”.

There is this example formula:

I wonder if it makes sense to include the number of buckets in the series cardinality calculation in the way it is presented in the formula? Shouldn’t the series cardinality be calculated separately for each bucket?

For example If I have two buckets: bucket_A, bucket_B
within bucket A I have one measurement and one field: measurement_A, field_A
within bucket B I also have one measurement and one field: measurement_B, field_B

So in total I have 2 buckets, 2 measurements and 2 fields.

The way I interpret the example formula, the series cardinality should be calculated:
SC=number_of_buckets * number_of_measurements * number_of_field_keys = 222 = 8

Based on the descriptions in the blog post, however, to me it sounds like the series cardinality should be calculated separately for each bucket because there could for example never be a situation where bucket B is combined with measurement_A and field_A:
SC=number_of_measurements * number_of_field_keys
SC_bucket_A = 11 = 1
SC_bucket_B = 1
1 = 1
SC_total = SC_bucket_A + SC_bucket_B = 1+1 = 2

Is this correctly understood?

I’ve always calculated series cardinality per measurement, honestly, that’s the only way I’ve seen it so far. (this post about cardinality calculation says the same for InfluxDb v2)

This is the definition of Series: A logical grouping of data defined by shared measurement, tag set, and field key.

if you have dependent/related tags the plain/simple multiplication will get you an over-estimate of cardinality (see here)

The doc link you shared isn’t clear enough about this, therefore I’ll ask @Anaisdg if she can have a look at it too. It might be worth updating it in case that’s misleading

Hmm, that sounds like a good point. Let’s modify my example slightly:

I have two buckets: bucket_A, bucket_B
within bucket A I have two measurements (air_measurement and radiation_measurement) and three fields. Two of the fields are related to the air_measurement: temperature and pressure and one field is related to the radiation_measurement: solar_radiaton_power
within bucket B I have one measurement: humidity and one field: humidity

So in total I have 2 buckets, 3 measurements and 4 fields.

The example formula, would give:
SC=number_of_buckets * number_of_measurements * number_of_field_keys = 2 * 3 * 4 = 24

Based on @Giovanni_Luisotto answer, my impression is that it does not make sense to calculate the cardinality across buckets or across measurements because measurements across buckets does never get combined (example: humidity will never get combined with bucket A), neither does fields across mesurements get combined (example: the field solar_radiation_power will never get combined with air_measurement)

Thus the cardinality should rather be calculated separately for each measurement:

air_measurement SC = number_of_field_keys = 2
radiation_measurement SC = 1
humidity SC = 1

Is this correctly understood?

That’s the correct approach in my opinion/experience, it worked like that in InfluxDB v1 and I honestly doubt it has changed in InfluxDB v2. Other posts (v2 related) seem to point in the same direction.

If you prefer you can wait for a more official answer (from InfluxData staff members)

@matias In InfluxDB OSS (both 1.x and 2.x), a series is defined by a common measurement and tag set (not field key). In InfluxDB Cloud, the field key is considered part of the series definition. This is because field keys are indexed in InfluxDB Cloud, but they are not indexed in InfluxDB OSS.

The reason cardinality matters is due to the efficiency of the index. If there’s too much cardinality in the database as a whole, the index can grow very large and InfluxDB will start to consume a lot of memory. So you really should consider system-wide cardinality. Right now, InfluxDB (both Cloud and OSS) reports on cardinality on the bucket level. The influxdb.cardinality() function reports the cardinality of data in a specific bucket.

So the calculation in your first post is correct if you’re using InfluxDB Cloud. If you’re using InfluxDB OSS, you can leave field keys out of the cardinality equation. But to understand system-wide cardinality, you need to calculate the cardinality of each bucket, then sum the cardinality of all buckets.

Now in your example:

I have two buckets: bucket_A, bucket_B
within bucket A I have two measurements (air_measurement and radiation_measurement) and three fields. Two of the fields are related to the air_measurement: temperature and pressure and one field is related to the radiation_measurement: solar_radiaton_power
within bucket B I have one measurement: humidity and one field: humidity

Here you actually have dependent fields (fields that exists in one measurement but not another). This is really common and does affect the potential cardinality of your data since that field is limited to a specific measurement. In this case, you need to consider cardinality on a measurement level.

But again, that calculation will differ between InfluxDB OSS and InfluxDB Cloud. In OSS, field keys are not part of the series definition and do not need to be included in the cardinality equation.

So this has all been about calculating the potential cardinality of your data, but if you’re looking for the actual cardinality in your database organization as a whole (across all buckets), you can use the following Flux query (with InfluxDB 2.4+ or InfluxDB Cloud):

import "array"
import "influxdata/influxdb"

bucketList =
    buckets()
        |> findColumn(fn: (key) => true, column: "id")

cardinalities =
    array.map(
        arr: bucketList,
        fn: (x) => {
            cardinality =
                (influxdb.cardinality(bucketID: x, start: time(v: 0))
                    |> findColumn(fn: (key) => true, column: "_value"))[0]

            return {bucketID: x, _value: cardinality}
        },
    )

array.from(rows: cardinalities)
    |> sum()
1 Like

Thanks @scott, this was a very clarifying answer. Now I think I understand the idea. The clarification on whether field keys are to be considered part of the series cardinality was helpful as well since the documentation on this topic is slightly unclear in my opinion.