Series cardinality calculation

@matias In InfluxDB OSS (both 1.x and 2.x), a series is defined by a common measurement and tag set (not field key). In InfluxDB Cloud, the field key is considered part of the series definition. This is because field keys are indexed in InfluxDB Cloud, but they are not indexed in InfluxDB OSS.

The reason cardinality matters is due to the efficiency of the index. If there’s too much cardinality in the database as a whole, the index can grow very large and InfluxDB will start to consume a lot of memory. So you really should consider system-wide cardinality. Right now, InfluxDB (both Cloud and OSS) reports on cardinality on the bucket level. The influxdb.cardinality() function reports the cardinality of data in a specific bucket.

So the calculation in your first post is correct if you’re using InfluxDB Cloud. If you’re using InfluxDB OSS, you can leave field keys out of the cardinality equation. But to understand system-wide cardinality, you need to calculate the cardinality of each bucket, then sum the cardinality of all buckets.

Now in your example:

I have two buckets: bucket_A, bucket_B
within bucket A I have two measurements (air_measurement and radiation_measurement) and three fields. Two of the fields are related to the air_measurement: temperature and pressure and one field is related to the radiation_measurement: solar_radiaton_power
within bucket B I have one measurement: humidity and one field: humidity

Here you actually have dependent fields (fields that exists in one measurement but not another). This is really common and does affect the potential cardinality of your data since that field is limited to a specific measurement. In this case, you need to consider cardinality on a measurement level.

But again, that calculation will differ between InfluxDB OSS and InfluxDB Cloud. In OSS, field keys are not part of the series definition and do not need to be included in the cardinality equation.

So this has all been about calculating the potential cardinality of your data, but if you’re looking for the actual cardinality in your database organization as a whole (across all buckets), you can use the following Flux query (with InfluxDB 2.4+ or InfluxDB Cloud):

import "array"
import "influxdata/influxdb"

bucketList =
    buckets()
        |> findColumn(fn: (key) => true, column: "id")

cardinalities =
    array.map(
        arr: bucketList,
        fn: (x) => {
            cardinality =
                (influxdb.cardinality(bucketID: x, start: time(v: 0))
                    |> findColumn(fn: (key) => true, column: "_value"))[0]

            return {bucketID: x, _value: cardinality}
        },
    )

array.from(rows: cardinalities)
    |> sum()
1 Like