Performance Tuning for High Cardinality

I’m looking to use InfluxDB for a use case that will have very high cardinality. I don’t think there is any way around that in terms of schema design - the scenario calls for queries that look something like this (written a SQL because it’s what I’m familiar with):

SELECT SUM(TotalActivities)
FROM Table
WHERE ClientId = "Some Client ID"
    AND Time >= "2020-12-01"
    AND Time <= "2020-12-31"
GROUP BY CustomerId

The database will have hundreds of millions of records, and there will be millions of unique CustomerId values and hundreds of ClientId values.

Initial experiments even with small amounts of data have shown that performance is unacceptable on this type of query with a few tens of thousands of CustomerId values when CustomerId is configured as a tag. If I configure CustomerId as a field and group by some other tag, performance is good, but that doesn’t meet my requirements - it just shows that high cardinality is the source of my performance issues.

I’ve tested the same scenario in KDB+ which performs roughly 40x better for the same query when the schema is properly optimised, but I know KDB+ quite well and an completely new to Influx so it might not be a fair comparison.

I’m using InfluxDB 2, latest version as of yesterday. We’re about to give up and look at alternative solutions, but before I do I wanted to ask if we might have missed anything simple that could be configured to improve performance on this kind of query. Below is the query I wrote in flux, which took about 20 seconds to execute with around a 30,000 “points” in the database that would resolve to something like 20,000 unique combinations of tags.

from(bucket: "activity")
  |> range(start: 2020-12-01, stop: 2020-12-31)
  |> filter(fn: (r) =>
    r._measurement == "AggregatedActivities" and
    r.ClientId == "some_guid" and
    (r._field =~ /Total/)
  )
  |> sum()
  |> group(columns: ["CustomerId"])
  |> pivot(rowKey:["_start"], columnKey: ["_field"], valueColumn: "_value")
  |> yield()

This query yields the correct results though I can’t attest to it’s efficiency.

Note: I realise this scenario is not really using Influx for what it’s designed to be efficient at. Most of what we want to use it for actually is fairly typical time-series queries but we do also need to produce headline stats from our data and if we can’t make this query run performantly there is no point in us continuing.

Hello @wwarby,
One way you could increase efficiency on that query is to use a task to assign a new tag to that unique 20,000 tag combinations.
After doing that, you could quickly query that subset of data by filtering for this new tag that identifies that data subset.

@wwarby Are you using InfluxDB 2.0 OSS or InfluxDB Cloud?

@wwarby also cardinality will be a thing of the past…I encourage you to take a look at the future of InfluxDB

I’m using the open source version with a view to licensing the product in the longer term once we get to the point that we really need HA/ replication etc.

Ah I see - that makes sense but unfortunately the example I gave is just one of many - there would actually be hundreds of permutations of group/filter tag combinations in the real world so I don’t think that approach will work for me. I had a read of that article you posted which sounds really positive. I guess it’ll be quite a while before it’s generally available but InfluxDB might be a better fit for my use case at that point.