I’m looking to use InfluxDB for a use case that will have very high cardinality. I don’t think there is any way around that in terms of schema design - the scenario calls for queries that look something like this (written a SQL because it’s what I’m familiar with):
SELECT SUM(TotalActivities)
FROM Table
WHERE ClientId = "Some Client ID"
AND Time >= "2020-12-01"
AND Time <= "2020-12-31"
GROUP BY CustomerId
The database will have hundreds of millions of records, and there will be millions of unique CustomerId values and hundreds of ClientId values.
Initial experiments even with small amounts of data have shown that performance is unacceptable on this type of query with a few tens of thousands of CustomerId values when CustomerId is configured as a tag. If I configure CustomerId as a field and group by some other tag, performance is good, but that doesn’t meet my requirements - it just shows that high cardinality is the source of my performance issues.
I’ve tested the same scenario in KDB+ which performs roughly 40x better for the same query when the schema is properly optimised, but I know KDB+ quite well and an completely new to Influx so it might not be a fair comparison.
I’m using InfluxDB 2, latest version as of yesterday. We’re about to give up and look at alternative solutions, but before I do I wanted to ask if we might have missed anything simple that could be configured to improve performance on this kind of query. Below is the query I wrote in flux, which took about 20 seconds to execute with around a 30,000 “points” in the database that would resolve to something like 20,000 unique combinations of tags.
from(bucket: "activity")
|> range(start: 2020-12-01, stop: 2020-12-31)
|> filter(fn: (r) =>
r._measurement == "AggregatedActivities" and
r.ClientId == "some_guid" and
(r._field =~ /Total/)
)
|> sum()
|> group(columns: ["CustomerId"])
|> pivot(rowKey:["_start"], columnKey: ["_field"], valueColumn: "_value")
|> yield()
This query yields the correct results though I can’t attest to it’s efficiency.
Note: I realise this scenario is not really using Influx for what it’s designed to be efficient at. Most of what we want to use it for actually is fairly typical time-series queries but we do also need to produce headline stats from our data and if we can’t make this query run performantly there is no point in us continuing.