There a lot of claims that InfluxDB does not support tags with high cardinality and AFAIK there is even an ongoing effort to re-write reveted index engine. But I can not understand what is a problem with high cardinality tags vs low cardinality tags.
This is how InfluxDB inverted indexes work from my point of view. For each tag we have a map from tagValue to the list of documents’ ids that have this tag value. For example, {red -> [1,2], blue -> [3,4]}. Consider we have low cardinatity tag - with two possible values and high cardinality tag with 1 million possible values. For 1 million documents, for low cardinality tag we will have map with just two keys - but each key will map to the list of 500K ids. For high cardinality tags, the size of the map will be 1 million keys - but each key maps only to one document id. From my perspective, these maps sizes are comparable (of course second one is a bit bigger, but it’s not a show-stopper).
I also heard, that when estimating memory consumption you need to multiple all tags cardinalities. But I don’t think this is a way InfluxDb works, it does not store all possible tags pemutations, instead it computes matching documents by using set operations on individual tag values.
What are your thoughts?
@andrershov There are a couple of articles on how our storage engine work that should help you understand how this is implemented under the hood:
Before asking the question, I’ve read both documents and watched Paul Dix’s talk at Percona Live about new indexing engine. In his talk, he describes current indexing engine implementation and it’s what I’ve described in the question. However, it’s not clear what particular problem high cardinality tags are causing.
@andrershov High cardinality tags cause memory exhaustion. While the database does not store all possible tag permutations (potential series in a schema), it will store all tag permutations (the actual number of series) that are written to the databases.
Yes me too having same concern why high cardinality tags cause performance issue.
I understands tags are there to ensure no duplicate datapoint. If someone try to insert a data point which already exist with same timestamp, tags help to differentiate if the new data point is really a new record or duplicate.
suppose if I want to query a measurement which has sum(field_value) with group by 30min interval, does high cardinality tags influence my query performance? Logically your query engine should not even use tags as it not relevant at all for the query. so why high cardinality tags is a problem?
Hi Rama ,
this concern is 2 years old ,
A lot has changed since then ,
I don’t think your example will suffer from
high cardinality tags ,
Are you using the latest release ?
Best regards ,
Marc
Hi Marc,
Thank your for our reply.
I am using influxdb version 1.7.1.
I use grafana to visualize the data in the influxdb, I am just querying 1 month data and took me 9min 55sec.
Influxdb running on 16 core with 128 GB RAM
Following are some details.
time influx -database db1 -execute 'SELECT sum("value") FROM "kwh_received" WHERE time >= 1543593600000ms and time <= 1546272000000ms GROUP BY time(30m) fill(null)' > out.txt 2>&1
real 9m55.736s
user 0m0.347s
sys 0m0.087s
I think following provide the number of cardinality for the timeframe of my query which is 99.94 million datapoints.
Is this too high ?
time influx -database db1 -execute 'SELECT count(*) from "kwh_received" WHERE time >= 1543593600000ms and time <= 1546272000000ms'
name: kwh_received
time count_value
---- -----------
1543593600000000000 99941453
real 0m7.231s
user 0m0.307s
sys 0m0.050s
Hi Rama ,
what is your shard group duration ?
In this article TL;DR InfluxDB Tech Tips - Shard Group Duration Recommendations | InfluxData
you can read :
We recommend configuring the shard group duration such that:
* it is two times your longest typical query’s time range
* each shard group has at least 100,000 [points](https://docs.influxdata.com/influxdb/v1.0/concepts/glossary/#point) per shard group
* each shard group had at least 1,000 points per [series](https://docs.influxdata.com/influxdb/v1.0/concepts/glossary/#series)
is your longest typical query’s time range 1 month ?
It would be great if we could set up indexes for specific measurements like we do in relational databases and leave tags just for collision matters.