We got in a situation where we had a high series cardinality, and we realized that two particular tags were unnecessary, so we changed our ingest process to not add those tags.
But ironically, this greatly increased series count because the new data actually forms a new set of series, without the tags. Thus, in an attempt to reduce our cardinality we inadvertently sort of doubled it
Is there any simple solution to this problem? The only thing I’m aware of is that we could export all the old data, manipulate it externally to remove the tags, and reimport it, which sounds quite time-consuming and perhaps tricky to get right while keeping all the data queryable. Does anyone have other suggestions?
Hello @ezquat,
Let me start by asking a basic question. Are you able to expire the old data at all? Once you’ve expired it all then you will see a decrease in cardinality.
Or if you’re using 2.x you can use a task. You could create a second bucket, start sending data there, and use a Flux script to write data form the old bucket to the new one, dropping those tags as it goes. Then you can expire data from the old bucket as you write it to the new one.
Although depending on how much data you have, your suggestion might be the best.
Anais, That definitely gives me some food for thought. I don’t know anything about 2.x, but would the same basic idea work in 1.x? I could imagine writing a query which does a “select … into” to transform the data into another retention policy. Then I could discard the original RP. But, I’m not sure what the memory implications of that are. If I duplicate the data into another RP, does that effectively increase the total series cardinality of the whole database once again? Is this something that would be different between 1.x and 2.x?
Regarding expiring old data, that’s an interesting question. We want to keep all of our data in some form, as it can occasionally be useful, but it is rare that we actually look at the older stuff. We can’t continue to grow our one InfluxDB instance forever. In the past we have twice used this strategy: Simply copy the influx data files to another place, and set a new, shorter, retention policy for the live database. That was an emergency reaction both times. It would be nice if the database gave us some feature to sort of roll data out to a cold-storage location. So far we have had some luck “rolling” data out by simply snapshotting/copying the files underneath Influx, but we’re not sure if that’s a valid, supportable option. Any suggestions for this scenario?