We’re using Prometheus in Openshift to log infrastructure and application metrics, and sending that data out to our existing, external InfluxDB cluster. All good, and we can create some awesome dashboards, but we’re struggling with cardinality. I’ve set Prometheus relabelling up to drop some of our tags with particularly high uniqueness, and I’m looking a taking a few measurements out altogether - but we’re still struggling.we turned on kube-state-metrics about 2 weeks ago, and since then, cardinality has been steady climbing.
We’ve recently altered the retention policy from 30 days to 7 days - intended as a temporary measure to get InfluxDB working again - but that doesn’t seem to have changed anything, and the cardinality is still increasing.
Can anyone give me any hints on how to manage this? It’s gone up from 9 million to 12.3 million in the last seven days. We’ve increased the memory to 64GB on each of the three hosts in the cluster, but performance of InfluxDB is being seriously compromised. What can we do?
More info (and a quick bump) - after the weekend and a day off ill yesterday, I’ve com back in to find that cardinality has not dropped at all. Any help at all in getting this sorted - at least so that Influx starts being responsive again - would be gratefully received…
Sorry, missed out some information. 1 of the three servers in the cluster has seen a drop in cardinality (from ~12million to ~4million). The others continue to grow - even more confused now.
If you are an enterprise customer, you can reach out to support – to dig deeper.
After altering the retention policies, did you also trigger a truncate shards command? I seem to recall that the change to the retention policies doesn’t happen until the shard rolls over…but you can force this to occur with the truncate command.
Apologies for piggy backing onto your original question, our org also uses Openshift and are looking to setup an integration to expose the prometheus metrics to Influx for long term storage. Can you advise of the approach you took to get this working?
We are currently testing on OSS but will be looking to go to Enterprise license early next year, would like to investigate an integration and get ahead of any cardinality issues we may face when putting in a production solution.