High cardinality & boolean values

Hi & thanks for making InfluxDB!

I have 500K objects with 15 boolean properties, that are measured once per day (with timestamp T00:00:00Z). As output, I want:

  1. A count of all true bools graph, grouped by property, over time
  2. To query an object/property on a specific date

I tried loading the object id’s as tags, but quickly ran out of mem.
I tried using the object id’s as measurement names, but apparently I cannot group by over multiple measurements.

Can I use InfluxDB for this data/queries?
Or should I wait for the new time series index in 1.3?
What is the best schema design here?

It seems there is a big optimization possible: the booleans don’t change much, so theoretically I would only have to store them when they change. Not sure how to implement this in InfluxDB though.

1 Like

Hey @gwillem, not quite sure what you mean about ‘500k’ objects. Can you give an example of what your data looks like?

The primary thing re: cardinality is the unique combinations of measurement + tags + fields. This is a very good rundown of how the count works, and provided you keep your cardinality as low as possible you should be able to write/read all of your data without problem!

Also, we’ve had a few discussions on this here in the community so maybe some of this will help too?

1 Like

Thanks a lot for your reply!

Objects: I’m monitoring 500K websites for certain properties, such as “SSL enabled”.

I want to store these properties as tags, because I want to run queries on them for graphs (“how many sites have SSL installed over time”). That produces a theoretical cardinality limit of: 500K * 2^15 = 16.4 billion. Hmm.

Perhaps I’m better off storing the aggregates separately (calculated per day) and storing the bools per site in a measurement based on the sitename? Eg:

"per_site.cnn_com" ssl=true,cdn=false
"site.totals" ssl=400000 cdn=300000

I think you just need to restructure your data in such a way to limit overall series, and of course testing is going to be essential.

The influx_inspect report -details /path/to/shard/num is your friend here, will give you good insight into the overall breakdown of your data.

I would write something as follows and see what it looks like. It’s hard to fully calculate ahead of time, worth trying it out first and iterating (premature optimization and all):

site,ssl=true,cdn=false total_ssl=400000,total_cdn=300000 <timestamp>

1 Like