Hi @Danielle_Paquette-Ha the title of your question got me curious :). That does indeed sound like a lot of potential series.
To get some more clarity on the question, do you actually expect the number of unique combinations of different tags to actually be that large in practice? I.e. would you expect that all of the 332 clients would be served by each of the 45k drivers, each of whom will on different occasions use one of the 28k vehicles? Or would a given client be mostly served by a handful of drivers, each of whom would only drive 1-2 different vehicles?
What I’m trying to highlight is that the cardinality you have to deal with isn’t the total number of hypothetical combinations that could exist, but the actual number of combinations that do exist in your dataset.
Hopefully that helps steer things in the right direction. If you do genuinely have such a large cardinality based on the above, then maybe it’s worth introducing some other index, such as “order ID”, and using that as a tag, while the other metrics are fields.
On the specific question of whether splitting into buckets would help, I’m not sure but I suspect someone else here would be.
Each of the 332 clients would be served by a handful of drivers (could be anywhere between a dozen, to maybe up to 4k for one client). And each of this driver would normally drive 1 or 2 vehicles.
Each client would have a handful of vehicles, up to 2200 as of today but we need to support up to 7000 vehicles per client.
So I guess I didn’t understand cardinality correctly.
I guess I still get a pretty high cardinality. Do you think InfluxDB will be able to perform well if I structure my data with ClientId and Vehicle as Tag Keys?
I’ve been reading and experimenting with InfluxDB for the past week or so, so I’m pretty new to it. And this project is pretty big and has a very high impact on our product.
@Pooh Oh yeah I forgot to mention, all this data has a timestamp. We collect data (100 Channel) from the vehicles at any given timestamp.
For example, we can collect the engine speed up to 2-3 times a second from each vehicle. And then we also collect other measurements from the vehicle once a second or more frequent. (100 different measurements)
Each of those measurements are linked to a vehicle, that a driver was driving for a client.
I have years of data of this type.
Currently we are stocking this info in SQL Server, but it’s not efficient and we cannot query it as we would like. That’s why we want to switch to a timeseries database.
Right, thanks for clarifying @Danielle_Paquette-Ha. I certainly think that you’re approaching the question in the right way, and have the correct angle as to what should or should not be a tag.
Conceptually, I would probably go for the following:
All “discrete” metrics such as ClientId, Driver, and Vehicle as tags
Each of the Channels as a field, in which continuous (or similar numerical/boolean) values are stored
I think this should work as long as the total number of unique ClientId-Driver-Vehicle combinations that exist in the dataset doesn’t go WAY high. But I’m not sure of what the actual limit is where you’ll experience a meaningful performance impact. I do know that Influx 2 is able to handle much higher cardinality than 1.x; I’m sure there’s a post about that somewhere that I can’t dig up right now.
So yeah, I’d be curious to hear from others on this forum who may be closer to this, as to what counts as “too high” these days - whether it’s 1M or 100M or 10B.