Optimize InfluxDB for 2.8 billion series and more

Danielle_Paquette-Ha · April 26, 2021, 5:21pm

We’re looking to migrate some data into InfluxDB. I’m working with InfluxDB 2.0 on a test server to determine the best way to stock our data.

As of today, I have about 2.7 billion series to migrate to InfluxDB but that number will only go up.

Here is the structure of the data I need to stock:

ClientId (332 values as of today, string of 7 characters)
Driver (int, 45k values as of today, will increase)
Vehicle (int, 28k values as of today, will increase)
Channel (100 values, should not increase, string of 40 characters)
value of the channel (float, 1 value per channel/vehicle/driver/client at a given timestamp)

At first, I thought of stocking my data this way:

One bucket (as all data have the same data retention)
Measurements = channels (so 100 kind of measurements are stocked)
Tag Keys = ClientId
Fields = Driver, Vehicle, Value of channel

This gave me a cardinality of 1 * 100 * 332 * 3 = 99 600 according to this article

But then I realized that InfluxDB handle duplicate based on “measurement name, tag set, and timestamp”.

So for my data, this will not work, as I need the duplicate to be based on ClientId, Channel, Vehicle at the minimum.

But if I change my data structure to be stored this way:

One bucket (as all data have the same data retention)
Measurements = channels (so 100 kind of measurements are stocked)
Tag Keys = ClientId, Vehicle
Fields = Driver, Value of channel

then I’ll get a cardinality of 2 788 800 000.

I understand that I need to keep cardinality as low as possible. (And ideally I would even need to be able to search by driver as well as by vehicle.)

My questions are:

If I split the data into different buckets (ex: 1 bucket per clientId), will it decrease my cardinality?
What would be the best way to stock data for such a large amount of series?

svet · April 26, 2021, 9:57pm

Hi @Danielle_Paquette-Ha the title of your question got me curious :). That does indeed sound like a lot of potential series.

To get some more clarity on the question, do you actually expect the number of unique combinations of different tags to actually be that large in practice? I.e. would you expect that all of the 332 clients would be served by each of the 45k drivers, each of whom will on different occasions use one of the 28k vehicles? Or would a given client be mostly served by a handful of drivers, each of whom would only drive 1-2 different vehicles?

What I’m trying to highlight is that the cardinality you have to deal with isn’t the total number of hypothetical combinations that could exist, but the actual number of combinations that do exist in your dataset.

Hopefully that helps steer things in the right direction. If you do genuinely have such a large cardinality based on the above, then maybe it’s worth introducing some other index, such as “order ID”, and using that as a tag, while the other metrics are fields.

On the specific question of whether splitting into buckets would help, I’m not sure but I suspect someone else here would be.

Danielle_Paquette-Ha · April 27, 2021, 12:28pm

Hi @svet ,
Thank you so much for your answer.

Each of the 332 clients would be served by a handful of drivers (could be anywhere between a dozen, to maybe up to 4k for one client). And each of this driver would normally drive 1 or 2 vehicles.
Each client would have a handful of vehicles, up to 2200 as of today but we need to support up to 7000 vehicles per client.

So I guess I didn’t understand cardinality correctly.

I guess I still get a pretty high cardinality. Do you think InfluxDB will be able to perform well if I structure my data with ClientId and Vehicle as Tag Keys?

I’ve been reading and experimenting with InfluxDB for the past week or so, so I’m pretty new to it. And this project is pretty big and has a very high impact on our product.

Pooh · April 27, 2021, 12:45pm

I’ve taken a step back and have a different question regarding using InfluxDB
for this type of data.

Where does the time series come in?

InfluxDB is specifically a time series database, where the primary index is a
timestamp.

Other, SQL-based, databases are relational, meaning you can use separate
tables for clients, drivers, vehicles, with relations set between the tables.

I’m just wondering why a time series database system is the right choice for
the data you’re collecting.

Regards,

Antony.

Danielle_Paquette-Ha · April 27, 2021, 1:01pm

@Pooh Oh yeah I forgot to mention, all this data has a timestamp. We collect data (100 Channel) from the vehicles at any given timestamp.

For example, we can collect the engine speed up to 2-3 times a second from each vehicle. And then we also collect other measurements from the vehicle once a second or more frequent. (100 different measurements)

Each of those measurements are linked to a vehicle, that a driver was driving for a client.

I have years of data of this type.

Currently we are stocking this info in SQL Server, but it’s not efficient and we cannot query it as we would like. That’s why we want to switch to a timeseries database.

Sorry I should have mention it.

svet · April 27, 2021, 3:10pm

Right, thanks for clarifying @Danielle_Paquette-Ha. I certainly think that you’re approaching the question in the right way, and have the correct angle as to what should or should not be a tag.

Conceptually, I would probably go for the following:

All “discrete” metrics such as ClientId, Driver, and Vehicle as tags
Each of the Channels as a field, in which continuous (or similar numerical/boolean) values are stored

I think this should work as long as the total number of unique ClientId-Driver-Vehicle combinations that exist in the dataset doesn’t go WAY high. But I’m not sure of what the actual limit is where you’ll experience a meaningful performance impact. I do know that Influx 2 is able to handle much higher cardinality than 1.x; I’m sure there’s a post about that somewhere that I can’t dig up right now.

So yeah, I’d be curious to hear from others on this forum who may be closer to this, as to what counts as “too high” these days - whether it’s 1M or 100M or 10B.

Danielle_Paquette-Ha · April 27, 2021, 3:15pm

Thank you so much for your answer @svet

That’s what I had in mind but I wanted the opinion of someone who’s more experienced at InfluxDB than me

I’m working on a test server but I’ll keep you updated when I get to the final version in production.

Topic		Replies	Views
Cardinality and Data Series Store influxdb	3	540	November 12, 2019
Series cardinality calculation	5	2864	October 21, 2022
Sensor Data - Series Cardinality Store	6	1633	November 7, 2019
Cardinality and system performance InfluxDB 2 influxdb	5	3023	September 22, 2021
InfluxDB users, what is your series cardinality?	6	7658	June 4, 2018

Optimize InfluxDB for 2.8 billion series and more

Related topics