Cardinality and system performance

We are running InfluxDB 2.0 OSS on Ubuntu Linux with 16 GB of RAM. We would like to write about ten fields of data from 100,000+ devices each minute. Each device is defined by a unique tag set, so there would be 1,000,000+ series, if we count each field as a separate series. In tests I ran about a year ago, the entire system would crash if I wrote too many series to the database. In a later release, maybe 2.0, only InfluxDB would shut down and maybe reset itself. Recently, I observed unusual patterns of CPU, disk, and memory usage. We’ve considered ways to reduce the number of series so cardinality won’t be a problem, but we still have some questions.

  1. How can I determine what is a safe number of series to write to InfluxDB without it or the system it is running on crashing?
  2. Does the schema affect ingest capacity, or does it solely depend on the number of series, regardless of the schema? I assume that each field counts as a series.
  3. Do inactive series count towards the cardinality limit that Influxdb can handle? Suppose for example that I have a seven day
    retention policy. On the first day 100,000 devices write data (10 fields per device and thus 1,000,000 streams). On the second day
    50,000 of the original devices no longer write any data but 50,000 new devices (distinct tag sets from the original devices) write data.
    Should this be considered a case with 1,500,000 streams?
  4. What can I expect to happen if the cardinality is too high?

In general, it’s been difficult to observe predictable behaviour when dealing with so many series. In a recent test, a series was defined by three tags, with ten values for the first tag, ten for the second, and 1000 for the third, for a total of 100,000 tag sets, times 10 fields per tag set, for a total of 1,000,000 series. The first couple of days, regular patterns of system metrics were observed. Then CPU usage, disk writes, and memory usage gradually increased for a couple of days. Finally, all performance metrics returned to the original patterns and remained that way for a couple of days until the test was stopped. I’m attaching a picture of the observations. In a test with double the number of series, I immediately saw a steady increase in CPU time dedicated to I/O until about 60% of CPU time was dedicated to I/O when I stopped the test. System performance didn’t return to normal until I deleted the bucket a couple of days later. While this makes it clear to me that InfluxDB can’t handle more than 1,000,000 series, it doesn’t help me determine what is a safe number of series and a safe way to introduce new series over time.

2 Likes

Hello @simon38,
I’ll try to answer these questions as best as I can. I’ll also try to loop in someone from the storage team.

  1. I would look at the following hardware guidelines:
    Hardware sizing guidelines | InfluxDB OSS 1.8 Documentation
    Specifically:
    Hardware sizing guidelines | InfluxDB OSS 1.8 Documentation
    (I know they’re for 1.x but they should apply to 2.x OSS as well)

  2. Schema shouldn’t affect ingest performance. A series is the unique number of measurement, tag set, and field combinations. See:
    Glossary | InfluxDB OSS 2.0 Documentation
    A single field written at a point in time is called a point. It’s not necessarily a new series though.

  3. If the series are expired/deleted they do not count towards the total series cardinality. So on the second day the 50,000 original series will contribute to series cardinality. However if you don’t write more points from those original 50,000 series for the rest of the 7 days, then those series will be deleted and your series cardinality will decrease from 100,000 to 50,000.

  4. If cardinality is too high you’ll likely notice that that reads and sometimes writes are getting slower and slower.

The following resources might also be of interest to you:

hi @simon38 ,

What is your influxdb write rate, points per second ?
How many CPU cores ?
What is max IOPS and MB/s storage limits at the server ?
Swapping ?
What is database shard duration ?
During the test, do you run any queries ?

cardinality of 1mln series is not a limit (I was able to run database with 10mln series at inflxudb oss 1.8.6 reliably, though at more powerful hardware).

Unfortunately, your graphs are not detailed enough, but it seems that your server disk performance is
insufficient. There were no disk_read graphs, but I assume they show high values too.

1mln cardinality should not affect write performance, but it will definitely use more memory, and will
make complex queries slower.

Influx does lot diskIO in background.
Not only storing raw datapoints, but also merging WAL files, running 4 levels of shard compactions, updating indexes. The more shards you get, the more compaction jobs scheduled.
So even if you stop writing new data, influxdb may still be busy with finishing internal compactions.

2 Likes

Thank you. You answered my question and gave me the resources I was looking for. I’m still trying to understand the system performance metrics, because I want to understand the signs that the system is in trouble, but the main question is the bounds on cardinality.

Thank you, these are good questions, which some other people I spoke to brought up. I don’t really know the answer to all the questions. Do you know how I can find out?

The write rate is 100,000 lines of InfluxDB line protocol per minute on the minute, with 10 fields, in batches of 10,000, so depending on how you count points, it is either 100,000 points or 1,000,000 points per minute, which corresponds to a rate o at most 17,000 points per second.

I’m working remotely, so I can’t physically inspect the machine. What I found out from the lshw command in Linux is the following:

CPU: Intel Xeon CPU E3-1240 v6 @3.70GHz
Storage: SCSI 999 GB DELL PERC H330 Adp

What I understand is that the Xeon has 4 cores, and the storage, it seems, is a hard drive. From Dell’s website (List of PowerEdge RAID Controller (PERC) types for Dell EMC systems | Dell US):
PERC H330 Adapter 12Gb/s SAS 6Gb/s SATA PCI-Express 3.0 16 RAID, 32 Non-RAID Hardware RAID

I don’t know about the max IPS, MB/s storage limits, nor what you mean by swapping other than what I showed in the figure. Maybe you can help me find that information.

The shard duration is 1 day, because the retention policy itself is 7 days.

I sporadically run queries to make sure the data is being written to the database.

Do you have an idea of what we could change in the system specs to increase the amount of data we’re writing? What else should I monitor and how should I interpret the data?

Either way, this input was useful because it at least supports what other people said to me.

If you have root access , check iostat output to see storage performance.
Look at the r/s, w/s, rMB/s, wMB/s and %util fields - it will tell you your read/write IOPS, the bandwidth used and utilization.

%util - if it is close to 100% (I think it is in your case) then your storage is not fast enough to handle the load.

Example:

# iostat -xm
Linux 4.4.0-1104-aws (influxdb-node1.private) 	09/20/2021 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          17.68    0.00    1.31    4.51    0.00   76.50

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
loop0             0.00     0.00    0.60    0.00     0.00     0.00     2.80     0.02   29.75   29.75    0.00   0.92   0.06
loop1             0.00     0.00    0.83    0.00     0.00     0.00     2.54     0.02   23.75   23.75    0.00   0.76   0.06
loop2             0.00     0.00    0.92    0.00     0.00     0.00     2.57     0.01   15.33   15.33    0.00   0.50   0.05
loop3             0.00     0.00    0.00    0.00     0.00     0.00     3.20     0.00    0.00    0.00    0.00   0.00   0.00
nvme3n1         940.38  1244.00  516.46  135.92     5.69     5.39    34.79     0.36    0.55    0.42    1.04   0.31  20.46
nvme7n1         940.42  1244.19  516.47  136.13     5.69     5.39    34.78     0.36    0.55    0.43    1.02   0.32  20.62
nvme2n1           6.26    15.32    6.91    3.87     0.06     0.08    26.82     0.01    0.64    0.35    1.15   0.37   0.40
nvme5n1         940.58  1244.26  517.03  136.27     5.69     5.39    34.75     0.05    0.08    1.03    0.92   0.32  20.59
nvme6n1         940.30  1243.82  516.38  135.78     5.69     5.39    34.79     0.34    0.53    0.42    0.95   0.30  19.85
nvme1n1         940.24  1243.60  516.31  135.66     5.69     5.39    34.80     0.36    0.55    0.41    1.06   0.32  20.62
nvme4n1         940.19  1243.75  516.34  135.77     5.69     5.39    34.79     0.34    0.52    0.41    0.93   0.30  19.49
nvme9n1         940.44  1244.12  516.49  135.96     5.69     5.39    34.79     0.34    0.53    0.42    0.93   0.30  19.61
nvme8n1         940.38  1243.90  516.40  135.81     5.69     5.39    34.79     0.36    0.55    0.44    0.96   0.31  20.15
nvme0n1           0.00     0.55   21.83    1.04     0.35     0.02    32.93     0.32   14.00   13.14   31.94   2.60   5.96

here you see 516 read and 135 write IOPS