We are running InfluxDB 2.0 OSS on Ubuntu Linux with 16 GB of RAM. We would like to write about ten fields of data from 100,000+ devices each minute. Each device is defined by a unique tag set, so there would be 1,000,000+ series, if we count each field as a separate series. In tests I ran about a year ago, the entire system would crash if I wrote too many series to the database. In a later release, maybe 2.0, only InfluxDB would shut down and maybe reset itself. Recently, I observed unusual patterns of CPU, disk, and memory usage. We’ve considered ways to reduce the number of series so cardinality won’t be a problem, but we still have some questions.
- How can I determine what is a safe number of series to write to InfluxDB without it or the system it is running on crashing?
- Does the schema affect ingest capacity, or does it solely depend on the number of series, regardless of the schema? I assume that each field counts as a series.
- Do inactive series count towards the cardinality limit that Influxdb can handle? Suppose for example that I have a seven day
retention policy. On the first day 100,000 devices write data (10 fields per device and thus 1,000,000 streams). On the second day
50,000 of the original devices no longer write any data but 50,000 new devices (distinct tag sets from the original devices) write data.
Should this be considered a case with 1,500,000 streams? - What can I expect to happen if the cardinality is too high?
In general, it’s been difficult to observe predictable behaviour when dealing with so many series. In a recent test, a series was defined by three tags, with ten values for the first tag, ten for the second, and 1000 for the third, for a total of 100,000 tag sets, times 10 fields per tag set, for a total of 1,000,000 series. The first couple of days, regular patterns of system metrics were observed. Then CPU usage, disk writes, and memory usage gradually increased for a couple of days. Finally, all performance metrics returned to the original patterns and remained that way for a couple of days until the test was stopped. I’m attaching a picture of the observations. In a test with double the number of series, I immediately saw a steady increase in CPU time dedicated to I/O until about 60% of CPU time was dedicated to I/O when I stopped the test. System performance didn’t return to normal until I deleted the bucket a couple of days later. While this makes it clear to me that InfluxDB can’t handle more than 1,000,000 series, it doesn’t help me determine what is a safe number of series and a safe way to introduce new series over time.