Path to 1 Billion Time Series: InfluxDB High Cardinality Indexing Ready for Testing

Originally published at: Path to 1 Billion Time Series: InfluxDB High Cardinality Indexing Ready for Testing | InfluxData

One of the long-standing requests we’ve had for InfluxDB is to support a large number of time series. That is, a very high cardinality in the number of unique time series that the database stores. While we currently have customers with tens of millions of time series, we’re looking to expand to hundreds of millions and eventually billions. Today we’ve released the first alpha build for testing of our new time series indexing engine. With this new engine, users should be able to have millions of unique time series (our target goal when this work is done is 1 billion). The number of series should be unbounded by the amount of memory on the server hardware. Further, the number of series that exist in the database will have a negligible impact on database startup time. This work has been in the making since last August and represents the most significant technical advancement in the database since we released the Time Series Merge Tree storage engine last year. Read on for details on how to enable the new index engine and what kinds of problems this will open up InfluxDB to solve.

Before we get into the details we’ll need to cover a little bit of background. InfluxDB actually looks like two databases in one: a time series data store and an inverted index for the measurement, tag, and field metadata. The TSM engine that we built in 2015 and 2016 was an effort to solve the first part of this problem: getting maximum throughput, compression, and query speed for the raw time series data. Up until now the inverted index was an in-memory data structure that was built during startup of the database based on the data in TSM. This meant that for every measurement, tag key/value pair, and field name there was a lookup table in memory to map those bits of metadata to an underlying time series. For users with a high number ephemeral series they’d see their memory utilization go up and up as new time series got created. Further, startup times would increase as all that data would have to be loaded onto the heap at start time.

The new time series index or TSI moves the index to files on disk that we memory map. This means that we let the operating system handle being the LRU. Much like the TSM engine for raw time series data we have a write-ahead log with an in-memory structure that gets merged at query time with the memory-mapped index. Background routines run constantly to compact the index into larger and larger files to avoid having to do too many index merges at query time. Under the covers, we’re using techniques like Robin Hood Hashing to do fast index lookups and HyperLogLog++ to keep sketches of cardinality estimates. The latter will give us the ability to add things to the query languages like SHOW CARDINALITY queries.

Problems TSI solves and doesn't solve

The biggest problem the current TSI work is meant to address is that of ephemeral time series. We see this most often with use cases that want to track per process metrics or per container metrics by putting their identifiers in tags. For example, the Heapster project for Kubernetes does this. For those series that are no longer hot for writes or queries, they won't take up space in memory.

The issue that this doesn’t yet address is limiting the scope of data returned by the SHOW queries. We’ll have updates to the query language in the future to limit those results by time. We also don’t solve the problem of having all these series hot for reads and writes. For that problem scale-out clustering is the solution. We’ll have to continue to optimize the query language and engine to work with large sets of series. The biggest thing to address in the near term is that queries that hit all series in the DB could potentially blow out the memory usage. We’ll need to add guard rails and limits into the language and eventually spill-to-disk query processing. That work will be ongoing in every release.

Enabling the new time series index

First, you'll have to download the alpha build of 1.3. You can find those packages in the nightly builds:
https://dl.influxdata.com/influxdb/artifacts/influxdb_1.3.0-alpha1_amd64.deb
https://dl.influxdata.com/influxdb/artifacts/influxdb-1.3.0-alpha1_linux_amd64.tar.gz
The new time series indexing engine is disabled by default. To enable it you’ll need to edit your config file. Under the data section add a new setting:
index-version = "tsi1"
Then restart the database. If you already have data, all old shards will continue to use the in-memory index. New shards will use the new disk based time series index. For testing purposes it'll be best to start with a fresh database. To verify that you're using disk based indexing, do a few writes and look at your data/<database>/<retention policy>/<shard id> directory. You should see a subdirectory called index.

Conclusion

This work is still early. There is more work we have to do to tune the compaction process so that it requires less memory along with much testing and bug fixing (watch the tsi label to keep track). Users that enable this should do so only on testing infrastructure. It is not meant for production use, and we could update the underlying format of the TSI files in the coming few months, which would require users to blow away that data. However, we're very excited about the new possibilities that will be enabled with the database by this work. Users will be able to track ephemeral time series like per process or per container metrics, or data across a very large array of sensors.

I’ll be giving talks at PerconaLive on April 26th and DataEngConf on April 27th that go deep into the details of TSM and TSI.

4 Likes

The new TSI concept which has been mentioned under this topic, I assume it will be either memory or disk based? Or is there a plan to have a mix of both e.g. keep a Cache of x amount of entries in memory and rest on disk (or even based on measurements/databases)

@sbains It will keep reciently written to series in memory and the rest on disk. We will add additional index control as users request those features.

I have personally done some testing on my current setup (High-cardinality) and it’s drastically has reduced the memory usage, I can’t wait to see this fully released it’s going to be a game changer!

Is there any ETA on this feature?

1 Like

@syepes Happy to hear that! We should be cutting the first release candidate for 1.3 sometime in the next week. We are aiming for a release mid to late June.

How was your experience with it? Did you run into any issues? What is your usecase where you have high cardinality?

For the moment I have just tested the same ingestion workload that I currently have on 1.2.x and in term of memory it’s night and day.

The usecase: Two retention policies raw (12h) and agg (3y), data is ingested in the raw rp with a high cardinality as it contains UUID’s then several CQ’s downsample and aggregate the data on the agg rp.

Is this already implemented/tested?

Pros and Cons

This feature is awesome as it solves the memory problem, multiplies the number of series, that can be ingested and stored. It should be enabled by default, but might need to be disabled, where high performance is needed and the number of series is very low (tens of thousands) and the amount of data is high such that nanosecond resolution is needed.

The overhead is usually minimal, except one case described below.
The cost is additional IO workload to lookup series, that were not cached.

However when new series need to be recorded the performance degrades significantly and may affect ingestion of all other data if an influx of new series happens, such as at the time of shards rotation.
When there are a few million series (nowhere near mentioned 1 Billion) that need to be added Influx DB starts to delay requests, which causes memory growth and eventually InfluxDB crashes.

It appears that the inserts to TSI index are happening synchronously while request is being processed.
When multiple requests cary over new series that need to be inserted, it appears that only one request gets the lock and does the insert.

I think that it is needed to organize a separate pipeline for creating new series, which would not cause delays for the data with known series.
It whould also be very helpful if the metadata is precreated when new shard is precreated, because at shard rotation time, all the same series continue to come in. Otherwise what was the point of precreating empty shard?
It is also needed to make the index capable of handling multi-threaded inserts. This may be achieved by partitioning the index into a series-key structured merge tree, where each section would have an independent lock and may later be merged to other sections. (Or at least something similar to page-level lock, like in traditional SQL Servers), where multiple concurrent inserts to different pages are allowed.

The work around that we currently have to implement is to track the series before the shard-rotation time and upload fake data near the end of the new shard, just to be deleted a few hours after the shard-rotation time.
Otherwise entire cluster is affected, nodes crash, HH queues overflow (defeats the purpose of having Kafka) and prologned data loss happens (for multiple hours). As a result entire Kafka queue needs to be reinserted for the time of outage, because the AntiEnthropy service just wouldn’t do anything even if kicked multiple times.
It’s good when it works, but only until the shards rotate next time - basically, disaster waiting to happen.