What's the logical connection between buckets, measurements & retention policies in InfluxDB 2.0?

I am currently studying the documentation of InfluxDB 2.0; however, I don’t understand the logic between buckets, measurements & retention policies entirely yet.

The documentation says that databases and retention policies got replaced with buckets. A bucket is per definition:

“a named location where time-series data is stored in InfluxDB 2.0”

In my understanding

A bucket contains shard groups => Shard groups store data of a certain interval in a particular folder; for example.: a shard group could always save data of a four-hours-interval in a single folder.

A shard group contains shards => Shards are the single rows/points of the time-series table.

Moreover, Influx writes in the documentation that one bucket has one retention policy.

This means that “a bucket” stores only one time-series and not several ones; otherwise, a bucket could have several retention policies.

In case my understanding is correct, does this mean that you can only include measurements in the same bucket when all of them have the same retention policy? Because if there are two measurements with different retention policies in the same bucket, one retention policy could delete data from the other measurement. Please correct me if I confuse things here.

However, in case I am right, how does this influence hardware requirements?

Influx says that the number of series affects hardware requirements.

That actually means, that every bucket/retention policy raises the number of series and by that the hardware requirements?

For example, does it make a difference when storing 60,000 series in one bucket
VS
Storing 20,000 series in bucket A, another 20,000 series in bucket B, and the final 20,000 series in bucket C.

I am looking forward to your feedback!

1 Like

A bucket has a single retention policy for all data stored in it. You can put many measurements (and many series) in a single bucket. Generally, you want to be deliberate in your schema design. Tags are indexed making them good for values you regularly want to query/pivot off of. Fields are not indexed and are better as result set values instead of query values. The measurement, tag set, and field key define a series. Since the tag values create new series, you generally want tags to have finite, low cardinality to keep the total number of series from exploding.

@AlvaroM you might find this blog post helpful: https://www.influxdata.com/blog/data-layout-and-schema-design-best-practices-for-influxdb/

1 Like

I was wondering exactly the same thing! I read the post that mhall119 shared but it did not answer the question.

Does it makes a difference if you store the series in 3 buckets instead of one?

I’m working on the design for stocking billion of series. In order to reduce cardinality, I was thinking of stocking them in different buckets but I can’t seem to find out if it’s going to make a difference or not.

@Danielle_Paquette-Ha -

It does not matter for Influx2.0 cloud how you spread the series out over buckets. The cardinality impact is the same from your point of view. Buckets allow differing retention policies and are a natural boundary for queries (it’s a little bit more work to join query results from separate buckets than if the data for one query is all in one bucket). If you put the same measurement (and tags, etc) into two buckets then you’ve actually doubled your cardinality! The series key for series cardinality includes the bucket. Cheers.

1 Like