Structure Considerations in migration from v1.8 to v2.1 - help/info needed

Hi, I’m evaluating a migration from InfluxDB 1.8 to 2.1 and I need some help figuring out some of the differences and how to properly migrate structures and evaluate limitations.

Current structure (v1.8):

As of now, I have 42 databases, some are inactive and could be deleted, but realistically the number might get above 42, for the considerations let’s say that I’ve got 20 DBs but I must be able to scale by creating more databases.

  • Each DB identifies a customer
  • All the databases have the very same structure
  • Each DB has his read-only user

Proposed conversion steps

The proposed (by the docs) conversion 1.8 → 2.1 steps are the following:

  1. Create one organization
  2. create buckets to replace DBs and RPs with a naming convention that can be db\rp (ex: Customer_1\temp)
  3. Create users and assign proper authorization tokens
  4. Rewrite CQs to Tasks

Strictly speaking about data, it should look like this, what once was a single DB is now made of 3 buckets
Untitled Diagram2

Here are my doubts

About Organizations

  1. what’s an organization supposed to represent exactly (is it proper to make it represent a customer?)
  2. what’s the gain of having only one Org with multiple customer DBs vs Multiple Orgs with only “single customer” Dbs (buckets) ?

About Task
I know tasks reside inside an organization, now I have 5 identical CQs in each database

  1. in the case of a single org, can I have just one task to loop over all the buckets? or do I still need to create a task for each “customer”

Other
Bucket limitation - A single InfluxDB 2.1 OSS instance supports approximately 20 buckets actively being written to or queried across all organizations depending on the use case
I’ve got basically 3 areas each with different write workloads:

  • temp → continuous low activity with small recurring peaks (every 5min)
  • standard → continuous activity, most of the writing is here + some CQs results (heavy calculation, but small result)
  • long → CQs result only, computed every few hours or daily, very few data points

I’m not worried about read activity as it is pretty low (mostly in case of troubleshooting)

Given this kind of workload:

  • How bad is it to go over the 20 buckets per instance? (as of now I’ll have ~20*3 = 60 buckets)
  • should I count/consider my temp and long buckets even if their activity is pretty low?

Scaling
So far I’ve only used vertical scaling, meaning adding DBs on the existing instances, and if needed adding resources to the machine.

  • given the “20 buckets maximum” recommendation, what’s the best way to scale? looks like vertical scaling is not as good as it was before…
  • it’s easy and flexible to route data into the proper DB & RP inside an Influx instance (using telegraf), but if you got multiple nodes/instances it becomes way less pleasant as you have to properly filter the output plugins to “route” the data to the correct instance (meaning hardcoded values inside the config)

Thanks for reading thus far
I’d appreciate any suggestion and/or additional consideration

I think you might find the following useful:

Specifically:

Single Bucket

The single bucket design has the following advantages:

  • Writing all of your data to a single bucket makes querying for data easier. You don’t have to keep track of which data exists within multiple buckets. You’ll likely need to use multiple measurements to organize your data effectively. However, you can still easily perform aggregations across multiple measurements by grouping the measurements together.
  • You can perform one downsampling task to downsample all of your data. Downsampling tasks are used to reduce the overall disk usage of your data by transforming high resolution data into lower resolution aggregates.

The single bucket design has the following disadvantages:

  • Each bucket is assigned one retention policy. With this design you can’t expire data subsets at different retention intervals.
  • If you need to query for a invicidual user’s data within a single bucket for your application, you should generate new read tokens for each individual customer. However, this design is less secure against malicious attacks than isolating different users’ data in separate buckets.

The single bucket approach is good for use cases where:

  • You’re developing a simple IoT application, like the Air sensor sample dataset.
  • You intend on treating all of your user’s data in the same way. Your data is being collected at similar intervals and a single retention policy is an effective solution for proper time series database management.
  • Your data is relatively non-sensitive and preventing a data breach is of little or no concern.

Bucket per User

The bucket per user design has the following advantages:

  • Your customer’s data is isolated and secure.
  • You can assign different retention policies to your different buckets depending on your customer’s needs.

The bucket per user design has the following disadvantages:

  • You can’t visualize multiple users’ data without joining the data across the buckets first.
  • Performing math across your users’ data is hard. For example if you want to know which user is reporting a max value for their field, you must first join all of your data together across the different user buckets first.
  • You’ll likely need to write multiple downsampling tasks to downsample all of your user’s data. You can automate some of this downsampling task creation with the use of parameterized queries, but it’ll require a little more work.

The bucket per user design is good for use cases where:

  • You’re developing a more sophisticated IoT application and the data is sensitive, like a human health application.
  • You’re writing sensitive data to InfluxDB and isolating your users’ data is a priority.
  • Your customers have different data requirements and you need to provide them with different time series data management solutions (including different retention policies and downsampling tasks).

Org per Customer

The org per customer design has the following advantages:

  • You can meet the data requirements of users with advanced high throughput and cardinality use cases.
  • Your users’ data is isolated and secure.

The org per customer design has the following disadvantage:

  • You can’t easily query for data across multiple organizations.

The org per customer design is good for use cases where:

  • You’re developing an industrial IoT solution that can be used for multiple companies.

Now to your questions:

  1. what’s an organization supposed to represent exactly (is it proper to make it represent a customer?)

You can make an org a customer. See above. Users are scoped to an org.

  1. what’s the gain of having only one Org with multiple customer DBs vs Multiple Orgs with only “single customer” Dbs (buckets) ?

See above.

  1. in the case of a single org, can I have just one task to loop over all the buckets? or do I still need to create a task for each “customer”

It depends on the task and what you’re trying to do with the data. You can loop over buckets with:

import "experimental/array"

sources = [
    {bucket: "b1", measurement: "m1", field: "f1"},
    {bucket: "b2", measurement: "m2", field: "f2"},
    {bucket: "b3", measurement: "m3", field: "f3"},
]

timeRange = {start: -1h, stop: now()}

tables = array.map(
    arr: measurements,
    fn: (x) => from(bucket: x.bucket)
        |> range(start: timeRange.start, stop: timeRange.stop)
        |> filter(fn: (r) => r._measurment == x.measurement and r._field == x.field)
)

union(tables: tables)

But you might also not have to loop in that matter you could potentially just query each bucket explicitly and then write the data to the desired location/measurement with different to() functions. It would be more helpful to see your CQs

  • How bad is it to go over the 20 buckets per instance? (as of now I’ll have ~20*3 = 60 buckets)
  • should I count/consider my temp and long buckets even if their activity is pretty low?

Where in the docs did you get this from? I need to ask around. I think that suggestion is extremely dependent on ingest and query rate and HW.

I would more follow these guidelines for 2.1

  • given the “20 buckets maximum” recommendation, what’s the best way to scale? looks like vertical scaling is not as good as it was before…

I’m having trouble imagining what you mean. How is routing to a bucket harder than a specific DB + RP? can I get some help here? Thank you.

1 Like

Thanks for the answers, they definitely help.

About the 20 bucket limit

  • you can find the warning in any bucket related page of the docs, like this one.
  • If helpful, I can provide data ingestion rate, but let’s say that the RPs with low activity get almost no points compared to the high activity one. (that’s why I hope to ignore it in the count)

About CQs to Tasks
I’m sure it’s doable in a way or another, but not having to replicate or edit the code would be a nice improvement, I’ll put a sample CQ here

CQ Sample
CREATE CONTINUOUS QUERY cq_sqlserver_query_stats_non_cumulative ON "__DbName__" RESAMPLE EVERY 5m FOR 15m BEGIN 
SELECT 
	non_negative_difference(last("execution_count")) AS "execution_count",
	non_negative_difference(last("plan_generation_num")) AS "plan_generation_num",
	non_negative_difference(last("total_clr_time_ms")) AS "total_clr_time_ms",
	non_negative_difference(last("total_columnstore_segment_reads")) AS "total_columnstore_segment_reads",
	non_negative_difference(last("total_columnstore_segment_skips")) AS "total_columnstore_segment_skips",
	non_negative_difference(last("total_dop")) AS "total_dop",
	non_negative_difference(last("total_elapsed_time_ms")) AS "total_elapsed_time_ms",
	non_negative_difference(last("total_grant_kb")) AS "total_grant_kb",
	non_negative_difference(last("total_ideal_grant_kb")) AS "total_ideal_grant_kb",
	non_negative_difference(last("total_logical_reads")) AS "total_logical_reads",
	non_negative_difference(last("total_logical_writes")) AS "total_logical_writes",
	non_negative_difference(last("total_physical_reads")) AS "total_physical_reads",
	non_negative_difference(last("total_reserved_threads")) AS "total_reserved_threads",
	non_negative_difference(last("total_rows")) AS "total_rows",
	non_negative_difference(last("total_used_grant_kb")) AS "total_used_grant_kb",
	non_negative_difference(last("total_used_threads")) AS "total_used_threads",
	non_negative_difference(last("total_worker_time_ms")) AS "total_worker_time_ms",
	last("statement_text") AS "statement_text",
	last("statement_text_preview") AS "statement_text_preview"
INTO
	"__DbName__"."standard"."sqlserver_query_stats_nc" 
FROM
	"__DbName__"."temp"."sqlserver_query_stats" 
GROUP BY
	time(5m),
	*
END

About routing
Right now all the telegraf remote clients point to a single telegraf Gateway (on my side), that I use to manage the routing and additional processing. All the data needed for the routing are added by the client themselves.
The output InfluxDB works perfectly fine for routing on DBs and RPs, at least as long as you have just one destination instance… (this is valid for both v1.8 and 2.1)

Gateway Output Config
[[outputs.influxdb]]

  urls = ["http://127.0.0.1:8086"]

  database_tag = "company"
  exclude_database_tag = false

  ##default destination
  retention_policy = "standard"  
  ## set only for some measurements, has priority over retention_policy 
  retention_policy_tag = "DestinationRP" 
  exclude_retention_policy_tag = true

If you have multiple instances of InfluxDB you must write and maintain several [[outputs.influxdb]] (one per InfluxDB instance) and use filters to send the data to the proper one… it’s manageable, but it’s way less straightforward than a single node environment as you have to either:

  • Do continuous maintenance/editing of the gateway config, by editing the filters based on the database_tag to route the data on the proper output
  • Add more info in the clients, meaning an additional static tag which will be used in the output filtering… simpler on the gateway, but you lose the ability to control everything server side (gateway) as the clients need to be restarted to apply any change (in this case, point to a different instance) and I might not have free access to them.

Note that case of a design with multiple organizations (v2.1) the routing becomes even more verbose as there is no routing option for orgs (something like organization_tag) in the outputs.influxdb_v2 plugin. I’ve opened an issue about it.

And againt thanks for your support @Anaisdg

The Flux task translation for your CQ would look like:

option task = {
    name: "CQ translation",
    every: 5m,
    offset: 1m,
}

from(bucket: "__DbName__")
    |> range(start: -15m)
    |> filter(fn: (r) => r._measurement == "sqlserver_query_stats")
    |> filter(fn: (r) => r._field == "execution_count" or 
    r._field == "plan_generation_num" or
    r._field == "total_clr_time_ms" or 
    r._field == "total_columnstore_segment_reads" or .... all of your fields however this filter is redundant if you're applying the non negative difference to all of your data/fields in that measurement)
    |> difference(
    nonNegative: true,
    columns: ["_value"],
    keepFirst: false,
    initialZero: false,
) 
   |> last() 
   |> set(key: "_measurement",value: "sqlserver_query_stats_nc")
   |> to(bucket: "__DbName__")
1 Like

@Giovanni_Luisotto as far as buckets are concerned I think the limit is variable. Performance is affected by cardinality, ingest volume, HW, types of queries, # reads, etc. The number of buckets is just one factor that can impact performance. If you’re not overloading your instance with too much cardinality and too many read/writes then you should be good. I would think as it as a soft limit or suggestion for most use cases. I’ll let you know if I hear something different.

1 Like