Is there any guidance on what the optimal shard size is for influx? Is it reasonable to have a shard size of 3-5GB? Also, what is a reasonable upper limit for the number of databases per instance (and does the commercial clustering support help with large numbers of databases)?
Your optimal shard size depends on a log of different variables including your TSM files, shard duration, schema design, and retention policy. Have you seen these resources?:
The number of databases you can run per instance also largely depends on your schemas, shard duration, retention policy, etc. There’s a limit, but I don’t know what it is…I’ve heard of a user making a new DB for each customer they had, but I wouldn’t recommend doing that ;p
The reason I need to create a lot of databases and have a large shard size is primarily because the retention policy based deletion of data doesn’t fit my use case. I need to keep data forever but selectively delete. Based on my limited experimentation, deleting data by tag is slow while dropping an entire database is very fast. I don’t suppose there is any way to shard based on something other than time but that would fit my use case best.
The “many databases” approach appears that it would work reasonably well but it would be nice to know if this would scale in the enterprise product. The only downside is the amount of data is quite variable (5K - 500M points per database). The smaller databases are suboptimal given that there appears to be a 32MiB overhead with each db. So, I guess if I can add something to the suggestion box it would be to have the ability to shard by tag as opposed to wall clock time.
Can you expand upon why a retention policy combined with continuous queries/Kapacitor downsampling cannot fit your use case? There are definitely users who keep their data around forever, albeit they usually keep a less granular snapshot of it. Do you need to retain data forever at the same high granularity?
I don’t suppose there is any way to shard based on something other than time but that would fit my use case best.
That is correct, there is no other way to shard data other than by time. We find that as InfluxDB is a time series database, it makes sense to partition data in this way.
If you’re interested in the Enterprise version of InfluxDB, please feel free to reach out to sales who can pair you with an engineer to ascertain if the “many databases” approach is feasible at the HA level. Alternatively, you can participate in a trial version of Enterprise to see if it works with your use case.
Best of luck!
Sure. The primary difference is that time is simulated, not actual wall clock time. The simulated time is monotonically increasing so it is a time series, just not wall clock time. Typically many simulations are run, all with the same range of time (say 2018 - 2060) but only a few are kept. If each simulation is in its own database then dropping it is very quick/efficient. The only downside is if the simulation is small (say 5K points) then the 32MiB overhead associated with the database could be a bit of a problem. For these situations, using tags is much more efficient from a data footprint perspective. For larger simulations (100M - 500M points) the overhead is not significant and it is a very good fit.
Another topic is dimensionality. InfluxDb (and all similar products) are optimized for 1 dimension vs time. In my case, I also have 2 and 3 dimensions vs time (for example. I suppose this could be flattened out so a single dimension but the number of fields could become very high ( > 1000 for 2D and > 1B for 3D). I don’t imagine the this is a good fit for influx (and it isn’t necessary to use the same technology for everything - polyglot is fine). However if other users are doing this kind of thing it would be interesting to hear about it.