Hi. After much research, we are finally planning to implement influxdb to replace the timeseries data we’re currently storing in mysql. We currently provide services to 5 customers, and store ~400,000 rows per day of iot data (~80k rows per customer per day). We are starting to see growth, and expect to add 2-4 customers per month for the remainder of the year.
I have studied the planning and sizing guides for the hardware, but I’m not sure the direction to take with buckets. Is there a performance impact with this level of data in a single bucket, or would one bucket per customer keep things smaller, more manageable, and perhaps faster for queries? I still have a lot of details to work out, but I’m looking for guidance to help avoid common pitfalls as I plan the architecture. Any pointers in the right direction will be much appreciated. Thanks!
Generally users place similar data into the same bucket. However if the data you’re collecting is unrelated and you don’t plan on performing analysis or applying functions across buckets, then you can separate out your data. 400,000 rows per day of iot data for 2-4 customers is not considered to be a large amount of data. If you plan on expanding to thousands of customers, then that’s a different deal. At that point whether or not that’s a lot of data for one bucket depends on how much data you want to be querying at any given point and what your downsampling or data expiration plan is.
Can you tell me more about what you’re trying to build? It can help me better help you and also your project just sounds cool and I want to know more. Are you developing an IoT application on top of InfluxDB? What’s the use case?
Sure. The iot data being received is primarily from atmospheric sensors, each of which measures 20+ points on a 10 minute interval. The data never stop, and much of it is related, so keeping it in the same bucket makes sense. We are planning to move the timeseries data to Influxdb, while the configuration and other data will remain in Postgres, at least for now. This brings up other questions about storing data as tags rather than using joins to a configuration db, but that’s a question for another day.
However, as I work through more of the problems, I see that my original question should have asked about max bucket sizes and speed. The major item I see is that all data for all customers needs to remain accessible for an extended period of time… at least a few years. And that data will be pulled directly by customers in the form of a download or nice and pretty charts and graphs on a regular basis, so access and queries also need to be fast.
Thanks for pointing me to the Best Practices page. I reviewed it earlier, but recall thinking it didn’t fully address my problem since I won’t be implementing a near-term retention policy. You mentioned above that 400k data points per day isn’t a large amount of data, can you offer suggestions for architecting a db that can smoothly handle a couple of million new data points per day, while still offering quick data pulls, and no data deletion? Boy, that sentence feels like I’m asking for the world… hopefully there are some general rules of thumb that can help me get started. Thanks again!