we have organized a data collection sinking into a influxdb instance.
A bucket is a single source with a predefined list of columns.
A measurement is a collection of data from the specified bucket: so that a mesurement holds in its name the starting timestamp and the ending timestamp of the data group uploaded to the db.
Do you find any performance regarding downside in influxdb using this type of organization?
I thought about querying, being always specified with a date interval in our application, before executing the query, we can sort and filter to which measurement exactly hold the data in order to specify the minimum needed group of measurements in the query call.
Does this gives a performance increment in querying influxdb for you or is it non-sense?
Would it be better to set a unique measurement for each data source for example?
Another related question, if a row of data is contained in more than one measurment, does this consists in duplicate or influx holds a unique value reading the data contained or just the timestamp?
Meanwhile if a row with the same timestamp is uploaded in a measurment where that timestamp is already existing, the latest data is retained and the oldest deleted, right?
Thanks guys any help or suggestion or question is appreciated.
Hello @nicfio,
Welcome. Hello I recommend reading this documentation on schema design best practices for InfluxDB v2:
A bucket will contain one or many measurements.
Measurements dont usually have a timestamp in their name. But every line will have a timestamp.
You can filter data with the range() function thats the best way to filter by time.
A row of data cant be in more than one measurment. Series are indexed in influxdb v2. Series are defined by the unique combination of measurement names, tag key value pairs and field keys.
your comments very much appreciated.
Indeed I am trying since few weeks influxdb v3 with the newer python client and I am enjoying the enhancements. For now, especially in the query operation featuring pyarrow flight client.
Regarding the schema employed, I am using the starting and closing date of a group of data as a sort of hash for the measurement name and after some activity time I am experiencing downsides.
I think globally it is best to optimize better the measurement role.