I have an InfluxDB database with 5 measurements. When I query the data without specifying the measurement filter, I still get the correct data, and the query execution time is the same as when I include the measurement filter.
Could you please explain why this happens? Is there any internal optimization in InfluxDB that makes the measurement filter redundant in this context? Or is there a specific reason why the performance remains unchanged?
@Satyam It all has to do with schema. If you’re querying fields that only exist in one measurement, the query engine can quickly identify that and return the requested data without a specific measurement filter in place. However, if you have the fields with the same name across multiple measurements, the query will return all rows with the requested field across all measurements. Each filter is just a way to identify what data to return. If data can be identified with just a tag or a field, there isn’t a need to filter by measurement.
On a lower level, Flux actually pushes the computational load of certain operations down to the storage tier (closer to where the data lives) where these operations happen much faster. The basic from() |> range() |> filter()
(if filtering by simple measurement, tag, or field key value pairs) is all pushed down to storage and is computed very quickly. So there may be a difference between filtering with and without a measurement, but it’s likely on the nanosecond scale and isn’t really noticeable.
Thanks a lot @scott
I need more suggestions from your side, as we are planning to use Influxdb on a production level.
So basically we are about to store approximately 120 million data points for around 60 different stocks in InfluxDB OSS. We aim to optimize both write and query operations for this large dataset. Could you please provide core technical suggestions to achieve this?
Specifically, we would like advice on the following:
- Data Schema Design: What is the most efficient way to structure our measurement, tags, and fields to handle high write throughput and fast query performance?
- Batching Writes: How can we effectively batch writes to improve performance?
- Sharding and Clustering: Would you recommend using sharding or clustering features, and if so, how should we implement them?
- Indexing: What indexing strategies should we use to ensure quick access to our time-series data?
- Hardware Recommendations: What hardware specifications should we consider for our InfluxDB server to handle this volume of data efficiently?
Thank you for your time and assistance!