Hi all.
I’m working with InfluxDB for my master thesis project. There is something about the internal mechanisms for query answering which is not clear to me. I try to detail as much as possible my use case, hoping someone could help me clarifying my doubts.
InfluxDB OSS 2.6 standalone local installation on Windows 10 OS.
I’m dealing with time series data which in influxdata articles are defined as “Events”; the fundamental point is that my data lack any kind of periodicity, they are not sampled at regular intervals.
I have inserted through a dedicated Java application exploiting the official client about 1.8 M (millions) records into the same bucket with infinite retention policy (i.e., never delete data), specifying as tagset the attributes used to identify the series (and used as filter condition to retrieve them in typical workload) and as timestamp the preexisting temporal label of the samples (dealing with data about the past 5-6 years more or less).
My exact use case is (as prototypical example): get the latest 100 values from the above-described bucket, where for “latest” I mean the points which have the most recent timestamps.
I have devised a Flux query like this one
from(bucket: "<bucket name>")
|> range(start: 0)
|> filter(fn: (r) => r._measurement == "<measurement>" and <equal conditions on attibutes forming the tagset>)
|> top(n:100, columns:["_time"])
It answers in about 1000 ms, while on other DBMSs with an equivalent data structuring and dedicated indexes on the attributes in filter I was able to get answer even in terms of dozens of ms. This great difference in performance sounds strange to me.
I know that range(start: 0)
, actually is breaking one fundamental indication of InfluxDB of always specifying a temporal filter, but in this case, due to the irregular temporal distribution of data, I cannot, simply because I don’t know in advance in which temporal range the latest 100 values are contained.
Nevertheless, on the basis of what I understand reading the documentation and some other related articles (just as example the following: InfluxDB Internals 101 - Part One | InfluxData), InfluxDB data should be persisted in time-ordered fashion, so I was expecting that just a small amount of the data distributed across shard groups had to be scanned to answer the query.
In conclusion, my questions are:
- Is the use case I’ve described optimized to be executed on InfluxDB, despite the relaxing of some conditions for “comfortable usage” (aperiodicity of time series, lack of explicit temporal filter, etc.)?
- If yes, what is my design error which is preventing me to obtain efficient performance?
- To try to “debug” on my own, is there a sort of “explain” modality to check in fact what operations InfluxDB executes to answer a specific query?
Any kind of hint is welcome. Thank you in advance.