InfluxDB Performance with Non-periodic Time Series: Retrieve Latest Points

Hi all.

I’m working with InfluxDB for my master thesis project. There is something about the internal mechanisms for query answering which is not clear to me. I try to detail as much as possible my use case, hoping someone could help me clarifying my doubts.

InfluxDB OSS 2.6 standalone local installation on Windows 10 OS.

I’m dealing with time series data which in influxdata articles are defined as “Events”; the fundamental point is that my data lack any kind of periodicity, they are not sampled at regular intervals.

I have inserted through a dedicated Java application exploiting the official client about 1.8 M (millions) records into the same bucket with infinite retention policy (i.e., never delete data), specifying as tagset the attributes used to identify the series (and used as filter condition to retrieve them in typical workload) and as timestamp the preexisting temporal label of the samples (dealing with data about the past 5-6 years more or less).

My exact use case is (as prototypical example): get the latest 100 values from the above-described bucket, where for “latest” I mean the points which have the most recent timestamps.

I have devised a Flux query like this one

from(bucket: "<bucket name>")
  |> range(start: 0)
  |> filter(fn: (r) => r._measurement == "<measurement>" and <equal conditions on attibutes forming the tagset>)
  |> top(n:100, columns:["_time"])

It answers in about 1000 ms, while on other DBMSs with an equivalent data structuring and dedicated indexes on the attributes in filter I was able to get answer even in terms of dozens of ms. This great difference in performance sounds strange to me.

I know that range(start: 0), actually is breaking one fundamental indication of InfluxDB of always specifying a temporal filter, but in this case, due to the irregular temporal distribution of data, I cannot, simply because I don’t know in advance in which temporal range the latest 100 values are contained.
Nevertheless, on the basis of what I understand reading the documentation and some other related articles (just as example the following: InfluxDB Internals 101 - Part One | InfluxData), InfluxDB data should be persisted in time-ordered fashion, so I was expecting that just a small amount of the data distributed across shard groups had to be scanned to answer the query.

In conclusion, my questions are:

  1. Is the use case I’ve described optimized to be executed on InfluxDB, despite the relaxing of some conditions for “comfortable usage” (aperiodicity of time series, lack of explicit temporal filter, etc.)?
  2. If yes, what is my design error which is preventing me to obtain efficient performance?
  3. To try to “debug” on my own, is there a sort of “explain” modality to check in fact what operations InfluxDB executes to answer a specific query?

Any kind of hint is welcome. Thank you in advance.

I’ve found on my own a partial response to question 3.: Optimize Flux queries | InfluxDB OSS 2.6 Documentation
Nevertheless, I was no table to detect any reference about how to interpret the so-called flux/query-plan output, and also the quantitative values in the operator profiler output are not of intuitive meaning for me.

Most important: I made no progress about questions 1. and 2.

Hello @antio,
Yes the cold and hot shares and Flux planner all contribute to this issue.
This has been a long standing issue–the ability to obtain global min, max, last etc.
Luckily these issues are being solved with InfluxDB Cloud powered by IOx

  1. The best you could do is to continuously write the last point with the original timestamp to a bucket with a small retention policy with a task to always retain the last value.

  2. NA

  3. NA

I’m sorry your frustration is valid. Concerns/questions like these is why engineering has build a new storage engine.

Thank you for your response @Anaisdg.

This give me the premise for another question: storing all data points within the same bucket with infinite retention policy (i.e., never delete data) I have a resulting shard group duration of 7d (it’s an evidence that I have checked with /api/v2/buckets endpoint invocation).

A.
This means that if I have data with timestamps spanning over years, just points within the same 7 days are grouped inside the same shard, right?

B.
So, if I increase manually the shard duration (with --shard-group-duration in a new bucket), I should have data distributed in less shards, thus augmenting the probability of finding the latest values all in the same shard. Is it reasonable?

C.
In conclusion, is it true the rule of thumb: if the data you have to retrieve are all within the same shard the performance in query answering is better?

D.
Under which precise conditions a shard is considered HOT or COLD? In the documentation I just find " When a shard is no longer actively written to, InfluxDB compacts shard data, resulting in a “cold” shard.", that is too vague in my opinion. Could a shard subject of many reads become HOT again?