I want to store data in both InfluxDB V1 and V2 in a way that I can retrieve the data in insertion order over a series of successive calls. My timestamps are not reliably increasing, so if I query in timestamp order I will miss data from one call to the next.
I have a working system, but it is very slow and CPU-intensive and I’m hoping there is a better way to approach it.
When writing to the database, I send pointname, timestamp, value, quality and index. Pointname is a tag, and quality and index are two numeric fields. Quality is just a number, and index is a monotonically increasing number - think of it as a row identifier. I can keep track of the last read index in my program, so I can read (for example) 1000 values by filtering for index from 0 to 999, then on a subsequent read from 1000 to 1999 and so on. This works perfectly until the database gets even a little bit large. For InfluxDB V1, it starts to take several seconds when the database has more than 10 million values, and for InfluxDB V2 the same happens around 40 million values.
My InfluxDB V1 query looks like this:
SELECT count("value") AS "mean_value" FROM "DataHub"."Desktop"."Simulated" where index >= 0 and index < 1000 GROUP BY "pointname"
Performance on this is awful, unsurprisingly, since I’m filtering with in inequality on a field. Is there a better way to do this in InfluxDB V1?
For InfluxDB V2, I do not know how to make this query. I have a hack that does not handle out-of-order timestamps, where I first determine the time range for the index range, then query for the data. I did this because I could not find a way to produce rows of output that contained pointname, value, quality and timestamp for each row of output while also filtering for the index range.
data1 = from (bucket: "Desktop")
|> range(start: 0, stop: 2200-01-01)
|> filter(fn: (r) => r._measurement == "Simulated")
|> filter(fn: (r) => r._field == "index" and r._value < 1000 and r._value >= 0)
|> keep(columns: ["_field", "_value", "_time"])
from data1
|> min(column: "_value")
|> keep(columns: ["_time"])
|> map(fn: (r) => ({ r with timestamp: uint(v: r._time) }))
|> yield(name: "data2")
from data1
|> max(column: "_value")
|> keep(columns: ["_time"])
|> map(fn: (r) => ({ r with timestamp: uint(v: r._time) }))
|> yield(name: "data3")
then
from (bucket: "Desktop")
|> range(start: data2, stop: data3)
|> filter(fn: (r) => r._measurement == "Simulated" and r._field =~ /index|quality|value/)
|> pivot(rowKey: ["_time"], columnKey: ["_field"], valueColumn: "_value")
|> map(fn: (r) => ({ r with time: uint(v: r._time) }))
|> keep (columns: ["time", "index", "quality", "value", "pointname"])
The queries to find the start and end timestamps take most of the time (about 80%), and most of that time is spent in the filter for index range. To top it off, this isn’t even correct. Ideally I want what I get with V1, where each row contains name, value, quality and timestamp for any record where the index is within the range (0 to 1000 in this example). My query as it stands will fail for data that arrives out of time order by potentially reading the same values more than once.
Is there a way to make the query a) correct in that it returns the data in insertion order, and b) more efficient?
I can freely modify how the data is inserted into InfluxDB - is there a better way to represent it?