Confused by `last()`

I need to extract the _time of the last record in a bucket.
The dataset goes from 2017-05-05... to 2020-09-24..., one row every 5 seconds.
There are missing data (no rows at all) from 2017-10-10... to 2017-11-07...

START = 0
from(bucket: BUCKET)
  |> range(start: START, stop: now())
  |> filter(fn: (r) => r["_measurement"] == MEASUREMENT and r["_field"] == FIELD)
  |> aggregateWindow(every: 15m, fn: mean, createEmpty: false)
  |> last(column: "_time")
  |> yield(name: "mean")

returns 2020-09-24..., the correct answer.

START = 0
from(bucket: BUCKET)
  |> range(start: START, stop: now())
  |> filter(fn: (r) => r["_measurement"] == MEASUREMENT and r["_field"] == FIELD)
  |> last(column: "_time")
  |> yield(name: "mean")

(without the aggregateWindow returns a single 2017-10-10... instead.
It is the last record BEFORE the missing data span, not what I want.

Changing the START time only in the second script:

  • START = 2017-10-10 => 2017-10-10T20:45:30Z (last value before the missing data span)
  • START = 2017-10-11 => empty (it looks like it doesn’t like to start in the middle of the missing data span)
  • START = 2017-10-16 => it returns the correct answer! (but we are in the middle of the missing data span yet)

Any reason why?

Hello @Nemecsek,
I’m sorry I’m having trouble understanding what you mean. Is there any chance you can export your data to CSV and send it my way so I can see what you mean on my end?
Thanks

Hallo @Anaisdg.
thank you for your reply.

Alas the database is our customer’s property and cannot be shared.
It has been imported from MySQL through Python/Pandas into InfluxDB2.
I wrote a flux script to import it directly, but the heavy load (some 11+ million lines read from a small headless server) were enough to kill MySQL service multiple times. In the end I loaded the records in chunks, and saved each individual chunk to InfluxDB2. There is a gap in the data, where a logger didn’t work: they are missing in the original MySQL database.
I can make visualizations, calculate aggregates and any other calculation, no problems.
The problem is last().

I made a schematics about the cases I described in the original post.
According to my understanding, requesting the last() record should always return the same 2020-09-24 record, but it doesn’t…

What could be the reason last() is wrong?

Is it really so hard to understand his question? It seems obvious to me and my colleagues. We had the same problem and were driven away from influx for lack of a proper response. Best wishes!

To concretely answer this question, we have to know the schema of the data. In your queries above, you’re filtering on measurement and field. Are there tags in the data? If there are, then you have multiple series. Flux returns a stream of tables. By default, each table respresents a series – groups of points that share a common measurement, tag set, and field key. last() returns the record for each table in the stream of tables, so if you have multiple series, the query you have above will return the last() record from each series.

Does the data before the gap belong to a different series (have a different tag set) than the data after the gap?

Try ungrouping the data (to just group everything in one huge table) before last():

START = 0
from(bucket: BUCKET)
  |> range(start: START, stop: now())
  |> filter(fn: (r) => r["_measurement"] == MEASUREMENT and r["_field"] == FIELD)
  |> group(columns: [])
  |> last(column: "_time")
  |> yield(name: "last")

@scott,
thank you for your answer.

I am a beginner in flux and think that there must be something I am missing.
Initially I thought I was receiving more last rows, one for each block of data, but there is only one.

There are no tags, just _time, _measurement (string), _field (string), and value (double).
Data have been imported using python/pandas (I checked if each chunk had perhaps a tag added by pandas, but there aren’t any).
I tried your group(columns: []) and nothing changes.

In the meantime I added data for another couple of month, but the filter results looks quite inconsistent.
Here are the result screenshots according to various START datetimes (I hope the image quality is good enough).

START=2017-01-01T00:00:00Z, before the beginning of data:

START=2017-10-10T00:00:00Z, among the last rows of the first data block:

START=2017-10-11T00:00:00Z, in the middle of the missing data block:

START=2017-10-15T00:00:00Z, again in the missing data block:

START=2017-10-16T00:00:0Z, again in the middle of the missing data, but now it returns the correct last()!

START=2020-11-31, after the last row:

START=2020-12-16, after the last row, but now no error…

START=2020-12-31, and here another error:

@Nemecsek Your understanding of last() is correct. Thank you for the screenshots. They do help. Your query is returning only a single series so last() should return the last row returned from the query. It is odd behavior. It appears that results may be getting truncated before being processed by last(), but that shouldn’t happen.

I’d be curious to know what the data looks like before last(), but understand this is client data. Try running the query without last() and check to see what the actual last returned value is. What version of InfluxDB are you using?

About the errors you encountered:


November 31, 2020 isn’t a real date. November only has 30 days.


This is the expected behavior. There aren’t any points after 2020-11-31.


In this case, the start value of range() is after the stop value since stop defaults to now().

@scott,
Thank you again for your answer.

Influxd version is
InfluxDB 2.0.1 (git: 37cc047133) build_date: 2020-11-11T03:53:31Z

About the 2 errors: these COVID-months look eternal. I was sure they had added some extra days to November… Sorry, while copy/pasting I overlooked the proper dates.
It would be nice to intercept those against user-sillyness.

I will give a look ASAP at the data without last()
I really miss a proper InfluxDB viewer to quickly check the DB content.