InfluxDB3 Enterprise: fastest way to get oldest and youngest datetime rows

0

I need to extract from an InfluxDB3 Enterprise table with

  • 17 Rows

  • 40 columns

  • four of which tags

the datetime of the youngest and oldest rows. My expetation was that this would have been extremelly fast being InfluxDB optimized for timeseries. On countrary, I’m experiencing a big trouble since it takes, on my PC, more than four minutes and a lot of memory. One of the queries I am trying on a Jupyter Notebook is the following one

pythonfrom influxdb_client_3 import InfluxDBClient3
import time

# Config
influx_token = '[YOUR_TOKEN]'
influx_url = "http://127.0.0.1:8181"
influx_bucket = "[your_bucket]"

client = InfluxDBClient3(
    host=influx_url,
    token=influx_token,
    database=influx_bucket
)

measurement = "XOM-option-5m"

query = f'SELECT FIRST(bid), LAST(bid) FROM "{measurement}"'

print(f"[QUERY] {query}")


t0 = time.time()
result = client.query(query, language="influxql")  # <-- CRITICAL: language="influxql"
elapsed = time.time() - t0

df = result.to_pandas() if hasattr(result, "to_pandas") else result

print(f"\n[RESULT] Elapsed: {elapsed:.2f}s")
print(f"[RESULT] Shape: {df.shape}")
print(f"[RESULT] Columns: {list(df.columns)}")
print(f"\n[RESULT] Data:")
print(df)

Hope I am wronging somethig otherwise it means I wrong to choose InfluxDB at all!!!

Thanks.

Your data is not an issue at all, you can try to modify the query a bit by adding a WHERE clause such as :

query = f’‘’
SELECT FIRST(bid), LAST(bid)
FROM “{measurement}”
WHERE time >= now() - 30d
‘’’

If your main use case is “give me the latest row(s) very fast,” also consider configuring a Last Values Cache on that table that will optimize the query further.

Thank you Suyash,

this may fix the last datetime, provided, but not granthed, that the last datetime in the DB was close to now(); but how about the first datetime of the entire table? We have no clues to limit the research range I guess.

Claude/Antropic and chatGPT gave me a suggestion that seems to work, but I have not found enough mentions in any documentation and this made me a bit uncomfortable; I copy the workaround found here following

import requests
from datetime import datetime, timezone

influx_url = "http://127.0.0.1:8181"
db = "ThetaData"
measurement = "AAL-option-1d" #"XOM-option-5m"

sql = f"""
SELECT MIN(min_time) AS start_ts, MAX(max_time) AS end_ts
FROM system.parquet_files
WHERE table_name = '{measurement}'
"""
params = {"db": db, "q": sql, "format": "json", "rpc_deduplicate": "false"}

r = requests.get(f"{influx_url}/api/v3/query_sql", params=params, timeout=30)
r.raise_for()[0]
start_ns = row["start_ts"]
end_ns = row["end_ts"]

start_iso = datetime.fromtimestamp(start_ns / 1e9, tz=timezone.utc).isoformat()
end_iso = datetime.fromtimestamp(end_ns / 1e9, tz=timezone.utc).isoformat(_status()
row = r.json)

print(f"start_ts (ns): {start_ns}")
print(f"start_ts (UTC): {start_iso}")
print(f"end_ts   (ns): {end_ns}")
print(f"end_ts   (UTC): {end_iso}")

This returns a correct results immediatly; but i would like to understand not just get code from AI. Any reference enywhere?

Thanks.

Your query looks good as system.parquet_files is a valid, metadata‑based way to get approximate global start/end times very quickly. However there can be recent data in memory/WAL buffer which has not been written to parquet file yet so bear that in mind. Yes, using AI is great but also ask it to explain the answer in more detail, specially SQL query so you know what is going on and you can also advise reading our documentation for further clarity as that is most up to date resource.

Thank you @suyash do you think there is a more robust and still fast way to get te first datetime when we really don’t know how to reduce the investigation range? Secondarly, is there a way to understand if there is pending recent data not yet ingested and in case a way to force InfluxDB to write it and update system parquet files?

Thank you.

PS: in my original post .. it was 17 million rows, not just 17!

I have a problem using that strategy interrogating system.parquet_files; it seems that the parquet metadata files are not uptadate at every new writing so that for instance if I start to write on a new table and 10 minutes later I interrogate the meatadata parquet files the system don’t find them. Is there a way to force the update of the system.parquet_files? alternatevelly how can set and use the Last Values Cache for retrieving quickly the min and max datetime in the table?

I’m not sure if you can force the update. For Last Value Cache, it’s a simple 2 step process:

  1. Create a LVC and give it a name and other parameters (columns etc) for which you want to cache the data
  2. Run query against that LVC to see the cached result

Detailed steps are here: Manage the Last Value Cache in InfluxDB 3 Enterprise | InfluxDB 3 Enterprise Documentation