Long query time (30sec+) on LANCache Dashboard

Good Morning everyone,

A couple of months ago I wanted to save nginx Logs to InfluxDB to make a dashboard for Grafana. It works at the moment, but the query to InfluxDB takes 30+ Seconds everytime Grafana generate the dashboards on a 16 Core VM. htop shows 1600% CPU time and I’m sure I didn’t make the best way of saving data into InfluxDB.

Following is a sample line of the logs:

{"timestamp":"1709839342.987","time_local":"07/Mar/2024:20:22:22 +0100","cache_identifier":"steam","remote_addr":"10.0.5.12","forwarded_for":"","remote_user":"","status":"200","bytes_sent":336,"referer":"","user_agent":"Valve/Steam HTTP Client 1.0","upstream_cache_status":"MISS","host":"cache14-fra2.steamcontent.com","http_range":"","method":"GET","path":"/depot/578081/chunk/05be81078eac27dff6aadfe494f391817f4088b9","proto":"HTTP/1.1","scheme":"http"}

Which I’m putting into Influx via python:

[...]
# JSON-Objekt aus der Log-Zeile extrahieren
    log_entry = json.loads(line.strip())

    # Unix-Timestamp in ein datetime-Objekt umwandeln
    timestamp = float(log_entry.get("timestamp", 0))
    dt_obj = datetime.utcfromtimestamp(timestamp)

    # Datenpunkt erstellen
    data_point = Point("web_access")\
        .tag("host", log_entry.get("host", "unknown"))\
        .tag("cache_identifier", log_entry.get("cache_identifier", "unknown"))\
        .tag("status", log_entry.get("status", "unknown"))\
        .tag("bytes_sent", int(log_entry.get("bytes_sent", 0)))\
        .tag("path", log_entry.get("path", ""))\
        .tag("user_agent", log_entry.get("user_agent", ""))\
        .tag("remote_addr", log_entry.get("remote_addr", ""))\
        .tag("time_local", log_entry.get("time_local", ""))\
        .tag("method", log_entry.get("method", ""))\
        .tag("upstream_cache_status", log_entry.get("upstream_cache_status", ""))\
        .tag("http_range", log_entry.get("http_range", ""))\
        .tag("proto", log_entry.get("proto", ""))\
        .field("scheme", log_entry.get("scheme", ""))\
        .time(dt_obj, WritePrecision.MS)
[...]

At the end my Grafana Dashboard looks like this:

One of my Querys looks like this:

from(bucket: "lancache")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_field"] == "scheme")
  |> drop(columns: ["_start", "_stop"])
  |> group(columns: ["cache_identifier"])
  |> count()
  |> rename(columns: {_value: ""})
  |> yield(name: "count")

Now I’m aking myself why the Query takes so long

I know that saving Infos like cache_identifier should be saved as field, but I wasn’t able to create the query to make it work in Grafana.

Is somebody here that could tell what could I do to improve the query time?

Thanks!

@Gamienator The best way to make Flux queries fast is to leverage pushdown optimizations as much as possible. These “push down” computation to the storage layer so that less data has to be loaded into memory and operated on there.

Right now, drop() after filter() is breaking the pushdown chain, so all the data returned from filter() has to be loaded into memory and the rest of the query operates on it there. I would suggest moving drop() to after count(). You should see a noticeable improvement in query performance:

from(bucket: "lancache")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_field"] == "scheme")
  |> group(columns: ["cache_identifier"])
  |> count()
  |> drop(columns: ["_start", "_stop"])
  |> rename(columns: {_value: ""})
  |> yield(name: "count")

Thank you very much Scott! My next LANParty is soon and therefore I’ll update and check all servers. Then I’ll try your query, but it gave me a great inside :hugs:

@scott I wasn’t able to resist testing your suggestion. On three of four queries it reduces the time from 28 seconds to 7 seconds :exploding_head:

One Query is still taking 21 seconds:

from(bucket: "lancache")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_field"] == "scheme")
  |> group(columns: ["scheme"])
  |> aggregateWindow(every: 1m, fn: count, createEmpty: false)
  |> drop(columns: ["_start", "_stop"])

Do you see a possible way to optimize that query as well or do I have to live with that long query time?

Sorry for the slow reply @Gamienator, I was out on vacation. With the query that’s taking 21 seconds, how large of a time range are you querying? How “dense” is the data (how often are points recorded)?

There isn’t anything I can see in the query that would make it slow, so my first thought is how much data is this query actually processing?

One thing you could actually do is run a profiler on the query to see query performance metrics. If you get a chance, it would look something like this:

import "profiler"

option profiler.enabledProfilers = ["query", "operator"]

from(bucket: "lancache")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_field"] == "scheme")
  |> group(columns: ["scheme"])
  |> aggregateWindow(every: 1m, fn: count, createEmpty: false)
  |> drop(columns: ["_start", "_stop"])

This will add two tables to your query results–one with information about the overall query performance and the other with information about each operation performed in the query execution plan.

1 Like

No worries Scott, vacation is important :slight_smile: And the next LAN is in October.

Thanks for the hints, the poor query performance could be because of the test PC thats influxDB is currently running (i5-7500T) and its quite a big amount of data. I remember that this query was faster on the production server. The last option would be going into the cloud, but that would be to expensive :smiley:

Thank you vey much for your help!