Long query time (30sec+) on LANCache Dashboard

Gamienator · August 5, 2024, 11:08am

Good Morning everyone,

A couple of months ago I wanted to save nginx Logs to InfluxDB to make a dashboard for Grafana. It works at the moment, but the query to InfluxDB takes 30+ Seconds everytime Grafana generate the dashboards on a 16 Core VM. htop shows 1600% CPU time and I’m sure I didn’t make the best way of saving data into InfluxDB.

Following is a sample line of the logs:

{"timestamp":"1709839342.987","time_local":"07/Mar/2024:20:22:22 +0100","cache_identifier":"steam","remote_addr":"10.0.5.12","forwarded_for":"","remote_user":"","status":"200","bytes_sent":336,"referer":"","user_agent":"Valve/Steam HTTP Client 1.0","upstream_cache_status":"MISS","host":"cache14-fra2.steamcontent.com","http_range":"","method":"GET","path":"/depot/578081/chunk/05be81078eac27dff6aadfe494f391817f4088b9","proto":"HTTP/1.1","scheme":"http"}

Which I’m putting into Influx via python:

[...]
# JSON-Objekt aus der Log-Zeile extrahieren
    log_entry = json.loads(line.strip())

    # Unix-Timestamp in ein datetime-Objekt umwandeln
    timestamp = float(log_entry.get("timestamp", 0))
    dt_obj = datetime.utcfromtimestamp(timestamp)

    # Datenpunkt erstellen
    data_point = Point("web_access")\
        .tag("host", log_entry.get("host", "unknown"))\
        .tag("cache_identifier", log_entry.get("cache_identifier", "unknown"))\
        .tag("status", log_entry.get("status", "unknown"))\
        .tag("bytes_sent", int(log_entry.get("bytes_sent", 0)))\
        .tag("path", log_entry.get("path", ""))\
        .tag("user_agent", log_entry.get("user_agent", ""))\
        .tag("remote_addr", log_entry.get("remote_addr", ""))\
        .tag("time_local", log_entry.get("time_local", ""))\
        .tag("method", log_entry.get("method", ""))\
        .tag("upstream_cache_status", log_entry.get("upstream_cache_status", ""))\
        .tag("http_range", log_entry.get("http_range", ""))\
        .tag("proto", log_entry.get("proto", ""))\
        .field("scheme", log_entry.get("scheme", ""))\
        .time(dt_obj, WritePrecision.MS)
[...]

At the end my Grafana Dashboard looks like this:

One of my Querys looks like this:

from(bucket: "lancache")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_field"] == "scheme")
  |> drop(columns: ["_start", "_stop"])
  |> group(columns: ["cache_identifier"])
  |> count()
  |> rename(columns: {_value: ""})
  |> yield(name: "count")

Now I’m aking myself why the Query takes so long

I know that saving Infos like cache_identifier should be saved as field, but I wasn’t able to create the query to make it work in Grafana.

Is somebody here that could tell what could I do to improve the query time?

Thanks!

scott · August 6, 2024, 4:50pm

@Gamienator The best way to make Flux queries fast is to leverage pushdown optimizations as much as possible. These “push down” computation to the storage layer so that less data has to be loaded into memory and operated on there.

Right now, drop() after filter() is breaking the pushdown chain, so all the data returned from filter() has to be loaded into memory and the rest of the query operates on it there. I would suggest moving drop() to after count(). You should see a noticeable improvement in query performance:

from(bucket: "lancache")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_field"] == "scheme")
  |> group(columns: ["cache_identifier"])
  |> count()
  |> drop(columns: ["_start", "_stop"])
  |> rename(columns: {_value: ""})
  |> yield(name: "count")

Gamienator · August 9, 2024, 3:08pm

Thank you very much Scott! My next LANParty is soon and therefore I’ll update and check all servers. Then I’ll try your query, but it gave me a great inside

Gamienator · August 9, 2024, 6:13pm

@scott I wasn’t able to resist testing your suggestion. On three of four queries it reduces the time from 28 seconds to 7 seconds

One Query is still taking 21 seconds:

from(bucket: "lancache")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_field"] == "scheme")
  |> group(columns: ["scheme"])
  |> aggregateWindow(every: 1m, fn: count, createEmpty: false)
  |> drop(columns: ["_start", "_stop"])

Do you see a possible way to optimize that query as well or do I have to live with that long query time?

scott · August 19, 2024, 8:32pm

Sorry for the slow reply @Gamienator, I was out on vacation. With the query that’s taking 21 seconds, how large of a time range are you querying? How “dense” is the data (how often are points recorded)?

There isn’t anything I can see in the query that would make it slow, so my first thought is how much data is this query actually processing?

One thing you could actually do is run a profiler on the query to see query performance metrics. If you get a chance, it would look something like this:

import "profiler"

option profiler.enabledProfilers = ["query", "operator"]

from(bucket: "lancache")
  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
  |> filter(fn: (r) => r["_field"] == "scheme")
  |> group(columns: ["scheme"])
  |> aggregateWindow(every: 1m, fn: count, createEmpty: false)
  |> drop(columns: ["_start", "_stop"])

This will add two tables to your query results–one with information about the overall query performance and the other with information about each operation performed in the query execution plan.

Gamienator · August 24, 2024, 3:05pm

No worries Scott, vacation is important And the next LAN is in October.

Thanks for the hints, the poor query performance could be because of the test PC thats influxDB is currently running (i5-7500T) and its quite a big amount of data. I remember that this query was faster on the production server. The last option would be going into the cloud, but that would be to expensive

Thank you vey much for your help!

Topic		Replies	Views
Performance in Influxdb 1.8 Store influxdb , performance	0	717	June 1, 2020
Long Running Query after 1.8 -> 2 InfluxDB 2	2	244	August 24, 2023
Tips on preventing Grafana from bogging down InfluxDB server?	3	396	April 12, 2021
INFLUXDB performance	11	4977	June 21, 2018
Suggestions to improve bad performance using HTTP-Requests to write data Store performance	1	799	September 23, 2019

Long query time (30sec+) on LANCache Dashboard

Related topics