HI, as suggested I have enable influx plugin in telegraf,
it’s look like normal utilization, but still we are facing high memory utilization on server , hence server will gone in hang-state.
please suggest, let me know if any other details required
You should try to understand what’s happening when the memory has 100% spikes (and I can’t help with that). is a continuous query running?
(btw 13min of avg query duration looks a lot to me)
13min for avg is query is very high. Are these write queries or read? You can visualize it too. You should figure it out which query/queries took long and why. Probably, those long queries are causing high memory usage.
I am also trying to find out long running query on influx. whenever I have run these 2 query, it’s gives no record.
“SHOW QUERIES" and “SHOW CONTINUOUS QUERIES”
Is there any way to find query history or all executed query on influx
Have a look at the data related to the “Average Query Duration” that you see in Grafana, I think the measurement contains more than that, not sure with what detail though, if you are lucky it might contain the query text.
If you have no continuous query then somewhere in a report you have a huge query that might need some work. Query performance is also influenced by the retention policy settings (shard duration), but first you need to find what is causing the issue
can you provide the structure of the measurement about queries?
I meant which tags and fields are available in the measurement about queries (it should be the “queryExecutor” measurement), but you will see which one is used in the Grafana query…
Sadly, since no tags are available to further restrict your research for the guilty query, you will probably need to open some reports (or run whatever usually runs on influx), probably the most used (since the problem occurs often) and see which one requires like 13 minutes to load, from there analyze the queries.
maybe someone is querying a huge time range or there might be some heavy calculation…
Run SHOW QUERIES while the cpu is at 100%, it will return all the query in execution
Check the Http request log, it tracks the requests and you can get the “query history” from its data. If not already configured have a look at the docs or at the section “2- Define Logging Settings” of this blog post (the blog itself)
Manually do what users or system do daily and check the response time and memory consumption
I have observed few days InfluxDB logs and below are the analysis :-
As 900+ telegraf agents reporting to InfluxDB, so many POST request.
I can see GET request from Grafana server and few scheduled query, these all are SELECT query, so is it possible they will impact InfluxDB performance ?
• 38669 out of 58058 GET request from A server on 14 Jan 2020.
• 26004 out of 40489 GET request from A server on 13 Jan 2020.
52185 out of 92094 GET request from A server on 15 Jan 2020.
63644 out of 96290 GET request from A server on 16 Jan 2020.
those 3 queries are simple and I doubt that this is the cause of the problem.
I can imagine those query are made by a grafana chart, all the query have the following pattern:
SELECT mean("used_percent") from "telegraf"."autogen"."mem" WHERE ("time" >= ''2020-01-10T05:14:00.000000000Z'' and "time" <= ''2020-01-10T05:18:59.599999999Z'' AND "IP" = '10.56.4.52') GROUP BY "IP"
You can run it yourself and check the response time but I doubt this is the problem