InfluxDB memory usages high

HI Team,

I have influxdb-1.7.8-1.x86_64 , and running on “Red Hat Enterprise Linux Server release 7.5 (Maipo)” .

Last few days we have notices Memory usage is suddenly high, these are hardware parameters :-

          total        used        free      shared  buff/cache   available

Mem: 94G 93G 439M 120M 661M 384M
Swap: 19G 3.6G 16G

Total cpu allocated :- 8

please suggest how we can control this issue.

What did you change in last few days? Check if there is a Continuous Query running?

I am facing this issue long back, even I have raised a request on GIthub.

Currently 900+ Telegraf agents are reporting to InfluxDB.

I have checked “SHOW QUERIES” but no result

is there any other option to check long running query?

You can try the Telegraf Influxdb Input, it may give you more information about what’s happening in the db instance

HI, as suggested I have enable influx plugin in telegraf,
it’s look like normal utilization, but still we are facing high memory utilization on server , hence server will gone in hang-state.

please suggest, let me know if any other details required

You should try to understand what’s happening when the memory has 100% spikes (and I can’t help with that). is a continuous query running?
(btw 13min of avg query duration looks a lot to me)

Try in influx console:

SHOW CONTINUOUS QUERIES

13min for avg is query is very high. Are these write queries or read? You can visualize it too. You should figure it out which query/queries took long and why. Probably, those long queries are causing high memory usage.

Thanks for reply!!!

I am also trying to find out long running query on influx. whenever I have run these 2 query, it’s gives no record.
“SHOW QUERIES" and “SHOW CONTINUOUS QUERIES”

Is there any way to find query history or all executed query on influx

Have a look at the data related to the “Average Query Duration” that you see in Grafana, I think the measurement contains more than that, not sure with what detail though, if you are lucky it might contain the query text.

If you have no continuous query then somewhere in a report you have a huge query that might need some work. Query performance is also influenced by the retention policy settings (shard duration), but first you need to find what is causing the issue

can you provide the structure of the measurement about queries?

SHOW MEASUREMENTS on telegraf
name: measurements
name


cpu
cpu_util
disk
diskio
kernel
mem
mem_util
msr_atl_c360_agg_info_log
msr_error_code_info
net
net_response
oracle_session_longrunning
procCheck
processes
redis
redis_keyspace
swap
swap_util
system
win_cpu
win_disk
win_diskio
win_mem
win_net
win_services
win_swap
win_system

SHOW MEASUREMENTS on chronograf
name: measurements
name


alerts

SHOW MEASUREMENTS on _internal
name: measurements
name


cq
database
httpd
queryExecutor
runtime
shard
subscriber
tsm1_cache
tsm1_engine
tsm1_filestore
tsm1_wal
write

I meant which tags and fields are available in the measurement about queries (it should be the “queryExecutor” measurement), but you will see which one is used in the Grafana query…

This is what I have in the Internal db

Sadly, since no tags are available to further restrict your research for the guilty query, you will probably need to open some reports (or run whatever usually runs on influx), probably the most used (since the problem occurs often) and see which one requires like 13 minutes to load, from there analyze the queries.
maybe someone is querying a huge time range or there might be some heavy calculation…

let us know if you find something

Thanks for reply !!!

that 13 was showing last 24 hrs avrg.

can you please help me to know find out all executed query in influxDB.

means is there any way to get executed query history.

3 options:

  1. Run SHOW QUERIES while the cpu is at 100%, it will return all the query in execution
  2. Check the Http request log, it tracks the requests and you can get the “query history” from its data. If not already configured have a look at the docs or at the section “2- Define Logging Settings” of this blog post (the blog itself)
  3. Manually do what users or system do daily and check the response time and memory consumption

The best one probably is option 2.

Another useful setting in this case is " log-queries-after "

Thanks for reply,

I have observed few days InfluxDB logs and below are the analysis :-

  • As 900+ telegraf agents reporting to InfluxDB, so many POST request.
  • I can see GET request from Grafana server and few scheduled query, these all are SELECT query, so is it possible they will impact InfluxDB performance ?

• 38669 out of 58058 GET request from A server on 14 Jan 2020.
• 26004 out of 40489 GET request from A server on 13 Jan 2020.

  •   52185 out of 92094 GET request from A server on 15 Jan 2020.
    
  •   63644 out of 96290 GET request from A server on 16 Jan 2020.
    

Queries like :-
Jan 17 10:50:08 N2VL-PD-FLU01 influxd: [httpd] 10.5.98.200 - - [17/Jan/2020:10:50:08 +0530] “GET /query?db=telegraf&epoch=s&q=SELECT+mean%28%22used_percent%22%29+from+%22telegraf%22.%22autogen%22.%22mem%22+WHERE+%28%22time%22+%3E%3D+%272020-01-10T05%3A14%3A00.000000000Z%27+and+%22time%22+%3C%3D+%272020-01-10T05%3A18%3A59.599999999Z%27+AND+%22IP%22+%3D+%2710.56.4.52%27%29+GROUP+BY+%22IP%22 HTTP/1.1” 200 151 “-” “python-requests/2.21.0” 06ce1767-38e9-11ea-a512-005056b67104 148463
Jan 17 10:50:08 N2VL-PD-FLU01 influxd: [httpd] 10.5.98.200 - - [17/Jan/2020:10:50:08 +0530] “GET /query?db=telegraf&epoch=s&q=SELECT+%28100-mean%28%22usage_idle%22%29%29+from+%22telegraf%22.%22autogen%22.%22cpu%22+WHERE+%28%22time%22+%3E%3D+%272020-01-11T05%3A14%3A00.000000000Z%27+and+%22time%22+%3C%3D+%272020-01-11T05%3A18%3A59.599999999Z%27+AND+%22IP%22+%3D+%2710.135.0.235%27%29+GROUP+BY+%22IP%22 HTTP/1.1” 200 153 “-” “python-requests/2.21.0” 06cd8de1-38e9-11ea-a510-005056b67104 152424
Jan 17 10:50:08 N2VL-PD-FLU01 influxd: [httpd] 10.5.98.200 - - [17/Jan/2020:10:50:08 +0530] “GET /query?db=telegraf&epoch=s&q=SELECT+%28100-mean%28%22usage_idle%22%29%29+from+%22telegraf%22.%22autogen%22.%22cpu%22+WHERE+%28%22time%22+%3E%3D+%272020-01-11T05%3A14%3A00.000000000Z%27+and+%22time%22+%3C%3D+%272020-01-11T05%3A18%3A59.599999999Z%27+AND+%22IP%22+%3D+%2710.92.204.57%27%29+GROUP+BY+%22IP%22 HTTP/1.1” 200 151 “-” “python-requests/2.21.0” 06e28664-38e9-11ea-a521-005056b67104 25637

So can you please confirm, GET query impact the server performance or increase the load on influxDB ?

those 3 queries are simple and I doubt that this is the cause of the problem.
I can imagine those query are made by a grafana chart, all the query have the following pattern:

SELECT mean("used_percent") from "telegraf"."autogen"."mem" WHERE ("time" >= ''2020-01-10T05:14:00.000000000Z'' and "time" <= ''2020-01-10T05:18:59.599999999Z'' AND "IP" = '10.56.4.52') GROUP BY "IP"

You can run it yourself and check the response time but I doubt this is the problem