Bug report. Out of memory, when create "Checks" in "Monitoring & Alerting"

fhrr · February 16, 2020, 8:59am

Description:

Certain actions in Monitoring & alerting, while creating new Threshold check lead to fast memory leak, high CPU usage. Then container reboots.

fhrr · February 16, 2020, 9:00am

Container restarted

Below, is presented:

setup
steps to reproduce

fhrr · February 16, 2020, 9:02am

Details of the setup:

InfluxDB 2.0 version - latest beta

Version 2.0.0 (dd4a6fc)

Telegraf config

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will send metrics to outputs in batches of at most
  ## metric_batch_size metrics.
  ## This controls the size of writes that Telegraf sends to output plugins.
  metric_batch_size = 1000

  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  ## This buffer only fills when writes fail to output plugin(s).
  metric_buffer_limit = 10000

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

  ## Default flushing interval for all outputs. Maximum flush_interval will be
  ## flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## By default or when set to "0s", precision will be set to the same
  ## timestamp order as the collection interval, with the maximum being 1s.
  ##   ie, when interval = "10s", precision will be "1s"
  ##       when interval = "250ms", precision will be "1ms"
  ## Precision will NOT be used for service inputs. It is up to each individual
  ## service input to set the timestamp at the appropriate precision.
  ## Valid time units are "ns", "us" (or "µs"), "ms", "s".
  precision = ""

  ## Logging configuration:
  ## Run telegraf with debug log messages.
  debug = false
  ## Run telegraf in quiet mode (error log messages only).
  quiet = false
  ## Specify the log file name. The empty string means to log to stderr.
  logfile = ""

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false
[[outputs.influxdb_v2]]
  ## The URLs of the InfluxDB cluster nodes.
  ##
  ## Multiple URLs can be specified for a single cluster, only ONE of the
  ## urls will be written to each interval.
  ## urls exp: http://127.0.0.1:9999
  urls = ["http://some-influxdb-server.xxx:9999"]

  ## Token for authentication.
  token = "$INFLUX_TOKEN"

  ## Organization is the name of the organization you wish to write to; must exist.
  organization = "some-org"

  ## Destination bucket to write into.
  bucket = "Riga-Telegraf-user-PCs-1d"
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics.
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states.
  report_active = false
[[inputs.disk]]
  ## By default stats will be gathered for all mount points.
  ## Set mount_points will restrict the stats to only the specified mount points.
  # mount_points = ["/"]
  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "overlay", "aufs", "squashfs"]
[[inputs.diskio]]
[[inputs.mem]]
[[inputs.net]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]

# [[inputs.win_perf_counters]]
#   [[inputs.win_perf_counters.object]]
#    # Processor usage, alternative to native, reports on a per core.
#    ObjectName = "Processor"
#    Instances = ["*"]
#    Counters = ["% Idle Time", "% Interrupt Time", "% Privileged Time", "% User Time", "% Processor Time"]
#    Measurement = "win_cpu"
#    #IncludeTotal=false #Set to true to include _Total instance when querying for all (*).
#    IncludeTotal=true
#

fhrr · February 16, 2020, 9:07am

Details of the setup:

Numebr of hosts under Telegraf monitoring: 3
All Telegrafs installed to Win 10 pro.

Container is run with next command:

docker run -dit
–name=influxdb
–restart=unless-stopped
–network=host
-v /influxdb:/root/.influxdbv2
Quay --reporting-disabled

fhrr · February 16, 2020, 9:08am

How to reproduce

Go to “Monitoring & Alerting”
Create new check

image1751×895 69.5 KB
Press “Submit” to run query

fhrr · February 16, 2020, 9:09am

Go to from “Define Query” to “Configure Check”

image1718×871 82.1 KB

fhrr · February 16, 2020, 9:10am

Run in a console htop - to monitor RAM.

image1283×383 23.4 KB

fhrr · February 16, 2020, 9:10am

Now change “Schedule Every”.
6.1 Put 1h instead of 1m

image1466×386 28.7 KB

fhrr · February 16, 2020, 9:11am

RAM consumption and CPU usage grows then drops - container restarted.

fhrr · February 16, 2020, 9:13am

6.2 Refresh page. (you shouldn’t close it)
Change “Schedule Every” from 1h (which is still on page after container restart) to 6h.
CPU will jump but then go to norm

6.3 Switch back “Schedule Every” to 1m.
wait for a minute and then switch to 6h.
Memory will be eaten

fhrr · February 16, 2020, 9:14am

Docker dies as well

bamne123 · February 16, 2020, 12:45pm

Wow, I guess Beta version has Issues. Did same happens for TICK stack 1.x version?

fhrr · February 20, 2020, 11:52am

Tested with latest Beta-4

Version 2.0.0 (3e054aa)

image1093×320 21.5 KB

The bug is still there.

John9570 · February 27, 2020, 4:23am

Is anyone going to address these OOM issues? If not, I’ll be moving on to Elasticsearch.

fhrr · March 11, 2020, 6:01pm

Tested Beta 5, looks like described bug does not appear anymore

However, memory usage slowly rises with time.

after container restart

Topic		Replies	Views
Error with custom telegraf.conf InfluxDB 2	1	1045	February 22, 2021
Metric buffer overflow Telegraf influxdb , telegraf	4	3733	March 16, 2022
Proxmox 7 and InfluxDB 2 Telegraf	0	2933	August 19, 2021
InfluxDB V2 and Telegraf -> unauthorized: unauthorized access InfluxDB 2	2	4257	March 16, 2021
Strange Ghost topics in InfluxDB? InfluxDB 2 influxdb , query	2	23	May 26, 2025

Bug report. Out of memory, when create "Checks" in "Monitoring & Alerting"

Related topics