Telegraf AMQP consumer closing after a few minutes (on docker)

telegraf

#1

Previously I was running telegraf 1.4.x just fine on windows. It was only collecting some windows metrics and an AMQP consumer and writing results just fine to an influx output plugin. I installed docker (Windows 10) and installed the telegraf container. I’ve tried versions 1.4.5 and #latest (1.5.2). Both had this same issue. AMQ broker is running on another host with versions RabbitMQ 3.6.6, Erlang 17.3.

Issue: telegraf will run for a few minutes just fine connecting to the AMQP broker, collecting metrics and storing them. However after ~4 minutes it will close the connection on an exception (501 Reason “EOF”), reconnect automatically and run for another couple minutes, and then close it again (this time no exception) and then never reconnect to the queue. At this point no metrics are no longer collected from the AMQP consumer plugin.

(1) Any idea why the AMQP consumer plugin throws an exception and closes the connection? Should this be a bug filed?

(2) Secondly, why can’t the amqp consumer re-connect after the first try? A quick look at the AMQP consumer source code suggests to me that only one attempt is made to re-connect. But I may not be reading the source code correctly. If I am, it seems like it should keep trying to reconnect.

telegraf log output:

PS C:\Users\user1> d exec -it docker_telegraf_1 tail -f /var/log/telegraf/telegraf.log
2018-03-10T16:01:40Z D! Output [influxdb] wrote batch of 5 metrics in 71.9857ms
2018-03-10T16:01:50Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:02:04Z D! Attempting connection to output: influxdb
2018-03-10T16:02:04Z D! Successfully connected to output: influxdb
2018-03-10T16:02:04Z I! Starting Telegraf v1.5.2
2018-03-10T16:02:04Z I! Loaded outputs: influxdb
2018-03-10T16:02:04Z I! Loaded inputs: inputs.internal inputs.amqp_consumer
2018-03-10T16:02:04Z I! Tags enabled: host=564e94b2746f host_type=container
2018-03-10T16:02:04Z I! Agent Config: Interval:30s, Quiet:false, Hostname:"564e94b2746f", Flush Interval:10s
2018-03-10T16:02:05Z I! Started AMQP consumer
2018-03-10T16:02:40Z D! Output [influxdb] buffer fullness: 287 / 1000 metrics.
2018-03-10T16:02:40Z D! Output [influxdb] wrote batch of 287 metrics in 83.5954ms
2018-03-10T16:02:50Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:03:00Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:03:10Z D! Output [influxdb] buffer fullness: 5 / 1000 metrics.
2018-03-10T16:03:10Z D! Output [influxdb] wrote batch of 5 metrics in 56.6494ms
2018-03-10T16:03:20Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:03:30Z D! Output [influxdb] buffer fullness: 5 / 1000 metrics.
2018-03-10T16:03:30Z D! Output [influxdb] wrote batch of 5 metrics in 54.2286ms
2018-03-10T16:03:40Z D! Output [influxdb] buffer fullness: 85 / 1000 metrics.
2018-03-10T16:03:40Z D! Output [influxdb] wrote batch of 85 metrics in 72.3162ms
2018-03-10T16:03:50Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:04:00Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:04:10Z D! Output [influxdb] buffer fullness: 5 / 1000 metrics.
2018-03-10T16:04:10Z D! Output [influxdb] wrote batch of 5 metrics in 56.3312ms
2018-03-10T16:04:20Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:04:30Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:04:40Z D! Output [influxdb] buffer fullness: 69 / 1000 metrics.
2018-03-10T16:04:40Z D! Output [influxdb] wrote batch of 69 metrics in 64.7451ms
2018-03-10T16:04:50Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:05:00Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:05:10Z D! Output [influxdb] buffer fullness: 5 / 1000 metrics.
2018-03-10T16:05:10Z D! Output [influxdb] wrote batch of 5 metrics in 55.5146ms
2018-03-10T16:05:20Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:05:30Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:05:40Z D! Output [influxdb] buffer fullness: 132 / 1000 metrics.
2018-03-10T16:05:40Z D! Output [influxdb] wrote batch of 132 metrics in 81.3303ms
2018-03-10T16:05:50Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:06:00Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:06:10Z D! Output [influxdb] buffer fullness: 5 / 1000 metrics.
2018-03-10T16:06:10Z D! Output [influxdb] wrote batch of 5 metrics in 54.9764ms
2018-03-10T16:06:20Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:06:25Z I! AMQP consumer queue closed
2018-03-10T16:06:25Z I! AMQP consumer connection closed: Exception (501) Reason: "EOF"; trying to reconnect
2018-03-10T16:06:25Z I! Started AMQP consumer
2018-03-10T16:06:30Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:06:40Z D! Output [influxdb] buffer fullness: 59 / 1000 metrics.
2018-03-10T16:06:40Z D! Output [influxdb] wrote batch of 59 metrics in 62.9762ms
2018-03-10T16:06:50Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:07:00Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:07:10Z D! Output [influxdb] buffer fullness: 5 / 1000 metrics.
2018-03-10T16:07:10Z D! Output [influxdb] wrote batch of 5 metrics in 53.9521ms
2018-03-10T16:07:20Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:07:30Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:07:40Z D! Output [influxdb] buffer fullness: 87 / 1000 metrics.
2018-03-10T16:07:40Z D! Output [influxdb] wrote batch of 87 metrics in 61.4431ms
2018-03-10T16:07:50Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:08:00Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:08:10Z D! Output [influxdb] buffer fullness: 5 / 1000 metrics.
2018-03-10T16:08:10Z D! Output [influxdb] wrote batch of 5 metrics in 67.682ms
2018-03-10T16:08:20Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:08:25Z I! AMQP consumer queue closed
2018-03-10T16:08:30Z D! Output [influxdb] buffer fullness: 0 / 1000 metrics.
2018-03-10T16:08:40Z D! Output [influxdb] buffer fullness: 5 / 1000 metrics.

telegraf config:

# Telegraf configuration

# Telegraf is entirely plugin driven. All metrics are gathered from the
# declared inputs, and sent to the declared outputs.

# Plugins must be declared in here to be active.
# To deactivate a plugin, comment out the name and any variables.

# Use 'telegraf -config telegraf.conf -test' to see what metrics a config
# file would generate.

# Global tags can be specified here in key="value" format.
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"
  host_type = "container"

# Configuration for telegraf agent
[agent]
  ## Default data collection interval for all inputs
  interval = "30s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will cache metric_buffer_limit metrics for each output, and will
  ## flush this buffer on a successful write.
  metric_buffer_limit = 1000
  ## Flush the buffer whenever full, regardless of flush_interval.
  flush_buffer_when_full = true

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "5s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

  ## Logging configuration:
  ## Run telegraf in debug mode
  debug = true
  ## Run telegraf in quiet mode
  quiet = false
  ## Specify the log file name. The empty string means to log to stdout.
  logfile = "/var/log/telegraf/telegraf.log"

  ## Override default hostname, if empty use os.Hostname()
  hostname = ""


###############################################################################
#                                  OUTPUTS                                    #
###############################################################################

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  # The full HTTP or UDP endpoint URL for your InfluxDB instance.
  urls = ["http://ws:8086"]
  # The target database for metrics (telegraf will create it if not exists)
  database = "telegraf"
  # Precision of writes, valid values are "ns", "us" (or "µs"), "ms", "s", "m", "h".
  # note: using second precision greatly helps InfluxDB compression
  precision = "s"

  ## Write timeout (for the InfluxDB client), formatted as a string.
  ## If not provided, will default to 5s. 0s means no timeout (not recommended).
  timeout = "30s"
  # username = "telegraf"
  # password = "metricsmetricsmetricsmetrics"
  # Set the user agent for HTTP POSTs (can be useful for log differentiation)
  user_agent = "telegraf"

###############################################################################
#                                  INPUTS                                     #
###############################################################################

[[inputs.internal]]
  ## If true, collect telegraf memory stats.
  collect_memstats = true


[[inputs.amqp_consumer]]
  ## AMQP url
  url = "amqp://user:password@pi:5672/metrics"
  ## AMQP exchange
  exchange = "influx"
  ## AMQP queue name
  queue = "telegraf"
  ## Binding Key
  binding_key = "#"		# the hash (#) indicates accept all messages regardless of routing key

  ## Controls how many messages the server will try to keep on the network
  #prefetch_count = 50
  ## for consumers before receiving delivery acks.

  ## Auth method. PLAIN and EXTERNAL are supported.
  ## Using EXTERNAL requires enabling the rabbitmq_auth_mechanism_ssl plugin as
  ## described here: https://www.rabbitmq.com/plugins.html
  auth_method = "PLAIN"
  ## Optional SSL Config
  # ssl_ca = "/etc/telegraf/ca.pem"
  # ssl_cert = "/etc/telegraf/cert.pem"
  # ssl_key = "/etc/telegraf/key.pem"
  ## Use SSL but skip chain & host verification
  # insecure_skip_verify = false

  ## Data format to consume.
  ## Each data format has its own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
  data_format = "influx"

edit: formatting


#2

Looks like a bug, can you open an issue on the github about this? EOF probably just indicates the server closed the connection.


#3

Thanks. I’m having problems reproducing it but I’ll file a bug when I’m able to get some time to reproduce it.