Telegraf Periodically Stops Sending Metrics

Hi,

My Telegraf container ceases sending any and all metrics to the Influx database after periodic time intervals which are affected by the resolution and size of the data.

As you can see, it would consistently break after the same period of time has passed since the telegraf container was restarted via a cronjob at 8:00am each day. How long it lasts depends on the amount of data passing through.

I have the following relevant Docker instances for my server setup:

  • MQTT broker (mosquitto)
  • Telegraf
  • Influxdb
# Telegraf Configuration

# Global tags can be specified here in key="value" format.
[global_tags]

# Configuration for telegraf agent
[agent]
  interval = "1s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  #collection_interval = "10s"
  collection_jitter = "1s"
  flush_interval = "3s"
  flush_jitter = "1s"
  precision = ""
  hostname = "telegraf-MQTT"
  omit_hostname = false

# test1
[[outputs.influxdb]]
  namepass = ["device1", "device2"]
  database = "test1"
  urls = ["http://influxdb:8086"]
  username = "user-tel"
  password = "pwd2"
  timeout = "5s"

# test3_test
[[outputs.influxdb]]
  namepass = ["device1_test3"]
  database = "test3_test"
  urls = ["http://influxdb:8086"]
  username = "user-tel"
  password = "pwd2"
  timeout = "5s"


# === test1_device1 ===
[[inputs.mqtt_consumer]]
  name_override = "device1"
  servers = ["tcp://123.123.123.123:8883"]
  qos = 0
  persistent_session = false
  ## If unset, a random client ID will be generated.
  client_id = "1002"

  ## Topics that will be subscribed to.
  topics = [
    "test1/processed/device1"
    ]

  username = "user-name"
  password = "pass-word"

  data_format = "json"
  json_name_key = "site"
  tag_keys = ["device"]
  json_time_key = "unix_time"
  json_time_format = "unix"

# === test1_device2 ===
[[inputs.mqtt_consumer]]
  name_override = "device2"
  servers = ["tcp://123.123.123.123:8883"]
  qos = 1
  persistent_session = false
  ## If unset, a random client ID will be generated.
  client_id = "1003"

  ## Topics that will be subscribed to.
  topics = [
    "test1/processed/device2"
    ]

  username = "user-name"
  password = "pass-word"

  data_format = "json"
  json_name_key = "site"
  tag_keys = ["device"]
  json_time_key = "unix_time"
  json_time_format = "unix"

# === test2 ===
[[inputs.mqtt_consumer]]
  name_override = "test2_tel"
  servers = ["tcp://123.123.123.123:8883"]

  qos = 0
  persistent_session = false

  ## If unset, a random client ID will be generated.
  client_id = "1005"

  ## Topics that will be subscribed to.
  topics = [
    "test2/processed/#"
    ]

  username = "user-name"
  password = "pass-word"

  data_format = "json"
  json_name_key = "site"
  tag_keys = ["device"]
  json_time_key = "unix_time"
  json_time_format = "unix"


# === test3_test ===
[[inputs.mqtt_consumer]]
  name_override = "device1_test3"
  servers = ["tcp://123.123.123.123:8883"]
  qos = 1
  persistent_session = true
  ## If unset, a random client ID will be generated.
  client_id = "1030"

  ## Topics that will be subscribed to.
  topics = [
    "test3/raw/device1"
    ]

  username = "user-name"
  password = "pass-word"

  data_format = "csv"
  csv_header_row_count = 0
  # renamed columns names to something shorter for posted question
  csv_column_names = ["TimeStamp","temp1","temp2","temp3","temp4","temp5","temp6","temp7","temp8","temp9","temp11","temp12","temp13","temp14","temp15","temp16","temp17","temp18","temp19","temp21","temp22","temp23","temp24","temp25","temp26","temp27","temp28","temp29","temp31","temp32","temp33","temp34","temp35","temp36","temp37","temp38","temp39","temp41","temp42","temp43","temp44","temp45","temp46","temp47","temp48","temp49","temp11","temp12","temp13","temp14","temp15","temp16","temp17","temp18","temp19","temp111","temp112","temp113","temp114","temp115","temp116","temp117","temp118","temp119","temp121","temp122","temp123","temp124","temp125","temp126","temp127","temp128","temp129","temp131","temp132","temp133","temp134","temp135","temp136","temp137","temp138","temp139","temp141","temp142","temp143","temp144","temp145","temp146","temp147","temp148","temp149","temp21","temp22","temp23","temp24","temp25","temp26","temp27","temp28","temp29","temp211","temp212","temp213","temp214","temp215","temp216","temp217","temp218","temp219","temp221","temp222","temp223","temp224","temp225","temp226","temp227","temp228","temp229","temp231","temp232","temp233","temp234","temp235","temp236","temp237","temp238","temp239","temp241","temp242","temp243","temp244","temp245","temp246","temp247","temp248","temp249","temp31","temp32","temp33","temp34","temp35","temp36","temp37","temp38","temp39","temp311","temp312","temp313","temp314","temp315","temp316","temp317","temp318","temp319","temp321","temp322","temp323","temp324","temp325","temp326","temp327","temp328","temp329","temp331","temp332","temp333","temp334","temp335","temp336","temp337","temp338","temp339","temp341","temp342","temp343","temp344","temp345","temp346","temp347","temp348","temp349","temp41","temp42","temp43","temp44","temp45","temp46","temp47","temp48","temp49","temp411","temp412","temp413","temp414","temp415","temp416","temp417","temp418","temp419","temp421","temp422","temp423","temp424","temp425","temp426","temp427","temp428","temp429","temp431","temp432","temp433","temp434","temp435","temp436","temp437","temp438","temp439","temp441","temp442","temp443","temp444","temp445","temp446","temp447","temp448","temp449"]
  csv_skip_rows = 0
  csv_skip_columns = 0
  csv_delimiter = ","
  csv_comment = ""
  csv_trim_space = false
  csv_tag_columns = []
  csv_measurement_column = ""

The only (temporary solution) so far has been having regular cronjobs to restart the telegraf container at a frequency greater than its failure frequency.

Any help resolving this is much appreciated. Thank you.

2 Likes

have you had a look at the Telegraf logs?
in the agent section there is a whole section related to logging. by default, the log is sent to stderr, I suggest you write it to a file in order to check what’s wrong.

Hello,
do you have any followup ?
I have exactly the same issue with a telegraf 1.21.1 outputing to mqtt. Stops sending data, still running, no errors in logs. A node-red working in // still sends data to the mqtt.
(This is the template for ansible)



# Global tags can be specified here in key="value" format.
[global_tags]
  site = "{{ telegraf_collector_site }}"
  area = "{{ telegraf_collector_area }}"

[agent]
  interval = "{{telegraf_collector_interval}}"
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  logtarget = "file"
  logfile = "/var/log/telegraf/telegraf.log"
  logfile_rotation_interval = "1d"
  logfile_rotation_max_archives = 5
  hostname = ""
  omit_hostname = false

###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################

[[outputs.mqtt]]
 servers = ["{{telegraf_collector_mqtt_server}}:{{telegraf_collector_mqtt_port}}"] # required.
 topic_prefix = "{{ telegraf_collector_topic_prefix }}"

  ## username and password to connect MQTT server.
 username = "{{telegraf_collector_mqtt_user}}"
 password = "{{telegraf_collector_mqtt_password}}"

 data_format = "json"
 json_timestamp_units = "1ns"



###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################

{% if telegraf_collector_monitor_influxdb %}

[[inputs.prometheus]]
  urls = ["http://{{telegraf_collector_monitor_influxdb_host}}:{{telegraf_collector_monitor_influxdb_port}}/metrics"]

{% endif %}

# Read metrics about cpu usage
[[inputs.cpu]]
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics
  collect_cpu_time = false
  ## If true, compute and report the sum of all non-idle CPU states
  report_active = false


# Read metrics about disk usage by mount point
[[inputs.disk]]
  ## By default stats will be gathered for all mount points.
  ## Set mount_points will restrict the stats to only the specified mount points.
  # mount_points = ["/"]

  ## Ignore mount points by filesystem type.
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]


# Read metrics about disk IO by device
[[inputs.diskio]]
  ## By default, telegraf will gather stats for all devices including
  ## disk partitions.
  ## Setting devices will restrict the stats to the specified devices.
  # devices = ["sda", "sdb", "vd*"]
  ## Uncomment the following line if you need disk serial numbers.
  # skip_serial_number = false
  #
  ## On systems which support it, device metadata can be added in the form of
  ## tags.
  ## Currently only Linux is supported via udev properties. You can view
  ## available properties for a device by running:
  ## 'udevadm info -q property -n /dev/sda'
  ## Note: Most, but not all, udev properties can be accessed this way. Properties
  ## that are currently inaccessible include DEVTYPE, DEVNAME, and DEVPATH.
  # device_tags = ["ID_FS_TYPE", "ID_FS_USAGE"]
  #
  ## Using the same metadata source as device_tags, you can also customize the
  ## name of the device via templates.
  ## The 'name_templates' parameter is a list of templates to try and apply to
  ## the device. The template may contain variables in the form of '$PROPERTY' or
  ## '${PROPERTY}'. The first template which does not contain any variables not
  ## present for the device is used as the device name tag.
  ## The typical use case is for LVM volumes, to get the VG/LV name instead of
  ## the near-meaningless DM-0 name.
  # name_templates = ["$ID_FS_LABEL","$DM_VG_NAME/$DM_LV_NAME"]


# Get kernel statistics from /proc/stat
[[inputs.kernel]]
  # no configuration


# Read metrics about memory usage
[[inputs.mem]]
  # no configuration


# Get the number of processes and group them by status
[[inputs.processes]]
  # no configuration


# Read metrics about swap memory usage
[[inputs.swap]]
  # no configuration


# Read metrics about system load & uptime
[[inputs.system]]
  ## Uncomment to remove deprecated metrics.
  # fielddrop = ["uptime_format"]


{% if telegraf_collector_monitor_docker %}
# # Read metrics about docker containers
[[inputs.docker]]
#   ## Docker Endpoint
#   ##   To use TCP, set endpoint = "tcp://[ip]:[port]"
#   ##   To use environment variables (ie, docker-machine), set endpoint = "ENV"
  endpoint = "unix:///var/run/docker.sock"

  ## Set to true to collect Swarm metrics(desired_replicas, running_replicas)
  gather_services = false

  ## Only collect metrics for these containers, collect all if empty
  container_names = []

  ## Set the source tag for the metrics to the container ID hostname, eg first 12 chars
  source_tag = false

  ## Containers to include and exclude. Globs accepted.
  ## Note that an empty array for both will include all containers
  container_name_include = []
  container_name_exclude = []

  ## Container states to include and exclude. Globs accepted.
  ## When empty only containers in the "running" state will be captured.
  ## example: container_state_include = ["created", "restarting", "running", "removing", "paused", "exited", "dead"]
  ## example: container_state_exclude = ["created", "restarting", "running", "removing", "paused", "exited", "dead"]
  # container_state_include = []
  # container_state_exclude = []

  ## Timeout for docker list, info, and stats commands
  timeout = "5s"

  ## Whether to report for each container per-device blkio (8:0, 8:1...),
  ## network (eth0, eth1, ...) and cpu (cpu0, cpu1, ...) stats or not.
  ## Usage of this setting is discouraged since it will be deprecated in favor of 'perdevice_include'.
  ## Default value is 'true' for backwards compatibility, please set it to 'false' so that 'perdevice_include' setting
  ## is honored.
  perdevice = true

  ## Specifies for which classes a per-device metric should be issued
  ## Possible values are 'cpu' (cpu0, cpu1, ...), 'blkio' (8:0, 8:1, ...) and 'network' (eth0, eth1, ...)
  ## Please note that this setting has no effect if 'perdevice' is set to 'true'
  # perdevice_include = ["cpu"]

  ## Whether to report for each container total blkio and network stats or not.
  ## Usage of this setting is discouraged since it will be deprecated in favor of 'total_include'.
  ## Default value is 'false' for backwards compatibility, please set it to 'true' so that 'total_include' setting
  ## is honored.
  total = false

  ## Specifies for which classes a total metric should be issued. Total is an aggregated of the 'perdevice' values.
  ## Possible values are 'cpu', 'blkio' and 'network'
  ## Total 'cpu' is reported directly by Docker daemon, and 'network' and 'blkio' totals are aggregated by this plugin.
  ## Please note that this setting has no effect if 'total' is set to 'false'
  # total_include = ["cpu", "blkio", "network"]

  ## Which environment variables should we use as a tag
  ##tag_env = ["JAVA_HOME", "HEAP_SIZE"]

  ## docker labels to include and exclude as tags.  Globs accepted.
  ## Note that an empty array for both will include all labels as tags
  docker_label_include = []
  docker_label_exclude = []

  ## Optional TLS Config
  # tls_ca = "/etc/telegraf/ca.pem"
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"
  ## Use TLS but skip chain & host verification
  # insecure_skip_verify = false
{% endif %}

Unfortunately not, I tried investigating it, but couldn’t figure out the issue.
As a temporary fix I had cronjobs to periodically restart the telegraf container.
I now just use python scripts run as services to import the data into Influxdb.

Very similar situation here, with [[inputs.modbus]] plugin reading a TCP/502 source (Hoymiles DTU-Pro) and [[outputs.influxdb]] plugin writing to a quite bored Raspi 3B, which has nothing else to do aside collecting influxdb messages from two senders, one every 140s, this one every 180s and to deliver this Grafana diagram every 60s:

Until now I have no logfile. Apparently I’m using stderr, running telegraf as a systemd type service so far.

I run:

Telegraf 1.24.2 (git: HEAD@9550e7a5)

via

/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d

on a powerful

Linux FileServer 5.15.0-50-generic #56-Ubuntu SMP Tue Sep 20 13:23:26 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

which is really idle between 08:00 and 17:00, except doing the telegraf job.

My Questions:

  • what’s the best configuration to either send logging to another device’s syslog port (rsyslog or Kiwi syslog viewer)
  • what’s the best configuration to write all logging (including crash) to /var/log/telegraf.log (see below box), because I’m surprised to have no logging despite configured.
  • is there a way to watchdog-monitor telegraf and then to restart the service?

My setup from which I expected to have logging:

daisy@FileServer:/etc/telegraf$ grep -Eve '(^[ \t]*#)|(^$)' telegraf.conf
[global_tags]
[agent]
  interval = "180s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "5s"
  precision = "0s"
debug = true
logtarget = "file"
logfile = "/var/log/telegraf.log"
logfile_rotation_interval = "24h"
logfile_rotation_max_size = "10MB"
  hostname = ""
  omit_hostname = false

daisy@FileServer:/etc/telegraf$ cat /var/log/telegraf.log
cat: /var/log/telegraf.log: No such file or directory
$

in parallel with [[outputs.influxdb]], I also write metrics to a file

daisy@FileServer:/etc/telegraf$ grep -Eve '(^[ \t]*#)|(^$)' telegraf.d/telegraf.output.file_in_tmp.conf
[[outputs.file]]
    files = [ "/tmp/metrics.out" ]
   use_batch_format = true
   rotation_interval = "24h"
      rotation_max_size = "10MB"
   data_format = "influx"

daisy@FileServer:/etc/telegraf$ cat /tmp/metrics.out | wc -l
1865

I could find out if the file sink also got no data, but can someone help me to bash-convert the “1665723063000000000” looking timestamps to human readable format? somewhat with sed + time @… , or such?

Just for completeness, surprisingly simple input definition:

[[inputs.modbus]]
    name = "DTUpro"
    slave_id = 1
    controller = "tcp://192.168.168.42:502"
    configuration_type = "register"
holding_registers = [
      { name = "pan1_dVolt",   byte_order = "AB",   data_type = "UINT16",   scale=1.0,    address = [4104]},
... and so on ...
]

Any hints, helps, next steps appreciated!
Daisy