Incorrect host used by Telegraf

Hi

In a nutshell: Collecting syslog data in docker swarm → the host tag is wrong from one host.

System layout:
I have multiple host. Docker swarm is started from Server1 on all Servers. My config is pretty much the same as in this thread (@glass_willis thanks for solving this). The full compse is:

version: "3.7"
services:
  telegraf:
    image: telegraf:1.30
    user: "telegraf:999"
    hostname: "{{.Node.Hostname}}"
    volumes:
      - /:/hostfs:ro
      - /var/run/docker.sock:/var/run/docker.sock
      - /data/stack/telegraf_configurations_test_syslog:/etc/telegraf/telegraf_configurations:ro
      - /data/stack/telegrafOutput:/tmp:rw
    command:
      - '--config-directory'
      - '/etc/telegraf/telegraf_configurations'
      - '--watch-config'
      - 'notify'
    environment:
      - HOST_ETC=/hostfs/etc
      - HOST_PROC=/hostfs/proc
      - HOST_SYS=/hostfs/sys
      - HOST_VAR=/hostfs/var
      - HOST_RUN=/hostfs/run
      - HOST_MOUNT_PREFIX=/hostfs
      - HOSTNAME={{.Node.Hostname}}
    networks:
      - proxy-net
    deploy:
      mode: global
      labels:
        - "traefik.enable=true"
        - "traefik.docker.network=proxy-net"
        - "traefik.tcp.services.telegrafSyslog.loadbalancer.server.port=6514"
        - "traefik.tcp.routers.telegrafSyslog.entrypoints=telegrafSyslog"
        - "traefik.tcp.routers.telegrafSyslog.rule=HostSNI(`*`)"
        - "traefik.tcp.routers.telegrafSyslog.service=telegrafSyslog"
        
networks:
  proxy-net:
    external: true

To be able to filter accross many different services collecting data the “host” tag needs to be correct. In the influxdb the “host” tag I see is always “Server2” while the “hostname” is reported correctly (Server1 or Server2). Server1 is where I execute the docker commands.

I tried with the processor rewrite plugin to no avail. This is my current telegraf.conf looks like this:

[global_tags]

[agent]
# The agent table configures Telegraf and the defaults used across all plugins.
  interval = "2s"
  round_interval = true
  metric_batch_size = 10000
  metric_buffer_limit = 100000
  collection_jitter = "1s"
  flush_interval = "2s"
  flush_jitter = "1s"
  precision = "1ms"
  # debug: Run Telegraf in debug mode.
  debug = true
  # quiet: Run Telegraf in quiet mode (error messages only).
  quiet = false
  # logfile: Specify the log file name. The empty string means to log to stderr. The directry has to exist in advance, else no logfile gets written.
  logfile = "/var/log/telegraf/Telegraf.log"
  # logtarget: Control the destination for logs. Can be one of �file�, �stderr� or, on Windows, �eventlog�. When set to �file�, the output file is determined by the �logfile� setting.
  logtarget = "file"
  # logfile_rotation_interval: Rotates logfile after the time interval specified. When set to 0 no time based rotation is performed.
  logfile_rotation_interval = 0
  # logfile_rotation_max_size: Rotates logfile when it becomes larger than the specified size. When set to 0 no size based rotation is performed.
  logfile_rotation_max_size = "100KB"
  # logfile_rotation_max_archives: Maximum number of rotated archives to keep, any older logs are deleted. If set to -1, no archives are removed.
  logfile_rotation_max_archives = 50
  # log_with_timezone: Set a timezone to use when logging or type �local� for local time. Example: �America/Chicago�. See this page for options/formats.
  # hostname: Override default hostname, if empty use os.Hostname().
  hostname = "${HOSTNAME}"
  # omit_hostname: If true, do no set the host tag in the Telegraf agent.
  omit_hostname = false


[[inputs.syslog]]
  alias = "Log_System"
  name_override = "Log_System"
  interval = "1s" #value is ignored by "tail" plugin as it is event driven
  
  ## Protocol, address and port to host the syslog receiver.
  server = "tcp4://localhost:6514"

  ## Framing technique used for messages transport
  ## Available settings are:
  ##   octet-counting  -- see RFC5425#section-4.3.1 and RFC6587#section-3.4.1
  ##   non-transparent -- see RFC6587#section-3.4.2
  framing = "octet-counting"

  # In order to avoid dis- and reconnects, which can create many warnings in syslog, read_timeout and keep_alive_period should be set as followed

  ## Zero means unlimited.
  read_timeout = "0s"
  ## Zero disables keep alive probes. Defaults to the OS configuration.
  keep_alive_period = "20s"

  # best_effort tries to handle even malformated syslog entries.
  best_effort = true
  [inputs.syslog.tags]
    _in = "LogSystemTest"


[[processors.override]]
  [processors.override.tags]
    host = "${HOSTNAME}"
	
[[outputs.influxdb]]
  alias = "InfluxDB_PCM_Log_System_Test"  
  tagexclude = ["_in"] 
  urls = ["https://192.168.102.109:8087"]
  insecure_skip_verify = true
  database = "InfluxDB_PCM_Log_System_Test"
  username = "telegraf_writer"
  password = "Write@InfluxDB"
  [outputs.influxdb.tagpass]
    _in = ["LogSystemTest"]

Any help on this highly appreciated! Thanks to all those folks out there helping out!

1 Like

host is used to report the name of the host that’s running Telegraf, and is active by default, you can opt-out with the option omit_hostname

I’m not sure about when it gets created, but since you are having this issue I think it’s appended at the end of every other processing overriding your own host tag

Thanks for the answer. I have spent several hours now trying to get this working. Unfortunately it does not seem to work with the boundary conditions in place: it needs to easily scalable. Hence no host specific hard coding only env vars can be used.

By now i found that the first of the telegraf services up, will receive all the messages writing its hostname to the host tag - no matter what the previous was.

Since many other services in the setting rely on the agent setting omit_hostname=false i cannot disable it and write the hostname to the tag as only one agent shall be started.

I have tried with host specific tags on tagpass which did not work. Another way to get around the problem would be having something like HOST_IP={{.Node.HostIP}} to plug into the server=.... I did not find anything probably because there are several IP addresses on host including loopback and maybe a second hard or virtual network card with respective IPs.

I’ll be honest, I’m not sure I completely get what you are trying to do…
What’s sure is that the tag key host is to be considered a system-reserved tag when using omit_hostname = false.

You either set it to true, so you can use that tag key freely, or you change your own tag key to a different one (via the config itself since it’s a static tag or using the [processor.rename])(telegraf/plugins/processors/rename/README.md at master · influxdata/telegraf · GitHub)