Incorrect host used by Telegraf

Hi

In a nutshell: Collecting syslog data in docker swarm → the host tag is wrong from one host.

System layout:
I have multiple host. Docker swarm is started from Server1 on all Servers. My config is pretty much the same as in this thread (@glass_willis thanks for solving this). The full compse is:

version: "3.7"
services:
  telegraf:
    image: telegraf:1.30
    user: "telegraf:999"
    hostname: "{{.Node.Hostname}}"
    volumes:
      - /:/hostfs:ro
      - /var/run/docker.sock:/var/run/docker.sock
      - /data/stack/telegraf_configurations_test_syslog:/etc/telegraf/telegraf_configurations:ro
      - /data/stack/telegrafOutput:/tmp:rw
    command:
      - '--config-directory'
      - '/etc/telegraf/telegraf_configurations'
      - '--watch-config'
      - 'notify'
    environment:
      - HOST_ETC=/hostfs/etc
      - HOST_PROC=/hostfs/proc
      - HOST_SYS=/hostfs/sys
      - HOST_VAR=/hostfs/var
      - HOST_RUN=/hostfs/run
      - HOST_MOUNT_PREFIX=/hostfs
      - HOSTNAME={{.Node.Hostname}}
    networks:
      - proxy-net
    deploy:
      mode: global
      labels:
        - "traefik.enable=true"
        - "traefik.docker.network=proxy-net"
        - "traefik.tcp.services.telegrafSyslog.loadbalancer.server.port=6514"
        - "traefik.tcp.routers.telegrafSyslog.entrypoints=telegrafSyslog"
        - "traefik.tcp.routers.telegrafSyslog.rule=HostSNI(`*`)"
        - "traefik.tcp.routers.telegrafSyslog.service=telegrafSyslog"
        
networks:
  proxy-net:
    external: true

To be able to filter accross many different services collecting data the “host” tag needs to be correct. In the influxdb the “host” tag I see is always “Server2” while the “hostname” is reported correctly (Server1 or Server2). Server1 is where I execute the docker commands.

I tried with the processor rewrite plugin to no avail. This is my current telegraf.conf looks like this:

[global_tags]

[agent]
# The agent table configures Telegraf and the defaults used across all plugins.
  interval = "2s"
  round_interval = true
  metric_batch_size = 10000
  metric_buffer_limit = 100000
  collection_jitter = "1s"
  flush_interval = "2s"
  flush_jitter = "1s"
  precision = "1ms"
  # debug: Run Telegraf in debug mode.
  debug = true
  # quiet: Run Telegraf in quiet mode (error messages only).
  quiet = false
  # logfile: Specify the log file name. The empty string means to log to stderr. The directry has to exist in advance, else no logfile gets written.
  logfile = "/var/log/telegraf/Telegraf.log"
  # logtarget: Control the destination for logs. Can be one of �file�, �stderr� or, on Windows, �eventlog�. When set to �file�, the output file is determined by the �logfile� setting.
  logtarget = "file"
  # logfile_rotation_interval: Rotates logfile after the time interval specified. When set to 0 no time based rotation is performed.
  logfile_rotation_interval = 0
  # logfile_rotation_max_size: Rotates logfile when it becomes larger than the specified size. When set to 0 no size based rotation is performed.
  logfile_rotation_max_size = "100KB"
  # logfile_rotation_max_archives: Maximum number of rotated archives to keep, any older logs are deleted. If set to -1, no archives are removed.
  logfile_rotation_max_archives = 50
  # log_with_timezone: Set a timezone to use when logging or type �local� for local time. Example: �America/Chicago�. See this page for options/formats.
  # hostname: Override default hostname, if empty use os.Hostname().
  hostname = "${HOSTNAME}"
  # omit_hostname: If true, do no set the host tag in the Telegraf agent.
  omit_hostname = false


[[inputs.syslog]]
  alias = "Log_System"
  name_override = "Log_System"
  interval = "1s" #value is ignored by "tail" plugin as it is event driven
  
  ## Protocol, address and port to host the syslog receiver.
  server = "tcp4://localhost:6514"

  ## Framing technique used for messages transport
  ## Available settings are:
  ##   octet-counting  -- see RFC5425#section-4.3.1 and RFC6587#section-3.4.1
  ##   non-transparent -- see RFC6587#section-3.4.2
  framing = "octet-counting"

  # In order to avoid dis- and reconnects, which can create many warnings in syslog, read_timeout and keep_alive_period should be set as followed

  ## Zero means unlimited.
  read_timeout = "0s"
  ## Zero disables keep alive probes. Defaults to the OS configuration.
  keep_alive_period = "20s"

  # best_effort tries to handle even malformated syslog entries.
  best_effort = true
  [inputs.syslog.tags]
    _in = "LogSystemTest"


[[processors.override]]
  [processors.override.tags]
    host = "${HOSTNAME}"
	
[[outputs.influxdb]]
  alias = "InfluxDB_PCM_Log_System_Test"  
  tagexclude = ["_in"] 
  urls = ["https://192.168.102.109:8087"]
  insecure_skip_verify = true
  database = "InfluxDB_PCM_Log_System_Test"
  username = "telegraf_writer"
  password = "Write@InfluxDB"
  [outputs.influxdb.tagpass]
    _in = ["LogSystemTest"]

Any help on this highly appreciated! Thanks to all those folks out there helping out!

host is used to report the name of the host that’s running Telegraf, and is active by default, you can opt-out with the option omit_hostname

I’m not sure about when it gets created, but since you are having this issue I think it’s appended at the end of every other processing overriding your own host tag

Thanks for the answer. I have spent several hours now trying to get this working. Unfortunately it does not seem to work with the boundary conditions in place: it needs to easily scalable. Hence no host specific hard coding only env vars can be used.

By now i found that the first of the telegraf services up, will receive all the messages writing its hostname to the host tag - no matter what the previous was.

Since many other services in the setting rely on the agent setting omit_hostname=false i cannot disable it and write the hostname to the tag as only one agent shall be started.

I have tried with host specific tags on tagpass which did not work. Another way to get around the problem would be having something like HOST_IP={{.Node.HostIP}} to plug into the server=.... I did not find anything probably because there are several IP addresses on host including loopback and maybe a second hard or virtual network card with respective IPs.

I’ll be honest, I’m not sure I completely get what you are trying to do…
What’s sure is that the tag key host is to be considered a system-reserved tag when using omit_hostname = false.

You either set it to true, so you can use that tag key freely, or you change your own tag key to a different one (via the config itself since it’s a static tag or using the [processor.rename])(telegraf/plugins/processors/rename/README.md at master · influxdata/telegraf · GitHub)