Missing metrics when proxying them to output.prometheus_client

Hello,

we have DMZ monitoring host, which is relaying all Telegraf (1.18) metrics ( from all nodes) to our InfluxDB, which works without issues. Now we added also on the DMZ monitoring Telegraf host, the prometheus config too:

[agent]
  hostname = "fc-r02-srv-monproxy"
  omit_hostname = false
  interval = "600s"
  round_interval = false
  metric_batch_size = 20000
  metric_buffer_limit = 200000
  collection_jitter = "0s"
  flush_interval = "600s"
  flush_jitter = "0s"
  precision = ""
  logfile = ""
  debug = true
  quiet = false
...
[[outputs.prometheus_client]]
collectors_exclude = ["gocollector", "process"]
expiration_interval = "300s"
export_timestamp = false
ip_range = ["192.168.43.0/24", "127.0.0.1/8"]
listen = ":9273"
metric_version = 2
path = "/metrics"
string_as_label = false
tls_cert = "/etc/ssl/private/cert_chain.crt"
tls_key = "/etc/ssl/private/cert.com.key"

If I check now the metrics in Grafana …

All graphs from any node has missing metrics, which uses that DMZ node. If I scrape the Telegraf directly (for nodes, which are in the same network and my Prometheus can reach them) it works.

I have no clue, where I can search for the problem …

Any suggestions ?

Hi @linuxmail,

Could you share your full config?

I see that you’ve set debug to true, are you seeing anything in the logs?

When you say there are missing metrics - are the metrics which are sent to influxDB different than the prometheus output?

Have you tried setting export_timestamp to true or shortening your collection_interval / agent interval/flush_interval to see if that makes a difference?

hi @helenosheaa,

that ist the config … from the “relay” node.

# Telegraf Configuration
#
# THIS FILE IS MANAGED BY PUPPET
#
[global_tags]
  dc = "fc"
  domain = "example.com"
  rack = "r02"
  role = "srv"

[agent]
  hostname = "fc-r02-srv-monproxy"
  omit_hostname = false
  interval = "600s"
  round_interval = false
  metric_batch_size = 20000
  metric_buffer_limit = 200000
  collection_jitter = "0s"
  flush_interval = "600s"
  flush_jitter = "0s"
  precision = ""
  logfile = ""
  debug = true
  quiet = false

#
# OUTPUTS:
#
[[outputs.influxdb]]
database = "telegraf"
metric_buffer_limit = "25000"
password = "spinat"
retention_policy = ""
timeout = "180s"
urls = ["https://graph-01.example.com:8086"]
username = "telegraf"
write_consistency = "any"

[[outputs.prometheus_client]]
collectors_exclude = ["gocollector", "process"]
expiration_interval = "300s"
export_timestamp = false
ip_range = ["192.168.43.0/24", "127.0.0.1/8"]
listen = ":9273"
metric_version = 2
path = "/metrics"
string_as_label = false
tls_cert = "/etc/ssl/private/example_chain.crt"
tls_key = "/etc/ssl/private/example.com.key"

[inputs.http_listener]
max_body_size = "0"
max_line_size = "0"
read_timeout = "10s"
service_address = ":8086"
write_timeout = "10s"
[inputs.snmp]
agents = ["172.21.1.1:161"]
auth_password = "spargel"
auth_protocol = "SHA"
interval = "1m"
priv_password = "karotte"
priv_protocol = "AES"
sec_level = "authPriv"
sec_name = "monitoring"
version = 3
[[inputs.snmp.field]]
is_tag = true
name = "switchname"
oid = "RFC1213-MIB::sysName.0"
[[inputs.snmp.table]]
inherit_tags = ["switchname"]
name = "network_interface"
oid = "IF-MIB::ifTable"
[[inputs.snmp.table.field]]
is_tag = true
name = "ifName"
oid = "IF-MIB::ifName"
[[inputs.snmp.table]]
inherit_tags = ["switchname"]
name = "network_interface_x"
oid = "IF-MIB::ifXTable"
[[inputs.snmp.table.field]]
is_tag = true
name = "ifName"
oid = "IF-MIB::ifName"
[[inputs.snmp.table]]
inherit_tags = ["switchname"]
name = "network_interface_stats"
oid = "EtherLike-MIB::dot3StatsTable"
[[inputs.snmp.table.field]]
is_tag = true
name = "ifName"
oid = "IF-MIB::ifName"
[[inputs.cpu]]
percpu = false
totalcpu = true
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs", "devfs", "udev"]
[[inputs.diskio]]
[[inputs.io]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.net]]
[[inputs.netstat]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]

The config from a normal node is the same, except the influxDB output and the buffer limits etc. pp, which is the relay (fc-r02-srv-monproxy) host.

I tried a lot … and changed the buffer / timeout etc. values … The InfluxDB values on the Grafana dashboard looks valid, while for Prometheus (Thanos) …

What I find strange … the identical config for a normal node … works perfectly, if I scrape the metrics directly:

# Telegraf Configuration
#
# THIS FILE IS MANAGED BY PUPPET
#
[global_tags]
  dc = "default"
  domain = "example.com"
  rack = "default"
  role = "git"

[agent]
  hostname = "git"
  omit_hostname = false
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  logfile = ""
  debug = false
  quiet = false

#
# OUTPUTS:
#
[[outputs.influxdb]]
database = "telegraf"
password = "kartoffel"
skip_database_creation = true
timeout = "180s"
urls = ["https://graph-01.exampe.com:8086"]
username = "telegraf"
[[outputs.prometheus_client]]
collectors_exclude = ["gocollector", "process"]
expiration_interval = "60s"
export_timestamp = false
ip_range = ["192.168.43.0/24", "127.0.0.1/8"]
listen = ":9273"
metric_version = 2
path = "/metrics"
string_as_label = false
tls_cert = "/etc/ssl/private/example_local_chain.crt"
tls_key = "/etc/ssl/private/example.local.key"

#
# INPUTS:
#
[[inputs.cpu]]
percpu = false
totalcpu = true
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs", "devfs", "udev"]
[[inputs.diskio]]
[[inputs.io]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.net]]
[[inputs.netstat]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]

This works perfect … as Thanos (Prometheus) can reach this node directly.

Very strange. Also I have no idea … where I can check … Also the logs looks fine:

...
May  3 13:26:10 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:26:10Z D! [outputs.prometheus_client] Wrote batch of 20000 metrics in 230.060484ms
May  3 13:26:10 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:26:10Z D! [outputs.prometheus_client] Buffer fullness: 325 / 200000 metrics
May  3 13:26:11 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:26:11Z D! [outputs.influxdb] Wrote batch of 20000 metrics in 922.251916ms
May  3 13:26:11 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:26:11Z D! [outputs.influxdb] Buffer fullness: 614 / 200000 metrics
...
May  3 13:28:19 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:28:19Z D! [outputs.prometheus_client] Wrote batch of 20000 metrics in 536.046541ms
May  3 13:28:19 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:28:19Z D! [outputs.prometheus_client] Buffer fullness: 286 / 200000 metrics
May  3 13:28:20 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:28:20Z D! [outputs.influxdb] Wrote batch of 20000 metrics in 917.353407ms
May  3 13:28:20 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:28:20Z D! [outputs.influxdb] Buffer fullness: 2457 / 200000 metrics
May  3 13:28:50 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:28:50Z D! [outputs.prometheus_client] Wrote batch of 20000 metrics in 205.839428ms
May  3 13:28:50 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:28:50Z D! [outputs.prometheus_client] Buffer fullness: 1349 / 200000 metrics
May  3 13:28:51 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:28:51Z D! [outputs.influxdb] Wrote batch of 20000 metrics in 861.643249ms
May  3 13:28:51 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:28:51Z D! [outputs.influxdb] Buffer fullness: 2065 / 200000 metrics

From the Git example node …

May  3 13:29:50 git telegraf[26924]: 2021-05-03T11:29:50Z D! [outputs.prometheus_client] Wrote batch of 40 metrics in 8.747049ms
May  3 13:29:50 git telegraf[26924]: 2021-05-03T11:29:50Z D! [outputs.prometheus_client] Buffer fullness: 14 / 10000 metrics
May  3 13:29:50 git telegraf[26924]: 2021-05-03T11:29:50Z D! [outputs.influxdb] Wrote batch of 54 metrics in 80.20002ms
May  3 13:29:50 git telegraf[26924]: 2021-05-03T11:29:50Z D! [outputs.influxdb] Buffer fullness: 27 / 10000 metrics
May  3 13:30:00 git telegraf[26924]: 2021-05-03T11:30:00Z D! [outputs.prometheus_client] Wrote batch of 44 metrics in 1.771804ms
May  3 13:30:00 git telegraf[26924]: 2021-05-03T11:30:00Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
May  3 13:30:00 git telegraf[26924]: 2021-05-03T11:30:00Z D! [outputs.influxdb] Wrote batch of 30 metrics in 20.38124ms
May  3 13:30:00 git telegraf[26924]: 2021-05-03T11:30:00Z D! [outputs.influxdb] Buffer fullness: 37 / 10000 metrics
....
May  3 13:30:10 git telegraf[26924]: 2021-05-03T11:30:10Z D! [outputs.prometheus_client] Wrote batch of 41 metrics in 4.241971ms
May  3 13:30:10 git telegraf[26924]: 2021-05-03T11:30:10Z D! [outputs.prometheus_client] Buffer fullness: 36 / 10000 metrics
May  3 13:30:10 git telegraf[26924]: 2021-05-03T11:30:10Z D! [outputs.influxdb] Wrote batch of 41 metrics in 23.595417ms
May  3 13:30:10 git telegraf[26924]: 2021-05-03T11:30:10Z D! [outputs.influxdb] Buffer fullness: 37 / 10000 metrics

I’ve attached an examle “node” output, which gets relayed. pmox-01_prom_output.txt (132.7 KB)

I’ve changed the export_timestamp = true … to see if it helps. Also changing scrape interval does not help. Changing the collection_interval / agent interval/flush_interval was not touched.

It would be great … if someone has an idea. :slight_smile:

cu denny

hi,

after adding export_timestamp = true it seems to be OK now … I have no drops anymore … so I try it with my second DC, without this option and see, if this was the problem.