Missing metrics when proxying them to output.prometheus_client

Hello,

we have DMZ monitoring host, which is relaying all Telegraf (1.18) metrics ( from all nodes) to our InfluxDB, which works without issues. Now we added also on the DMZ monitoring Telegraf host, the prometheus config too:

[agent]
  hostname = "fc-r02-srv-monproxy"
  omit_hostname = false
  interval = "600s"
  round_interval = false
  metric_batch_size = 20000
  metric_buffer_limit = 200000
  collection_jitter = "0s"
  flush_interval = "600s"
  flush_jitter = "0s"
  precision = ""
  logfile = ""
  debug = true
  quiet = false
...
[[outputs.prometheus_client]]
collectors_exclude = ["gocollector", "process"]
expiration_interval = "300s"
export_timestamp = false
ip_range = ["192.168.43.0/24", "127.0.0.1/8"]
listen = ":9273"
metric_version = 2
path = "/metrics"
string_as_label = false
tls_cert = "/etc/ssl/private/cert_chain.crt"
tls_key = "/etc/ssl/private/cert.com.key"

If I check now the metrics in Grafana …

All graphs from any node has missing metrics, which uses that DMZ node. If I scrape the Telegraf directly (for nodes, which are in the same network and my Prometheus can reach them) it works.

I have no clue, where I can search for the problem …

Any suggestions ?

Hi @linuxmail,

Could you share your full config?

I see that you’ve set debug to true, are you seeing anything in the logs?

When you say there are missing metrics - are the metrics which are sent to influxDB different than the prometheus output?

Have you tried setting export_timestamp to true or shortening your collection_interval / agent interval/flush_interval to see if that makes a difference?

hi @helenosheaa,

that ist the config … from the “relay” node.

# Telegraf Configuration
#
# THIS FILE IS MANAGED BY PUPPET
#
[global_tags]
  dc = "fc"
  domain = "example.com"
  rack = "r02"
  role = "srv"

[agent]
  hostname = "fc-r02-srv-monproxy"
  omit_hostname = false
  interval = "600s"
  round_interval = false
  metric_batch_size = 20000
  metric_buffer_limit = 200000
  collection_jitter = "0s"
  flush_interval = "600s"
  flush_jitter = "0s"
  precision = ""
  logfile = ""
  debug = true
  quiet = false

#
# OUTPUTS:
#
[[outputs.influxdb]]
database = "telegraf"
metric_buffer_limit = "25000"
password = "spinat"
retention_policy = ""
timeout = "180s"
urls = ["https://graph-01.example.com:8086"]
username = "telegraf"
write_consistency = "any"

[[outputs.prometheus_client]]
collectors_exclude = ["gocollector", "process"]
expiration_interval = "300s"
export_timestamp = false
ip_range = ["192.168.43.0/24", "127.0.0.1/8"]
listen = ":9273"
metric_version = 2
path = "/metrics"
string_as_label = false
tls_cert = "/etc/ssl/private/example_chain.crt"
tls_key = "/etc/ssl/private/example.com.key"

[inputs.http_listener]
max_body_size = "0"
max_line_size = "0"
read_timeout = "10s"
service_address = ":8086"
write_timeout = "10s"
[inputs.snmp]
agents = ["172.21.1.1:161"]
auth_password = "spargel"
auth_protocol = "SHA"
interval = "1m"
priv_password = "karotte"
priv_protocol = "AES"
sec_level = "authPriv"
sec_name = "monitoring"
version = 3
[[inputs.snmp.field]]
is_tag = true
name = "switchname"
oid = "RFC1213-MIB::sysName.0"
[[inputs.snmp.table]]
inherit_tags = ["switchname"]
name = "network_interface"
oid = "IF-MIB::ifTable"
[[inputs.snmp.table.field]]
is_tag = true
name = "ifName"
oid = "IF-MIB::ifName"
[[inputs.snmp.table]]
inherit_tags = ["switchname"]
name = "network_interface_x"
oid = "IF-MIB::ifXTable"
[[inputs.snmp.table.field]]
is_tag = true
name = "ifName"
oid = "IF-MIB::ifName"
[[inputs.snmp.table]]
inherit_tags = ["switchname"]
name = "network_interface_stats"
oid = "EtherLike-MIB::dot3StatsTable"
[[inputs.snmp.table.field]]
is_tag = true
name = "ifName"
oid = "IF-MIB::ifName"
[[inputs.cpu]]
percpu = false
totalcpu = true
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs", "devfs", "udev"]
[[inputs.diskio]]
[[inputs.io]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.net]]
[[inputs.netstat]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]

The config from a normal node is the same, except the influxDB output and the buffer limits etc. pp, which is the relay (fc-r02-srv-monproxy) host.

I tried a lot … and changed the buffer / timeout etc. values … The InfluxDB values on the Grafana dashboard looks valid, while for Prometheus (Thanos) …

What I find strange … the identical config for a normal node … works perfectly, if I scrape the metrics directly:

# Telegraf Configuration
#
# THIS FILE IS MANAGED BY PUPPET
#
[global_tags]
  dc = "default"
  domain = "example.com"
  rack = "default"
  role = "git"

[agent]
  hostname = "git"
  omit_hostname = false
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  logfile = ""
  debug = false
  quiet = false

#
# OUTPUTS:
#
[[outputs.influxdb]]
database = "telegraf"
password = "kartoffel"
skip_database_creation = true
timeout = "180s"
urls = ["https://graph-01.exampe.com:8086"]
username = "telegraf"
[[outputs.prometheus_client]]
collectors_exclude = ["gocollector", "process"]
expiration_interval = "60s"
export_timestamp = false
ip_range = ["192.168.43.0/24", "127.0.0.1/8"]
listen = ":9273"
metric_version = 2
path = "/metrics"
string_as_label = false
tls_cert = "/etc/ssl/private/example_local_chain.crt"
tls_key = "/etc/ssl/private/example.local.key"

#
# INPUTS:
#
[[inputs.cpu]]
percpu = false
totalcpu = true
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs", "devfs", "udev"]
[[inputs.diskio]]
[[inputs.io]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.net]]
[[inputs.netstat]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]

This works perfect … as Thanos (Prometheus) can reach this node directly.

Very strange. Also I have no idea … where I can check … Also the logs looks fine:

...
May  3 13:26:10 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:26:10Z D! [outputs.prometheus_client] Wrote batch of 20000 metrics in 230.060484ms
May  3 13:26:10 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:26:10Z D! [outputs.prometheus_client] Buffer fullness: 325 / 200000 metrics
May  3 13:26:11 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:26:11Z D! [outputs.influxdb] Wrote batch of 20000 metrics in 922.251916ms
May  3 13:26:11 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:26:11Z D! [outputs.influxdb] Buffer fullness: 614 / 200000 metrics
...
May  3 13:28:19 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:28:19Z D! [outputs.prometheus_client] Wrote batch of 20000 metrics in 536.046541ms
May  3 13:28:19 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:28:19Z D! [outputs.prometheus_client] Buffer fullness: 286 / 200000 metrics
May  3 13:28:20 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:28:20Z D! [outputs.influxdb] Wrote batch of 20000 metrics in 917.353407ms
May  3 13:28:20 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:28:20Z D! [outputs.influxdb] Buffer fullness: 2457 / 200000 metrics
May  3 13:28:50 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:28:50Z D! [outputs.prometheus_client] Wrote batch of 20000 metrics in 205.839428ms
May  3 13:28:50 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:28:50Z D! [outputs.prometheus_client] Buffer fullness: 1349 / 200000 metrics
May  3 13:28:51 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:28:51Z D! [outputs.influxdb] Wrote batch of 20000 metrics in 861.643249ms
May  3 13:28:51 fc-r02-srv-monproxy telegraf[20824]: 2021-05-03T11:28:51Z D! [outputs.influxdb] Buffer fullness: 2065 / 200000 metrics

From the Git example node …

May  3 13:29:50 git telegraf[26924]: 2021-05-03T11:29:50Z D! [outputs.prometheus_client] Wrote batch of 40 metrics in 8.747049ms
May  3 13:29:50 git telegraf[26924]: 2021-05-03T11:29:50Z D! [outputs.prometheus_client] Buffer fullness: 14 / 10000 metrics
May  3 13:29:50 git telegraf[26924]: 2021-05-03T11:29:50Z D! [outputs.influxdb] Wrote batch of 54 metrics in 80.20002ms
May  3 13:29:50 git telegraf[26924]: 2021-05-03T11:29:50Z D! [outputs.influxdb] Buffer fullness: 27 / 10000 metrics
May  3 13:30:00 git telegraf[26924]: 2021-05-03T11:30:00Z D! [outputs.prometheus_client] Wrote batch of 44 metrics in 1.771804ms
May  3 13:30:00 git telegraf[26924]: 2021-05-03T11:30:00Z D! [outputs.prometheus_client] Buffer fullness: 0 / 10000 metrics
May  3 13:30:00 git telegraf[26924]: 2021-05-03T11:30:00Z D! [outputs.influxdb] Wrote batch of 30 metrics in 20.38124ms
May  3 13:30:00 git telegraf[26924]: 2021-05-03T11:30:00Z D! [outputs.influxdb] Buffer fullness: 37 / 10000 metrics
....
May  3 13:30:10 git telegraf[26924]: 2021-05-03T11:30:10Z D! [outputs.prometheus_client] Wrote batch of 41 metrics in 4.241971ms
May  3 13:30:10 git telegraf[26924]: 2021-05-03T11:30:10Z D! [outputs.prometheus_client] Buffer fullness: 36 / 10000 metrics
May  3 13:30:10 git telegraf[26924]: 2021-05-03T11:30:10Z D! [outputs.influxdb] Wrote batch of 41 metrics in 23.595417ms
May  3 13:30:10 git telegraf[26924]: 2021-05-03T11:30:10Z D! [outputs.influxdb] Buffer fullness: 37 / 10000 metrics

I’ve attached an examle “node” output, which gets relayed. pmox-01_prom_output.txt (132.7 KB)

I’ve changed the export_timestamp = true … to see if it helps. Also changing scrape interval does not help. Changing the collection_interval / agent interval/flush_interval was not touched.

It would be great … if someone has an idea. :slight_smile:

cu denny

hi,

after adding export_timestamp = true it seems to be OK now … I have no drops anymore … so I try it with my second DC, without this option and see, if this was the problem.

Hi,

Thanks for the update, hopefully the timestamp resolves the problem. Let me know if this is the case!

hi @helenosheaa,

it looks like, it was the solution. I did not changed anything … except enable this option. On Prometheus I can see:

...
msg="Error on ingesting out-of-order samples" num_dropped=12516
...
msg="Error on ingesting samples with different value but same timestamp"  num_dropped=4

as example

cu denny