Randoming missing metric datapoints - Prometheus output plugin

Hi team,
I’m using telegraf to collect metrics at endpoint and these telegraf agents are configured to send to Central telegraf server. The Central telegraf server acts as proxy to consolidate all these endpoints metrics and expose it using prometheus output plugin. I notice that there are gaps in the datapoints and this gap sometimes goes upto 5min which leads to alerts.

My endpoint config
interval :10s
flush interval:10s

Central telegraf server(Using influxdb listener)
flush interval: 5s
buffer limit :200000
batch size: 100000
Prometheus Output Plugin:
expiration interval : 60s

Prometheus is configured to scrape the telegraf prometheus endpoint every 5s.

Correction :Prometheus is configured to scrape the telegraf prometheus endpoint every 10s with 5s timeout

Hello @Sathiyaprakash,
Can you share your telegeraf config?
Also can you set debug=true and share the logs at the times of these gaps?
Its unlikely that this is a telegraf error.
@jpowers do you have any recommendation for how to debug this?

When I enabled additional telegraf output to local influxdb , I do not see any missing datapoints in influxdb. I suspect its something related to prometheus output plugin.

@Sathiyaprakash,

When you say you are missing metrics, can you provide an example of the metric?

It sounds like you set up on the collection nodes is:

Unknown input -> influxdb (?)

What inputs are you using, and can you confirm your output?

And on the central telegraf is:

influxdb_listener -> prometheus_client

Also what version of telegraf are you using?

Issue: The metric is not completely missed but I see gaps in the datapoints . The gaps sometimes exceeds 3-5 min.

Sample metric that has missing data points is disk_used_percent. Also its not the same metrics all the time as its just random.

Endpoints:

Endpoints inputs: Metrics collected include cpu,memory , systemd , disk etc.
Endpoints output : Endpoints are configured to send to Central telegraf influxdb listerner

Central Telegraf Server and version:

Central Telegraf has 2 output configured:

  1. Prometheus output(primary)
  2. Influxdb ( created for troubleshooting this issue)

Telegraf Version in central telegraf server: Telegraf 1.28.5

Please try v1.30 or newer. There was a fix that went in around data not expiring fix(outputs.prometheus_client): Ensure v1 collector data expires promptly by powersj · Pull Request #14232 · influxdata/telegraf · GitHub

If upgrading does not help with the issue, then my next suggestion will be to add [[outputs.file]] to your config. This will print the metrics to stdout by default. It would be good to then run and reproduce the issue. When you see something missing, then go look at the metrics in stdout and see when the last time was that the metric was received.

Including those logs would be extremely helpful in further triage.

@jpowers , I will plan for the telegraf upgrade. Meanwhile the issue re-occurred. There were no data for disk_used_percent for 18-20 min . There no errors in the telegraf logs which is in debug mode.
Also wanted to understand what is the effect of expiration_interval and export_timestamp.

[[outputs.prometheus_client]]

required

listen = “:9990”
path = “/metrics”
metric_version = 2
expiration_interval = “60s”
#export_timestamp = true

I could see datapoints in influxdb during the period for which it was missing in prometheus. Does expiration_interval and export_timestamp plays any role in missing metrics.

export_timestamp

Adds the timestamp to the protmethus metric or not.

expiration_interval

Determines the longest a metric will exist.

Without actual logs and a config demonstrating what you are doing I’m not sure how to help further.

Central Telegraf Configs:

Telegraf.conf

[agent]
interval = “10s”
round_interval = true
metric_batch_size = 100000
metric_buffer_limit = 200000
collection_jitter = “0s”
flush_interval = “5s”
flush_jitter = “0s”
precision = “0s”
debug = true
quiet = false
hostname = “”
omit_hostname = false

/etc/telegraf/telegraf.d/outputs_prometheus_client.conf

[[outputs.prometheus_client]]
listen = “:9990”
path = “/metrics”
metric_version = 2
expiration_interval = “60s”
#export_timestamp = true

Hi @jpowers
I have shared the config in this thread and have sent the logs to you. Please check and assist.

I have shared the config in this thread and have sent the logs to you. Please check and assist.

The logs you provided are not clear to me what is going on. You have sent me the influxdb_evidence file, that looks like a query result from InfluxDB itself? You previously stated that you saw no issues with InfluxDB. Is that not the case?

Based on what you have shared, it is not clear to me with the Prometheus output what metric is actually missing and when.

  1. I had attached influxdb report to show that the influxdb output has no issues during the time the metrics were missing in prometheus.
  2. The missing metric is “cpu_usage_idle”
  3. Missing Time : approx 13:00 to 13:23 UTC
  4. Telegraf logs doesn’t show any errors during this period as well.

@jpowers , I even tried by setting flush interval to 30 sec and prometheus scrape interval to 15 sec so we do not miss any metrics. But we still see gaps in the metrics.

@Sathiyaprakash,

I have run with your config and watched the metrics endpoint for data with zero issues.

Nothing you have shared shows an actual issue with Telegraf, especially if you see the data sent to InfluxDB successfully. My suggestion to you is if you see or are able to reproduce, that you somehow capture what the /metrics endpoint looks like at the time.

I am most suspect of whatever you have that is scraping the endpoint and if potentially something is not capturing all the data or dropping values that may not have changed from the previous scrape.

@jpowers , Im using prometheus to scrape the metrics from “telegraf prometheus output” plugin. I have also set it(prometheus) to scrape every 15sec (10sec timeout) while setting the central telegraf flush interval to 30 sec.