Telegraf output prometheus lose precision

danielmotaleite · May 3, 2021, 10:08pm

have a telegraf with a external script to grab potential aws ec2 instance termination

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  logfile = "/var/log//telegraf/telegraf.log"
  quiet = false
  hostname = "backuppc-a01"
  omit_hostname = false

[[outputs.prometheus_client]]
  listen = ":9009"

[[inputs.exec]]
  # get termination time
  commands = [ "/usr/local/sbin/aws-termination.sh" ]
  data_format = "influx"
  timeout = "15s"


$ /usr/local/sbin/aws-termination.sh 
aws_instance_termination,action=instance-stop,host=backuppc-a01 seconds=1192060i

yet curl tp telegraf output this:

aws_instance_termination_seconds{action="instance-stop",host="backuppc-a01",id="backuppc-a01",region="eu-central-1",type="t3.micro",zone="eu-central-1a"} 1.19206e+06

Notice the integer to float, something with a seconds precision gets rounded to several minute with the same value

This outputs in prometheus a step that breaks alerting

Notice that the graph is always this, if i wait 10min, i still get the same graph, where was flat turns back to the same angled line

I tried to change the script output to integer, uinteger, float and i’m unable to fix this.

Any hint how to fix or workaround this? right now the only way i can see is to increase the prometheus alert range to outside the step, so it can always change

thanks for the help

danielmotaleite · May 6, 2021, 9:03pm

Ok, after looking to this again, i changed my mind, i don’t see the lost of precision bin the integer to float, i assumed a wrong problem based on a wrong number of samples
Doing periodic curls to telegraf, i do not see any problem in the output. So for sure my problem isn’t in the telegraf

if i check the ALERTS{alertname=“Instance_Termination_Long”}, that show this alert, i do not see any problem in prometheus alert to justify the flapping alert.
So while that first 5 minutes step in prometheus is very weird, i now think the problem i’m having is really a alertmanager problem.

So sorry about the noise

Topic		Replies	Views
Missing metrics when proxying them to output.prometheus_client telegraf	5	1810	May 7, 2021
Randoming missing metric datapoints - Prometheus output plugin telegraf	16	561	May 3, 2024
Telegraf Timing (intervals, round_interval, collection_jitter, collection_offset and flush_interval) telegraf	2	2509	March 20, 2023
Using telegraf to proxy metrics to prometheus is losing metrics? Telegraf prometheus , outputs	2	726	November 9, 2023
Telegraf \| prometheus output trouble Telegraf	5	739	December 1, 2022

Telegraf output prometheus lose precision

Related topics