Using telegraf to proxy metrics to prometheus is losing metrics?

I am using telegraf to collect metrics from a local mysql_exporter, add some labels to them, and expose them via the prometheus telegraf output. I am seeing, however, that every now and then, the metrics exposed by telegraf are not complete / some metrics are missing, whereas the ones from mysqld_exporter are complete.

My config is this:

[global_tags]
  tablet_type = "replica"
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = "0s"
  hostname = ""
  omit_hostname = false
[global_tags]
  tablet_type = "replica"
[[inputs.prometheus]]
  urls = ["http://localhost:10104/metrics"]
  interval = "2s"
[[outputs.prometheus_client]]
  listen = ":9104"
  expiration_interval = "10s"
  metric_version = 1

Continuously retrieving metrics from the prometheus output port and grepping for a few of them shows that they are not always present:

user@host:~$ while true; do date; curl -s localhost:9104/metrics |  egrep --color '^(mysql_version_info|mysql_slave_status_seconds_behind_master|mysql_slave_status_slave_io_running|mysql_slave_status_slave_sql_running|mysql_slave_status_master_server_id)' | wc -l; echo; sleep 1; done
Wed Nov  8 04:33:50 PST 2023
5

Wed Nov  8 04:33:52 PST 2023
5

Wed Nov  8 04:33:53 PST 2023
5

Wed Nov  8 04:33:54 PST 2023
5

Wed Nov  8 04:33:55 PST 2023
3

Wed Nov  8 04:33:57 PST 2023
3

Wed Nov  8 04:33:58 PST 2023
3

Wed Nov  8 04:33:59 PST 2023
3

Wed Nov  8 04:34:00 PST 2023
3

Wed Nov  8 04:34:02 PST 2023
3

Wed Nov  8 04:34:03 PST 2023
4

Wed Nov  8 04:34:04 PST 2023
4

Wed Nov  8 04:34:05 PST 2023
4

Wed Nov  8 04:34:07 PST 2023
5

whereas doing the directly from the mysqld_exporter port always obtains all 5 metrics:

user@host:~% while true; do date; curl -s localhost:10104/metrics |  egrep --color '^(mysql_version_info|mysql_slave_status_seconds_behind_master|mysql_slave_status_slave_io_running|mysql_slave_status_slave_sql_running|mysql_slave_status_master_server_id)' | wc -l; echo; sleep 1; done
Wed Nov  8 04:33:50 PST 2023
5

Wed Nov  8 04:33:52 PST 2023
5

Wed Nov  8 04:33:53 PST 2023
5

Wed Nov  8 04:33:55 PST 2023
5

Wed Nov  8 04:33:56 PST 2023
5

Wed Nov  8 04:33:58 PST 2023
5

Wed Nov  8 04:33:59 PST 2023
5

Wed Nov  8 04:34:01 PST 2023
5

Wed Nov  8 04:34:02 PST 2023
5

Wed Nov  8 04:34:04 PST 2023
5

Wed Nov  8 04:34:06 PST 2023
5

Wed Nov  8 04:34:07 PST 2023
5

What might be the cause of this? Thanks!


Edit:

  • Missing time range in one of the bash outputs.
  • Formatting.

Just to close this thread out, since we chatted on slack:

After some experimentation it seems metrics are missing because the metric buffer gets full and they get dropped. Playing with the size has helped

Indeed. To be more specific, this is what I changed:

-  metric_batch_size = 1000
-  metric_buffer_limit = 10000
+  metric_batch_size = 30000
+  metric_buffer_limit = 100000

I picked the numbers after directly scrapping from mysqld_exporter, counting the metrics there (~27000), and running telegraf with --debug to see how full the buffer was and how often it was being written.

I also played a bit with interval and expiration_interval to have fresh metrics while at the same time not having telegraf consume too much CPU.

E.

1 Like