Telegraf scrape Kubernetes Services (inputs.prometheus) Auto-Discovery

Hi,
So im using the inputs.prometheus plugin with RBAC configured in K8s. Im also using the prometheus annotations to discover and scrape the pod’s in kubernetes.

Now i would like to scrape some services in kubernetes. I see in the documentation that this is an option but only via Consul Catalog:
PLEASE VIEW UPLOAD:

Hi,

You said you tried pod anotaton route and did not get any metrics. Did you add the correct annotations called out to your pods?

Thanks

Hi,

Yes i added the correct annotations:

Please again look at the attachment


as for some reason i cannot add more than 1 link then it complains as im a new user:

SO scraping the kubernetes service endpoint works 100% correct, i would have thought that if i added the pod annotations to the corresponding pod (the pod that uses that service) it would return the same metrics right, but not.
Below is the input.prometheus plugin and ofcourse RBAC is working as it returns a path that its scraping as seen above in attachment.

	[[inputs.prometheus]]
	 metric_version = 2
	 monitor_kubernetes_pods = true
	 pod_scrape_scope = \"cluster\"
	 pod_scrape_interval = 60
	 response_timeout = \"40s\"
	 insecure_skip_verify = true
	 monitor_kubernetes_pods_namespace = \"namespace\"
	 namepass = ['metrics1', 'metrics2', 'metrics3']

Ok, so i just verified that my configs are correct, usually i scrape the service endpointgs from a list and this works 100%. What im doing now is scraping the pod (with the annotations).
So i just checked the service details via kubectl and verified that against the returned urls from inputs.prometheus and its exactly correct, but no metrics…

Ok, so it sounds like the plugin is doing the right thing. The question is why is it not collecting anything.

Can you share any further log messages? Preferably when telegraf attempts to actually collect at an interval.

It would be great if you could not use screenshots and copy and paste the logs.

Every time i want to upload anything with more than 2 links this forum complains and says that because a new user im limited to only 2 links. Thats why im uploading screenshots.

Thanks for the logs:

2022-09-13T16:11:40Z E! [inputs.prometheus] Error in plugin: http://10.63.77.202:8080/actuator/prometheus returned HTTP status 404 Not Found
2022-09-13T16:11:40Z E! [inputs.prometheus] Error in plugin: http://10.63.77.25:8080/actuator/prometheus returned HTTP status 404 Not Found
2022-09-13T16:12:42Z E! [inputs.prometheus] Error in plugin: http://10.63.77.202:8080/actuator/prometheus returned HTTP status 404 Not Found
2022-09-13T16:12:42Z E! [inputs.prometheus] Error in plugin: http://10.63.77.25:8080/actuator/prometheus returned HTTP status 404 Not Found
2022-09-13T16:13:27Z D! [outputs.file] Buffer fullness: 0 / 20000 metrics
2022-09-13T16:13:32Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 20000 metrics
2022-09-13T16:13:33Z D! [outputs.prometheus_client] Buffer fullness: 0 / 20000 metrics

The prometheus plugin will go through all URLs and create a goroutine for each URL and collect data. So while there are a couple that are 404, the others should have been captured.

Are you certain that these endpoints are reporting valid prometheus metrics? I had expected an error if that was the case as well, but the fact that no metrics are returned makes me wonder if the metric end points are empty.

Hi,

As mentioned above if i take those endpoints and put them in a list, they return metrics.
These endpoints are all returning metrics via prometheus, as that is the current monitoring app, im testing out telegraf and it works gr8, just this auto-discover version of the inputs.prometheus plugin is giving me an issue:
[[inputs.prometheus]]
metric_version = 2
monitor_kubernetes_pods = true
pod_scrape_scope = "cluster"
pod_scrape_interval = 60
response_timeout = "40s"
insecure_skip_verify = true
monitor_kubernetes_pods_namespace = "namespace"
namepass = [‘metrics1’, ‘metrics2’, ‘metrics3’]

I use that telegraf debug command on the container shell and it returns nothing. Its strange.

Ill do a screenshot today with auto-discovery and the shell command and then the list and the shell command.

Here is my Telegraf Agent Config:

{
“telegraf.conf”: "[[outputs.influxdb_v2]]
urls = ["$INFLUXDB_URL"]
token = "$INFLUX_TOKEN"
organization = "$INFLUX_ORG"
bucket = "$INFLUX_BUCKET"

	[agent]
	  interval = \"60s\"
	  round_interval = false
	  metric_batch_size = 3000
	  metric_buffer_limit = 20000
	  collection_jitter = \"10s\"
	  flush_interval = \"125s\"
	  flush_jitter = \"20s\"
	  precision = \"1ns\"
	  hostname = \"\"
	  omit_hostname = false
	  debug = true
	  quiet = false
	  logtarget = \"file\"
	  logfile = \"/etc/telegraf/log\"
	  logfile_rotation_max_size = \"150MB\"
	  logfile_rotation_max_archives = 5
	 
	  # Read metrics from one or many apps
	[[inputs.prometheus]]
	 metric_version = 2
	 monitor_kubernetes_pods = true
	 pod_scrape_scope = \"cluster\"
	 pod_scrape_interval = 60
	 response_timeout = \"40s\"
	 insecure_skip_verify = true
	 monitor_kubernetes_pods_namespace = \"namespace\"
	 namepass = ['metrics list']
	
	[[outputs.file]]
	  files = [\"/tmp/metrics.out\"]
	  use_batch_format = true
	  rotation_max_size = \"150MB\"
	  rotation_max_archives = 5
	  data_format = \"json\"
	  json_timestamp_units = \"1s\"
	
	[[outputs.prometheus_client]]
	  expiration_interval = \"180s\"
	  listen = \":9273\"
	  path = \"/metrics\"
	  string_as_label = false"

Hi Josh,

Maybe there is something in this config above (telegraf agent) that causing an issue:

I did find this in my other logs:

2022-09-13T07:49:33Z D! [inputs.prometheus] registered a delete request for “my-app-bc55d4954-hld5n” in namespace “my-namespace”
2022-09-13T07:49:33Z D! [inputs.prometheus] will stop scraping for “http://10.63.73.142:8080/actuator/prometheus

Im not sure if there was an issue with this app at the time, but thats about the only error i can see related to a particular app, but thats about it, im scraping quite a few apps as you can see.

Is there any other debug command that i can use on the shell to verify this plugin.

Im using this:

telegraf --config /etc/telegraf/telegraf.conf --input-filter prometheus --test --debug

Is there any other debug command that i can use on the shell to verify this plugin.
telegraf --config /etc/telegraf/telegraf.conf --input-filter prometheus --test --debug

Hmm --test does have some edge-cases when using a service input like the prometheus input with Kubernetes is. I would suggest running with --test-wait 120 to ensure you at least fulfill one collection interval.

Unfortunately, I do think we are reaching the end of my knowledge of this plugin. At this point I would file a bug and we can see about building a debug version with more log output to understand why data is not getting parsed.

Hi,

Ok tried that and nothing.

Question: is there any other way to scrape k8s service endpoints apart from using Consul Catalog?

Thanks for your Help BTW!

Question: is there any other way to scrape k8s service endpoints apart from using Consul Catalog?

Hmm I do not.

So i eventually figured out what the problem is, the namepass filtering breaks the inputs.prometheus plugin. Any idea’s on what this could be? or is the namepass not compatible with this auto-discovery?
[[inputs.prometheus]]
metric_version = 2
monitor_kubernetes_pods = true
pod_scrape_scope = "cluster"
pod_scrape_interval = 60
response_timeout = "40s"
insecure_skip_verify = true
monitor_kubernetes_pods_namespace = "namespace"
namepass = [‘metrics1’, ‘metrics2’, ‘metrics3’]

When i just remove the namepass metric filter, the metrics starts flowing in.

Sorry for the delay I’ve been taking some time off.

Ah! So namepass will determine what metric names are emitted. I had wrongly assumed you changed those for anonymity, but if you don’t have any metrics called “metrics1”, “metrics2” and/or “metrics3” then nothing would be emitted.

This is helpful in the cases where you want to slim down the metrics, or only want certain metrics to go to a specific output for example.