Telegraf not using endpoint_url for cloudwatch

Issue:

We have requirement to override the default endpoint “https://monitoring.us-east-2.amazonaws.com/” with a custom vpc endpoint “https://vpcendpoint.monitoring.us-east-2.vpce.amazonaws.com” to collect cloudwatch metrics. After trying to enable “endpoint_url”, Telegraf is ignoring the “enpoint_url” and throwing errors.

The same works when we test it via AWS-CLI like below:

aws cloudwatch --endpoint-url "https://vpcendpoint.monitoring.us-east-2.vpce.amazonaws.com list-metrics --namespace AWS/EBS --output text

Per telegraf’s cloudwatch input [documentation]:frowning:telegraf/README.md at master · influxdata/telegraf · GitHub)

We are using following configuration to collect cloudwatch metrics:

Telegraf Configuation

[global_tags]

[agent]
interval = “10s”
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = “0s”
flush_interval = “10s”
flush_jitter = “0s”
debug = false
quiet = false
logtarget = “file”
logfile = “/var/log/telegraf/telegraf.log”
logfile_rotation_interval = “24h”
logfile_rotation_max_archives = -1
hostname = “”
omit_hostname = false

[[outputs.influxdb]]
urls = [“http://localhost:8086”]
database = “telegraf_PRD”
retention_policy = “”
write_consistency = “any”

[[inputs.cloudwatch]]
region = “us-east-2”
access_key = “xxxxx”
secret_key = “xxxx”
period = “5m”
delay = “5m”
interval = “5m”
namespaces = [“AWS/ElastiCache”]
ratelimit = 25
endpoint_url = “https://vpcendpoint.monitoring.us-east-2.vpce.amazonaws.com

[[inputs.cloudwatch.metrics]]
names = [“IsMaster”, “CPUUtilization”, “EngineCPUUtilization”, “SwapUsage”, “BytesUsedForCache”, “FreeableMemory”, “NetworkBytesIn”, “NetworkBytesOut”, “ReplicationBytes”, “ReplicationLag”, “CurrConnections”, “NewConnections”, “CurrItems”, “Reclaimed”, “CacheHits”, “CacheMisses”, “Evictions”, “GetTypeCmds”, “SetTypeCmds”]

[[inputs.cloudwatch.metrics.dimensions]]
name = “CacheClusterId”
value = “*”

Telegraf shows following in the log:

2023-02-07T14:54:16Z E! [inputs.cloudwatch] failed to list metrics with namespace AWS/EFS: operation error CloudWatch: ListMetrics, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post “https://monitoring.us-east-2.amazonaws.com/”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2023-02-07T14:54:16Z E! [inputs.cloudwatch] failed to list metrics with namespace AWS/EBS: operation error CloudWatch: ListMetrics, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post “https://monitoring.us-east-2.amazonaws.com/”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2023-02-07T14:55:18Z E! [inputs.cloudwatch] failed to list metrics with namespace AWS/ElastiCache: operation error CloudWatch: ListMetrics, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post “https://monitoring.us-east-2.amazonaws.com/”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Could this be a bug?

What version of telegraf are you using? This sounds vaguely similar to an older issue that was fixed with https://github.com/influxdata/telegraf/pull/10841

The version we are running is: Telegraf 1.25.0 (git: HEAD@4d17ec79)

Given the comment on the endpoint_url parameter, what happens if you unset that value? What URL is used then? The comment seems to indiciate that that URL should be determined automatically.

Post “https://monitoring.us-east-2.amazonaws.com/”: context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Note that the context deadline exceeded can also be due to a number of other networking related issues. For example, DNS going down, network blip, or a proxy.

The URL if unset is https://monitoring.us-east-2.amazonaws.com.

If the proxy is enabled it still uses https://monitoring.us-east-2.amazonaws.com. however the telegraf log doesn’t show this. We found this out by looking at our proxy logs.

Our servers are configured to talk to the VPC endpoint without any proxy.

Following AWS-CLI command FAILS without proxy:

       aws cloudwatch list-metrics --namespace AWS/EBS --output text

Output of above command:

      [root@serv1 telegraf]# aws cloudwatch list-metrics --namespace AWS/EBS --output text

      Connect timeout on endpoint URL: "https://monitoring.us-east-2.amazonaws.com/"

Following AWS-CLI command PASSES without proxy:

       aws cloudwatch --endpoint-url "https://vpcendpoint.monitoring.us-east-2.vpce.amazonaws.com list-metrics --namespace AWS/EBS --output text

Output of above command:

       [root@serv1 telegraf]# aws cloudwatch --endpoint-url https://vpcendpoint.monitoring.us-east-2.vpce.amazonaws.com list-metrics --namespace AWS/EBS --output text
       None
       METRICS VolumeQueueLength       AWS/EBS
       DIMENSIONS      VolumeId        vol-010620cd373ba1e33
       METRICS VolumeQueueLength       AWS/EBS
       DIMENSIONS      VolumeId        vol-03908a4d51cd23e5b
       METRICS VolumeIdleTime  AWS/EBS
       DIMENSIONS      VolumeId        vol-062c4a96b8ebeb259
       METRICS VolumeQueueLength       AWS/EBS
       DIMENSIONS      VolumeId        vol-0dc035668e6cb12ee

If the proxy is enabled it still uses https://monitoring.us-east-2.amazonaws.com. however the telegraf log doesn’t show this.

What error or message do you get from telegraf in this case?

If proxy is enabled, all I see in the telegraf log is:

2023-02-07T15:27:59Z I! Starting Telegraf 1.25.0
2023-02-07T15:27:59Z I! Available plugins: 228 inputs, 9 aggregators, 26 processors, 21 parsers, 57 outputs, 2 secret-stores
2023-02-07T15:27:59Z I! Loaded inputs: apache cloudwatch (3x) cpu disk diskio influxdb internal jolokia2_agent mem net swap system
2023-02-07T15:27:59Z I! Loaded aggregators:
2023-02-07T15:27:59Z I! Loaded processors:
2023-02-07T15:27:59Z I! Loaded secretstores:
2023-02-07T15:27:59Z I! Loaded outputs: influxdb
2023-02-07T15:27:59Z I! Tags enabled: host=serv1

Can you enable debug mode please?

proxy=enabled
debug=enabled

2023-02-07T17:21:09Z I! Starting Telegraf 1.25.0
2023-02-07T17:21:09Z I! Available plugins: 228 inputs, 9 aggregators, 26 processors, 21 parsers, 57 outputs, 2 secret-stores
2023-02-07T17:21:09Z I! Loaded inputs: apache cloudwatch (3x) cpu disk diskio influxdb internal jolokia2_agent mem net swap system
2023-02-07T17:21:09Z I! Loaded aggregators:
2023-02-07T17:21:09Z I! Loaded processors:
2023-02-07T17:21:09Z I! Loaded secretstores:
2023-02-07T17:21:09Z I! Loaded outputs: influxdb
2023-02-07T17:21:09Z I! Tags enabled: host=serv1
2023-02-07T17:21:09Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"aws-tig-300-pv.ipa.snapbs.com", Flush Interval:10s
2023-02-07T17:21:09Z D! [agent] Initializing plugins
2023-02-07T17:21:09Z D! [agent] Connecting outputs
2023-02-07T17:21:09Z D! [agent] Attempting connection to [outputs.influxdb]
2023-02-07T17:21:09Z D! [agent] Successfully connected to outputs.influxdb
2023-02-07T17:21:09Z D! [agent] Starting service inputs
2023-02-07T17:21:19Z D! [outputs.influxdb] Wrote batch of 301 metrics in 17.405934ms
2023-02-07T17:21:19Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:21:29Z D! [outputs.influxdb] Wrote batch of 306 metrics in 17.415334ms
2023-02-07T17:21:29Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:21:39Z D! [outputs.influxdb] Wrote batch of 306 metrics in 17.766097ms
2023-02-07T17:21:39Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:21:49Z D! [outputs.influxdb] Wrote batch of 306 metrics in 21.137483ms
2023-02-07T17:21:49Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:21:59Z D! [outputs.influxdb] Wrote batch of 306 metrics in 17.295794ms
2023-02-07T17:21:59Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:22:00Z D! [inputs.cloudwatch] no metrics found to collect
2023-02-07T17:22:09Z D! [outputs.influxdb] Wrote batch of 398 metrics in 19.436063ms
2023-02-07T17:22:09Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:22:19Z D! [outputs.influxdb] Wrote batch of 306 metrics in 17.188467ms
2023-02-07T17:22:19Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:22:29Z D! [outputs.influxdb] Wrote batch of 306 metrics in 21.379194ms
2023-02-07T17:22:29Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:22:39Z D! [outputs.influxdb] Wrote batch of 306 metrics in 17.27799ms
2023-02-07T17:22:39Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:22:49Z D! [outputs.influxdb] Wrote batch of 306 metrics in 17.11858ms
2023-02-07T17:22:49Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:22:59Z D! [outputs.influxdb] Wrote batch of 306 metrics in 16.998235ms
2023-02-07T17:22:59Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:23:09Z D! [outputs.influxdb] Wrote batch of 306 metrics in 17.380779ms
2023-02-07T17:23:09Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:23:19Z D! [outputs.influxdb] Wrote batch of 306 metrics in 20.272756ms
2023-02-07T17:23:19Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:23:29Z D! [outputs.influxdb] Wrote batch of 306 metrics in 17.138346ms
2023-02-07T17:23:29Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:23:39Z D! [outputs.influxdb] Wrote batch of 306 metrics in 17.346709ms
2023-02-07T17:23:39Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:23:49Z D! [outputs.influxdb] Wrote batch of 306 metrics in 17.035755ms
2023-02-07T17:23:49Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:23:59Z D! [outputs.influxdb] Wrote batch of 306 metrics in 17.412485ms
2023-02-07T17:23:59Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:24:00Z D! [inputs.cloudwatch] no metrics found to collect

proxy=disabled
debug=enabled

2023-02-07T17:31:44Z I! Starting Telegraf 1.25.0
2023-02-07T17:31:44Z I! Available plugins: 228 inputs, 9 aggregators, 26 processors, 21 parsers, 57 outputs, 2 secret-stores
2023-02-07T17:31:44Z I! Loaded inputs: apache cloudwatch (3x) cpu disk diskio influxdb internal jolokia2_agent mem net swap system
2023-02-07T17:31:44Z I! Loaded aggregators:
2023-02-07T17:31:44Z I! Loaded processors:
2023-02-07T17:31:44Z I! Loaded secretstores:
2023-02-07T17:31:44Z I! Loaded outputs: influxdb
2023-02-07T17:31:44Z I! Tags enabled: host=aws-tig-300-pv.ipa.snapbs.com
2023-02-07T17:31:44Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"aws-tig-300-pv.ipa.snapbs.com", Flush Interval:10s
2023-02-07T17:31:44Z D! [agent] Initializing plugins
2023-02-07T17:31:44Z D! [agent] Connecting outputs
2023-02-07T17:31:44Z D! [agent] Attempting connection to [outputs.influxdb]
2023-02-07T17:31:44Z D! [agent] Successfully connected to outputs.influxdb
2023-02-07T17:31:44Z D! [agent] Starting service inputs
2023-02-07T17:31:54Z D! [outputs.influxdb] Wrote batch of 301 metrics in 17.479323ms
2023-02-07T17:31:54Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:32:04Z D! [outputs.influxdb] Wrote batch of 306 metrics in 17.397799ms
2023-02-07T17:32:04Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:32:14Z D! [outputs.influxdb] Wrote batch of 306 metrics in 20.306989ms
2023-02-07T17:32:14Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:32:17Z E! [inputs.cloudwatch] failed to list metrics with namespace AWS/EFS: operation error CloudWatch: ListMetrics, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post "https://monitoring.us-east-2.amazonaws.com/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2023-02-07T17:32:17Z D! [inputs.cloudwatch] no metrics found to collect
2023-02-07T17:32:18Z E! [inputs.cloudwatch] failed to list metrics with namespace AWS/EBS: operation error CloudWatch: ListMetrics, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post "https://monitoring.us-east-2.amazonaws.com/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
2023-02-07T17:32:18Z D! [inputs.cloudwatch] no metrics found to collect
2023-02-07T17:32:24Z D! [outputs.influxdb] Wrote batch of 306 metrics in 17.446229ms
2023-02-07T17:32:24Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:32:34Z D! [outputs.influxdb] Wrote batch of 306 metrics in 16.808481ms
2023-02-07T17:32:34Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:32:44Z D! [outputs.influxdb] Wrote batch of 306 metrics in 16.727573ms
2023-02-07T17:32:44Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2023-02-07T17:32:54Z D! [outputs.influxdb] Wrote batch of 306 metrics in 16.794908ms
2023-02-07T17:32:54Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics

I am not entirely clear on how and when your system is using a proxy versus when it should be. In telegraf, before we get to the list metrics all we have done is created a client. So there hasn’t been much set up and configuration.

I have put up fix: debug statements for cloudwatch by powersj · Pull Request #12646 · influxdata/telegraf · GitHub which has some additional debug print statements. In 20-30mins can you:

  1. go grab the artifacts from that build
  2. run your config with only the cloudwatch input that you are having issues with, no other inputs please
  3. enable debug mode
  4. get the complete log messages from a run with that with and without a proxy and with and without the endpoint URL

Thanks!

I am working on it right now will share the results shortly.

cloudwatch_testing.tar.gz (6.1 KB)

Attached are the results of the test you requested. Please review.

Test 1 looks to work as expected?

2023-02-08T13:40:06Z D! [outputs.influxdb] Wrote batch of 8 metrics in 18.448265ms

Correct, it works. However, per our proxy logs its not using the endpoint_url instead its hitting the https://monitoring.us-east-2.amazonaws.com which is not expected.

We don’t want to use proxy. The requests to our vpc endpoint url work through aws-cli without proxy, so it should work with telegraf as well, agree?

I’m wondering if something similar to the cloudwatch_logs output here is missing from cloudwatch input here

Can you file an issue please?

Will file an issue on github?

yes, please do :slight_smile: Sign in to GitHub · GitHub

Issue submitted: Telegraf not using endpoint_url for cloudwatch input · Issue #12653 · influxdata/telegraf · GitHub