AWS Cloudwatch Telegraf

My setup:

  • Telegraf 1.10.2
  • InfluxDB shell version: 1.7.4
  • Grafana Version 6.1.0

My Telegraf config for cloudwatch:

[global_tags]

[agent]
interval = “10s”
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = “0s”
flush_interval = “10s”
flush_jitter = “0s”
precision = “”
debug = false
quiet = false
logfile = “xxxxxx/logs/telegraf.log”
hostname = “”
omit_hostname = false

[[outputs.influxdb]]
urls = [“http://localhost:8086”]
database = “telegraf”
retention_policy = “”
write_consistency = “any”
timeout = “5s”

[[inputs.cloudwatch]]

region = “us-xxx-xx”
access_key = “xxxxx”
secret_key = “xxxx”
period = “30s”
delay = “5m”
interval = “5m”
namespace = “AWS/ElastiCache”
ratelimit = 25

#####statistic_include = [ “average”, “maximum”, “p90” ]
#####statistic_exclude = [ “sum”, “minimum”, “sample_count” ]

[[inputs.cloudwatch.metrics]]
names = [“IsMaster”, “CPUUtilization”, “EngineCPUUtilization”, “SwapUsage”, “BytesUsedForCache”, “FreeableMemory”, “NetworkBytesIn”, “NetworkBytesOut”, “ReplicationBytes”, “ReplicationLag”, “CurrConnections”, “NewConnections”, “CurrItems”, “Reclaimed”, “CacheHits”, “CacheMisses”, “Evictions”, “GetTypeCmds”, “SetTypeCmds”]

[[inputs.cloudwatch.metrics.dimensions]]
name = “CacheClusterId”
value = “*”

Issue1:

My above setup is collecting “sum, minimum, maximum, average and sample_count” and the values of min, max & average all look the same they are not changing, are we doing something wrong?

Issue2:

We would like to limit our collection of statistics to: Min, Max & Average. I tried doing that by setting up the statistic_include & statistic_exclude, however they are not working. Hence, I commented them out as you see above in my configuration.

Appreciate any help with above two issues.

I haven’t heard anyone experiencing unchanging values before, very odd. If you can compare with the aws cli tool and if it differs open an issue on the issue tracker.

On issue 2, this feature is new development for 1.11 and isn’t in 1.10. You can try the nightly builds if you would like to test it out. When browsing the documentation on github, make sure the branch is set to release-1.10. You can also expect some much improved performance and fewer requests to Cloudwatch.

@daniel

What all roles does Telegraf needs to be given? Its not clear in the documentation of the cloudwatch input. Curretly the IAM rule created for me has following set:

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “VisualEditor0”,
“Effect”: “Allow”,
“Action”: [
“cloudwatch:GetMetricData”,
“cloudwatch:ListMetrics”
“cloudwatch:GetMetricStatistics”
],
“Resource”: “*”
}
]
}

I tried the example listed in the documentation:
aws cloudwatch get-metric-statistics --namespace AWS/ElastiCache --region us-east-1 --period 300 --start-time 2019-04-29T15:00:00Z --end-time 2019-04-29T15:15:00Z --statistics Average --metric-name EngineCPUUtilization --dimensions Name=CacheClusterId,Value=xxxxx

I got following error:

An error occurred (AccessDenied) when calling the GetMetricStatistics operation: User: arn:aws:iam::xxxx:user/xxxx is not authorized to perform: cloudwatch:GetMetricStatistics

Testing the telegraf configuration produced following data (Notice the values are same for avg,max,min etc):

telegraf --input-filter cloudwatch --config /etc/telegraf/telegraf.conf --test

cloudwatch_aws_elasti_cache,cache_cluster_id=abc-xxx-003,host=xxx.com,region=us-east-1,unit=percent engine_cpu_utilization_average=0.03333333333333333,engine_cpu_utilization_maximum=0.03333333333333333,engine_cpu_utilization_minimum=0.03333333333333333,engine_cpu_utilization_sample_count=1,engine_cpu_utilization_sum=0.03333333333333333 1556575140000000000
cloudwatch_aws_elasti_cache,cache_cluster_id=abc-xxx-002,host=xxx.com,region=us-east-1,unit=percent engine_cpu_utilization_average=0.03333888981496916,engine_cpu_utilization_maximum=0.03333888981496916,engine_cpu_utilization_minimum=0.03333888981496916,engine_cpu_utilization_sample_count=1,engine_cpu_utilization_sum=0.03333888981496916 1556575140000000000
cloudwatch_aws_elasti_cache,cache_cluster_id=abc-xxx-001,host=xxx.com,region=us-east-1,unit=percent engine_cpu_utilization_average=0.03333333333333333,engine_cpu_utilization_maximum=0.03333333333333333,engine_cpu_utilization_minimum=0.03333333333333333,engine_cpu_utilization_sample_count=1,engine_cpu_utilization_sum=0.03333333333333333 1556575140000000000
cloudwatch_aws_elasti_cache,cache_cluster_id=abc-xxx-003,host=xxx.com,region=us-east-1,unit=percent engine_cpu_utilization_average=0.050008334722453744,engine_cpu_utilization_maximum=0.050008334722453744,engine_cpu_utilization_minimum=0.050008334722453744,engine_cpu_utilization_sample_count=1,engine_cpu_utilization_sum=0.050008334722453744 1556575140000000000
cloudwatch_aws_elasti_cache,cache_cluster_id=abc-xxx-004,host=xxx.com,region=us-east-1,unit=percent cpu_utilization_average=0,cpu_utilization_maximum=0,cpu_utilization_minimum=0,cpu_utilization_sample_count=1,cpu_utilization_sum=0 1556575140000000000
cloudwatch_aws_elasti_cache,cache_cluster_id=abc-xxx-002,host=xxx.com,region=us-east-1,unit=percent engine_cpu_utilization_average=0.03333333333333333,engine_cpu_utilization_maximum=0.03333333333333333,engine_cpu_utilization_minimum=0.03333333333333333,engine_cpu_utilization_sample_count=1,engine_cpu_utilization_sum=0.03333333333333333 1556575140000000000
cloudwatch_aws_elasti_cache,cache_cluster_id=abc-xxx-002,host=xxx.com,region=us-east-1,unit=percent cpu_utilization_average=0,cpu_utilization_maximum=0,cpu_utilization_minimum=0,cpu_utilization_sample_count=1,cpu_utilization_sum=0 1556575140000000000

It depends a bit on the version of Telegraf you are using, with 1.10 you will need:

cloudwatch:ListMetrics
cloudwatch:GetMetricStatistics

With 1.11/nightly builds I expect you need:

cloudwatch:ListMetrics
cloudwatch:GetMetricData

It’s very weird output for sure, but perhaps it is normal? I notice the utilization is very low so perhaps the issue clears up once the system starts going.

@daniel

Here is our “enginecpuutilization” data for Elasticache that was collected for last few hours, please note the activity is same for Avg, Max & Min below (in same order):

Can you please provide the command line that needs to be run so that I can try to run from AWS CLI and see if it is any different?

Could this be a BUG in telegraf?

The command should look similar to this, but you will need to edit the cluster id:

aws cloudwatch get-metric-data --region us-east-1  --start-time 2019-04-30T00:00:00Z   --end-time 2019-04-30T00:15:00Z   --metric-data-queries '[
  {
    "Id": "engine_cpu_utililization",
    "MetricStat": {
      "Metric": {
        "Namespace": "AWS/ElasticCache",
        "MetricName": "EngineCPUUtilization",
        "Dimensions": [
          {
            "Name": "CacheClusterId",
            "Value": "i-deadbeef"
          }
        ]
      },
      "Period": 300,
      "Stat": "Average"
    },
    "Label": "engine_cpu_utililization"
  }
]'

@daniel

The values that I got for max,min,avg & p90 were all different when I ran it through AWS CLI. See below:

aws cloudwatch get-metric-data --region us-east-1 --start-time 2019-04-30T00:00:00Z --end-time 2019-04-30T00:15:00Z --metric-data-queries file://./metrics.json
{
“Messages”: ,
“MetricDataResults”: [
{
“Timestamps”: [
“2019-04-30T00:10:00Z”,
“2019-04-30T00:05:00Z”,
“2019-04-30T00:00:00Z”
],
“StatusCode”: “Complete”,
“Values”: [
0.030002222592654327,
0.03333666722231483,
0.033334445000030875
],
“Id”: “engine_cpu_utililization_avg”,
“Label”: “engine_cpu_utililization_avg”
},
{
“Timestamps”: [
“2019-04-30T00:10:00Z”,
“2019-04-30T00:05:00Z”,
“2019-04-30T00:00:00Z”
],
“StatusCode”: “Complete”,
“Values”: [
0.016666666666666666,
0.03333333333333333,
0.03332777870354941
],
“Id”: “engine_cpu_utililization_min”,
“Label”: “engine_cpu_utililization_min”
},
{
“Timestamps”: [
“2019-04-30T00:10:00Z”,
“2019-04-30T00:05:00Z”,
“2019-04-30T00:00:00Z”
],
“StatusCode”: “Complete”,
“Values”: [
0.03333888981496916,
0.03333888981496916,
0.03333888981496916
],
“Id”: “engine_cpu_utililization_max”,
“Label”: “engine_cpu_utililization_max”
},
{
“Timestamps”: [
“2019-04-30T00:10:00Z”,
“2019-04-30T00:05:00Z”,
“2019-04-30T00:00:00Z”
],
“StatusCode”: “Complete”,
“Values”: [
0.03321353941174161,
0.03333833412512757,
0.03333777853715309
],
“Id”: “engine_cpu_utililization_p90”,
“Label”: “engine_cpu_utililization_p90”
}
]
}

However this is not the case when I use telegraf.

Definitely odd, can you try also fetching the same period with get-metric-statistics, maybe it is specific to this method:

aws cloudwatch get-metric-statistics --namespace AWS/ElastiCache --region us-east-1 --period 300 --start-time 2019-04-30T00:00:00Z --end-time 2019-04-30T00:15:00Z --statistics Average --statistics Maximum --statistics Minimum --statistics SampleCount --metric-name EngineCPUUtilization --dimensions Name=CacheClusterId,Value=xxxxx

@daniel

Here is the output of the command you requested, the values of max, min, avg all look to be different through AWS-CLI command line:

aws cloudwatch get-metric-statistics --namespace AWS/ElastiCache --region us-east-1 --period 300 --start-time 2019-04-30T00:00:00Z --end-time 2019-04-30T00:15:00Z --statistics Average --statistics Maximum --statistics Minimum --statistics SampleCount --metric-name EngineCPUUtilization --dimensions Name=CacheClusterId,Value=xxxxx
{
“Datapoints”: [
{
“SampleCount”: 5.0,
“Timestamp”: “2019-04-30T00:05:00Z”,
“Unit”: “Percent”
},
{
“SampleCount”: 5.0,
“Timestamp”: “2019-04-30T00:00:00Z”,
“Unit”: “Percent”
},
{
“SampleCount”: 5.0,
“Timestamp”: “2019-04-30T00:10:00Z”,
“Unit”: “Percent”
}
],
“Label”: “EngineCPUUtilization”
}

aws cloudwatch get-metric-statistics --namespace AWS/ElastiCache --region us-east-1 --period 300 --start-time 2019-04-30T00:00:00Z --end-time 2019-04-30T00:15:00Z --statistics Average --statistics Maximum --statistics Minimum --metric-name EngineCPUUtilization --dimensions Name=CacheClusterId,Value=xxxxx
{
“Datapoints”: [
{
“Timestamp”: “2019-04-30T00:05:00Z”,
“Minimum”: 0.03333333333333333,
“Unit”: “Percent”
},
{
“Timestamp”: “2019-04-30T00:00:00Z”,
“Minimum”: 0.03332777870354941,
“Unit”: “Percent”
},
{
“Timestamp”: “2019-04-30T00:10:00Z”,
“Minimum”: 0.016666666666666666,
“Unit”: “Percent”
}
],
“Label”: “EngineCPUUtilization”
}
aws cloudwatch get-metric-statistics --namespace AWS/ElastiCache --region us-east-1 --period 300 --start-time 2019-04-30T00:00:00Z --end-time 2019-04-30T00:15:00Z --statistics Average --statistics Maximum --metric-name EngineCPUUtilization --dimensions Name=CacheClusterId,Value=xxxxx
{
“Datapoints”: [
{
“Timestamp”: “2019-04-30T00:05:00Z”,
“Maximum”: 0.03333888981496916,
“Unit”: “Percent”
},
{
“Timestamp”: “2019-04-30T00:00:00Z”,
“Maximum”: 0.03333888981496916,
“Unit”: “Percent”
},
{
“Timestamp”: “2019-04-30T00:10:00Z”,
“Maximum”: 0.03333888981496916,
“Unit”: “Percent”
}
],
“Label”: “EngineCPUUtilization”
}
aws cloudwatch get-metric-statistics --namespace AWS/ElastiCache --region us-east-1 --period 300 --start-time 2019-04-30T00:00:00Z --end-time 2019-04-30T00:15:00Z --statistics Average --metric-name EngineCPUUtilization --dimensions Name=CacheClusterId,Value=xxxxx
{
“Datapoints”: [
{
“Timestamp”: “2019-04-30T00:05:00Z”,
“Average”: 0.03333666722231483,
“Unit”: “Percent”
},
{
“Timestamp”: “2019-04-30T00:00:00Z”,
“Average”: 0.033334445000030875,
“Unit”: “Percent”
},
{
“Timestamp”: “2019-04-30T00:10:00Z”,
“Average”: 0.03000222259265433,
“Unit”: “Percent”
}
],
“Label”: “EngineCPUUtilization”
}

It seems I told you the wrong way to specify multiple statistics, try:

aws cloudwatch get-metric-statistics --namespace AWS/ElastiCache --region us-east-1 --period 300 --start-time 2019-04-30T00:00:00Z --end-time 2019-04-30T00:15:00Z --statistics Average Maximum Minimum SampleCount --metric-name EngineCPUUtilization --dimensions Name=CacheClusterId,Value=xxxxx

You should only need to run it once and get all the statistics in one response.

@daniel

Below is the output of the command you requested:

aws cloudwatch get-metric-statistics --namespace AWS/ElastiCache --region us-east-1 --period 300 --start-time 2019-04-30T00:00:00Z --end-time 2019-04-30T00:15:00Z --statistics Average Maximum Minimum SampleCount --metric-name EngineCPUUtilization --dimensions Name=CacheClusterId,Value=xxxxx
{
“Datapoints”: [
{
“SampleCount”: 5.0,
“Timestamp”: “2019-04-30T00:05:00Z”,
“Average”: 0.03333666722231483,
“Maximum”: 0.03333888981496916,
“Minimum”: 0.03333333333333333,
“Unit”: “Percent”
},
{
“SampleCount”: 5.0,
“Timestamp”: “2019-04-30T00:00:00Z”,
“Average”: 0.033334445000030875,
“Maximum”: 0.03333888981496916,
“Minimum”: 0.03332777870354941,
“Unit”: “Percent”
},
{
“SampleCount”: 5.0,
“Timestamp”: “2019-04-30T00:10:00Z”,
“Average”: 0.03000222259265433,
“Maximum”: 0.03333888981496916,
“Minimum”: 0.016666666666666666,
“Unit”: “Percent”
}
],
“Label”: “EngineCPUUtilization”
}

I think Telegraf reports the statistics the same because the sample count is 1, while in the aws cli it is 5. In your Telegraf config, try setting period = 5m and see if that helps.

@daniel

That change worked. Now I see data that is different for each of the stats Max, Min & Avg etc. Thank you for working with me. After the changes our graph looks like below: