Teleport: Docker input high cardinality on AWS

octalthorpe · June 5, 2017, 4:25pm

I am getting very high cardinality from the docker input measurements on AWS, where the lifecycle of containers are short. Some cron type, jobs, and crash/restart scenarios. Running in AWS using the Container service, makes a few of the tag values random each time a container starts.

Top measurements:
20471 docker_container_blkio
15889 docker_container_cpu
5103 docker_container_mem
5012 docker_container_net

I can see that a couple tags that are added based on docker labels will increase the cardinality:

com.amazonaws.ecs.task-definition-version
com.amazonaws.ecs.task-arn

I can exclude these via the configuration, so that will reduce cardinality somewhat. The container name is another issue as it contains a generated identifier for example “container_name=ecs-ion-integration-49-receptionist-c6f18886a79fc4ba2f00”

It appears in code that the input simply adds the container name as a tag.

Does anyone have a solution to this?
I’m thinking the only way to do this is to add a generic tag exclusion, not just one on labels? Something like “tag_exclude” in the configuration that would remove any tags listed before the measurement is pushed?

jackzampolin · June 5, 2017, 7:10pm

@octalthorpe There is a tagexclude option in telegraf.

I would at the very least make sure you exclude the provider specific docker tags you show above. I’ve run into that issue with our Kubernetes integration.

berto · August 13, 2018, 6:43pm

Greetings!

This post is about a year old; hopefully it’ll get some visibility. I’m running into a similar issue with Docker Swarm. I believe the cardinality issues I’m running up against have to do with the way Docker Swarm names containers when they are deployed as part of a Service within a Stack.

For example, when a container for a service named app within a stack named api is deployed, the container is named something like: api_app.1.77h3ypzz1jtnm6uxmu7qdto93. For services with more than one replica, this expands out to api_app.2.[...], api_app.3.[...], api_app.4.[...], etc.

When this is happening across dozens of stacks, each comprised of a handful of services, this alone adds up. Throw in a stack that iterates multiple times a day, and cardinality starts taking a hit quickly. This blew out our max cardinality on InfluxDB last week and all stats came to a grinding halt. So, what to do?

I have two questions:

What’s the recommended way of handling this? Is using tagexclude still the best way to go about this?

and

In my case I’m less concerned about the container’s unique id. In the example, above, getting stats for api_app.1 instead of the id-suffixed container name above is plenty fine for my metrics needs. Is there a way for the inputs.docker plugin to take care of this manipulation? While tagexclude will remove tags altogether, I still want the container name, but a part of the container’s name is causing the cardinality to blow up.

Below is a screenshot showing a small handful of the service ids currently in my database:

Thank you!

daniel · August 13, 2018, 7:20pm

In Telegraf 1.7 you can use the regex processor to modify the tag values, perhaps this would work for your case?

Topic		Replies	Views
Scraping and storing prometheus app metrics into influx Telegraf telegraf , prometheus	3	859	June 4, 2020
Need Help to Pass Host and App Name to Docker Metrics telegraf	6	924	April 24, 2017
Docker->Telegraf->influxDB Need image version	1	834	June 16, 2017
[Telegraf] [Docker] How to collect metrics from a POD vs Container in a proper way with or without [[inputs.docker]] Telegraf influxdb , telegraf	0	1137	March 7, 2018
Telegraf won’t send inputs.docker metrics correctly to questdb Telegraf	1	33	November 25, 2024

Teleport: Docker input high cardinality on AWS

Related topics