Updating telegraf to version 1.29.5 crashes kubernetes pod

We are using telegraf to log metrics from zookeeper and rabbitmq pods in our k8s cluster.
Currently we are running with version 1.28.5 of telegraf but due to many vulnerability reported in this version we need to bump to version 1.29.* as most of them are fixed there.

The container base image that we are using is photon3.

However, when we upgraded our telegraf version the pods crashes soon after successful initial start up with the following error:

2024-02-28T11:53:17Z I! Loading config: /etc/telegraf/telegraf.conf
panic: could not acquire lock on 0x7f7a8890f000, limit reached? [Err: cannot allocate memory]

Do you have the same issues with the official images?

This error is related to the fact that telegraf could not allocate locked memory pages for secret store values. This memory is where secrets are stored securely. There are two options:

  1. Increase the ulimit on their system. The user does this with the ulimit -l command. To both see and set the value. For docker, I believe there is a --ulimit flag that could be used, like --ulimit memlock=8192:8192
  2. Add the --unprotected option to the command arguments to not use the reservable memory and instead store secrets in memory. This is less secure as secrets could find their way into paged out memory, and why this is opt-in. For docker you would need to update the CMD used to include this.

@jpowers
Just to clarify I tried setting in our docker image the following to increase the ulimit to the proposed by you

Set ulimit to 8192

RUN ulimit -n 8192

However this did not work as expected.

Can you clarify on the --ulimit memlock=8192:8192 - to set this when building the image while passing the docker build command?

In our case we will need to fix the ulimit inside the docker image.

You would add this when running the container, not building the image.

Currently the pod is ran as non root user due to compliance requirements so passing this as a command under the container spec won’t work as setting the ulimit requires Root user access.

So if I cannot pass it in the docker container spec when building the image what other options are left here.

And what I am also curious is how was this working with the older version of telegraf 1.28.* and in the new one we are hitting memory issues. I went through the release notes for version 1.29 and a lot of things had been change so it is almost impossible for me to find out what is causing this now.

so passing this as a command under the container spec won’t work as setting the ulimit requires

As I’ve mentioned, you can pass it as an option to the container. Please see the resource constraints docs. I’ve also mentioned the --unprotected flag option above.

And what I am also curious is how was this working with the older version of telegraf 1.28.* and in the new one we are hitting memory issues.

As users increase their usage of secret stores, the need to get lockable memory increases.

Hi again and sorry for the late reply on this one.

So I did some digging and it seems that because of k8s limitations there is no way for us to increase the ulimits.

The container also cannot be ran as unprotected.

I found out that our Telegraf is not using secret stores so is it possible to disable this option when running the container and can this be passed as a flag in the custom Telegraf config we are passing.

The only other option I see is to migrate to non k8s native Telegraf because of these breaking changes.

Thanks in advance!

@jpowers Could you please advise on my comments from last week?
Thanks!

What else are you looking for? I’m not sure I see a question.

The container also cannot be ran as unprotected.

I did not suggest running the container as unprotected. I suggested update the containers CLI command with a single flag.

because of k8s limitations there is no way for us to increase the ulimits.

I would love to know what these limitations are.

This is what I found regarding the k8s limitations on the ulimit increase:

I am trying to understand is there a way to disable the secret store usage in Telegraf config if we are not using secrets so that we do not get pod restarts due to the memory demand.

The only way out I currently see is not running Telegraf in k8s container.

As I have said in my original response, you can pass the --unprotected flag.

Thanks for the link I was unaware of that without the use of a privileged container setting it on the host.

Hi @jpowers ,

Could you please help with the following questions if possible? Thanks.

  1. Why is there no issue in the older versions of telgraf?
    Is it because that the older versions use --unprotected flag be default? (If so, we could use “–unprotected” explicitly with the new versions since we are not making things worse :slight_smile: .)

  2. Why need that big reservable memory?
    We use telegraf only for scaping rabbitmq. Thus there is only one password (for rabbitmq) which needs to be stored. Not sure why 64K, which is the default value in our k8s cluster setup, is not sufficient. This seems to be something that can be improved perhaps.

  3. –insecure v.s. --unprotected
    In RabbitMQ plugin 1.29.1 panic: [Err: cannot allocate memory] · Issue #14497 · influxdata/telegraf · GitHub, you mentioned --insecure, but here you mentioned --unprotected. Are they the same or you think --unprotected is the proper one?

No, there was no flag before. It was the addition of the secret store functionality.

Why need that big reservable memory?

There are some requirements by the library used to store secrets safely. We have a PR up to reduce these limits in the library, but are waiting on that to be landed.

–insecure v.s. --unprotected

Typo on my part

Hi @jpowers ,

Do you have any estimate about when that PR for reducing limits would be landed? And how much it would reduce the limit?

The current limit for our pods in k8s is 64. (ulimit -l => 64) Can we expect that the limit would be reduced to be under 64 with that PR you mentioned? Thanks.

I have no ETA it depends on the upstream maintainer to review, land, and release and then for telegraf to consume. Not sure on exact usage as this various on how many plugins you are using and how many fields those plugins use, but it should greatly reduce the usage.