We are using telegraf to log metrics from zookeeper and rabbitmq pods in our k8s cluster.
Currently we are running with version 1.28.5 of telegraf but due to many vulnerability reported in this version we need to bump to version 1.29.* as most of them are fixed there.
The container base image that we are using is photon3.
However, when we upgraded our telegraf version the pods crashes soon after successful initial start up with the following error:
2024-02-28T11:53:17Z I! Loading config: /etc/telegraf/telegraf.conf
panic: could not acquire lock on 0x7f7a8890f000, limit reached? [Err: cannot allocate memory]
This error is related to the fact that telegraf could not allocate locked memory pages for secret store values. This memory is where secrets are stored securely. There are two options:
Increase the ulimit on their system. The user does this with the ulimit -l command. To both see and set the value. For docker, I believe there is a --ulimit flag that could be used, like --ulimit memlock=8192:8192
Add the --unprotected option to the command arguments to not use the reservable memory and instead store secrets in memory. This is less secure as secrets could find their way into paged out memory, and why this is opt-in. For docker you would need to update the CMD used to include this.
Currently the pod is ran as non root user due to compliance requirements so passing this as a command under the container spec won’t work as setting the ulimit requires Root user access.
So if I cannot pass it in the docker container spec when building the image what other options are left here.
And what I am also curious is how was this working with the older version of telegraf 1.28.* and in the new one we are hitting memory issues. I went through the release notes for version 1.29 and a lot of things had been change so it is almost impossible for me to find out what is causing this now.
so passing this as a command under the container spec won’t work as setting the ulimit requires
As I’ve mentioned, you can pass it as an option to the container. Please see the resource constraints docs. I’ve also mentioned the --unprotected flag option above.
And what I am also curious is how was this working with the older version of telegraf 1.28.* and in the new one we are hitting memory issues.
As users increase their usage of secret stores, the need to get lockable memory increases.
Hi again and sorry for the late reply on this one.
So I did some digging and it seems that because of k8s limitations there is no way for us to increase the ulimits.
The container also cannot be ran as unprotected.
I found out that our Telegraf is not using secret stores so is it possible to disable this option when running the container and can this be passed as a flag in the custom Telegraf config we are passing.
The only other option I see is to migrate to non k8s native Telegraf because of these breaking changes.
This is what I found regarding the k8s limitations on the ulimit increase:
I am trying to understand is there a way to disable the secret store usage in Telegraf config if we are not using secrets so that we do not get pod restarts due to the memory demand.
The only way out I currently see is not running Telegraf in k8s container.
Could you please help with the following questions if possible? Thanks.
Why is there no issue in the older versions of telgraf?
Is it because that the older versions use --unprotected flag be default? (If so, we could use “–unprotected” explicitly with the new versions since we are not making things worse .)
Why need that big reservable memory?
We use telegraf only for scaping rabbitmq. Thus there is only one password (for rabbitmq) which needs to be stored. Not sure why 64K, which is the default value in our k8s cluster setup, is not sufficient. This seems to be something that can be improved perhaps.
No, there was no flag before. It was the addition of the secret store functionality.
Why need that big reservable memory?
There are some requirements by the library used to store secrets safely. We have a PR up to reduce these limits in the library, but are waiting on that to be landed.
Do you have any estimate about when that PR for reducing limits would be landed? And how much it would reduce the limit?
The current limit for our pods in k8s is 64. (ulimit -l => 64) Can we expect that the limit would be reduced to be under 64 with that PR you mentioned? Thanks.
I have no ETA it depends on the upstream maintainer to review, land, and release and then for telegraf to consume. Not sure on exact usage as this various on how many plugins you are using and how many fields those plugins use, but it should greatly reduce the usage.