How to set permissions to be able to monitor InfluxDB /data/ and /wal/ directory sizes from inside Telegraf container?

I could really use some help, if anyone can assist.

I am having a problem understanding how to properly set permissions for user IDs when running multiple containers for InfluxDB and Telegraf. My goal is to have the following:

  1. A named volume on the host system for storing InfluxDB /data/ and /wal/ directories
  2. A container for running the InfluxDB service
  3. A container for running a Telegraf service, which is able to obtain the directory sizes of the /data/ and /wal/ directories on the host system for monitoring purposes.

Here is my situation so far:

I am using podman (although this question also probably applies to Docker as well), and am setting everything up as a non-root user called tom (UID=1005). I create the volume using

$ podman volume create influxdb_volume

and then create the two containers using

$ podman run -d --rm --name influxdb_container \
--mount type=volume,source=influxdb_volume,destination=/var/lib/influxdb \
--mount type=bind,src=/home/tom/config_files/influxdb.conf,dst=/etc/influxdb/influxdb.conf \
--publish 8086:8086 \
 influxdb:1.8

and

$ podman run -d --rm --name telegraf_container \
--mount type=bind,src=/home/tom/config_files/telegraf.conf,dst=/etc/telegraf/telegraf.conf \
--mount type=bind,src=/,dst=/hostfs \
-e HOST_MOUNT_PREFIX=/hostfs \
-e HOST_PROC=/hostfs/proc \
telegraf

You can see above that the entire host filesystem is mounted into the telegraf container to /hostfs (as recommended in the docs).

I can see that the influxdb container runs its process as root user inside the container:

$ podman run --rm influxdb:1.8 id
uid=0(root) gid=0(root) groups=0(root)

and that the telegraf container runs its process as a user called telegraf (UID=999) inside the container:

$ podman run --rm telegraf id
uid=999(telegraf) gid=0(root) groups=0(root),999(telegraf)

The influxdb named volume is located by default at /.local/share/containers/storage/volumes/ with the following permissions:

$ tree -pugd -L 3 /home/tom/.local/share/containers/storage/volumes/
/home/tom/.local/share/containers/storage/volumes/
└── [drwx------ tom      tom     ]  influxdb_volume
    └── [drwxr-xr-x 166534   166534  ]  _data
        ├── [drwxr-xr-x tom      tom     ]  data
        ├── [drwxr-xr-x tom      tom     ]  meta
        └── [drwx------ tom      tom     ]  wal

To monitor the size of the /data/ and /wal/ directories, I have the following bash script get_disk_usage.sh which just uses the du command to print the directory size in bytes:

#!/bin/bash

echo "["
du -s -B1 "$@" | awk '{if (NR!=1) {printf ",\n"};printf "  { \"dir_size_bytes\": "$1", \"path\": \""$2"\" }";}'
echo
echo "]"

The telegraf.conf file is then used to declare the script to run using the exec plugin:

[agent]
  hostname = "qsd-23"
  flush_interval = "5s"
  interval = "5s"

[[inputs.mem]]
    fieldpass = [ "available", "used" ]

[[inputs.exec]]
    commands = [ "/hostfs/home/tom/get_disk_usage.sh 
	/hostfs/home/qsd/.local/share/containers/storage/volumes/influxdb_volume/_data/data 
	/hostfs/home/qsd/.local/share/containers/storage/volumes/influxdb_volume/_data/wal" ]
    timeout = "1m"
    name_override = "du"
    name_suffix = ""
    data_format = "json"
    tag_keys = [ "path" ]

[[outputs.influxdb]]
  database = "telegraf"
  urls = [ "MY_IP:8086" ]

The problem is that the script doesn’t run due to a few permissions issues.

  1. The location on the host where the get_disk_usage.sh bash script is located is at /home/tom/ which is owned by the user tom. In order for the telegraf container to run this script (as telegraf user), this directory needs to be set to allow all other users to execute scripts.
  2. It can be seen above that the influxdb_volume directory has some unusual permissions as well, and need to be changed to allow the bash script to read their sizes inside the conatiner (and I do not have sudo access on the system).

What is the best was to configure the users to achieve what I am looking for? I am really confused about how to deal with the user tom on the host system (and whose home directory holds the bash scripts), the user telegraf who is running the process inside the telegraf container, and the root user inside the influxdb container (mapped with subuid onto the host filesystem) who owns the named volume data directory.

Thank you!

Hi,

Have you considered using the influxdb input plugin, which produces a influxdb_shard metric with diskBytes as the directory size of the data + WAL directory? See: https://github.com/influxdata/telegraf/tree/master/plugins/inputs/influxdb

Hi, yes I can spend some more time looking into that, thanks.

I would like to know how to avoid the problem I describe though in general (it’s not the first time I’ve faced it with Influx / docker).

At the moment, I mount the /data/ and /wal/ directories directly from the host to the container with a bind mount as follows:

podman run -d --rm \
--mount type=bind,src=/home/tom/telegraf.conf,dst=/etc/telegraf/telegraf.conf \
--mount type=bind,src=/home/tom/.local/share/containers/storage/volumes/influxdb_volume/_data/data,dst=/home/influxdb_data \
--mount type=bind,src=/home/tom/.local/share/containers/storage/volumes/influxdb_volume/_data/wal,dst=/home/influxdb_wal \
--mount type=bind,src=/home/tom/get_disk_usage.sh,dst=/home/get_disk_usage.sh \
telegraf

But the problem is that the data and wal directories are mounted into the container as root user. Since the container (and therefore the exec plugin) is running as telegraf user, it is not able to read the sizes of these mounted directories.

I do not want to chmod 755 the named volume on the host (and indeed can’t because I don’t have sudo).

It seems like there must be a better way to sort out these permissions, or maybe I’m just doing soemthing wrong. Should I be using a single GID somehow?

Any help would be fantastic to resolve this.

Any help would be fantastic to resolve this.

This boils down to unix permissions of the mounts vs the user telegraf is running as. I don’t think you are doing anything wrong, but you do need to either change the permissions or the user such that the telegraf process has access to those folder.

Thanks, @jpowers yes I need telegraf to have access to thoses folders, like you say.

I cannot change the permissions of the directories on the host for two reasons:

  1. It requires sudo privileges, which I don’t have. Also, my understanding of the whole docker//podman setup was that containers can be setup to run everything separated from the root user.

  2. Even if I did chmod the permissions on the data and wal folders as sudo on the host, when the InfluxDB container adds new files to the volume (for example because a new database was created) then it would change the permissions beck again. Some users have had success with periodically modifying the permissions, but it seems like a ugly workaround to me, and there must be a better way.

I am not experienced enough to know how to proceed. Thanks so much for your patience with this.

Hi @jpowers

I have fixed my permissions issue, and am able to monitor the disk usage by running du on the /data/ and /wal/ directories. As you suggested, I am also trying to use the influxdb input plugin for telegraf, and to pick out the diskBytes field of the influxdb_shard measurement.

The yellow trace is the output of my du -s /wal/ command, while the green trace is the diskBytes from the influxdb input plugin:

I have two questions about this, if you can help?

  1. Why does the shard measurement drop down (towards the right side of the graph), while the du /wal/ result stays high?
  2. Why is the du /wal result quite “choppy”, and seems like it refreshes every 5 minues?

Is it something to do with how the data is stored in memory / cache?

I would ask over in the influxdb section as you might get a better idea of what is going on from them.