Problem monitoring size of mounted directory from inside Telegraf docker container

I am trying to monitor the size of my InfluxDB wal and data directories using Telegraf and InfluxDB. I am running both Telegraf and InfluxDB on seperate docker containers, and the data/wal directories are stored on an external drive which is just mounted to the host system, and passed through to the container at /mnt/SSD_240GB.

My method for obtaining the directory sizes is to use a simple shell script which runs the disk usage (du) on each of the directories, and run this shell script automatically via the Telegraf exec plug in (this seems to be a pretty common way to do it). The problem is that the directory sizes arriving in InfluxDB aren’t correct.

This is the shell script:

#!/bin/bash

echo "["
du -s -B1 "$@" | awk '{if (NR!=1) {printf ",\n"};printf "  { \"dir_size_bytes\": "$1", \"path\": \""$2"\" }";}'
echo
echo "]"

which you can see gives the correct directory sizes if I run it directly in the Telegraf container (without using the exec plugin):

But the data that arrives in InfluxDB is always 20480 bytes and 4096 bytes for the two directories:

Does anyone know what’s going on here?


Here is my the relevant part of the telegraf.conf file:

[[inputs.exec]]
  commands = [ "/etc/telegraf/scripts/get_disk_usage.sh /mnt/SSD_240GB/docker_data/InfluxDB/wal /mnt/SSD_240GB/docker_data/InfluxDB/data" ]
  timeout = "1m"
  name_override = "du"
  name_suffix = ""
  data_format = "json"
  tag_keys = [ "path" ]

and here is the docker-compose.yaml file:

telegraf:
    image: telegraf
    container_name: telegraf_container
    restart: always
    ports:
      - 8125:8125
    networks:
      - docker_monitoring_network
    volumes:
      - /mnt/SSD_240GB/docker_config_files/Telegraf/telegraf.conf:/etc/telegraf/telegraf.conf
      - /mnt/SSD_240GB/docker_config_files/Telegraf/scripts:/etc/telegraf/scripts
      - /mnt/SSD_240GB:/mnt/SSD_240GB:ro

Hi @teeeeee,

So are you setting any retention policies for your InfluxDB buckets? If you are using 2.X of InfluxDB you can also access these metrics via: http://localhost:8086/metrics. 2.x also has an inbuilt scrapper to scrape and store these metrics: Create an InfluxDB scraper | InfluxDB OSS 2.6 Documentation

Hi Jay,

No, I am using InfluxDB 1.8 , and am not setting any retention policies. The issue is that the data collected by the shell script does not equal the data that is collected when running the script via the exec plug in (which is then sent to InfluxDB).

Hi @teeeeee,
hmm interesting, so I guess we have to locate where in the chain we have an issue. Could you add the printer to your telegraf config: telegraf/plugins/processors/printer at master · influxdata/telegraf · GitHub

This will let us see the raw line protocol being produced. My two bets are:

  1. Either there is a strange passing issue going on via the json serializer
  2. Since Telegraf runs as the telegraf user there is a strange permissions issue occurring which is leading to inconsistent results. As you are running the script as root within the container

Hi Jay,

I have added the printer plugin as you suggested, and it shows that the directory sizes are the wrong ones (4096 bytes and 20480 bytes):

Thanks

Hi Jay,

Will it be a problem that Telegraf runs as the “telegraf” user even though I have changed the permissions of my shell script to all all users to read/write/execute?

When I look at the file’s permissions from inside the container, I can see that it is owned by “root”, but that all users can read/write/execute:

SHOT3

If the permissions of a script are rwx for everyone, it makes no difference who
the owner is, but it does suggest there’s a deeper problem (or at least a
better solution), since doing “chmod 777 xyz” is very very rarely the Right
Thing To Do.

I regard it as the equivalent of mislaying your house keys, so you leave the
front door open all the time as a solution to the problem.

Antony.

Yes I know - thanks for the comment. I did chmod 777 only in order to try to guarantee that it was not a permissions issue with the file.

It seems more like the problem is that when the Telegraf service runs the shell script it is not able to see the actual size of the mounted directory - only some kind of link to the directory I guess (which it regards as having a size of 4096 bytes).

The only similar issue I could find was here:

this counts directories as having 4096 bytes”:

https://github.com/influxdata/telegraf/issues/3945#issuecomment-377034171

(ofcourse I am using a different approach, but I wondered if it could be related somehow).

Hi Jay,

I think you are correct, this is a permissions issue. Not with the shell script, but instead with the directories that I am trying to run du on within this script (namely /wal and /data). Here you can see the permissions were restricted on those folders:

SHOT4

If I run chmod 777 on these directories, then the correct data starts flowing from Telegraf into InfluxDB:

So my question really is now: how should I set the user/permission without simply doing chmod 777 ?

The owner of the filesystem /mnt/SSD_240GB that is mounted is “root”, but you say that Telegraf runs as the “telegraf” user. I cannot change the owner of /mnt/SSD_240GB to telegraf, because other docker containers (unrelated to Telegraf) need to use it. Do I need to create a user “telegraf” on the host system (i.e. outside of the docker container) , and do something with that?

I saw this blog on the InfluxData website about passing in a user, but not sure if it is what’s needed here.

Thanks for your patience.

@Jay_Clifford
Are you able to assist with this? Thanks.

Hi @Jay_Clifford and @Pooh

I am still having trouble with this. I have opened a new issue here.

Can you suggest the recommended way to configure the user / permissions in such a case (I believe it is a common setup) ?

The container is run as non-root user (named tom), and the telegraf process inside the container runs as user telegraf (this is the docker image default).

Thanks