Strange behaviour of telegraf and/or diskio plugin

telegraf
#1

Hi there,

I’m experiencing something that I cannot explain at all so can someone please help me?

I have a bunch of different CentOS servers with the same problem: they do not report diskio information. Everything else works like a charm expect this single plugin. In a linux commands language we have the following story

root@riff:~# pgrep -a telegraf
242390 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
root@riff:~# grep -B2 -A4 diskio /etc/telegraf/telegraf.conf
# Read metrics about disk IO by device
[[inputs.diskio]]

# Get kernel statistics from /proc/stat
[[inputs.kernel]]
root@riff:~# telegraf --test | grep diskio
2018/02/27 14:54:51 I! Using config file: /etc/telegraf/telegraf.conf
* Plugin: inputs.diskio, Collection 1
> diskio,name=xvda2,host=riff read_bytes=31558038528i,write_bytes=90248761344i,write_time=119070621i,io_time=53623809i,weighted_io_time=137432660i,writes=6841359i,read_time=18371392i,iops_in_progress=0i,reads=1969480i 1519703692000000000
> diskio,name=xvda1,host=riff reads=12103559i,read_time=149513526i,write_time=808237275i,io_time=79621553i,weighted_io_time=957765779i,iops_in_progress=0i,writes=12715230i,read_bytes=194178266112i,write_bytes=170966654976i 1519703692000000000
> diskio,name=xvda3,host=riff io_time=17287130i,weighted_io_time=117838385i,iops_in_progress=0i,read_time=40106127i,write_time=77729931i,reads=2938334i,writes=1392616i,read_bytes=36339414016i,write_bytes=22348427264i 1519703692000000000
> diskio,name=xvda4,host=riff io_time=7248751i,weighted_io_time=47135701i,writes=532510i,read_bytes=29800793088i,read_time=16095740i,write_time=31041866i,reads=1525965i,write_bytes=6535241728i,iops_in_progress=0i 1519703692000000000

Everything looks ok at this stage but now the most interesting part of the story:

root@riff:~# journalctl -u telegraf | tail -n 3
Feb 27 15:01:00 riff telegraf[242390]: 2018-02-27T04:01:00Z E! Error in plugin [inputs.diskio]: error getting disk io info: open /proc/diskstats: no such file or directory
Feb 27 15:01:10 riff telegraf[242390]: 2018-02-27T04:01:10Z E! Error in plugin [inputs.diskio]: error getting disk io info: open /proc/diskstats: no such file or directory
Feb 27 15:01:20 riff telegraf[242390]: 2018-02-27T04:01:20Z E! Error in plugin [inputs.diskio]: error getting disk io info: open /proc/diskstats: no such file or directory

root@riff:~# ls -l /proc/diskstats
-r--r--r-- 1 root root 0 Feb 27 14:57 /proc/diskstats

root@riff:~# cat /proc/diskstats
 202       1 xvda1 12103646 70438 379256202 149515123 12716894 15917712 333952984 808283699 0 79629455 957813799
 202       3 xvda3 2938424 13569 70976786 40106881 1392972 996026 43659296 77738198 5 17289808 117847410
 202       4 xvda4 1526006 9816 58205210 16096219 532755 332494 12768016 31056931 0 7249266 47151245
 202       2 xvda2 1969565 7769 61637722 18372582 6842316 2515634 176295800 119088329 0 53630438 137451558
   7       0 loop0 0 0 0 0 0 0 0 0 0 0 0

And just in case

root@riff:~# telegraf --version
Telegraf v1.5.2 (git: release-1.5 67440c95)
root@riff:~# uname -mrs
Linux 3.10.0-427.18.2.lve1.4.38.el7.x86_64 x86_64
#2

I wasn’t able to reproduce these issues with CentOS 7 running in VirtualBox, can you provide some more information about where you are running your servers?

Are you running any kind of MAC, like SELinux? Telegraf usually runs under the telegraf user, and sometimes people have run into issues where Telegraf doesn’t have appropriate permissions to access a file, but the permissions on /proc/diskstats seem fine…

Does every collection attempt fail?

#3

Hi and thanks for getting into this.

We dont use any access control tools like SELinux or AppArmor. All the servers are executed in XEN environment so maybe thats the root of the issue.
However you are absolutely right about the permissions:

root@riff [~]# su telegraf -p -c "cat /proc/diskstats"
cat: /proc/diskstats: No such file or directory

root@riff [~]# su telegraf -p -c "ls -l /proc/"
total 0
dr-xr-xr-x 7 telegraf telegraf  0 Feb 28 11:15 193536
dr-xr-xr-x 7 telegraf telegraf  0 Feb 28 04:07 3668
-r--r--r-- 1 root     root      0 Feb 28 11:15 cmdline
-r--r--r-- 1 root     root      0 Feb 28 11:15 cpuinfo
-r--r--r-- 1 root     root      0 Feb 28 11:15 filesystems
-r--r--r-- 1 root     root      0 Feb 28 11:15 loadavg
-r--r--r-- 1 root     root      0 Feb 28 11:15 meminfo
lrwxrwxrwx 1 root     root     11 Feb 28 11:15 mounts -> self/mounts
lrwxrwxrwx 1 root     root      8 Feb 28 11:15 net -> self/net
lrwxrwxrwx 1 root     root     64 Feb 28 04:05 self -> 193536
-r--r--r-- 1 root     root      0 Feb 28 11:15 stat
-r--r--r-- 1 root     root      0 Feb 28 11:15 uptime
-r--r--r-- 1 root     root      0 Feb 28 11:15 version

Which is really strange because everything looks allright:

root@riff [~]# getfacl /proc/
getfacl: Removing leading '/' from absolute path names
# file: proc/
# owner: root
# group: root
user::r-x
group::r-x
other::r-x

root@riff [~]# getfacl /proc/diskstats 
getfacl: Removing leading '/' from absolute path names
# file: proc/diskstats
# owner: root
# group: root
user::r--
group::r--
other::r--

root@riff:~# ls -l /proc/diskstats
-r--r--r-- 1 root root 0 Feb 27 14:57 /proc/diskstats

But anyway the problem is not on the telegraf side and this conversation can be abandoned. But just in case. Do you have any better idea then simply executing telegraf under a root account?

#4

Unfortunately I don’t have any better ideas.

Xen might be the issue, but I’d kind of expect to see the same behavior across users in that case. The only other mechanism that I can think of that would have user-based access restrictions would be something like SELinux. It might be worth double checking. It looks like it is enabled by default on the CentOS VM I brought up.

#5

it is always worth double checking! however i have big doubts that it is Xen that might be the problem. but you never know… come back with updates if you do find what’s wrong in the end. sorry can’t actually help.