Worked fine initially, but stopped for one node after a few days, for another after two weeks.
All other telegraf measurements, and there are quite a few, work fine on these nodes.
I get for smart_device an exit_status=1 from these two nodes. When I test with
That is strange indeed. And thanks for the tip- was not aware debug not much help when plug-ins misbehave.
Maybe check there aren’t a bunch of processes or threads accumulating over time under telegraf user: ps -edf | grep telegraf
Another possibility is that smartctl process is hanging/blocking and never returns back to telegraf . Could try using longer or shorter timeout values timeout = "30s"
it’s not a accumulation problem, checked with ps. Since telegraf is being restarted to activate config changes this would lead to transient failure, and not the quasi static one I see (fine up to a point in time, always fails after that time).
Timeout also isn’t the culprit. Testing with
time sudo -u telegraf telegraf --config in.smart_sda.conf --test
first of all works on all nodes, and gives execution times of a few tens of seconds. The timeout is set to 30s.
I really don’t know how to degug this any further, any help/hints very much appreciated,
I’m stumped as well. Maybe set up a monitor over the telegraf process with pidstat -t -p PidOfTelegrafProcess 1 > /tmp/pidstatlog.txt
Depending how the plugin was written you might see a thread for smart that runs forever , or threads that run then terminate. Either way, pidstat log might show some indication of what’s happening underneath. My money is still on something hanging/blocking/dying that somehow doesn’t take down the whole telegraf process
Normally if the plugin were stuck you would get an error message, it’s possible that for some reason the data retrieved from smartctl isn’t producing any metrics. Perhaps you could write a wrapper script for smartctl that saves the command output to a file using tee?