Telegraf smart plugin returns after some time exit_status = 1, but works with telegraf --test

I have setup S.M.A.R.T data collection with telegraf via

[[inputs.smart]]
  interval = "6h"
  use_sudo = true
  attributes = true
  devices = [ "/dev/sda" ]
  tagexclude = ["capacity", "model", "serial_no", "wwn"]

Worked fine initially, but stopped for one node after a few days, for another after two weeks.
All other telegraf measurements, and there are quite a few, work fine on these nodes.
I get for smart_device an exit_status=1 from these two nodes. When I test with

sudo -u telegraf sudo smartctl --info --attributes --health  \
                               --format=brief /dev/sda

or

sudo -u telegraf telegraf --config in.smart_sda.conf --test

no errors are returned on both nodes and the expected output is printed.

Any help/hint is highly welcome.

Are you using a service manager (usually systemd) , or running telegref manually?

You on the latest version of telegraf?

In telegraf config, try adding debug = true

Edit: whoops, changed false to true :upside_down_face:

Here some environment details:

  • telegraf installed from https://repos.influxdata.com, using latest version 1.14.4-1
  • telegraf is run as systemd service
  • I’ve many plugins configured, all work fine, except smart
  • as said before, the smart plugin works under sudo -u telegraf telegraf ... --test

You probably mean debug = true. I tried this too.
The smart plugin still fails as seen from

SELECT enabled,exit_status FROM smart_device WHERE host='serv01'
    2020-06-28T07:25:00Z         1
    2020-06-28T07:30:00Z         1

and a journalctl --utc -u telegraf on this node just shows (prefix text removed)

2020-06-28T07:30:00Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2020-06-28T07:30:10Z D! [outputs.influxdb] Wrote batch of 38 metrics in 10.523401ms
2020-06-28T07:30:10Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics

so no telling output around 2020-06-28T07:30:00Z.

This seems a general issue with debug = true, see https://github.com/influxdata/telegraf/issues/6584. Thus good for [agent] level issues, but not for plugin issues.

Again, any help/hint on possible causes or ways to debug this are highly welcome.

1 Like

That is strange indeed. And thanks for the tip- was not aware debug not much help when plug-ins misbehave.

Maybe check there aren’t a bunch of processes or threads accumulating over time under telegraf user: ps -edf | grep telegraf

Another possibility is that smartctl process is hanging/blocking and never returns back to telegraf . Could try using longer or shorter timeout values timeout = "30s"

Hi @FixTestRepeat,

it’s not a accumulation problem, checked with ps. Since telegraf is being restarted to activate config changes this would lead to transient failure, and not the quasi static one I see (fine up to a point in time, always fails after that time).

Timeout also isn’t the culprit. Testing with

time sudo -u telegraf telegraf --config in.smart_sda.conf --test

first of all works on all nodes, and gives execution times of a few tens of seconds. The timeout is set to 30s.

I really don’t know how to degug this any further, any help/hints very much appreciated,

I’m stumped as well. Maybe set up a monitor over the telegraf process with pidstat -t -p PidOfTelegrafProcess 1 > /tmp/pidstatlog.txt

Depending how the plugin was written you might see a thread for smart that runs forever , or threads that run then terminate. Either way, pidstat log might show some indication of what’s happening underneath. My money is still on something hanging/blocking/dying that somehow doesn’t take down the whole telegraf process

Normally if the plugin were stuck you would get an error message, it’s possible that for some reason the data retrieved from smartctl isn’t producing any metrics. Perhaps you could write a wrapper script for smartctl that saves the command output to a file using tee?

1 Like

How did you go with this? Still the same issues? Any new findings?