Telegraf smart plugin returns after some time exit_status = 1, but works with telegraf --test

wfjm · June 27, 2020, 5:56pm

I have setup S.M.A.R.T data collection with telegraf via

[[inputs.smart]]
  interval = "6h"
  use_sudo = true
  attributes = true
  devices = [ "/dev/sda" ]
  tagexclude = ["capacity", "model", "serial_no", "wwn"]

Worked fine initially, but stopped for one node after a few days, for another after two weeks.
All other telegraf measurements, and there are quite a few, work fine on these nodes.
I get for smart_device an exit_status=1 from these two nodes. When I test with

sudo -u telegraf sudo smartctl --info --attributes --health  \
                               --format=brief /dev/sda

or

sudo -u telegraf telegraf --config in.smart_sda.conf --test

no errors are returned on both nodes and the expected output is printed.

Any help/hint is highly welcome.

FixTestRepeat · June 27, 2020, 8:45pm

Are you using a service manager (usually systemd) , or running telegref manually?

You on the latest version of telegraf?

In telegraf config, try adding debug = true

Edit: whoops, changed false to true

wfjm · June 28, 2020, 7:52am

Here some environment details:

telegraf installed from https://repos.influxdata.com, using latest version 1.14.4-1
telegraf is run as systemd service
I’ve many plugins configured, all work fine, except smart
as said before, the smart plugin works under sudo -u telegraf telegraf ... --test

You probably mean debug = true. I tried this too.
The smart plugin still fails as seen from

SELECT enabled,exit_status FROM smart_device WHERE host='serv01'
    2020-06-28T07:25:00Z         1
    2020-06-28T07:30:00Z         1

and a journalctl --utc -u telegraf on this node just shows (prefix text removed)

2020-06-28T07:30:00Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
2020-06-28T07:30:10Z D! [outputs.influxdb] Wrote batch of 38 metrics in 10.523401ms
2020-06-28T07:30:10Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics

so no telling output around 2020-06-28T07:30:00Z.

This seems a general issue with debug = true, see https://github.com/influxdata/telegraf/issues/6584. Thus good for [agent] level issues, but not for plugin issues.

Again, any help/hint on possible causes or ways to debug this are highly welcome.

FixTestRepeat · June 28, 2020, 12:01pm

That is strange indeed. And thanks for the tip- was not aware debug not much help when plug-ins misbehave.

Maybe check there aren’t a bunch of processes or threads accumulating over time under telegraf user: ps -edf | grep telegraf

Another possibility is that smartctl process is hanging/blocking and never returns back to telegraf . Could try using longer or shorter timeout values timeout = "30s"

wfjm · July 6, 2020, 7:01am

Hi @FixTestRepeat,

it’s not a accumulation problem, checked with ps. Since telegraf is being restarted to activate config changes this would lead to transient failure, and not the quasi static one I see (fine up to a point in time, always fails after that time).

Timeout also isn’t the culprit. Testing with

time sudo -u telegraf telegraf --config in.smart_sda.conf --test

first of all works on all nodes, and gives execution times of a few tens of seconds. The timeout is set to 30s.

I really don’t know how to degug this any further, any help/hints very much appreciated,

FixTestRepeat · July 6, 2020, 9:14am

I’m stumped as well. Maybe set up a monitor over the telegraf process with pidstat -t -p PidOfTelegrafProcess 1 > /tmp/pidstatlog.txt

Depending how the plugin was written you might see a thread for smart that runs forever , or threads that run then terminate. Either way, pidstat log might show some indication of what’s happening underneath. My money is still on something hanging/blocking/dying that somehow doesn’t take down the whole telegraf process

daniel · July 8, 2020, 9:54pm

Normally if the plugin were stuck you would get an error message, it’s possible that for some reason the data retrieved from smartctl isn’t producing any metrics. Perhaps you could write a wrapper script for smartctl that saves the command output to a file using tee?

FixTestRepeat · July 29, 2020, 11:57am

How did you go with this? Still the same issues? Any new findings?

Topic		Replies	Views
Telegraf: input.smart is not working Telegraf telegraf	5	4499	February 11, 2020
Erro when I try to start Telegraf after install it, on Ubuntu Server 22.04 Telegraf telegraf	5	2873	April 1, 2024
TrueNAS + Telegraf ceased communication, troubleshooting futile Telegraf telegraf	12	956	July 6, 2023
Can not run the Telegraf service Telegraf systemd	10	19651	March 20, 2021
Telegraf - 16.3 inputs.gnmi Telegraf influxdb , telegraf , grafana	2	2075	December 4, 2020

Telegraf smart plugin returns after some time exit_status = 1, but works with telegraf --test

Related topics