TrueNAS + Telegraf ceased communication, troubleshooting futile

Hello all,

I had TrueNAS equipped with a custom Telegraf application, and all was well. I can’t pinpoint exactly when the communication stopped, I thought it was simply because of an incomplete DNS change, but then I remembered … I actually got it back up and running last night for a short period of time until it stopped again, and I want to say there was nothing outstanding that caused the cessation last night…

Here’s what we’ve done so far …

  • Confirmed DNS records for everything resolve properly via MXToolKit, dig and ping. All is well.

  • Confirmed API Token has sufficient permissions (Also granted all permissions for troubleshooting purposes.)

  • Confirmed telegraf.conf is accurate (truncated here to most relevant part): [[outputs.influxdb_v2]]urls = ["URL:PORT"]token = "Token"organization = "Org"bucket = "Bucket"

  • Checked container log inside TrueNAS. Container set to run as UID:GID “0:0”

2023-06-27 12:26:53.609426+00:00Unpacking uuid-runtime (2.36.1-8+deb11u1) ...

2023-06-27 12:26:53.782821+00:00Selecting previously unselected package nvme-cli.

2023-06-27 12:26:53.786700+00:00Preparing to unpack .../nvme-cli_1.12-5_amd64.deb ...

2023-06-27 12:26:53.808256+00:00Unpacking nvme-cli (1.12-5) ...

2023-06-27 12:26:54.039700+00:00Selecting previously unselected package smartmontools.

2023-06-27 12:26:54.043715+00:00Preparing to unpack .../smartmontools_7.2-1_amd64.deb ...

2023-06-27 12:26:54.099373+00:00Unpacking smartmontools (7.2-1) ...

2023-06-27 12:26:54.354569+00:00Selecting previously unselected package sudo.

2023-06-27 12:26:54.358480+00:00Preparing to unpack .../sudo_1.9.5p2-3+deb11u1_amd64.deb ...

2023-06-27 12:26:54.383193+00:00Unpacking sudo (1.9.5p2-3+deb11u1) ...

2023-06-27 12:26:54.903915+00:00Setting up smartmontools (7.2-1) ...

2023-06-27 12:26:55.145527+00:00Setting up sudo (1.9.5p2-3+deb11u1) ...

2023-06-27 12:26:55.320281+00:00invoke-rc.d: could not determine current runlevel

2023-06-27 12:26:55.360839+00:00Setting up uuid-runtime (2.36.1-8+deb11u1) ...

2023-06-27 12:26:55.472789+00:00Adding group \uuidd' (GID 101) ...`

2023-06-27 12:26:56.400694+00:00Done.

2023-06-27 12:26:56.462507+00:00Warning: The home dir /run/uuidd you specified can't be accessed: No such file or directory

2023-06-27 12:26:56.462797+00:00Adding system user \uuidd' (UID 101) ...`

2023-06-27 12:26:56.462812+00:00Adding new user \uuidd' (UID 101) with group `uuidd' ...`

2023-06-27 12:26:57.857894+00:00Not creating home directory \/run/uuidd'.`

2023-06-27 12:26:57.895416+00:00invoke-rc.d: could not determine current runlevel

2023-06-27 12:26:57.912887+00:00Starting uuid generator: uuidd.

2023-06-27 12:26:58.110117+00:00Setting up nvme-cli (1.12-5) ...

2023-06-27 12:26:59.242023+00:002023-06-27T12:26:59Z I! Loading config: /etc/telegraf/telegraf.conf

2023-06-27 12:26:59.243326+00:002023-06-27T12:26:59Z I! Starting Telegraf 1.27.1

2023-06-27 12:26:59.243346+00:002023-06-27T12:26:59Z I! Available plugins: 237 inputs, 9 aggregators, 28 processors, 23 parsers, 59 outputs, 4 secret-stores

2023-06-27 12:26:59.243352+00:002023-06-27T12:26:59Z I! Loaded inputs: cpu diskio execd kernel mem net sensors smart swap system zfs

2023-06-27 12:26:59.243357+00:002023-06-27T12:26:59Z I! Loaded aggregators:

2023-06-27 12:26:59.243365+00:002023-06-27T12:26:59Z I! Loaded processors:

2023-06-27 12:26:59.243370+00:002023-06-27T12:26:59Z I! Loaded secretstores:

2023-06-27 12:26:59.243375+00:002023-06-27T12:26:59Z I! Loaded outputs: influxdb_v2

2023-06-27 12:26:59.243380+00:002023-06-27T12:26:59Z I! Tags enabled: host=HOST

2023-06-27 12:26:59.243389+00:002023-06-27T12:26:59Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"HOST", Flush Interval:10s

2023-06-27 12:26:59.243664+00:002023-06-27T12:26:59Z I! [inputs.execd] Starting process: /mnt/zfs_libs/zpool_influxdb [--execd]

2023-06-27 12:26:59.255053+00:002023-06-27T12:26:59Z E! [inputs.execd] stderr: "crypto/fips/fips_post.c:235: FIPS self test failure"
  • Checked /var/log for both auth, syslog, containers and pretty much every other log I think could be relevant on all 3 servers. Nothing exceptional.

  • Verified connectivity between TRUE-NAS-SCALE and InfluxDB server. – Verified ipv6 connectivity but not ipv4?

root@TRUENAS:~# netstat -a

Active Internet connections (servers and established)

Proto Recv-Q Send-Q Local Address Foreign Address State

tcp 0 0 [127.0.0.53](https://127.0.0.53):domain [0.0.0.0](https://0.0.0.0):* LISTEN

tcp 0 0 [127.0.0.54](https://127.0.0.54):domain [0.0.0.0](https://0.0.0.0):* LISTEN

tcp 0 0 localhost:smtp [0.0.0.0](https://0.0.0.0):* LISTEN

tcp 0 0 TRUENAS.db.hq.domain:33940 TRUENAS.db.hq.domain:8086 ESTABLISHED

tcp6 0 0 localhost:smtp [::]:* LISTEN

tcp6 0 0 [::]:8086 [::]:* LISTEN

tcp6 0 0 [::]:ssh [::]:* LISTEN

tcp6 0 0 TRUENAS.db.hq.domain:8086 Workstation.hq.domain:60569 ESTABLISHED

tcp6 0 0 TRUENAS.db.hq.domain:8086 Firewall.hq.domainy:8963 TIME_WAIT

tcp6 0 0 TRUENAS.db.hq.domaing:ssh VirtualMachine.hq.domainys:41202 ESTABLISHED

tcp6 0 0 TRUENAS.db.hq.domain:8086 Firewall.hq.domainy:8886 TIME_WAIT

tcp6 0 0 TRUENAS.db.hq.domain:8086 Workstation.hq.domain:60713 ESTABLISHED

tcp6 0 0 TRUENAS.db.hq.domain:8086 Firewall.hq.domainy:8765 TIME_WAIT

tcp6 0 0 TRUENAS.db.hq.domain:8086 TRUENAS.db.hq.domainys:33940 ESTABLISHED

tcp6 0 0 TRUENAS.db.hq.domain:8086 Firewall.hq.domainy:9132 TIME_WAIT

tcp6 0 0 TRUENAS.db.hq.domain:8086 Workstation.hq.domain:60717 ESTABLISHED

tcp6 0 0 TRUENAS.db.hq.domain:8086 Firewall.hq.domainy:8873 TIME_WAIT

tcp6 0 0 TRUENAS.db.hq.domain:8086 Workstation.hq.domain:60716 ESTABLISHED

tcp6 0 0 TRUENAS.db.hq.domain:8086 Firewall.hq.domainy:8900 TIME_WAIT

tcp6 0 0 TRUENAS.db.hq.domain:8086 Firewall.hq.domainy:9053 TIME_WAIT

tcp6 0 0 TRUENAS.db.hq.domain:8086 Workstation.hq.domain:60714 ESTABLISHED

tcp6 0 0 TRUENAS.db.hq.domain:8086 Firewall.hq.domainy:8880 TIME_WAIT

tcp6 0 0 TRUENAS.db.hq.domain:8086 Workstation.hq.domain:60715 ESTABLISHED

tcp6 0 0 TRUENAS.db.hq.domain:8086 TRUENAS.hq.domain:30393 ESTABLISHED

udp 0 0 [127.0.0.54](https://127.0.0.54):domain [0.0.0.0](https://0.0.0.0):*

udp 0 0 [127.0.0.53](https://127.0.0.53):domain [0.0.0.0](https://0.0.0.0):*

Blehhhh… Plenty of other troubleshooting has been done but I am exhausted from trying to troubleshoot then remember what I’ve done to post here.

I am rebooting all 3 systems involved to attempt to bring the database back up then manually add the container to try once more …

Anyone have any ideas, questions or suggestions?

What about the telegraf logs?

I will say that I know nothing about how the TrueNAS telegraf is different, so you may need to go over to their forums or community and ask if anything is special.

For help here, my best guess is to first look at the telegraf logs, enable debug, and see if metrics are even sending.

@jpowers Thank you for the prompt response, my friend.

I was tied up all day after an embarrassingly long argument between myself, Telegraf and TrueNAS and had to deal with an incident with our work net/testnet all morning.

I’m recreating the environment now and rebuilding the container in one moment and will see if there’s anything pertinent. – Any specific location/logs you’re looking for?

Also, as per someone on Reddit suggesting … is the Error in FIPS relevant? I see no outward indication the container has failed to build or actively run based on … it running, and having connection via ipv6, at least.

Is FIPS failure a fatal error or would just lead to an error for FIPS-specific functions?

TrueNAS doesn’t actually … have a Telegraf implementation, so custom work-around had to be made, which just pulls the latest telegraf from the repo and sets the environment as such,

  # Container Environment Variables
# name
HOST_ETC
# value
/hostfs/etc
# name
HOST_PROC
# value
/hostfs/proc
# name
HOST_SYS
# value
/hostfs/sys
# name
HOST_VAR
# value
/hostfs/var
# name
HOST_RUN
# value
/hostfs/run
# name
HOST_MOUNT_PREFIX
# value
/hostfs
# name
LD_LIBRARY_PATH
# value
/mnt/zfs_libs

  # Port Forwarding
# Container Port
8094
# Node Port
9094
# Protocole
TCP

  # Storage
# Host Path
/mnt/telegraf/telegraf.conf
# Mount Path
/etc/telegraf/telegraf.conf
# Host Path
/mnt/telegraf/etc
# Mount Path
/hostfs/etc
# Host Path
/mnt/telegraf/proc
# Mount Path
/hostfs/proc
# Host Path
/mnt/telegraf/sys
# Mount Path
/hostfs/sys
# Host Path
/mnt/telegraf/run
# Mount Path
/hostfs/run
# Host Path
/mnt/telegraf/entrypoint.sh
# Mount Path
/entrypoint.sh
# Host Path
/mnt/telegraf/zfs_libs
# Mount Path
/mnt/zfs_libs

There is obviously a custom entrypoint and so forth I could show if would be helpful … but let me build this env and I’ll be back with you.

Ha I am going to be honest I didn’t scroll down far enough to see some telegraf output. I saw the other setup messages and wondered what this was for.

The FIPS error might mean you do not get metrics for that one input. But you should still see metrics from the other inputs.

Can you enable debug in your telegraf config as it will print a lot more details about sending metrics. Also it would be good to get a few minutes of logs to verify that it is able to send metrics and get an idea of how many.

[agent]
  debug = true

Haha, will do right now.
I’ve spent most of the time in between posts pruning and cleaning up all references and leftover pods/containers or any mention of telegraf to ensure we have as clean of a slate as possible to work with.

I did find a few files I had not seen before, as there were so many containers from trying over and over, it was hard to sort through.

I have yet to go through them all, but this one seemed of importance glancing over it. But let me get this new instance spinnedered up for you and I’ll brb.

{"log":"Setting up smartmontools (7.2-1) ...\r\n","stream":"stdout","time":"2023-06-22T02:54:54.556020609Z"}
{"log":"Setting up sudo (1.9.5p2-3+deb11u1) ...\r\n","stream":"stdout","time":"2023-06-22T02:54:54.999107089Z"}
{"log":"invoke-rc.d: could not determine current runlevel\r\n","stream":"stdout","time":"2023-06-22T02:54:55.203250476Z"}
{"log":"invoke-rc.d: policy-rc.d denied execution of start.\r\n","stream":"stdout","time":"2023-06-22T02:54:55.212348152Z"}
{"log":"Setting up uuid-runtime (2.36.1-8+deb11u1) ...\r\n","stream":"stdout","time":"2023-06-22T02:54:55.249879945Z"}
{"log":"Adding group `uuidd' (GID 101) ...\r\n","stream":"stdout","time":"2023-06-22T02:54:55.397762871Z"}
{"log":"Done.\r\n","stream":"stdout","time":"2023-06-22T02:54:56.52410506Z"}
{"log":"Warning: The home dir /run/uuidd you specified can't be accessed: No such file or directory\r\n","stream":"stdout","time":"2023-06-22T02:54:56.583900149Z"}
{"log":"Adding system user `uuidd' (UID 101) ...\r\n","stream":"stdout","time":"2023-06-22T02:54:56.584245619Z"}
{"log":"Adding new user `uuidd' (UID 101) with group `uuidd' ...\r\n","stream":"stdout","time":"2023-06-22T02:54:56.584264086Z"}
{"log":"Not creating home directory `/run/uuidd'.\r\n","stream":"stdout","time":"2023-06-22T02:54:57.947458151Z"}
{"log":"invoke-rc.d: could not determine current runlevel\r\n","stream":"stdout","time":"2023-06-22T02:54:57.982839286Z"}
{"log":"invoke-rc.d: policy-rc.d denied execution of start.\r\n","stream":"stdout","time":"2023-06-22T02:54:57.989220683Z"}
{"log":"Setting up nvme-cli (1.12-5) ...\r\n","stream":"stdout","time":"2023-06-22T02:54:58.18527136Z"}
{"log":"2023-06-22T02:54:59Z I! Loading config: /etc/telegraf/telegraf.conf\n","stream":"stderr","time":"2023-06-22T02:54:59.204355549Z"}
{"log":"2023-06-22T02:54:59Z I! Starting Telegraf 1.27.1\n","stream":"stderr","time":"2023-06-22T02:54:59.205949179Z"}
{"log":"2023-06-22T02:54:59Z I! Available plugins: 237 inputs, 9 aggregators, 28 processors, 23 parsers, 59 outputs, 4 secret-stores\n","stream":"stderr","time":"2023-06-22T02:54:59.205973356Z"}
{"log":"2023-06-22T02:54:59Z I! Loaded inputs: cpu diskio execd kernel mem net sensors smart swap system zfs\n","stream":"stderr","time":"2023-06-22T02:54:59.205985809Z"}
{"log":"2023-06-22T02:54:59Z I! Loaded aggregators: \n","stream":"stderr","time":"2023-06-22T02:54:59.205994937Z"}
{"log":"2023-06-22T02:54:59Z I! Loaded processors: \n","stream":"stderr","time":"2023-06-22T02:54:59.206008159Z"}
{"log":"2023-06-22T02:54:59Z I! Loaded secretstores: \n","stream":"stderr","time":"2023-06-22T02:54:59.206016151Z"}
{"log":"2023-06-22T02:54:59Z I! Loaded outputs: influxdb_v2\n","stream":"stderr","time":"2023-06-22T02:54:59.206024392Z"}
{"log":"2023-06-22T02:54:59Z I! Tags enabled: host=agrippa\n","stream":"stderr","time":"2023-06-22T02:54:59.206031864Z"}
{"log":"2023-06-22T02:54:59Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:\"agrippa\", Flush Interval:10s\n","stream":"stderr","time":"2023-06-22T02:54:59.206039574Z"}
{"log":"2023-06-22T02:54:59Z I! [inputs.execd] Starting process: /mnt/zfs_libs/zpool_influxdb [--execd]\n","stream":"stderr","time":"2023-06-22T02:54:59.212980465Z"}
{"log":"2023-06-22T02:54:59Z E! [inputs.execd] stderr: \"crypto/fips/fips_post.c:235: FIPS self test failure\"\n","stream":"stderr","time":"2023-06-22T02:54:59.227174811Z"}
{"log":"2023-06-22T13:41:30Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:41:30.0009276Z"}
{"log":"2023-06-22T13:41:40Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:41:40.001259822Z"}
{"log":"2023-06-22T13:42:00Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:42:00.001491406Z"}
{"log":"2023-06-22T13:42:20Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:42:20.001465616Z"}
{"log":"2023-06-22T13:42:40Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:42:40.002184056Z"}
{"log":"2023-06-22T13:43:00Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:43:00.002538033Z"}
{"log":"2023-06-22T13:43:20Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:43:20.00145262Z"}
{"log":"2023-06-22T13:43:30Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:43:30.002048779Z"}
{"log":"2023-06-22T13:44:00Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:44:00.000762417Z"}
{"log":"2023-06-22T13:44:20Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:44:20.002226842Z"}
{"log":"2023-06-22T13:44:30Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:44:30.00270687Z"}
{"log":"2023-06-22T13:44:50Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:44:50.002196945Z"}
{"log":"2023-06-22T13:45:00Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:45:00.002357333Z"}
{"log":"2023-06-22T13:45:20Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:45:20.001726678Z"}
{"log":"2023-06-22T13:45:30Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:45:30.001788342Z"}
{"log":"2023-06-22T13:45:40Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:45:40.003145886Z"}
{"log":"2023-06-22T13:45:45Z E! [inputs.smart] Error in plugin: failed to run command '/usr/sbin/smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sde': command timed out - smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107+truenas] (local build)\n","stream":"stderr","time":"2023-06-22T13:45:45.407629997Z"}
{"log":"Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org\n","stream":"stderr","time":"2023-06-22T13:45:45.407793733Z"}
{"log":"\n","stream":"stderr","time":"2023-06-22T13:45:45.407845784Z"}
{"log":"=== START OF INFORMATION SECTION ===\n","stream":"stderr","time":"2023-06-22T13:45:45.407884522Z"}
{"log":"Model Family:     Hitachi Ultrastar A7K1000\n","stream":"stderr","time":"2023-06-22T13:45:45.407923719Z"}
{"log":"Device Model:     HUA721010KLA330\n","stream":"stderr","time":"2023-06-22T13:45:45.40798974Z"}
{"log":"Serial Number:    PBHN30MF\n","stream":"stderr","time":"2023-06-22T13:45:45.408029977Z"}
{"log":"Firmware Version: GKAOA94A\n","stream":"stderr","time":"2023-06-22T13:45:45.408066409Z"}
{"log":"User Capacity:    1,000,204,886,016 bytes [1.00 TB]\n","stream":"stderr","time":"2023-06-22T13:45:45.408103005Z"}
{"log":"Sector Size:      512 bytes logical/physical\n","stream":"stderr","time":"2023-06-22T13:45:45.40816398Z"}
{"log":"Device is:        In smartctl database [for details use: -P show]\n","stream":"stderr","time":"2023-06-22T13:45:45.408202396Z"}
{"log":"ATA Version is:   ATA/ATAPI-7 T13/1532D revision 1\n","stream":"stderr","time":"2023-06-22T13:45:45.408239955Z"}
{"log":"Local Time is:    Thu Jun 22 13:45:30 2023 UTC\n","stream":"stderr","time":"2023-06-22T13:45:45.408306169Z"}
{"log":"SMART support is: Available - device has SMART capability.\n","stream":"stderr","time":"2023-06-22T13:45:45.408348679Z"}
{"log":"SMART support is: Enabled\n","stream":"stderr","time":"2023-06-22T13:45:45.408410261Z"}
{"log":"Power mode is:    ACTIVE or IDLE\n","stream":"stderr","time":"2023-06-22T13:45:45.40844837Z"}
{"log":"\n","stream":"stderr","time":"2023-06-22T13:45:45.408485159Z"}
{"log":"=== START OF READ SMART DATA SECTION ===\n","stream":"stderr","time":"2023-06-22T13:45:45.408521072Z"}
{"log":"SMART overall-health self-assessment test result: PASSED\n","stream":"stderr","time":"2023-06-22T13:45:45.40855751Z"}
{"log":"\n","stream":"stderr","time":"2023-06-22T13:45:45.40861843Z"}
{"log":"SMART Attributes Data Structure revision number: 16\n","stream":"stderr","time":"2023-06-22T13:45:45.408655693Z"}
{"log":"Vendor Specific SMART Attributes with Thresholds:\n","stream":"stderr","time":"2023-06-22T13:45:45.408692487Z"}
{"log":"ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE\n","stream":"stderr","time":"2023-06-22T13:45:45.408728236Z"}
{"log":"  1 Raw_Read_Error_Rate     PO-R--   096   096   016    -    458753\n","stream":"stderr","time":"2023-06-22T13:45:45.408796944Z"}
{"log":"  2 Throughput_Performance  P-S---   100   100   054    -    0\n","stream":"stderr","time":"2023-06-22T13:45:45.408845777Z"}
{"log":"  3 Spin_Up_Time            POS---   106   106   024    -    682 (Average 631)\n","stream":"stderr","time":"2023-06-22T13:45:45.408889655Z"}
{"log":"  4 Start_Stop_Count        -O--C-   099   099   000    -    5446\n","stream":"stderr","time":"2023-06-22T13:45:45.408927984Z"}
{"log":"  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0\n","stream":"stderr","time":"2023-06-22T13:45:45.408974847Z"}
{"log":"  7 Seek_Error_Rate         PO-R--   100   100   067    -    0\n","stream":"stderr","time":"2023-06-22T13:45:45.4090026Z"}
{"log":"  8 Seek_Time_Performance   P-S---   128   128   020    -    35\n","stream":"stderr","time":"2023-06-22T13:45:45.409028283Z"}
{"log":"  9 Power_On_Hours          -O--C-   089   089   000    -    83048\n","stream":"stderr","time":"2023-06-22T13:45:45.409053463Z"}
{"log":" 10 Spin_Retry_Count        PO--C-   100   100   060    -    0\n","stream":"stderr","time":"2023-06-22T13:45:45.409097619Z"}
{"log":" 12 Power_Cycle_Count       -O--CK   100   100   000    -    371\n","stream":"stderr","time":"2023-06-22T13:45:45.409124766Z"}
{"log":"192 Power-Off_Retract_Count -O--CK   096   096   000    -    5921\n","stream":"stderr","time":"2023-06-22T13:45:45.409151087Z"}
{"log":"193 Load_Cycle_Count        -O--C-   096   096   000    -    5921\n","stream":"stderr","time":"2023-06-22T13:45:45.409194117Z"}
{"log":"194 Temperature_Celsius     -O----   125   125   000    -    48 (Min/Max 17/56)\n","stream":"stderr","time":"2023-06-22T13:45:45.409223747Z"}
{"log":"196 Reallocated_Event_Count -O--CK   100   100   000    -    0\n","stream":"stderr","time":"2023-06-22T13:45:45.40924959Z"}
{"log":"197 Current_Pending_Sector  -O---K   100   100   000    -    0\n","stream":"stderr","time":"2023-06-22T13:45:45.409275092Z"}
{"log":"198 Offline_Uncorrectable   ---R--   100   100   000    -    0\n","stream":"stderr","time":"2023-06-22T13:45:45.409316068Z"}
{"log":"199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0\n","stream":"stderr","time":"2023-06-22T13:45:45.409342986Z"}
{"log":"                            ||||||_ K auto-keep\n","stream":"stderr","time":"2023-06-22T13:45:45.409368043Z"}
{"log":"                            |||||__ C event count\n","stream":"stderr","time":"2023-06-22T13:45:45.409393313Z"}
{"log":"                            ||||___ R error rate\n","stream":"stderr","time":"2023-06-22T13:45:45.409433937Z"}
{"log":"                            |||____ S speed/performance\n","stream":"stderr","time":"2023-06-22T13:45:45.409460798Z"}
{"log":"                            ||_____ O updated online\n","stream":"stderr","time":"2023-06-22T13:45:45.40948593Z"}
{"log":"                            |______ P prefailure warning\n","stream":"stderr","time":"2023-06-22T13:45:45.409511212Z"}
{"log":"\n","stream":"stderr","time":"2023-06-22T13:45:45.40955212Z"}
{"log":"2023-06-22T13:46:00Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:46:00.001905643Z"}
{"log":"2023-06-22T13:46:30Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:46:30.001187426Z"}
{"log":"2023-06-22T13:46:40Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:46:40.001358571Z"}
{"log":"2023-06-22T13:46:50Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:46:50.001348316Z"}
{"log":"2023-06-22T13:47:10Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:47:10.00239994Z"}
{"log":"2023-06-22T13:47:20Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:47:20.001610232Z"}
{"log":"2023-06-22T13:47:30Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:47:30.001640065Z"}
{"log":"2023-06-22T13:47:40Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:47:40.002226715Z"}
{"log":"2023-06-22T13:47:44Z E! [inputs.smart] Error in plugin: failed to run command '/usr/sbin/smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sde': command timed out - smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107+truenas] (local build)\n","stream":"stderr","time":"2023-06-22T13:47:44.552659265Z"}
{"log":"Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org\n","stream":"stderr","time":"2023-06-22T13:47:44.552760701Z"}
{"log":"\n","stream":"stderr","time":"2023-06-22T13:47:44.552795995Z"}

{“log”:“2023-06-22T13:45:45Z E! [inputs.smart] Error in plugin: failed to run command ‘/usr/sbin/smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sde’: command timed out - smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107+truenas] (local build)\n”,“stream”:“stderr”,“time”:“2023-06-22T13:45:45.407629997Z”}

So an error on an input usually will kill telegraf. This is probably why you are not seeing metrics. Looks like the smart command took quite a while to complete successfully.

I’d try running the command locally and or in the container and see how long it takes. It could be the sign that the drive is failing and smartctl doesn’t return quickly. There is also a timeout config option as part of inputs.smart that you could consider using to increase the timeout value.

Okay, something seems maybe strange. I’m monitoring now, however, it’s either fixed and the tz is off as it’s displaying no metrics within the last 1 hour, or something else is incorrect. However, this is the debug log since launching new container.

As per your last response, I will look into that if this stops again. The drives “shouldn’t” be failing, as they’re even still under return policy, I believe, haha. But stranger things have happened.

It was also perhaps just a “gunking up” of files from having attempted to spin this up 50 times and apparently it not doing a proper job of pruning itself.

2023-06-28 14:58:50.895261+00:00invoke-rc.d: could not determine current runlevel

2023-06-28 14:58:50.900369+00:00invoke-rc.d: policy-rc.d denied execution of start.

2023-06-28 14:58:51.090283+00:00Setting up nvme-cli (1.12-5) ...

2023-06-28 14:58:52.088863+00:002023-06-28T14:58:52Z I! Loading config: /etc/telegraf/telegraf.conf

2023-06-28 14:58:52.090544+00:002023-06-28T14:58:52Z I! Starting Telegraf 1.27.1

2023-06-28 14:58:52.090573+00:002023-06-28T14:58:52Z I! Available plugins: 237 inputs, 9 aggregators, 28 processors, 23 parsers, 59 outputs, 4 secret-stores

2023-06-28 14:58:52.090585+00:002023-06-28T14:58:52Z I! Loaded inputs: cpu diskio execd kernel mem net sensors smart swap system zfs

2023-06-28 14:58:52.090608+00:002023-06-28T14:58:52Z I! Loaded aggregators:

2023-06-28 14:58:52.090616+00:002023-06-28T14:58:52Z I! Loaded processors:

2023-06-28 14:58:52.090622+00:002023-06-28T14:58:52Z I! Loaded secretstores:

2023-06-28 14:58:52.090627+00:002023-06-28T14:58:52Z I! Loaded outputs: influxdb_v2

2023-06-28 14:58:52.090637+00:002023-06-28T14:58:52Z I! Tags enabled: host=Agrippa

2023-06-28 14:58:52.090644+00:002023-06-28T14:58:52Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"Agrippa", Flush Interval:10s

2023-06-28 14:58:52.090649+00:002023-06-28T14:58:52Z D! [agent] Initializing plugins

2023-06-28 14:58:52.090864+00:002023-06-28T14:58:52Z D! [agent] Connecting outputs

2023-06-28 14:58:52.090892+00:002023-06-28T14:58:52Z D! [agent] Attempting connection to [outputs.influxdb_v2]

2023-06-28 14:58:52.091033+00:002023-06-28T14:58:52Z D! [agent] Successfully connected to outputs.influxdb_v2

2023-06-28 14:58:52.091050+00:002023-06-28T14:58:52Z D! [agent] Starting service inputs

2023-06-28 14:58:52.091117+00:002023-06-28T14:58:52Z I! [inputs.execd] Starting process: /mnt/zfs_libs/zpool_influxdb [--execd]

2023-06-28 14:58:52.107842+00:002023-06-28T14:58:52Z E! [inputs.execd] stderr: "crypto/fips/fips_post.c:235: FIPS self test failure"

2023-06-28 14:59:00.021804+00:002023-06-28T14:59:00Z D! [inputs.system] Reading users: open /hostfs/var/run/utmp: no such file or directory

2023-06-28 14:59:02.184536+00:002023-06-28T14:59:02Z D! [outputs.influxdb_v2] Wrote batch of 522 metrics in 92.237818ms

2023-06-28 14:59:02.184641+00:002023-06-28T14:59:02Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics

2023-06-28 14:59:10.025744+00:002023-06-28T14:59:10Z D! [inputs.system] Reading users: open /hostfs/var/run/utmp: no such file or directory

2023-06-28 14:59:12.194332+00:002023-06-28T14:59:12Z D! [outputs.influxdb_v2] Wrote batch of 531 metrics in 101.08907ms

2023-06-28 14:59:12.194482+00:002023-06-28T14:59:12Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics

2023-06-28 14:59:20.017720+00:002023-06-28T14:59:20Z D! [inputs.system] Reading users: open /hostfs/var/run/utmp: no such file or directory

2023-06-28 14:59:22.439526+00:002023-06-28T14:59:22Z D! [outputs.influxdb_v2] Wrote batch of 531 metrics in 345.294282ms

2023-06-28 14:59:22.439832+00:002023-06-28T14:59:22Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics

2023-06-28 14:59:30.017740+00:002023-06-28T14:59:30Z D! [inputs.system] Reading users: open /hostfs/var/run/utmp: no such file or directory

2023-06-28 14:59:32.200716+00:002023-06-28T14:59:32Z D! [outputs.influxdb_v2] Wrote batch of 531 metrics in 105.465327ms

2023-06-28 14:59:32.200841+00:002023-06-28T14:59:32Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics

2023-06-28 14:59:40.022570+00:002023-06-28T14:59:40Z D! [inputs.system] Reading users: open /hostfs/var/run/utmp: no such file or directory

2023-06-28 14:59:02.184536+00:002023-06-28T14:59:02Z D! [outputs.influxdb_v2] Wrote batch of 522 metrics in 92.237818ms

So this means that metrics are clearly successfully written. If you want to check the timestamps you could add the [[outputs.file]] output. It will print to stdout the metrics as well, but it will be a lot of output since you have ~500 a time.

500 seemed like a lot to me as well, as my config seems pretty light.

[agent]
    interval = "10s"
    round_interval = true
    metric_batch_size = 1000
    metric_buffer_limit = 10000
    collection_jitter = "0s"
    flush_interval = "10s"
    flush_jitter = "0s"
    precision = ""
    hostname = "xxx"
    omit_hostname = false
    debug = true
...
[[inputs.cpu]]
    percpu = true
    totalcpu = true
    collect_cpu_time = false
    report_active = false
[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.net]]
[[inputs.sensors]]
[[inputs.execd]]
    command = ["/mnt/zfs_libs/zpool_influxdb", "--execd"]
    environment = ["LD_LIBRARY_PATH=/mnt/zfs_libs"]
    signal = "STDIN"
    restart_delay = "10s"
    data_format = "influx"
[[inputs.zfs]]
    kstatPath = "/hostfs/proc/spl/kstat/zfs"
    poolMetrics = true
    datasetMetrics = true
[[inputs.smart]]
    timeout = "30s"
    attributes = true
    use_sudo = true

Also, curiously, although it seems like it’s not causing a ton of issues, I ponder why users is causing the var/run/utmp error.

I am going to gracefully restart the effected systems/VMs as it seems there was/is an issue with TrueNAS not connecting to NTP that was causing the offset in InfluxDB timings.

The file itself is used to track login/logouts, and telegraf attempts to read it to show the count of current users. I don’t think that file exists on FreeBSD, which FreeNAS uses, hence the error message. You could verify by seeing if the file even exists on the host. Also the system input will still produce metrics, only without the user info.

500 seemed like a lot to me as well, as my config seems pretty light.

It can add up quickly :slight_smile:

smart will print out a different metric for each smart attribute per device. So that could quickly go up depending on how many drives you have and how many attributes they expose.

The other one that could expose quite a few is sensors.

Zfs is dependent on the number of pools, cpu is number of CPUs + 1, and net could have a metric for every docker bridge device you have.

@jpowers Just to follow up in case anyone else runs into this issue, I apologize for leaving this hanging for a few days. Been a rough start of the week. haha.

You are not able to actually create docker-compose or dockerfiles and run them in shell (iirc) so you have to go through a dumb manual process through the web gui.

I do not know if this was the original issue, as I just system pruned and started over, but I have noticed, any time there is a major change, for instance, I just moved my hard drives from an i7-7700k system I had laying around into an actual HPE server, now I’m having failures again but there is nothing obvious in the log files prima facie.

When I click on the "application, though, I get the following:

2023-07-06 10:07:09 Successfully assigned ix-telegraf/telegraf-ix-chart-6ccfd96db5-plldg to ix-truenas

2023-07-06 10:07:09 Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.0.0.1 8.8.8.8 8.8.4.4

2023-07-06 10:07:08 Error updating Endpoint Slices for Service ix-telegraf/telegraf-ix-chart: failed to create EndpointSlice for Service ix-telegraf/telegraf-ix-chart: Unauthorized

2023-07-06 10:07:08 Failed to create new replica set "telegraf-ix-chart-6ccfd96db5": Unauthorized

2023-07-06 2:42:09 Scaled down replica set telegraf-ix-chart-65d59c8f98 to 0 from 1

2023-07-06 2:42:09 Deleted pod: telegraf-ix-chart-65d59c8f98-nfhmv

2023-07-06 2:42:09 Stopping container ix-chart

2023-07-06 2:36:45 Started container ix-chart

2023-07-06 2:36:44 Created container ix-chart

2023-07-06 2:36:39 Add eth0 [172.16.0.69/16] from ix-net

2023-07-06 2:36:39 Container image "telegraf:latest" already present on machine

2023-07-06 2:36:37 Scaled up replica set telegraf-ix-chart-65d59c8f98 to 1 from 0

2023-07-06 2:36:37 Created pod: telegraf-ix-chart-65d59c8f98-nfhmv

2023-07-06 2:36:37 Successfully assigned ix-telegraf/telegraf-ix-chart-65d59c8f98-nfhmv to ix-truenas

2023-07-05 23:14:29 MountVolume.SetUp failed for volume "kube-api-access-5kffp" : object "ix-telegraf"/"kube-root-ca.crt" not registered

2023-07-05 23:14:27 MountVolume.SetUp failed for volume "kube-api-access-jz2tq" : object "ix-telegraf"/"kube-root-ca.crt" not registered

So my next step to resolve the issue will be to delete the application, docker system prune and add the application again, it should spin right up.

Thank you for your assistance earlier, by the way.

1 Like

Thanks for following up! Unfortunate that you had to do the clean up, but I am glad to hear it works and didn’t appear to be an issue with telegraf