I had TrueNAS equipped with a custom Telegraf application, and all was well. I can’t pinpoint exactly when the communication stopped, I thought it was simply because of an incomplete DNS change, but then I remembered … I actually got it back up and running last night for a short period of time until it stopped again, and I want to say there was nothing outstanding that caused the cessation last night…
Here’s what we’ve done so far …
Confirmed DNS records for everything resolve properly via MXToolKit, dig and ping. All is well.
Confirmed API Token has sufficient permissions (Also granted all permissions for troubleshooting purposes.)
Confirmed telegraf.conf is accurate (truncated here to most relevant part): [[outputs.influxdb_v2]]urls = ["URL:PORT"]token = "Token"organization = "Org"bucket = "Bucket"
Checked container log inside TrueNAS. Container set to run as UID:GID “0:0”
2023-06-27 12:26:53.609426+00:00Unpacking uuid-runtime (2.36.1-8+deb11u1) ...
2023-06-27 12:26:53.782821+00:00Selecting previously unselected package nvme-cli.
2023-06-27 12:26:53.786700+00:00Preparing to unpack .../nvme-cli_1.12-5_amd64.deb ...
2023-06-27 12:26:53.808256+00:00Unpacking nvme-cli (1.12-5) ...
2023-06-27 12:26:54.039700+00:00Selecting previously unselected package smartmontools.
2023-06-27 12:26:54.043715+00:00Preparing to unpack .../smartmontools_7.2-1_amd64.deb ...
2023-06-27 12:26:54.099373+00:00Unpacking smartmontools (7.2-1) ...
2023-06-27 12:26:54.354569+00:00Selecting previously unselected package sudo.
2023-06-27 12:26:54.358480+00:00Preparing to unpack .../sudo_1.9.5p2-3+deb11u1_amd64.deb ...
2023-06-27 12:26:54.383193+00:00Unpacking sudo (1.9.5p2-3+deb11u1) ...
2023-06-27 12:26:54.903915+00:00Setting up smartmontools (7.2-1) ...
2023-06-27 12:26:55.145527+00:00Setting up sudo (1.9.5p2-3+deb11u1) ...
2023-06-27 12:26:55.320281+00:00invoke-rc.d: could not determine current runlevel
2023-06-27 12:26:55.360839+00:00Setting up uuid-runtime (2.36.1-8+deb11u1) ...
2023-06-27 12:26:55.472789+00:00Adding group \uuidd' (GID 101) ...`
2023-06-27 12:26:56.400694+00:00Done.
2023-06-27 12:26:56.462507+00:00Warning: The home dir /run/uuidd you specified can't be accessed: No such file or directory
2023-06-27 12:26:56.462797+00:00Adding system user \uuidd' (UID 101) ...`
2023-06-27 12:26:56.462812+00:00Adding new user \uuidd' (UID 101) with group `uuidd' ...`
2023-06-27 12:26:57.857894+00:00Not creating home directory \/run/uuidd'.`
2023-06-27 12:26:57.895416+00:00invoke-rc.d: could not determine current runlevel
2023-06-27 12:26:57.912887+00:00Starting uuid generator: uuidd.
2023-06-27 12:26:58.110117+00:00Setting up nvme-cli (1.12-5) ...
2023-06-27 12:26:59.242023+00:002023-06-27T12:26:59Z I! Loading config: /etc/telegraf/telegraf.conf
2023-06-27 12:26:59.243326+00:002023-06-27T12:26:59Z I! Starting Telegraf 1.27.1
2023-06-27 12:26:59.243346+00:002023-06-27T12:26:59Z I! Available plugins: 237 inputs, 9 aggregators, 28 processors, 23 parsers, 59 outputs, 4 secret-stores
2023-06-27 12:26:59.243352+00:002023-06-27T12:26:59Z I! Loaded inputs: cpu diskio execd kernel mem net sensors smart swap system zfs
2023-06-27 12:26:59.243357+00:002023-06-27T12:26:59Z I! Loaded aggregators:
2023-06-27 12:26:59.243365+00:002023-06-27T12:26:59Z I! Loaded processors:
2023-06-27 12:26:59.243370+00:002023-06-27T12:26:59Z I! Loaded secretstores:
2023-06-27 12:26:59.243375+00:002023-06-27T12:26:59Z I! Loaded outputs: influxdb_v2
2023-06-27 12:26:59.243380+00:002023-06-27T12:26:59Z I! Tags enabled: host=HOST
2023-06-27 12:26:59.243389+00:002023-06-27T12:26:59Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"HOST", Flush Interval:10s
2023-06-27 12:26:59.243664+00:002023-06-27T12:26:59Z I! [inputs.execd] Starting process: /mnt/zfs_libs/zpool_influxdb [--execd]
2023-06-27 12:26:59.255053+00:002023-06-27T12:26:59Z E! [inputs.execd] stderr: "crypto/fips/fips_post.c:235: FIPS self test failure"
Checked /var/log for both auth, syslog, containers and pretty much every other log I think could be relevant on all 3 servers. Nothing exceptional.
Verified connectivity between TRUE-NAS-SCALE and InfluxDB server. – Verified ipv6 connectivity but not ipv4?
I will say that I know nothing about how the TrueNAS telegraf is different, so you may need to go over to their forums or community and ask if anything is special.
For help here, my best guess is to first look at the telegraf logs, enable debug, and see if metrics are even sending.
@jpowers Thank you for the prompt response, my friend.
I was tied up all day after an embarrassingly long argument between myself, Telegraf and TrueNAS and had to deal with an incident with our work net/testnet all morning.
I’m recreating the environment now and rebuilding the container in one moment and will see if there’s anything pertinent. – Any specific location/logs you’re looking for?
Also, as per someone on Reddit suggesting … is the Error in FIPS relevant? I see no outward indication the container has failed to build or actively run based on … it running, and having connection via ipv6, at least.
Is FIPS failure a fatal error or would just lead to an error for FIPS-specific functions?
TrueNAS doesn’t actually … have a Telegraf implementation, so custom work-around had to be made, which just pulls the latest telegraf from the repo and sets the environment as such,
# Container Environment Variables
# name
HOST_ETC
# value
/hostfs/etc
# name
HOST_PROC
# value
/hostfs/proc
# name
HOST_SYS
# value
/hostfs/sys
# name
HOST_VAR
# value
/hostfs/var
# name
HOST_RUN
# value
/hostfs/run
# name
HOST_MOUNT_PREFIX
# value
/hostfs
# name
LD_LIBRARY_PATH
# value
/mnt/zfs_libs
# Port Forwarding
# Container Port
8094
# Node Port
9094
# Protocole
TCP
# Storage
# Host Path
/mnt/telegraf/telegraf.conf
# Mount Path
/etc/telegraf/telegraf.conf
# Host Path
/mnt/telegraf/etc
# Mount Path
/hostfs/etc
# Host Path
/mnt/telegraf/proc
# Mount Path
/hostfs/proc
# Host Path
/mnt/telegraf/sys
# Mount Path
/hostfs/sys
# Host Path
/mnt/telegraf/run
# Mount Path
/hostfs/run
# Host Path
/mnt/telegraf/entrypoint.sh
# Mount Path
/entrypoint.sh
# Host Path
/mnt/telegraf/zfs_libs
# Mount Path
/mnt/zfs_libs
There is obviously a custom entrypoint and so forth I could show if would be helpful … but let me build this env and I’ll be back with you.
Ha I am going to be honest I didn’t scroll down far enough to see some telegraf output. I saw the other setup messages and wondered what this was for.
The FIPS error might mean you do not get metrics for that one input. But you should still see metrics from the other inputs.
Can you enable debug in your telegraf config as it will print a lot more details about sending metrics. Also it would be good to get a few minutes of logs to verify that it is able to send metrics and get an idea of how many.
Haha, will do right now.
I’ve spent most of the time in between posts pruning and cleaning up all references and leftover pods/containers or any mention of telegraf to ensure we have as clean of a slate as possible to work with.
I did find a few files I had not seen before, as there were so many containers from trying over and over, it was hard to sort through.
I have yet to go through them all, but this one seemed of importance glancing over it. But let me get this new instance spinnedered up for you and I’ll brb.
{"log":"Setting up smartmontools (7.2-1) ...\r\n","stream":"stdout","time":"2023-06-22T02:54:54.556020609Z"}
{"log":"Setting up sudo (1.9.5p2-3+deb11u1) ...\r\n","stream":"stdout","time":"2023-06-22T02:54:54.999107089Z"}
{"log":"invoke-rc.d: could not determine current runlevel\r\n","stream":"stdout","time":"2023-06-22T02:54:55.203250476Z"}
{"log":"invoke-rc.d: policy-rc.d denied execution of start.\r\n","stream":"stdout","time":"2023-06-22T02:54:55.212348152Z"}
{"log":"Setting up uuid-runtime (2.36.1-8+deb11u1) ...\r\n","stream":"stdout","time":"2023-06-22T02:54:55.249879945Z"}
{"log":"Adding group `uuidd' (GID 101) ...\r\n","stream":"stdout","time":"2023-06-22T02:54:55.397762871Z"}
{"log":"Done.\r\n","stream":"stdout","time":"2023-06-22T02:54:56.52410506Z"}
{"log":"Warning: The home dir /run/uuidd you specified can't be accessed: No such file or directory\r\n","stream":"stdout","time":"2023-06-22T02:54:56.583900149Z"}
{"log":"Adding system user `uuidd' (UID 101) ...\r\n","stream":"stdout","time":"2023-06-22T02:54:56.584245619Z"}
{"log":"Adding new user `uuidd' (UID 101) with group `uuidd' ...\r\n","stream":"stdout","time":"2023-06-22T02:54:56.584264086Z"}
{"log":"Not creating home directory `/run/uuidd'.\r\n","stream":"stdout","time":"2023-06-22T02:54:57.947458151Z"}
{"log":"invoke-rc.d: could not determine current runlevel\r\n","stream":"stdout","time":"2023-06-22T02:54:57.982839286Z"}
{"log":"invoke-rc.d: policy-rc.d denied execution of start.\r\n","stream":"stdout","time":"2023-06-22T02:54:57.989220683Z"}
{"log":"Setting up nvme-cli (1.12-5) ...\r\n","stream":"stdout","time":"2023-06-22T02:54:58.18527136Z"}
{"log":"2023-06-22T02:54:59Z I! Loading config: /etc/telegraf/telegraf.conf\n","stream":"stderr","time":"2023-06-22T02:54:59.204355549Z"}
{"log":"2023-06-22T02:54:59Z I! Starting Telegraf 1.27.1\n","stream":"stderr","time":"2023-06-22T02:54:59.205949179Z"}
{"log":"2023-06-22T02:54:59Z I! Available plugins: 237 inputs, 9 aggregators, 28 processors, 23 parsers, 59 outputs, 4 secret-stores\n","stream":"stderr","time":"2023-06-22T02:54:59.205973356Z"}
{"log":"2023-06-22T02:54:59Z I! Loaded inputs: cpu diskio execd kernel mem net sensors smart swap system zfs\n","stream":"stderr","time":"2023-06-22T02:54:59.205985809Z"}
{"log":"2023-06-22T02:54:59Z I! Loaded aggregators: \n","stream":"stderr","time":"2023-06-22T02:54:59.205994937Z"}
{"log":"2023-06-22T02:54:59Z I! Loaded processors: \n","stream":"stderr","time":"2023-06-22T02:54:59.206008159Z"}
{"log":"2023-06-22T02:54:59Z I! Loaded secretstores: \n","stream":"stderr","time":"2023-06-22T02:54:59.206016151Z"}
{"log":"2023-06-22T02:54:59Z I! Loaded outputs: influxdb_v2\n","stream":"stderr","time":"2023-06-22T02:54:59.206024392Z"}
{"log":"2023-06-22T02:54:59Z I! Tags enabled: host=agrippa\n","stream":"stderr","time":"2023-06-22T02:54:59.206031864Z"}
{"log":"2023-06-22T02:54:59Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:\"agrippa\", Flush Interval:10s\n","stream":"stderr","time":"2023-06-22T02:54:59.206039574Z"}
{"log":"2023-06-22T02:54:59Z I! [inputs.execd] Starting process: /mnt/zfs_libs/zpool_influxdb [--execd]\n","stream":"stderr","time":"2023-06-22T02:54:59.212980465Z"}
{"log":"2023-06-22T02:54:59Z E! [inputs.execd] stderr: \"crypto/fips/fips_post.c:235: FIPS self test failure\"\n","stream":"stderr","time":"2023-06-22T02:54:59.227174811Z"}
{"log":"2023-06-22T13:41:30Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:41:30.0009276Z"}
{"log":"2023-06-22T13:41:40Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:41:40.001259822Z"}
{"log":"2023-06-22T13:42:00Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:42:00.001491406Z"}
{"log":"2023-06-22T13:42:20Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:42:20.001465616Z"}
{"log":"2023-06-22T13:42:40Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:42:40.002184056Z"}
{"log":"2023-06-22T13:43:00Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:43:00.002538033Z"}
{"log":"2023-06-22T13:43:20Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:43:20.00145262Z"}
{"log":"2023-06-22T13:43:30Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:43:30.002048779Z"}
{"log":"2023-06-22T13:44:00Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:44:00.000762417Z"}
{"log":"2023-06-22T13:44:20Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:44:20.002226842Z"}
{"log":"2023-06-22T13:44:30Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:44:30.00270687Z"}
{"log":"2023-06-22T13:44:50Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:44:50.002196945Z"}
{"log":"2023-06-22T13:45:00Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:45:00.002357333Z"}
{"log":"2023-06-22T13:45:20Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:45:20.001726678Z"}
{"log":"2023-06-22T13:45:30Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:45:30.001788342Z"}
{"log":"2023-06-22T13:45:40Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:45:40.003145886Z"}
{"log":"2023-06-22T13:45:45Z E! [inputs.smart] Error in plugin: failed to run command '/usr/sbin/smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sde': command timed out - smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107+truenas] (local build)\n","stream":"stderr","time":"2023-06-22T13:45:45.407629997Z"}
{"log":"Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org\n","stream":"stderr","time":"2023-06-22T13:45:45.407793733Z"}
{"log":"\n","stream":"stderr","time":"2023-06-22T13:45:45.407845784Z"}
{"log":"=== START OF INFORMATION SECTION ===\n","stream":"stderr","time":"2023-06-22T13:45:45.407884522Z"}
{"log":"Model Family: Hitachi Ultrastar A7K1000\n","stream":"stderr","time":"2023-06-22T13:45:45.407923719Z"}
{"log":"Device Model: HUA721010KLA330\n","stream":"stderr","time":"2023-06-22T13:45:45.40798974Z"}
{"log":"Serial Number: PBHN30MF\n","stream":"stderr","time":"2023-06-22T13:45:45.408029977Z"}
{"log":"Firmware Version: GKAOA94A\n","stream":"stderr","time":"2023-06-22T13:45:45.408066409Z"}
{"log":"User Capacity: 1,000,204,886,016 bytes [1.00 TB]\n","stream":"stderr","time":"2023-06-22T13:45:45.408103005Z"}
{"log":"Sector Size: 512 bytes logical/physical\n","stream":"stderr","time":"2023-06-22T13:45:45.40816398Z"}
{"log":"Device is: In smartctl database [for details use: -P show]\n","stream":"stderr","time":"2023-06-22T13:45:45.408202396Z"}
{"log":"ATA Version is: ATA/ATAPI-7 T13/1532D revision 1\n","stream":"stderr","time":"2023-06-22T13:45:45.408239955Z"}
{"log":"Local Time is: Thu Jun 22 13:45:30 2023 UTC\n","stream":"stderr","time":"2023-06-22T13:45:45.408306169Z"}
{"log":"SMART support is: Available - device has SMART capability.\n","stream":"stderr","time":"2023-06-22T13:45:45.408348679Z"}
{"log":"SMART support is: Enabled\n","stream":"stderr","time":"2023-06-22T13:45:45.408410261Z"}
{"log":"Power mode is: ACTIVE or IDLE\n","stream":"stderr","time":"2023-06-22T13:45:45.40844837Z"}
{"log":"\n","stream":"stderr","time":"2023-06-22T13:45:45.408485159Z"}
{"log":"=== START OF READ SMART DATA SECTION ===\n","stream":"stderr","time":"2023-06-22T13:45:45.408521072Z"}
{"log":"SMART overall-health self-assessment test result: PASSED\n","stream":"stderr","time":"2023-06-22T13:45:45.40855751Z"}
{"log":"\n","stream":"stderr","time":"2023-06-22T13:45:45.40861843Z"}
{"log":"SMART Attributes Data Structure revision number: 16\n","stream":"stderr","time":"2023-06-22T13:45:45.408655693Z"}
{"log":"Vendor Specific SMART Attributes with Thresholds:\n","stream":"stderr","time":"2023-06-22T13:45:45.408692487Z"}
{"log":"ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE\n","stream":"stderr","time":"2023-06-22T13:45:45.408728236Z"}
{"log":" 1 Raw_Read_Error_Rate PO-R-- 096 096 016 - 458753\n","stream":"stderr","time":"2023-06-22T13:45:45.408796944Z"}
{"log":" 2 Throughput_Performance P-S--- 100 100 054 - 0\n","stream":"stderr","time":"2023-06-22T13:45:45.408845777Z"}
{"log":" 3 Spin_Up_Time POS--- 106 106 024 - 682 (Average 631)\n","stream":"stderr","time":"2023-06-22T13:45:45.408889655Z"}
{"log":" 4 Start_Stop_Count -O--C- 099 099 000 - 5446\n","stream":"stderr","time":"2023-06-22T13:45:45.408927984Z"}
{"log":" 5 Reallocated_Sector_Ct PO--CK 100 100 005 - 0\n","stream":"stderr","time":"2023-06-22T13:45:45.408974847Z"}
{"log":" 7 Seek_Error_Rate PO-R-- 100 100 067 - 0\n","stream":"stderr","time":"2023-06-22T13:45:45.4090026Z"}
{"log":" 8 Seek_Time_Performance P-S--- 128 128 020 - 35\n","stream":"stderr","time":"2023-06-22T13:45:45.409028283Z"}
{"log":" 9 Power_On_Hours -O--C- 089 089 000 - 83048\n","stream":"stderr","time":"2023-06-22T13:45:45.409053463Z"}
{"log":" 10 Spin_Retry_Count PO--C- 100 100 060 - 0\n","stream":"stderr","time":"2023-06-22T13:45:45.409097619Z"}
{"log":" 12 Power_Cycle_Count -O--CK 100 100 000 - 371\n","stream":"stderr","time":"2023-06-22T13:45:45.409124766Z"}
{"log":"192 Power-Off_Retract_Count -O--CK 096 096 000 - 5921\n","stream":"stderr","time":"2023-06-22T13:45:45.409151087Z"}
{"log":"193 Load_Cycle_Count -O--C- 096 096 000 - 5921\n","stream":"stderr","time":"2023-06-22T13:45:45.409194117Z"}
{"log":"194 Temperature_Celsius -O---- 125 125 000 - 48 (Min/Max 17/56)\n","stream":"stderr","time":"2023-06-22T13:45:45.409223747Z"}
{"log":"196 Reallocated_Event_Count -O--CK 100 100 000 - 0\n","stream":"stderr","time":"2023-06-22T13:45:45.40924959Z"}
{"log":"197 Current_Pending_Sector -O---K 100 100 000 - 0\n","stream":"stderr","time":"2023-06-22T13:45:45.409275092Z"}
{"log":"198 Offline_Uncorrectable ---R-- 100 100 000 - 0\n","stream":"stderr","time":"2023-06-22T13:45:45.409316068Z"}
{"log":"199 UDMA_CRC_Error_Count -O-R-- 200 200 000 - 0\n","stream":"stderr","time":"2023-06-22T13:45:45.409342986Z"}
{"log":" ||||||_ K auto-keep\n","stream":"stderr","time":"2023-06-22T13:45:45.409368043Z"}
{"log":" |||||__ C event count\n","stream":"stderr","time":"2023-06-22T13:45:45.409393313Z"}
{"log":" ||||___ R error rate\n","stream":"stderr","time":"2023-06-22T13:45:45.409433937Z"}
{"log":" |||____ S speed/performance\n","stream":"stderr","time":"2023-06-22T13:45:45.409460798Z"}
{"log":" ||_____ O updated online\n","stream":"stderr","time":"2023-06-22T13:45:45.40948593Z"}
{"log":" |______ P prefailure warning\n","stream":"stderr","time":"2023-06-22T13:45:45.409511212Z"}
{"log":"\n","stream":"stderr","time":"2023-06-22T13:45:45.40955212Z"}
{"log":"2023-06-22T13:46:00Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:46:00.001905643Z"}
{"log":"2023-06-22T13:46:30Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:46:30.001187426Z"}
{"log":"2023-06-22T13:46:40Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:46:40.001358571Z"}
{"log":"2023-06-22T13:46:50Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:46:50.001348316Z"}
{"log":"2023-06-22T13:47:10Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:47:10.00239994Z"}
{"log":"2023-06-22T13:47:20Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:47:20.001610232Z"}
{"log":"2023-06-22T13:47:30Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:47:30.001640065Z"}
{"log":"2023-06-22T13:47:40Z W! [inputs.smart] Collection took longer than expected; not complete after interval of 10s\n","stream":"stderr","time":"2023-06-22T13:47:40.002226715Z"}
{"log":"2023-06-22T13:47:44Z E! [inputs.smart] Error in plugin: failed to run command '/usr/sbin/smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sde': command timed out - smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107+truenas] (local build)\n","stream":"stderr","time":"2023-06-22T13:47:44.552659265Z"}
{"log":"Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org\n","stream":"stderr","time":"2023-06-22T13:47:44.552760701Z"}
{"log":"\n","stream":"stderr","time":"2023-06-22T13:47:44.552795995Z"}
{“log”:“2023-06-22T13:45:45Z E! [inputs.smart] Error in plugin: failed to run command ‘/usr/sbin/smartctl --info --health --attributes --tolerance=verypermissive -n standby --format=brief /dev/sde’: command timed out - smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.107+truenas] (local build)\n”,“stream”:“stderr”,“time”:“2023-06-22T13:45:45.407629997Z”}
So an error on an input usually will kill telegraf. This is probably why you are not seeing metrics. Looks like the smart command took quite a while to complete successfully.
I’d try running the command locally and or in the container and see how long it takes. It could be the sign that the drive is failing and smartctl doesn’t return quickly. There is also a timeout config option as part of inputs.smart that you could consider using to increase the timeout value.
Okay, something seems maybe strange. I’m monitoring now, however, it’s either fixed and the tz is off as it’s displaying no metrics within the last 1 hour, or something else is incorrect. However, this is the debug log since launching new container.
As per your last response, I will look into that if this stops again. The drives “shouldn’t” be failing, as they’re even still under return policy, I believe, haha. But stranger things have happened.
It was also perhaps just a “gunking up” of files from having attempted to spin this up 50 times and apparently it not doing a proper job of pruning itself.
2023-06-28 14:58:50.895261+00:00invoke-rc.d: could not determine current runlevel
2023-06-28 14:58:50.900369+00:00invoke-rc.d: policy-rc.d denied execution of start.
2023-06-28 14:58:51.090283+00:00Setting up nvme-cli (1.12-5) ...
2023-06-28 14:58:52.088863+00:002023-06-28T14:58:52Z I! Loading config: /etc/telegraf/telegraf.conf
2023-06-28 14:58:52.090544+00:002023-06-28T14:58:52Z I! Starting Telegraf 1.27.1
2023-06-28 14:58:52.090573+00:002023-06-28T14:58:52Z I! Available plugins: 237 inputs, 9 aggregators, 28 processors, 23 parsers, 59 outputs, 4 secret-stores
2023-06-28 14:58:52.090585+00:002023-06-28T14:58:52Z I! Loaded inputs: cpu diskio execd kernel mem net sensors smart swap system zfs
2023-06-28 14:58:52.090608+00:002023-06-28T14:58:52Z I! Loaded aggregators:
2023-06-28 14:58:52.090616+00:002023-06-28T14:58:52Z I! Loaded processors:
2023-06-28 14:58:52.090622+00:002023-06-28T14:58:52Z I! Loaded secretstores:
2023-06-28 14:58:52.090627+00:002023-06-28T14:58:52Z I! Loaded outputs: influxdb_v2
2023-06-28 14:58:52.090637+00:002023-06-28T14:58:52Z I! Tags enabled: host=Agrippa
2023-06-28 14:58:52.090644+00:002023-06-28T14:58:52Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"Agrippa", Flush Interval:10s
2023-06-28 14:58:52.090649+00:002023-06-28T14:58:52Z D! [agent] Initializing plugins
2023-06-28 14:58:52.090864+00:002023-06-28T14:58:52Z D! [agent] Connecting outputs
2023-06-28 14:58:52.090892+00:002023-06-28T14:58:52Z D! [agent] Attempting connection to [outputs.influxdb_v2]
2023-06-28 14:58:52.091033+00:002023-06-28T14:58:52Z D! [agent] Successfully connected to outputs.influxdb_v2
2023-06-28 14:58:52.091050+00:002023-06-28T14:58:52Z D! [agent] Starting service inputs
2023-06-28 14:58:52.091117+00:002023-06-28T14:58:52Z I! [inputs.execd] Starting process: /mnt/zfs_libs/zpool_influxdb [--execd]
2023-06-28 14:58:52.107842+00:002023-06-28T14:58:52Z E! [inputs.execd] stderr: "crypto/fips/fips_post.c:235: FIPS self test failure"
2023-06-28 14:59:00.021804+00:002023-06-28T14:59:00Z D! [inputs.system] Reading users: open /hostfs/var/run/utmp: no such file or directory
2023-06-28 14:59:02.184536+00:002023-06-28T14:59:02Z D! [outputs.influxdb_v2] Wrote batch of 522 metrics in 92.237818ms
2023-06-28 14:59:02.184641+00:002023-06-28T14:59:02Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2023-06-28 14:59:10.025744+00:002023-06-28T14:59:10Z D! [inputs.system] Reading users: open /hostfs/var/run/utmp: no such file or directory
2023-06-28 14:59:12.194332+00:002023-06-28T14:59:12Z D! [outputs.influxdb_v2] Wrote batch of 531 metrics in 101.08907ms
2023-06-28 14:59:12.194482+00:002023-06-28T14:59:12Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2023-06-28 14:59:20.017720+00:002023-06-28T14:59:20Z D! [inputs.system] Reading users: open /hostfs/var/run/utmp: no such file or directory
2023-06-28 14:59:22.439526+00:002023-06-28T14:59:22Z D! [outputs.influxdb_v2] Wrote batch of 531 metrics in 345.294282ms
2023-06-28 14:59:22.439832+00:002023-06-28T14:59:22Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2023-06-28 14:59:30.017740+00:002023-06-28T14:59:30Z D! [inputs.system] Reading users: open /hostfs/var/run/utmp: no such file or directory
2023-06-28 14:59:32.200716+00:002023-06-28T14:59:32Z D! [outputs.influxdb_v2] Wrote batch of 531 metrics in 105.465327ms
2023-06-28 14:59:32.200841+00:002023-06-28T14:59:32Z D! [outputs.influxdb_v2] Buffer fullness: 0 / 10000 metrics
2023-06-28 14:59:40.022570+00:002023-06-28T14:59:40Z D! [inputs.system] Reading users: open /hostfs/var/run/utmp: no such file or directory
2023-06-28 14:59:02.184536+00:002023-06-28T14:59:02Z D! [outputs.influxdb_v2] Wrote batch of 522 metrics in 92.237818ms
So this means that metrics are clearly successfully written. If you want to check the timestamps you could add the [[outputs.file]] output. It will print to stdout the metrics as well, but it will be a lot of output since you have ~500 a time.
Also, curiously, although it seems like it’s not causing a ton of issues, I ponder why users is causing the var/run/utmp error.
I am going to gracefully restart the effected systems/VMs as it seems there was/is an issue with TrueNAS not connecting to NTP that was causing the offset in InfluxDB timings.
The file itself is used to track login/logouts, and telegraf attempts to read it to show the count of current users. I don’t think that file exists on FreeBSD, which FreeNAS uses, hence the error message. You could verify by seeing if the file even exists on the host. Also the system input will still produce metrics, only without the user info.
500 seemed like a lot to me as well, as my config seems pretty light.
It can add up quickly
smart will print out a different metric for each smart attribute per device. So that could quickly go up depending on how many drives you have and how many attributes they expose.
The other one that could expose quite a few is sensors.
Zfs is dependent on the number of pools, cpu is number of CPUs + 1, and net could have a metric for every docker bridge device you have.
@jpowers Just to follow up in case anyone else runs into this issue, I apologize for leaving this hanging for a few days. Been a rough start of the week. haha.
You are not able to actually create docker-compose or dockerfiles and run them in shell (iirc) so you have to go through a dumb manual process through the web gui.
I do not know if this was the original issue, as I just system pruned and started over, but I have noticed, any time there is a major change, for instance, I just moved my hard drives from an i7-7700k system I had laying around into an actual HPE server, now I’m having failures again but there is nothing obvious in the log files prima facie.
When I click on the "application, though, I get the following:
2023-07-06 10:07:09 Successfully assigned ix-telegraf/telegraf-ix-chart-6ccfd96db5-plldg to ix-truenas
2023-07-06 10:07:09 Nameserver limits were exceeded, some nameservers have been omitted, the applied nameserver line is: 10.0.0.1 8.8.8.8 8.8.4.4
2023-07-06 10:07:08 Error updating Endpoint Slices for Service ix-telegraf/telegraf-ix-chart: failed to create EndpointSlice for Service ix-telegraf/telegraf-ix-chart: Unauthorized
2023-07-06 10:07:08 Failed to create new replica set "telegraf-ix-chart-6ccfd96db5": Unauthorized
2023-07-06 2:42:09 Scaled down replica set telegraf-ix-chart-65d59c8f98 to 0 from 1
2023-07-06 2:42:09 Deleted pod: telegraf-ix-chart-65d59c8f98-nfhmv
2023-07-06 2:42:09 Stopping container ix-chart
2023-07-06 2:36:45 Started container ix-chart
2023-07-06 2:36:44 Created container ix-chart
2023-07-06 2:36:39 Add eth0 [172.16.0.69/16] from ix-net
2023-07-06 2:36:39 Container image "telegraf:latest" already present on machine
2023-07-06 2:36:37 Scaled up replica set telegraf-ix-chart-65d59c8f98 to 1 from 0
2023-07-06 2:36:37 Created pod: telegraf-ix-chart-65d59c8f98-nfhmv
2023-07-06 2:36:37 Successfully assigned ix-telegraf/telegraf-ix-chart-65d59c8f98-nfhmv to ix-truenas
2023-07-05 23:14:29 MountVolume.SetUp failed for volume "kube-api-access-5kffp" : object "ix-telegraf"/"kube-root-ca.crt" not registered
2023-07-05 23:14:27 MountVolume.SetUp failed for volume "kube-api-access-jz2tq" : object "ix-telegraf"/"kube-root-ca.crt" not registered