I have been collecting metrics using the plugin for a few months, but I have not found any of the metrics to be useful for identifying when my nodes are failing (in various ways). The #1 thing I want to know is “is puppet applying successfully on this node”, and I can’t seem to figure out how to discern that from these metrics.
I could add an exec bash script to manually grep the log file for “Error” but if I can’t see this in the data, what the heck do people use puppetagent for? I think the information must be there and I’m just not seeing it.
I’m not a puppet user, so I’m not sure this will be helpful, but what if you monitor the various failure related measurements such as puppetagent_resources_failed or puppetagent_resources_failedtorestart for non-zero values?
This was my original intention, but unfortunately these metrics do not work as well as it seems they would. There are several ways a node can fail
I have done a number of tests trying to find a good way to see which nodes are failing but I am no less confused about how to make sense of these different metrics.
Regarding those specific metrics:
Over the last 3 months I have only one node (out of ~20) that reported a puppetagent_resources_failedtorestart > 0 and I believe these were from some puppet exec scripts hitting a timeout.
puppetagent_resources_failed seems pretty nice but it is weirdly inconsistent. It does catch some types of errors, but other times I will ssh into a node with this metric > 0, run puppet agent -t, find no errors, go back to my monitoring dash and find that the node no longer reports an error (very frustrating).
I have at least two nodes failing right now (due to 500 errors (not resource failures)) and I think a few others which have not applied my latest changes to my puppet config (probably also failing), but I can’t see this in the data.
You could check the /var/lib/puppet/state/last_run_summary.yaml file that is used for the source of this data. Maybe it is not being updated properly, perhaps check its modification time?
other times I will ssh into a node with this metric > 0, run puppet agent -t, find no errors, go back to my monitoring dash and find that the node no longer reports an error
I wonder if this could be a caused by using the wrong aggregation function, what query are you using to display the errors?
Since I am still figuring this out my basic approach is to look at all the plots of *.-opt-puppetlabs-puppet-cache-state-last_run_summary_yaml.puppetagent.* ( graphite backend ).
Currently I use two ways of finding failing nodes:
resources_total == 0
resources_failed > 0
A few months ago I intentionally caused failures (500 error) on a node manifest. Below are the results:
After looking through all the variables (except all the times) I found only three unique signals which might be useful (others were identical to these; see below).
other times I will ssh into a node with this metric > 0, run puppet agent -t, find no errors, go back to my monitoring dash and find that the node no longer reports an error
I wonder if this could be a caused by using the wrong aggregation function, what query are you using to display the errors?
I just ran into something like this (but opposite my original description - the node reports ok when actually failing after puppet agent -t). Here is an example of what I am talking about:
I wonder if running the puppet command changes the file. Maybe try to cat /var/lib/puppet/state/last_run_summary.yaml before and after you start debugging to see if is modified.
Good suggestion. I saved a version before running, but didn’t think to get one within that window. I will next time! I suspect the file does change, but maybe there is something to be learned from looking at how the file differs.
I think some of these troubles may be because puppet saves an incomplete summary when failure is encountered when running manually with puppet agent -t:
user@server:~$ sudo /opt/puppetlabs/bin/puppet agent -t
Info: Using configured environment 'test'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Lookup of key 'role' failed: Unable to parse (/etc/puppetlabs/code/environments/test/data/nodes/server.domain.edu.yaml): expected <block end>, but found Scalar while parsing a block mapping at line 14 column 13 at /etc/puppetlabs/code/environments/test/manifests/site.pp:1:9 on node server.domain.edu
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
user@server:~$ sudo cat /opt/puppetlabs/puppet/cache/state/last_run_summary.yaml
---
version:
config:
puppet: 4.10.1
time:
last_run: 1508862664
Here’s last_run_report if you’re interested as well:
user@server:~$ sudo cat /opt/puppetlabs/puppet/cache/state/last_run_report.yaml
--- !ruby/object:Puppet::Transaction::Report
metrics: {}
logs:
- !ruby/object:Puppet::Util::Log
level: :info
message: Using configured environment 'test'
source: Puppet
tags:
- info
time: '2017-10-24T12:30:39.514599387-04:00'
file:
line:
- !ruby/object:Puppet::Util::Log
level: :info
message: Retrieving pluginfacts
source: Puppet
tags:
- info
time: '2017-10-24T12:30:39.515819917-04:00'
file:
line:
- !ruby/object:Puppet::Util::Log
level: :info
message: Retrieving plugin
source: Puppet
tags:
- info
time: '2017-10-24T12:30:39.666972319-04:00'
file:
line:
- !ruby/object:Puppet::Util::Log
level: :info
message: Loading facts
source: Puppet
tags:
- info
time: '2017-10-24T12:30:41.396969874-04:00'
file:
line:
- !ruby/object:Puppet::Util::Log
level: :err
message: 'Could not retrieve catalog from remote server: Error 500 on SERVER: Server
Error: Evaluation Error: Error while evaluating a Function Call, Lookup of key
''role'' failed: Unable to parse (/etc/puppetlabs/code/environments/test/data/nodes/server.domain.edu.yaml):
expected <block end>, but found Scalar while parsing a block mapping at line 14
column 13 at /etc/puppetlabs/code/environments/test/manifests/site.pp:1:9 on node
server.domain.edu'
source: Puppet
tags:
- err
time: '2017-10-24T12:31:04.812310783-04:00'
file:
line:
- !ruby/object:Puppet::Util::Log
level: :warning
message: Not using cache on failed catalog
source: Puppet
tags:
- warning
time: '2017-10-24T12:31:04.813330686-04:00'
file:
line:
- !ruby/object:Puppet::Util::Log
level: :err
message: Could not retrieve catalog; skipping run
source: Puppet
tags:
- err
time: '2017-10-24T12:31:04.813772737-04:00'
file:
line:
resource_statuses: {}
host: server.domain.edu
time: 2017-10-24 12:30:39.020147395 -04:00
kind: apply
report_format: 6
puppet_version: 4.10.1
configuration_version:
transaction_uuid: eb1cab67-e1f4-4ba6-9413-603af8457949
code_id:
catalog_uuid:
cached_catalog_status: not_used
master_used:
environment: test
status: failed
noop: false
noop_pending: false
corrective_change: false
When this happens the I guess telegraf keeps reporting the last value? I would expect NAs or at 0s, but that’s not what i am seeing:
Note also that I have not fixed the issue (or changed anything on the node or puppetmaster). The issue reveals itself only when I use puppet agent -t, and even then I cannot see it in the data until after the next automated puppet run kicks in.
Looking at the logs below, I can see that a lot of this weirdness is not telegraf’s fault. Still, there must be some workable solution…
user@server:~$ sudo grep -A 15 "level: :err" /opt/puppetlabs/puppet/cache/state/last_run_report.yaml
level: :err
message: 'Could not retrieve catalog from remote server: Error 500 on SERVER: Server
Error: Evaluation Error: Error while evaluating a Function Call, Lookup of key
''role'' failed: Unable to parse (/etc/puppetlabs/code/environments/test/data/nodes/server.domain.edu.yaml):
expected <block end>, but found Scalar while parsing a block mapping at line 14
column 13 at /etc/puppetlabs/code/environments/test/manifests/site.pp:1:9 on node
server.domain.edu'
source: Puppet
tags:
- err
time: '2017-10-24T12:57:53.130125589-04:00'
file:
line:
The earlier yaml file you sent that starts with !ruby/object:Puppet::Transaction::Report is not at all what we are expecting, are there any errors in the telegraf logfile? If the yaml file cannot be read or parsed then no metrics are created and an error is logged.
I believe that Grafana shows a flat line until another data point comes in, but I can’t explain why a new point appeared at 16:33. All points are created at the current time, not using data from the yaml file, so a point at this timestamp shouldn’t appear at a later time unless there was some sort of communication error.
user@server:~$ ls /var/log/telegraf/ /opt/telegraf/ /var/opt/telegraf/ /etc/telegraf/
ls: cannot access '/opt/telegraf/': No such file or directory
ls: cannot access '/var/opt/telegraf/': No such file or directory
/etc/telegraf/:
telegraf.conf telegraf.d
user@server:~$ sudo journalctl -u telegraf
-- Logs begin at Tue 2017-10-24 10:57:08 EDT, end at Tue 2017-10-24 14:51:56 EDT. --
Oct 24 10:57:18 server systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Oct 24 10:57:24 server telegraf[1218]: 2017-10-24T14:57:24Z I! Starting Telegraf (version 1.3.0)
Oct 24 10:57:24 server telegraf[1218]: 2017-10-24T14:57:24Z I! Loaded outputs: graphite
Oct 24 10:57:24 server telegraf[1218]: 2017-10-24T14:57:24Z I! Loaded inputs: inputs.diskio inputs.mem inputs.ping inputs.swap inputs
Oct 24 10:57:24 server telegraf[1218]: 2017-10-24T14:57:24Z I! Tags enabled: host=server
Oct 24 10:57:24 server telegraf[1218]: 2017-10-24T14:57:24Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"server", Flush Inte
Oct 24 11:38:37 server telegraf[1218]: 2017-10-24T15:38:37Z I! Reloading Telegraf config
Oct 24 11:38:37 server telegraf[1218]: 2017-10-24T15:38:37Z I! Hang on, flushing any cached metrics before shutdown
Oct 24 11:38:37 server telegraf[1218]: 2017-10-24T15:38:37Z I! Starting Telegraf (version 1.3.0)
Oct 24 11:38:37 server telegraf[1218]: 2017-10-24T15:38:37Z I! Loaded outputs: graphite
Oct 24 11:38:37 server telegraf[1218]: 2017-10-24T15:38:37Z I! Loaded inputs: inputs.exec inputs.processes inputs.puppetagent inputs.
Oct 24 11:38:37 server telegraf[1218]: 2017-10-24T15:38:37Z I! Tags enabled: host=server
Oct 24 11:38:37 server telegraf[1218]: 2017-10-24T15:38:37Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"server", Flush Inte
Oct 24 11:57:58 server telegraf[1218]: 2017-10-24T15:57:58Z I! Reloading Telegraf config
Oct 24 11:57:58 server telegraf[1218]: 2017-10-24T15:57:58Z I! Hang on, flushing any cached metrics before shutdown
Oct 24 11:57:58 server telegraf[1218]: 2017-10-24T15:57:58Z I! Starting Telegraf (version 1.3.0)
Oct 24 11:57:58 server telegraf[1218]: 2017-10-24T15:57:58Z I! Loaded outputs: graphite
Oct 24 11:57:58 server telegraf[1218]: 2017-10-24T15:57:58Z I! Loaded inputs: inputs.mem inputs.net inputs.ping inputs.processes inpu
Oct 24 11:57:58 server telegraf[1218]: 2017-10-24T15:57:58Z I! Tags enabled: host=server
Oct 24 11:57:58 server telegraf[1218]: 2017-10-24T15:57:58Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"server", Flush Inte
I don’t see anything relevant, but maybe I’m not looking in the right place…
I tried this again on another node and found no errors in the telegraf log. The telegraf plugin probably could handle this better, but running puppet agent with the --test option changes logging so in some ways it makes sense that the plugin might behave weirdly. I think I got off on a tangent there; the real issue I have is being able to consistently detect node failures from these metrics.
I will report back here next time I have some specific data showing how my current setup failed in this regard.