Puppetagent telegraf input plugin usefulness?

Is anyone having success monitoring with the puppetagent input plugin for telegraf?

I have been collecting metrics using the plugin for a few months, but I have not found any of the metrics to be useful for identifying when my nodes are failing (in various ways). The #1 thing I want to know is “is puppet applying successfully on this node”, and I can’t seem to figure out how to discern that from these metrics.

I could add an exec bash script to manually grep the log file for “Error” but if I can’t see this in the data, what the heck do people use puppetagent for? I think the information must be there and I’m just not seeing it.

I’m not a puppet user, so I’m not sure this will be helpful, but what if you monitor the various failure related measurements such as puppetagent_resources_failed or puppetagent_resources_failedtorestart for non-zero values?

This was my original intention, but unfortunately these metrics do not work as well as it seems they would. There are several ways a node can fail

I have done a number of tests trying to find a good way to see which nodes are failing but I am no less confused about how to make sense of these different metrics.

Regarding those specific metrics:

Over the last 3 months I have only one node (out of ~20) that reported a puppetagent_resources_failedtorestart > 0 and I believe these were from some puppet exec scripts hitting a timeout.

puppetagent_resources_failed seems pretty nice but it is weirdly inconsistent. It does catch some types of errors, but other times I will ssh into a node with this metric > 0, run puppet agent -t, find no errors, go back to my monitoring dash and find that the node no longer reports an error (very frustrating).

I have at least two nodes failing right now (due to 500 errors (not resource failures)) and I think a few others which have not applied my latest changes to my puppet config (probably also failing), but I can’t see this in the data.

You could check the /var/lib/puppet/state/last_run_summary.yaml file that is used for the source of this data. Maybe it is not being updated properly, perhaps check its modification time?

Yes, this is a good suggestion and I have checked to verify puppet is indeed running and updating mtime on the last_run_summary.yaml every ~30min.

The metrics I am getting are correct match the files, but I am having trouble interpreting them in a meaningful/useful way.

other times I will ssh into a node with this metric > 0, run puppet agent -t, find no errors, go back to my monitoring dash and find that the node no longer reports an error

I wonder if this could be a caused by using the wrong aggregation function, what query are you using to display the errors?

Since I am still figuring this out my basic approach is to look at all the plots of *.-opt-puppetlabs-puppet-cache-state-last_run_summary_yaml.puppetagent.* ( graphite backend ).

Currently I use two ways of finding failing nodes:

  1. resources_total == 0
  2. resources_failed > 0

A few months ago I intentionally caused failures (500 error) on a node manifest. Below are the results:


After looking through all the variables (except all the times) I found only three unique signals which might be useful (others were identical to these; see below).

It looks like any one of them can show when there is a 500 error because they drop to 0.

Groups of series that appear identical in this test:

  1. 1
    1. changes_total
    2. events_success
    3. events_total
    4. resources_changed
    5. resources_outofsync
  2. 2
    1. resources_total
  3. 3
    1. time_total
    2. time_exec

There is a line in last_run_report.yaml that reads status: failed (I am only testing on puppet 4)… That’s the #1 thing I want out of this plugin. :sweat_smile:

other times I will ssh into a node with this metric > 0, run puppet agent -t, find no errors, go back to my monitoring dash and find that the node no longer reports an error

I wonder if this could be a caused by using the wrong aggregation function, what query are you using to display the errors?

I just ran into something like this (but opposite my original description - the node reports ok when actually failing after puppet agent -t). Here is an example of what I am talking about:

I wonder if running the puppet command changes the file. Maybe try to cat /var/lib/puppet/state/last_run_summary.yaml before and after you start debugging to see if is modified.

Good suggestion. I saved a version before running, but didn’t think to get one within that window. I will next time! I suspect the file does change, but maybe there is something to be learned from looking at how the file differs.

I think some of these troubles may be because puppet saves an incomplete summary when failure is encountered when running manually with puppet agent -t:

user@server:~$ sudo /opt/puppetlabs/bin/puppet agent -t
Info: Using configured environment 'test'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Function Call, Lookup of key 'role' failed: Unable to parse (/etc/puppetlabs/code/environments/test/data/nodes/server.domain.edu.yaml): expected <block end>, but found Scalar while parsing a block mapping at line 14 column 13 at /etc/puppetlabs/code/environments/test/manifests/site.pp:1:9 on node server.domain.edu
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run


user@server:~$ sudo cat /opt/puppetlabs/puppet/cache/state/last_run_summary.yaml 
---
version:
  config: 
  puppet: 4.10.1
time:
  last_run: 1508862664

Here’s last_run_report if you’re interested as well:

user@server:~$ sudo cat /opt/puppetlabs/puppet/cache/state/last_run_report.yaml 
--- !ruby/object:Puppet::Transaction::Report
metrics: {}
logs:
- !ruby/object:Puppet::Util::Log
  level: :info
  message: Using configured environment 'test'
  source: Puppet
  tags:
  - info
  time: '2017-10-24T12:30:39.514599387-04:00'
  file: 
  line: 
- !ruby/object:Puppet::Util::Log
  level: :info
  message: Retrieving pluginfacts
  source: Puppet
  tags:
  - info
  time: '2017-10-24T12:30:39.515819917-04:00'
  file: 
  line: 
- !ruby/object:Puppet::Util::Log
  level: :info
  message: Retrieving plugin
  source: Puppet
  tags:
  - info
  time: '2017-10-24T12:30:39.666972319-04:00'
  file: 
  line: 
- !ruby/object:Puppet::Util::Log
  level: :info
  message: Loading facts
  source: Puppet
  tags:
  - info
  time: '2017-10-24T12:30:41.396969874-04:00'
  file: 
  line: 
- !ruby/object:Puppet::Util::Log
  level: :err
  message: 'Could not retrieve catalog from remote server: Error 500 on SERVER: Server
    Error: Evaluation Error: Error while evaluating a Function Call, Lookup of key
    ''role'' failed: Unable to parse (/etc/puppetlabs/code/environments/test/data/nodes/server.domain.edu.yaml):
    expected <block end>, but found Scalar while parsing a block mapping at line 14
    column 13 at /etc/puppetlabs/code/environments/test/manifests/site.pp:1:9 on node
    server.domain.edu'
  source: Puppet
  tags:
  - err
  time: '2017-10-24T12:31:04.812310783-04:00'
  file: 
  line: 
- !ruby/object:Puppet::Util::Log
  level: :warning
  message: Not using cache on failed catalog
  source: Puppet
  tags:
  - warning
  time: '2017-10-24T12:31:04.813330686-04:00'
  file: 
  line: 
- !ruby/object:Puppet::Util::Log
  level: :err
  message: Could not retrieve catalog; skipping run
  source: Puppet
  tags:
  - err
  time: '2017-10-24T12:31:04.813772737-04:00'
  file: 
  line: 
resource_statuses: {}
host: server.domain.edu
time: 2017-10-24 12:30:39.020147395 -04:00
kind: apply
report_format: 6
puppet_version: 4.10.1
configuration_version: 
transaction_uuid: eb1cab67-e1f4-4ba6-9413-603af8457949
code_id: 
catalog_uuid: 
cached_catalog_status: not_used
master_used: 
environment: test
status: failed
noop: false
noop_pending: false
corrective_change: false

When this happens the I guess telegraf keeps reporting the last value? I would expect NAs or at 0s, but that’s not what i am seeing:

NOTE: I think resources_restarted==1 because I fixed something in the previously run; I don’t think it is useful for detecting a failure.

Aha! I’m not losing my mind: the historical values from that plot above just changed once the automated puppet run kicked in:

Note also that I have not fixed the issue (or changed anything on the node or puppetmaster). The issue reveals itself only when I use puppet agent -t, and even then I cannot see it in the data until after the next automated puppet run kicks in.

Looking at the logs below, I can see that a lot of this weirdness is not telegraf’s fault. Still, there must be some workable solution…

user@server:~$ sudo cat /opt/puppetlabs/puppet/cache/state/last_run_summary.yaml 
---
version:
  config: 1508860865
  puppet: 4.10.1
resources:
  changed: 2
  corrective_change: 0
  failed: 0
  failed_to_restart: 0
  out_of_sync: 2
  restarted: 0
  scheduled: 0
  skipped: 0
  total: 121
time:
  alternative_entry: 0.034723084
  anchor: 0.0008198349999999999
  apt_key: 0.000552145
  augeas: 0.022542928
  concat_file: 0.000274396
  concat_fragment: 0.000931884
  config_retrieval: 1.651657324
  cron: 0.000620604
  exec: 0.07546974199999999
  file: 0.17293390200000003
  file_line: 0.000388122
  filebucket: 0.000131051
  firewall: 0.001397966
  host: 0.00665172
  mount: 0.003049649
  notify: 0.007604843
  package: 0.126386366
  schedule: 0.000618926
  service: 0.250942501
  total: 6.210052758000001
  vcsrepo: 3.85235577
  last_run: 1508864281
changes:
  total: 2
events:
  failure: 0
  success: 2
  total: 2
user@server:~$ sudo tail -20 /opt/puppetlabs/puppet/cache/state/last_run_report.yaml 
    change_count: 0
    out_of_sync_count: 0
    events: []
    corrective_change: false
host: server.domain.edu
time: 2017-10-24 12:57:27.487512493 -04:00
kind: apply
report_format: 6
puppet_version: 4.10.1
configuration_version: 1508860865
transaction_uuid: 8b6ea478-ad48-4d82-be14-30de06783edf
code_id: 
catalog_uuid: 84a3414e-ebe4-44fa-b934-0c08675550da
cached_catalog_status: on_failure
master_used: 
environment: test
status: changed
noop: false
noop_pending: false
corrective_change: false
user@server:~$ sudo grep -A 15 "level: :err" /opt/puppetlabs/puppet/cache/state/last_run_report.yaml
  level: :err
  message: 'Could not retrieve catalog from remote server: Error 500 on SERVER: Server
    Error: Evaluation Error: Error while evaluating a Function Call, Lookup of key
    ''role'' failed: Unable to parse (/etc/puppetlabs/code/environments/test/data/nodes/server.domain.edu.yaml):
    expected <block end>, but found Scalar while parsing a block mapping at line 14
    column 13 at /etc/puppetlabs/code/environments/test/manifests/site.pp:1:9 on node
    server.domain.edu'
  source: Puppet
  tags:
  - err
  time: '2017-10-24T12:57:53.130125589-04:00'
  file: 
  line:

The earlier yaml file you sent that starts with !ruby/object:Puppet::Transaction::Report is not at all what we are expecting, are there any errors in the telegraf logfile? If the yaml file cannot be read or parsed then no metrics are created and an error is logged.

I believe that Grafana shows a flat line until another data point comes in, but I can’t explain why a new point appeared at 16:33. All points are created at the current time, not using data from the yaml file, so a point at this timestamp shouldn’t appear at a later time unless there was some sort of communication error.

Good question… uhhh…

user@server:~$ ls /var/log/telegraf/ /opt/telegraf/ /var/opt/telegraf/ /etc/telegraf/
ls: cannot access '/opt/telegraf/': No such file or directory
ls: cannot access '/var/opt/telegraf/': No such file or directory
/etc/telegraf/:
telegraf.conf  telegraf.d



user@server:~$ sudo journalctl -u telegraf
-- Logs begin at Tue 2017-10-24 10:57:08 EDT, end at Tue 2017-10-24 14:51:56 EDT. --
Oct 24 10:57:18 server systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Oct 24 10:57:24 server telegraf[1218]: 2017-10-24T14:57:24Z I! Starting Telegraf (version 1.3.0)
Oct 24 10:57:24 server telegraf[1218]: 2017-10-24T14:57:24Z I! Loaded outputs: graphite
Oct 24 10:57:24 server telegraf[1218]: 2017-10-24T14:57:24Z I! Loaded inputs: inputs.diskio inputs.mem inputs.ping inputs.swap inputs
Oct 24 10:57:24 server telegraf[1218]: 2017-10-24T14:57:24Z I! Tags enabled: host=server
Oct 24 10:57:24 server telegraf[1218]: 2017-10-24T14:57:24Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"server", Flush Inte
Oct 24 11:38:37 server telegraf[1218]: 2017-10-24T15:38:37Z I! Reloading Telegraf config
Oct 24 11:38:37 server telegraf[1218]: 2017-10-24T15:38:37Z I! Hang on, flushing any cached metrics before shutdown
Oct 24 11:38:37 server telegraf[1218]: 2017-10-24T15:38:37Z I! Starting Telegraf (version 1.3.0)
Oct 24 11:38:37 server telegraf[1218]: 2017-10-24T15:38:37Z I! Loaded outputs: graphite
Oct 24 11:38:37 server telegraf[1218]: 2017-10-24T15:38:37Z I! Loaded inputs: inputs.exec inputs.processes inputs.puppetagent inputs.
Oct 24 11:38:37 server telegraf[1218]: 2017-10-24T15:38:37Z I! Tags enabled: host=server
Oct 24 11:38:37 server telegraf[1218]: 2017-10-24T15:38:37Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"server", Flush Inte
Oct 24 11:57:58 server telegraf[1218]: 2017-10-24T15:57:58Z I! Reloading Telegraf config
Oct 24 11:57:58 server telegraf[1218]: 2017-10-24T15:57:58Z I! Hang on, flushing any cached metrics before shutdown
Oct 24 11:57:58 server telegraf[1218]: 2017-10-24T15:57:58Z I! Starting Telegraf (version 1.3.0)
Oct 24 11:57:58 server telegraf[1218]: 2017-10-24T15:57:58Z I! Loaded outputs: graphite
Oct 24 11:57:58 server telegraf[1218]: 2017-10-24T15:57:58Z I! Loaded inputs: inputs.mem inputs.net inputs.ping inputs.processes inpu
Oct 24 11:57:58 server telegraf[1218]: 2017-10-24T15:57:58Z I! Tags enabled: host=server
Oct 24 11:57:58 server telegraf[1218]: 2017-10-24T15:57:58Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"server", Flush Inte

I don’t see anything relevant, but maybe I’m not looking in the right place…

It should be there in the output of journalctl -u telegraf, just make sure you are looking over the timerange when the error occurred.

I tried this again on another node and found no errors in the telegraf log. The telegraf plugin probably could handle this better, but running puppet agent with the --test option changes logging so in some ways it makes sense that the plugin might behave weirdly. I think I got off on a tangent there; the real issue I have is being able to consistently detect node failures from these metrics.

I will report back here next time I have some specific data showing how my current setup failed in this regard.