Monitoring with Push vs. Pull: InfluxDB Adds Pull Support with Kapacitor

#1

Originally published at: https://www.influxdata.com/monitoring-with-push-vs-pull-influxdb-adds-pull-support-with-kapacitor/

The monitoring community has been having a debate for a while now about push vs. pull monitoring. A year ago I was firmly in the push camp, but recently I’ve come around to appreciating what pull has to offer, and you need both push and pull. That’s why I’m excited that today we’re releasing a beta of Kapacitor 1.3 that adds support for service discovery and pulling (scraping) of Prometheus style targets. We’re looking to bring the best of both worlds to monitoring and time series. Read on for some definitions of push and pull, my thoughts on the pros and cons of each and examples of how to get pull monitoring and collection going with InfluxDB and Kapacitor.

Clarifying Push & Pull

In the pull-based method, the monitoring agent polls the targets being monitored periodically and alerts based on that data. In the push method, telemetry and metrics are pushed to the monitoring agent (or more frequently a time series database) and monitoring is done either through the agent or other processes querying the database. When instrumenting your own application code, there is a choice to be made between push and pull. Either you send metrics out to another service via a client library, or you make them available to others through some network addressable target (like an HTTP API, for example).

Contrary to popular belief, I contend that for services and systems you don’t control and write code for, all monitoring and metrics collection starts with pull. For example, we built Telegraf as an agent for collecting metrics and pushing them to InfluxDB. Looks like a push method, right? Sort of. It starts with a pull. Telegraf pulls from the target (e.g. Redis, MySQL), formats the metrics into InfluxDB Line Protocol and sends them off to InfluxDB. Within Telegraf this happens on a regular polling interval set through the configuration.

The real strength of formalizing the pull method is that it gives a standard language for services and applications of all kinds to expose targets with a standard format to pull metrics data from. This is the thing that I think Prometheus really got right. If you want to expose metrics data, this makes a standard and it significantly reduces the effort of writing a collector. If you look at the code in Telegraf, every single plugin (which is a metric puller) has to define custom code for pulling the data and transforming it into the proper format.

Prometheus also added the embedded database making it more than a simple monitoring agent like Nagios. I think the database is where people get tied up on what is pull and what is push. When you think of a database, it’s always push.

Pros & Cons

There are pros and cons to both methods. The primary disadvantage of pull based methods is that they don't work well for event-driven time series (like individual requests to an API, or events in your infrastructure). The strength of pull based methods is that any outside system can collect telemetry data from your applications without them sending data to a new service.

While service discovery is often listed as a strength of pull, it’s really just a difference of who is interacting with the service discovery mechanism. If you were using a push method you could have the client library hit service discovery to find out where it has to ship metrics and events when it starts up. It’s unlikely that your target will change, while you’ll almost certainly be spinning up new services and applications, meaning that your metrics puller will have to periodically poll service discovery to find new targets.

Kapacitor & Pull

With this release, we've integrated Prometheus' service discovery and scraping code into Kapacitor. That means that any service discovery target that works with Prometheus will work with Kapacitor. Combined with a TICK script, you can use Kapacitor to monitor Prometheus scrape targets, write data into InfluxDB, and perform filtering, transformation and other tasks. With Kapacitor's user defined functions (UDFs) it becomes trivial to pull in advanced anomaly detection and custom logic.

Example

Here's a quick getting started guide and example to make Kapacitor write all pull data into InfluxDB. First, you'll need to have InfluxDB running. Install the beta of the Kapacitor 1.3 release:
https://dl.influxdata.com/kapacitor/releases/kapacitor_1.3.0~beta2_amd64.deb
https://dl.influxdata.com/kapacitor/releases/kapacitor-1.3.0~beta2.x86_64.rpm
https://dl.influxdata.com/kapacitor/releases/kapacitor-1.3.0~beta2_windows_amd64.zip
https://dl.influxdata.com/kapacitor/releases/kapacitor-1.3.0~beta2_linux_amd64.tar.gz
https://dl.influxdata.com/kapacitor/releases/kapacitor-1.3.0~beta2_linux_i386.tar.gz
https://dl.influxdata.com/kapacitor/releases/kapacitor-1.3.0~beta2_linux_armhf.tar.gz
You'll want to update your Kapacitor config. There are a few parts to this. The first is the configuration to define discovery targets. For this example we'll discover scrape targets from a directory in the file system. Update your kapacitor.conf with this section:
[[file-discovery]]
 enabled = true
 id = "discover_file"
 refresh-interval = "10s"
 ##### This will look for prometheus json files
 ##### File format is here https://prometheus.io/docs/operating/configuration/#%3Cfile_sd_config%3E
 files = ["/Users/pauldix/prom/*.json"]
You'll want to update where it's looking for files. In that directory it will look for JSON files that match the prometheus config. For example if we run Prometheus Node Exporter on the default port on our host, we can have a JSON file prom.json:
[{
 "targets": ["localhost:9100"]
}]
With the discovery part in place, we need to hook up a scraper that will scrape all of the targets picked up by the discoverer. You can do that by adding this section to your kapacitor.conf:
[[scraper]]
 enabled = true
 name = "node_exporter"
 discoverer-id = "discover_file"
 discoverer-service = "file-discovery"
 db = "prometheus"
 rp = "autogen"
 type = "prometheus"
 scheme = "http"
 metrics-path = "/metrics"
 scrape-interval = "10s"
 scrape-timeout = "10s"
 username = ""
 password = ""
 bearer-token = ""
 ssl-ca = ""
 ssl-cert = ""
 ssl-key = ""
 ssl-server-name = ""
 insecure-skip-verify = false
Everything that gets scraped will get sent in Kapacitor to the prometheus database and autogen retention policy. That means that any TICK scripts enabled in Kapacitor that stream data from that DB and Retention Policy (RP) will get data from the scraper at the set scrape interval.

As a last step, let’s create a TICK script that takes the output of the scraper and writes it to our local InfluxDB. We’ll use the same database and retention policy name, so first we’ll want to update our kapacitor.conf under the influxdb section so that we exclude that data from the subscription that Kapacitor has to all InfluxDB data:

[influxdb.excluded-subscriptions]
  _kapacitor = ["autogen"]
  prometheus = ["autogen"]

With that going we should see all of the scraped data in InfluxDB. We could also add TICK scripts to alert on the scraped data or transform it before we write it into the database. Look at our prometheus TICK script normalizer repo in github.

Coming Soon

We'll be updating this and adding to it. We'll have integration with Kubernetes service discovery, a UI in Chronograf to make working with scrapers and discovery targets easier and some pre-built Chronograf dashboards to go along with the more popular Prometheus exporters.
1 Like
#2

I’d suggest also that push allows faster collection, better batching/bursting (up to a metric/sec or faster), and importantly, sync collection times for several metrics such as cpu user, idle, iowait or ram free, cache, buffer - metrics where pull systems tend to confuse users with slightly different collection times (and thus these components don’t add to 100%.

Also, pull systems over the Internet run into trouble with firewalls, needing NAT holes, but push systems are 100% outbound, so much friendlier to SaaS monitoring platforms and distributed systems.

Agreed the on-server / service-interaction component is somewhat separate and has its own push/pull challenge, such as how to collect and store pushed data such as logs or push metrics from code vs. storing credentials and modifying configurations to access services such as DBs, JMX, etc.

Steve Mushero, CEO
OpsStack.io

2 Likes