How to slow Telegraf data collection

Background:

I’ve put together a website that collects performance information from thousands of websites worldwide. It uses the http_reponse plugin, running on ~9 Telegaf instances worldwide, and then sends metrics to InfluxDB.

Problem:

Telegraf is so fast in sending out requests that I get rate-limited by some of the services that I’m checking. For instance, I check about 4500 API endpoints for AWS. Each of my agents has a collection interval of ~60m. And so, every hour, Telegraf makes requests to all 4500 AWS endpoints in about 5 minutes, or about 15 a second. That’s enough to set off alarms at AWS and impose rate limits.

Example:

The result you can see on this page for AWS. It shows several thousand failures, typically slow connections or timeouts. We know that AWS isn’t actually falling down. What we’re seeing is AWS rate-limiting my Telegraf agents.

Question:

How do I put some kind of delay on each Telegraf request? Or (better) reduce the amount of concurrency in Telegraf requests?

I’m open to solutions that do this through configuration, as well as those that involve forking/modifying the Telegraf source code. (Happy to submit a pull request if desired, though I’m very much a Go beginner.)

Attempted solution 1:

I have one hack that I’m trying currently.

Here’s the current hack: use the Starlark process plugin to run a simple hash function many times to slow things down. The more for loop iterations (line 10), the more hashes, and hopefully the more delay between requests. The reason it’s a hack is that it burns up a ton of CPU on my dev laptop, and my hosting provider probably won’t be happy with me if I run this in production.

You can see the configuration below. (I have many different Telegraf configs, since each one sends over a tag with the company name.)

% more telegraf.httpresponse.zynga.com.conf
[[inputs.http_response]]
response_timeout = "20s"
follow_redirects = true
urls = [ "https://zynga.com" ]
[inputs.http_response.tags]
companydisplay = "Zynga"
[[processors.starlark]]
source = """
def apply(metric):
for x in range(80):
hash(str(x))
return metric
"""

Attempted solution 2:

The other idea is to build a webpage that only delivers a result after one second (e.g., have a line in NodeJS, await sleep(1000);). Then put the URL to that webpage in my Telegraf URL list. See line 4 below.

% more telegraf.httpresponse.zynga.com.conf
[[inputs.http_response]]
response_timeout = "20s"
follow_redirects = true
urls = [ "https://downhound.com/onesecondpage", "https://zynga.com" ]
[inputs.http_response.tags]
companydisplay = "Zynga"

Question:

Does Telegraf http_response plugin process URLs in its list in sequence, or with some amount of concurrency? (And again, can this concurrency be reduced?)

Thanks!

Hi @Al_Sargent,
You could separate your HTTP input plugins and then apply different collection intervals to each. For example:

# HTTP/HTTPS request given an address a method and a timeout
[[inputs.http_response]]
interval = 30s
  ## address is Deprecated in 1.12, use 'urls'

  ## List of urls to query.
  urls = ["http://localhost"]
[[inputs.http_response]]
interval = 120s
  ## address is Deprecated in 1.12, use 'urls'

  ## List of urls to query.
  urls = ["http://cloud"]

I’d try with the collection_jitter option. it’s random so It’s not guaranteed to solve the problem.

Since it works by input, if you gather all in just one “instance” of a plugin it won’t make any difference, therefore you must split the gathering into multiple instances in order for the jitter to have an effect.
The jitter is set at agent level, but you can override it in the single “input instance”.

This of course adds some delay in your data gathering process, consider that in your queries. (ie: if you get points every 15mins, with a 5min jitter, you might get the points after 20min… query accordingly)

as a sample

##alias allows you to name the instance of a plugin, useful as it's printed in the logs
[agent]
  interval = "10m"
  collection_jitter = "2m"
  {...}

[[inputs.http_response]]
  {...}
  collection_jitter = "30s" ##this overrides the 2m agent default
  alias = "HTTP 1"

[[inputs.http_response]]
  {...}
  alias = "HTTP 2"

[[inputs.http_response]]
  {...}
  alias = "HTTP 3"
1 Like

Thanks @Giovanni_Luisotto and @Jay_Clifford . Both of those might spread out Telegraf collection after the first collection (I haven’t tested). The problem that remains, though, is that the first collection still resembles a thundering herd.

Telegraf is fast enough to pings ~10,000 URL in about 1m15, which is enough to set of rate limiting at AWS (with 4500 URLs), and Cloudflare (which handles caching for many of the other URLs).

So, even if subsequent runs are spread out using your ideas above, the rate-limit as triggered in the first set of 10k HTTP requests.

I don’t see why the jitter shouldn’t work in the “first-run” as it adds a random offset to the gathering…
The only other way I see is to use input exec or execd, meaning you will need to write your own code outside telegraf, there you can apply whatever logic you want.

2 Likes

Indeed, Jitter will also work at the first collection interval.

Since very recently there is also the option to define an offset to the collection interval.

1 Like