I’ve put together a website that collects performance information from thousands of websites worldwide. It uses the http_reponse plugin, running on ~9 Telegaf instances worldwide, and then sends metrics to InfluxDB.
Telegraf is so fast in sending out requests that I get rate-limited by some of the services that I’m checking. For instance, I check about 4500 API endpoints for AWS. Each of my agents has a collection interval of ~60m. And so, every hour, Telegraf makes requests to all 4500 AWS endpoints in about 5 minutes, or about 15 a second. That’s enough to set off alarms at AWS and impose rate limits.
The result you can see on this page for AWS. It shows several thousand failures, typically slow connections or timeouts. We know that AWS isn’t actually falling down. What we’re seeing is AWS rate-limiting my Telegraf agents.
How do I put some kind of delay on each Telegraf request? Or (better) reduce the amount of concurrency in Telegraf requests?
I’m open to solutions that do this through configuration, as well as those that involve forking/modifying the Telegraf source code. (Happy to submit a pull request if desired, though I’m very much a Go beginner.)
Attempted solution 1:
I have one hack that I’m trying currently.
Here’s the current hack: use the Starlark process plugin to run a simple hash function many times to slow things down. The more for loop iterations (line 10), the more hashes, and hopefully the more delay between requests. The reason it’s a hack is that it burns up a ton of CPU on my dev laptop, and my hosting provider probably won’t be happy with me if I run this in production.
You can see the configuration below. (I have many different Telegraf configs, since each one sends over a tag with the company name.)
% more telegraf.httpresponse.zynga.com.conf [[inputs.http_response]] response_timeout = "20s" follow_redirects = true urls = [ "https://zynga.com" ] [inputs.http_response.tags] companydisplay = "Zynga" [[processors.starlark]] source = """ def apply(metric): for x in range(80): hash(str(x)) return metric """
Attempted solution 2:
The other idea is to build a webpage that only delivers a result after one second (e.g., have a line in NodeJS, await sleep(1000);). Then put the URL to that webpage in my Telegraf URL list. See line 4 below.
% more telegraf.httpresponse.zynga.com.conf [[inputs.http_response]] response_timeout = "20s" follow_redirects = true urls = [ "https://downhound.com/onesecondpage", "https://zynga.com" ] [inputs.http_response.tags] companydisplay = "Zynga"
Does Telegraf http_response plugin process URLs in its list in sequence, or with some amount of concurrency? (And again, can this concurrency be reduced?)