Telegraf with multiple outputs: If one is down, no one gets the data

mephisto · October 18, 2019, 11:32am

Hi there,

I have a few things that don’t play well.

I send data with Telegraf to Multiple Outputs: (production Instance, Test Instance, 2.0 Beta Instance).

Now I have two effects I don’t understand:

If one Output is down for a reason, the other outputs don’t seem to get the Data as well.
There is no Backfill: My understanding was, that as long as metric_buffer_limit isn’t saturated, write requests get buffered until the Output is back again.

That is not the Case. If I shutdown one InfluxDB instance from my output list, all other Instances don’t get data, and a Backfill doesn’t too.

This is how the Telegraf Config looks like:

[[outputs.influxdb]]
urls = [“http://127.0.0.1:8086”]
namedrop = [“_test"]
[outputs.influxdb.tagdrop]
influxdb_database = ["”]
database = “telegraf”

[[outputs.influxdb]]
urls = [“http://192.168.x.x:8086”]
namedrop = [“_test"]
metric_buffer_limit = 100000
[outputs.influxdb.tagdrop]
influxdb_database = ["”]
database = “telegraf”

What do I miss here?

mephisto · November 3, 2019, 12:12pm

Just an update of this:

It happens if one of the Outputs runs into a undefined timeout. If one target output is just shutdown and the server replies “connection refused” the metrics buffer works as intended.

I worked around this by putting an instance of influxdb-relay in between (on the machine where telegraf is running). Timeouts of the desination output are not blocking other outputs anymore.

daniel · November 5, 2019, 6:22pm

This should definitely not happen, what version of Telegraf are you using?

mephisto · November 6, 2019, 10:39pm

1.12.4, happened with 1.12.1 too.

If I gracefully shutdown one output it works as intended, if an output goes away because of an unexpected Network failure (VPN link down), it hangs, and all other outputs hang, too. 100% reproduceable.

daniel · November 7, 2019, 3:50am

Can you run Telegraf with --debug and then collect the logs when you shutdown one of the outputs?

mephisto · November 11, 2019, 9:28pm

I will do when I get back, will take a few days.

Hipska · November 15, 2019, 11:07am

I have the same problem. When I configure a second output which is down or blocked by firewall, after reloading Telegraf, nothing happens as it times out on that output. After some time Telegraf even stops without even started gathering metrics or trying to send to other outputs.

daniel · November 15, 2019, 6:51pm

@Hipska Can you show your logs running Telegraf with --debug? You can alternately set this in the [agent] section of the configuration file.

mephisto · November 18, 2019, 10:14am

Hi,

i have just reproduced it. The Output config looks like this:

[[outputs.influxdb]]
urls = [“http://127.0.0.1:8086”]
namedrop = [“_test"]
[outputs.influxdb.tagdrop]
influxdb_database = ["”]
database = “telegraf”

[[outputs.influxdb]]
urls = [“http://192.168.x.y:8086/”]
namedrop = [“_test"]
metric_buffer_limit = 100000
[outputs.influxdb.tagdrop]
influxdb_database = ["”]
database = “telegraf”

[[outputs.influxdb]]
urls = [“http://127.0.0.1:8086”]
database = “pmi”
tagexclude = [“influxdb_database”]
[outputs.influxdb.tagpass]
influxdb_database = [“pmi”]

[[outputs.influxdb]]
urls = [“http://192.168.x.y:8086/”]
database = “pmi”
metric_buffer_limit = 100000
tagexclude = [“influxdb_database”]
[outputs.influxdb.tagpass]
influxdb_database = [“pmi”]

[[outputs.influxdb]]
urls = [“http://127.0.0.1:8086”]
database = “telegraf”
tagexclude = [“influxdb_database”]
[outputs.influxdb.tagpass]
influxdb_database = [“idc”]

[[outputs.influxdb]]
urls = [“http://192.168.x.y:8086/”]
database = “telegraf”
tagexclude = [“influxdb_database”]
[outputs.influxdb.tagpass]
influxdb_database = [“idc”]

When I tell the Firewall that leads to 192.168.x.y to drop the packets silently the following happens:

Nov 18 11:01:00 telegraf[11714]: 2019-11-18T10:01:00Z D! [outputs.influxdb] Wrote batch of 51 metrics in 61.231974ms
Nov 18 11:01:00 telegraf[11714]: 2019-11-18T10:01:00Z D! [outputs.influxdb] Buffer fullness: 1 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [outputs.influxdb] when writing to [http://192.168.x.y:8086/]: Post http://192.168.x.y:8086/write?db=telegraf: net/http: request canceled while waiting for connection
(Client.Timeout exceeded while awaiting headers)
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z D! [outputs.influxdb] Buffer fullness: 787 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [agent] Error writing to outputs.influxdb: could not write any address
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [outputs.influxdb] when writing to [http://192.168.x.y:8086/]: Post http://192.168.x.y:8086/write?db=telegraf: net/http: request canceled while waiting for connection
(Client.Timeout exceeded while awaiting headers)
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z D! [outputs.influxdb] Buffer fullness: 946 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [agent] Error writing to outputs.influxdb: could not write any address
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [outputs.influxdb] when writing to [http://192.168.x.y:8086/]: Post http://192.168.x.y:8086/write?db=pmi: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z D! [outputs.influxdb] Buffer fullness: 1000 / 100000 metrics
Nov 18 11:01:05 telegraf[11714]: 2019-11-18T10:01:05Z E! [agent] Error writing to outputs.influxdb: could not write any address
Nov 18 11:01:10 telegraf[11714]: 2019-11-18T10:01:10Z D! [outputs.influxdb] Buffer fullness: 0 / 100000 metrics
Nov 18 11:01:10 telegraf[11714]: 2019-11-18T10:01:10Z D! [outputs.influxdb] Wrote batch of 23 metrics in 96.375817ms
Nov 18 11:01:10 telegraf[11714]: 2019-11-18T10:01:10Z D! [outputs.influxdb] Buffer fullness: 17 / 100000 metrics

This keeps happening and data is not written to any of the output, I see a gap in the Graphs.

When the missing output comes back online, I see this:

Nov 18 11:03:40 telegraf[11714]: panic: channel is full
Nov 18 11:03:40 telegraf[11714]: goroutine 10357 [running]:
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/agent.(*trackingAccumulator).onDelivery(0xc000292780, 0x2c11e80, 0xc002bff860)
Nov 18 11:03:40 telegraf[11714]: #011/go/src/github.com/influxdata/telegraf/agent/accumulator.go:167 +0x7a
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/metric.(*trackingData).notify(...)
Nov 18 11:03:40 telegraf[11714]: #011/go/src/github.com/influxdata/telegraf/metric/tracking.go:73
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/metric.(*trackingMetric).decr(0xc00174ea60)
Nov 18 11:03:40 telegraf[11714]: #011/go/src/github.com/influxdata/telegraf/metric/tracking.go:163 +0x9e
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/metric.(*trackingMetric).Accept(0xc00174ea60)
Nov 18 11:03:40 telegraf[11714]: #011/go/src/github.com/influxdata/telegraf/metric/tracking.go:144 +0x3a
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/internal/models.(*Buffer).metricWritten(0xc0001c2fa0, 0x2c72240, 0xc00174ea60)
Nov 18 11:03:40 telegraf[11714]: #011/go/src/github.com/influxdata/telegraf/internal/models/buffer.go:93 +0x72
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/internal/models.(*Buffer).Accept(0xc0001c2fa0, 0xc002092000, 0x30, 0x30)
Nov 18 11:03:40 telegraf[11714]: #011/go/src/github.com/influxdata/telegraf/internal/models/buffer.go:179 +0xa6
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/internal/models.(*RunningOutput).Write(0xc0001aa280, 0x0, 0xc000560660)
Nov 18 11:03:40 telegraf[11714]: #011/go/src/github.com/influxdata/telegraf/internal/models/running_output.go:190 +0xf7
Nov 18 11:03:40 telegraf[11714]: github.com/influxdata/telegraf/agent.(*Agent).flushOnce.func1(0xc001755b00, 0xc0016d7bc0)
Nov 18 11:03:40 telegraf[11714]: #011/go/src/github.com/influxdata/telegraf/agent/agent.go:597 +0x27
Nov 18 11:03:40 telegraf[11714]: created by github.com/influxdata/telegraf/agent.(*Agent).flushOnce
Nov 18 11:03:40 telegraf[11714]: #011/go/src/github.com/influxdata/telegraf/agent/agent.go:596 +0xc8
Nov 18 11:03:40 systemd[1]: telegraf.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Nov 18 11:03:40 systemd[1]: telegraf.service: Unit entered failed state.
Nov 18 11:03:40 systemd[1]: telegraf.service: Failed with result ‘exit-code’.
Nov 18 11:03:40 systemd[1]: telegraf.service: Service hold-off time over, scheduling restart.

Telegraf panics, and systemd restarts it. After this, data is pouring in again.

thx for looking into this…

daniel · November 19, 2019, 1:56am

Okay, there is definitely a bug with the panic, we should open an issue on GitHub and include the entire configuration file.

It looks like you are using a plugin that creates “tracked metrics”, so the reason output stops when one output is down is because Telegraf will not take new metrics until the ones it has are delivered. This prevents Telegraf from draining messages off of a queue input when it can’t send them. Messages are only considered sent when all outputs have successfully processed it or when they are filtered.

mephisto · November 20, 2019, 10:16pm

Uuhm, not that I would know. In this case I get MQTT data and push it to multiple outputs.

I opened an issue on gh… #6694

Topic		Replies	Views
Limit Telegrafs Output-Writes/s Telegraf telegraf	1	463	January 13, 2022
None of the outputs work if one of the outputs has error Telegraf telegraf , prometheus , client-libraries , outputs	3	1209	September 20, 2022
Data Loss happens at telegraf side influxdb , telegraf	4	1443	August 8, 2019
Outputs.influxdb did not complete within its flush interval - means points get lost Telegraf telegraf	1	108	October 31, 2024
Telegraf Tail from multiple files Telegraf	0	1524	March 30, 2017

Telegraf with multiple outputs: If one is down, no one gets the data

Related topics