Shutting execd process down quicker when reloading the Telegraf configuration

Hi all,

I have a Telegraf configuration file with execd input plugins (calling Python). I let Telegraf watch the configuration file with --watch-config.

I noticed that when the configuration file changes, the configuration is reloaded, but it takes 5 seconds per Python process to shut down:

2023-11-09T15:28:11Z I! Reloading Telegraf config
2023-11-09T15:28:16Z I! [inputs.execd] Process python/python.exe shut down
2023-11-09T15:28:21Z I! [inputs.execd] Process python/python.exe shut down

This makes restarting very slow. If I terminate Telegraf with Control-C, the Python processes do shut down quickly:

2023-11-09T15:31:05Z I! [inputs.execd] Process python/python.exe shut down
2023-11-09T15:31:05Z I! [inputs.execd] Process python/python.exe shut down
2023-11-09T15:31:05Z I! [agent] Hang on, flushing any cached metrics before shutdown
2023-11-09T15:31:05Z I! [agent] Stopping running outputs

My question: Is there something that can be done to make the Python processes stop quicker when reloading?

Important information: I’m using Windows.

Telegraf version: 1.28.3

To try this yourself, you can use the following configuration. It doesn’t display anything, but you can observe that reloading takes indeed 5 seconds, while terminating with Control-C is nearly instantaneous.

# Comment, change to reload
[[inputs.execd]]
  command = ["python", "-c", "import time; [time.sleep(1) for i in range(1000000)]"]
  data_format = "influx"
  restart_delay = "10s"

[[outputs.file]]
  files = ["stdout"]
  data_format = "influx"

In general, I think the kill and sighup follow the same path

Do you get any error messages as well about sending SIGTERM or killing the process?

I wonder if during the SIGHUP we try to gracefully clean up are a hitting a 5 second timeout?

Thank you for your reply!

Sometimes there are error messages, e.g.

2023/11/10 09:59:48 Error in Watcher Error channel: CancelIo: The handle is invalid.
2023/11/10 09:59:48 Error in Watcher Error channel: CloseHandle: The handle is invalid.

I don’t know whether this is related to the issue.

I did some additional testing and it seems that on Windows, shutdown is done gracefully, while reload is not.

During Telegraf shutdown, Python detects a KeyboardInterrupt, and optionally, Python can also detect the shutdown with win32api.SetConsoleCtrlHandler. I can clean up and exit, all is good. If the cleanup takes more than 5 seconds, Python gets killed - also good.

During Telegraf reload, Python detects nothing, so it seems Telegraf doesn’t send any signals. After 5 (lost) seconds, Python gets killed and then Telegraf restarts.

My issue is that I don’t want the 5 lost seconds, but it does seem like it is a part of a bigger bug, namely that Telegraf reload (on Windows) does not follow the same clean shutdown procedure for execd, as Telegraf shutdown does.

Full log, Python and Telegraf output together:

2023-11-10T08:59:37Z I! Loading config: telegraf.conf
2023-11-10T08:59:37Z I! Starting Telegraf 1.28.3 brought to you by InfluxData the makers of InfluxDB
2023-11-10T08:59:37Z I! Available plugins: 240 inputs, 9 aggregators, 29 processors, 24 parsers, 59 outputs, 5 secret-stores
2023-11-10T08:59:37Z I! Config watcher started
2023-11-10T08:59:37Z I! Loaded inputs: execd
2023-11-10T08:59:37Z I! Loaded aggregators:
2023-11-10T08:59:37Z I! Loaded processors:
2023-11-10T08:59:37Z I! Loaded secretstores:
2023-11-10T08:59:37Z I! Loaded outputs: file
2023-11-10T08:59:37Z I! [agent] Config: Interval:10s, Quiet:false, Flush Interval:10s
2023-11-10T08:59:37Z I! [inputs.execd] Starting process: python.exe [hangup.py]

Python:
2023-11-10T08:59:37.313398 Starting indefinite loop

Manual action: 
Change config file comment

2023-11-10T08:59:48Z I! Config file modified
2023-11-10T08:59:48Z I! Reloading Telegraf config
2023/11/10 09:59:48 Error in Watcher Error channel: CancelIo: The handle is invalid.
2023/11/10 09:59:48 Error in Watcher Error channel: CloseHandle: The handle is invalid.
2023-11-10T08:59:53Z I! [inputs.execd] Process python.exe shut down
2023-11-10T08:59:53Z I! [agent] Hang on, flushing any cached metrics before shutdown
2023-11-10T08:59:53Z I! [agent] Stopping running outputs
2023-11-10T08:59:53Z I! Loading config: telegraf.conf
2023-11-10T08:59:53Z I! Config watcher started
2023-11-10T08:59:53Z I! Starting Telegraf 1.28.3 brought to you by InfluxData the makers of InfluxDB
2023-11-10T08:59:53Z I! Available plugins: 240 inputs, 9 aggregators, 29 processors, 24 parsers, 59 outputs, 5 secret-stores
2023-11-10T08:59:53Z I! Loaded inputs: execd
2023-11-10T08:59:53Z I! Loaded aggregators:
2023-11-10T08:59:53Z I! Loaded processors:
2023-11-10T08:59:53Z I! Loaded secretstores:
2023-11-10T08:59:53Z I! Loaded outputs: file
2023-11-10T08:59:53Z I! [agent] Config: Interval:10s, Quiet:false, Flush Interval:10s
2023-11-10T08:59:53Z I! [inputs.execd] Starting process: python.exe [hangup.py]


Python:
2023-11-10T08:59:53.159398 Starting indefinite loop

Manual action: 
Terminate Telegraf with Control-C

Python:
2023-11-10T09:00:08.390532 win32api.SetConsoleCtrlHandler triggered, signal: 0. Sleeping for 2 seconds.
2023-11-10T09:00:10.391763 Now really exiting
2023-11-10T09:00:10.396079 Keyboardinterrupt detected, exiting

2023-11-10T09:00:10Z I! [inputs.execd] Process python.exe shut down
2023-11-10T09:00:10Z I! [agent] Hang on, flushing any cached metrics before shutdown
2023-11-10T09:00:10Z I! [agent] Stopping running outputs

I’ll file a bug report for this next week, unless you can spot something I’m missing.

I am not sure I agree that this is a bug. The decision to do a graceful shutdown during a SIGHUP was a design decision and makes some sense as you can prevent breaking during the middle of an interval. Whereas during a shutdown you are saying I’m done and don’t care about gracefully cleaning things up.

That said I would consider a feature request to possibly expose the shutdown timeout.

I’m not sure whether we understand each other. With SIGHUP, do you mean Telegraf reload or Telegraf shutdown?

My observation is that during a Telegraf shutdown, things are gracefully cleaned up.

During a Telegraf reload however, which can happen at a random moment during operation, Python just gets killed. Since this is during operation, you would expect cleanup to prevent data loss as much as possible.

I’m not sure whether we understand each other

I agree, your original post asks the question how to clean up faster during a SIGHUP. Re-reading it sounds like you see the 5 second timeout get hit and kills your process.

What is your expected behavior when doing a SIGHUP? Block until the command completes? Or don’t try to clean up, just kill and move on?

I’m not familiar with SIGHUP. From what I can read about it, I’m guessing it means the signal sent by Telegraf to Python, when Telegraf wants Python to shut down. Is that what you mean with SIGHUP?

My original question is how to make the Python process stop quicker when reloading the Telegraf configuration. (No cleanup is involved in my case, although in general it is possible to be needed.)

Just to be clear: as soon as my Python script is notified that it needs to exit, it can exit within milliseconds. I just added 2 seconds of artifical cleanup for demonstration purposes. This can be observed when Telegraf is shut down (KeyboardInterrupt is detected in Python). The problem is that during Telegraf reloading, the Python script never gets notified, as far as I can tell (see logs and the example Telegraf configuration). Without notification, Python keeps running and gets killed after 5 seconds.

In summary: I experience that Telegraf terminates execd processes differently during reloading, as compared to during shutdown. In my view, the current shutdown procedure is the correct one for both.

I hope this clears things up!

SIGHUP is a system call that telegraf will react to. Namely, when the telegraf process receives a SIGHUP from either the user, the operating system, or from the watch config file logic, then we will stop the plugins and reload the configuration. This alleviates the need to have to stop telegraf and start it back up again.

fwiw sending Control-C sends the SIGINT system call.

The problem is that during Telegraf reloading , the Python script never gets notified,

Thanks, this helps put things together. This is because the sighup is not going to be sent or propagated to your script.

Thank you for the clarification.

I still think Telegraf behaves differently than one would expect in this particular case, but it might be related to OS limitations, and not easy to adjust.

I will try to find a different way to achieve what I’m aiming for.

In any case, this thread might help people with similar observations to make sense of it. Thank you for your help!

1 Like