Telegraf service doesn't start after Windows updates: Error 7009 A timeout was reached

Hey all,

I am experiencing an issue on a few Windows VMs that range from Windows Server 2012 R2 to Windows Server 2016 where the Telegraf service will timeout and fail to start after Windows updates are deployed.

Currently I have the service to set to Automatic (Delayed Start) to attempt to combat this, which has worked in many cases but not all. We still have some servers that refuse to start the Telegraf service after Windows updates which is very inconvenient.

Here is the error we receive after Windows updates are deployed:

telegraf error1

And here is the settings for the service to show how it is configured:



Has anyone had this issue in the past and found a good way to solve it? Thanks for any help you might provide!

Are you saying Telegraf was running, then updates were applied, then telegraf stops?

Are you logging anything to a file or the event log and if so do you see anything?

Hi jpowers! Thanks for reaching out.

Sorry I think I wasn’t clear in the original post. Telegraf is installed as a Windows service that is set to automatically start with a delayed start when Windows server starts. On a normal reboot, the Telegraf agent starts automatically like one would expect. However, after applying monthly Windows updates and going through a routine restart of the system like Windows will require, upon next start up the Telegraf service will frequently fail. The telegraf service failure log is posted as the first picture, how would you recommend getting more detailed logs from Telegraf in this case?

Hi,

More questions:

  • Does telegraf eventually start up or never work?
  • Does this happen after just a restart? I ask because I also wonder if networking is not up yet.
  • What plugins are you using?
  • If you start telegraf by hand does it work?
  • Does telegraf eventually start up or never work?

Unfortunately it requires manual intervention and it will never start in these cases, even with all the above Windows service fail safes for when a service fails.

  • Does this happen after just a restart? I ask because I also wonder if networking is not up yet.

Since they are production VMs and often times databases, I can’t test this quite as easily, but it is possible that it is happening every restart on a few select servers. Most of our ~100 VMs we have telegraf deployed on can restart and the service will start with no issue, so if is happening every restart it could possibly be a resource issue on a few specific boxes. It really seems linked to Windows updates and additional load / processing going on after installing a monthly one.

I agree it seems like a possible solution could be adding a dependency to the service to only start after a specific networking component is available, but I’m not certain which would be the best to choose.

  • What plugins are you using?

We are using the Windows performance and MS SQL server performance plugins.

  • If you start telegraf by hand does it work?

Yes once our grafana alerts go off that a telegraf agent is no longer reporting we can manually start the service fine without a server reboot or anything.

Yes once our grafana alerts go off that a telegraf agent is no longer reporting we can manually start the service fine without a server reboot or anything.

I suspect that telegraf is then failing to start because a service is not up, networking is not up, or something else is preventing telegraf from connecting to an output and as a result it fails to start.

I see you already have it set to Delayed Start, which is what I was going to suggest. I do not have enough knowledge of windows services to say if there would be a better option, other than to maybe increase the delay of restart.