Hello
Quick description of the setup:
- 1 server Influxdb2 managing the Telegraf config (name: ConfigServer1 influxd-2.7.6)
- 1 server Influxdb to store the data sent by the telegraf agent (name: DataServer1)
- N client under windows with telegraf agent installed and loading the config from ConfigServer1 (telegraf v1.30.2)
- the idea for now is to move from config files to centralized configs.
My question :
How telegraf behave if the ConfigServer1 is down when Telegraf agent try to start?
i did some test and (under windows):
- Telegraf agent service start (but remember the server with the config is down) the service stay up and running, but nothing is sent, no logs are generated (logfile = ‘${ProgramData}\telegraf\telegraf.log’)
- in the event viewer → telegraf is started and event showing it’s loading the conf (but server down : so not possible)
- querying the service status of telegraf, service is started
- i stop the service , and restart it, same state.
- if i execute the telegraf.exe in console mode it try to start, fail and stop.
C:\Program Files\Telegraf>telegraf.exe --config http://<IP redacted>:8086/api/v2/telegrafs/<telegraf conf id redacted>
2024-05-14T06:39:42Z I! Loading config: http://<IP redacted>:8086/api/v2/telegrafs/<telegraf conf id redacted>
2024-05-14T06:39:44Z E! error loading config file http://<IP redacted>:8086/api/v2/telegrafs/<telegraf conf id redacted>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<IP redacted>:8086/api/v2/telegrafs/<telegraf conf id redacted>": dial tcp <IP redacted>:8086: connectex: No connection could be made because the target machine actively refused it.
- I tried also:
- ConfigServer1 down
- start telegraf service (no log/no data sent, but service is up)
- wait for 5 min
- start ConfigServer1
- wait for 5 min
- check the status of telegraf
- nothing change → telegraf service is still up and running but not sending data or logging
the “expected” behaviors should be (or could be)
- the service go down because it’s unable to load config
- then the setting --service-auto-restart do the job (ConfigServer back online 10 min after ,the telegraf service will finally load the config
- the config is cached locally, loading the cached config, with notification in the event llog (Warning we used the cached config, check what’s going on)
Thanks for your help.
Telegraf makes 3 quick attempts to download the config, if all those attempts fail, then Telegraf will fail to start. There currently are no config options to change this scenario.
We currently have a spec that we are reviewing to add additional retries to this behavior: docs: Add URL config behavior spec by powersj · Pull Request #15321 · influxdata/telegraf · GitHub
the config is cached locally
This is not something we are interested in adding to Telegraf.
Hello Josh
all test actually done , show the service not failing even multiple hours after the start of the service.
i tested via command line and yes telegraf is failing with errror.
but via service installed is not failing/stopping
i tested with 1.28.5/1.29.2/1.30.2 → same behavior once in service, the service never fail/stop if the ConfigServer service is down
stupid question, as the log file parameter is in the conf file loaded through http^, it’ not generating any log, is there a way to create the service with specifying the log file path
something like that or a way to redirect stderr to a log from a service runnig
"C:\Program Files\Telegraf\telegraf.exe" --config http://<redactedIP>:8086/api/v2/telegrafs/<redactedId> --log c:\temp\telegraf.log --service install
for the caching file
This is not something we are interested in adding to Telegraf.
Noted , i had to ask
I would look at the Windows Event viewer then and see what shows up there. I think by default, this is where I would expect messages should go until the config file is loaded.
I would look at the Windows Event viewer then and see what shows up there. I think by default, this is where I would expect messages should go until the config file is loaded.
already checked and explained in the first message (although maybe not so clear enough)
the only entry in the event viewer is :
Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted conf ID>
there is no error event , that’s why i’m asking other way to generate logs or redirect to a file (if it’s possible).
i’m asking the alternative config logs to help debugging/finding why the service is not failing as supposed.
To resume (i did some more tests)
on normal situation all is working
If ConfigServer is down:
- command line execution failed → expected
C:\Program Files\Telegraf>"C:\Program Files\Telegraf\telegraf.exe" --config http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T06:36:11Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T06:36:13Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
- service mode is not failing → service stay up, it should failed and stop
if config ID is wrong (i changed to a fake ID for testing)
- command line execution failed → expected
C:\Program Files\Telegraf>"C:\Program Files\Telegraf\telegraf.exe" --config http://<redacted IP>:8086/api/v2/telegrafs/<redacted fake config ID>
2024-05-15T06:35:23Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted fake config ID>
2024-05-15T06:35:23Z I! Error getting HTTP config. Retry 0 of 3 in 10s. Status=400
2024-05-15T06:35:33Z I! Error getting HTTP config. Retry 1 of 3 in 10s. Status=400
2024-05-15T06:35:43Z I! Error getting HTTP config. Retry 2 of 3 in 10s. Status=400
2024-05-15T06:35:53Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted fake config ID>: retry 3 of 3 failed to retrieve remote config: 400 Bad Request
- service mode is not failing → service stay up, it should failed and stop
i will fill a bug report
Last thing
i found that i can use nssm to install the service and in nssm you can redirect the stderr to a file
so i have the logs
the service is trying to load the file , and looping on “retry 0 of 3 failed”
2024-05-15T07:22:55Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:22:57Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:22:58Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:00Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:00Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:02Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:03Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:05Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:05Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:07Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:08Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:10Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:10Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:12Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:13Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:15Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:16Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:18Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:18Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:20Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:21Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:23Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:23Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:25Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:26Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:28Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:28Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:30Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:31Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:33Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:33Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:35Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:36Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:38Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:38Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:40Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:41Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
2024-05-15T07:23:43Z E! error loading config file http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>: retry 0 of 3 failed connecting to HTTP config server: Get "http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>": dial tcp <redacted IP>:8086: connectex: No connection could be made because the target machine actively refused it.
2024-05-15T07:23:43Z I! Loading config: http://<redacted IP>:8086/api/v2/telegrafs/<redacted config ID>
That is interesting that nssm has additional info and options. Thanks for giving it a try.
It seems that restarting the service on failure is the default behavior of both the way Telegraf creates a service and of NSSM.
I don’t necessarily consider that behavior a bug, rather something the user needs to configure or change in either NSSM or the service itself after setting up. We have plenty of users who prefer things get restarted rather than die.
Debugging services on Windows is not the best experience, but it is not clear to me how Telegraf could improve this situation. One thing that we do suggest is not running Telegraf as a service, but by hand as it would have shown you the same error as well.
One thing that we do suggest is not running Telegraf as a service, but by hand as it would have shown you the same error as well.
it’s not the case:
- command line logs show it try and failed with clear message,
- “retry 0 of 3 failed”
- “2024-05-15T06:36:13Z E! error loading config file http://:8086/api/v2/telegrafs/: retry 0 of 3 failed connecting to HTTP config server: Get “http://:8086/api/v2/telegrafs/”: dial tcp :8086: connectex: No connection could be made because the target machine actively refused it.”
- service: it try again and again and stay “retry 0 of 3 failed”, and loop
- “retry 0 of 3 failed”
- “retry 0 of 3 failed”
- “retry 0 of 3 failed”
- “retry 0 of 3 failed”
- etc
- here nothing, no error logged at all. (in the event log for example)
- there is an option in telegraf that configure the service to restart after 5m (configurable)
- –service-auto-restart / --service-restart-delay
- so that should be sufficient for the people that like it try for ever
NSSM have option to redirect to file the STDIN/STDOUT and STDERR, as default for telegraf is to send log to stderr i remember that and i use nssm on influxdb server.
i’m not sure that is complex to debug window service, (well that depend what we need to debug… but for getting the logs)
few suggestions:
- as service we could have stderr sent to log file in %temp% folder , activated by --debug option for example
- Telegraf is able to write in the event log, seems is not logging on the retry failure every time it occurs for debugging service.
- if it write failed to load config every 10s in the event viewer at least we now we have that issue.
- here nothing, no error logged at all.
Which is exactly the point I’m making. Run by the CLI and you get everything.
-
- here nothing, no error logged at all. (in the event log for example)
This will be fixed by: fix(windows): Make sure to log the final error message by srebhan · Pull Request #15346 · influxdata/telegraf · GitHub
-
- so that should be sufficient for the people that like it try for ever
we are not changing the existing behavior