AWS Timestream output plugin problem

AlexL · April 24, 2023, 12:30pm

Hello,

For some reason, sometimes I’m starting to get an error that output plugin did not complete within flush interval and agent is stuck in that state until I restart it.

Also, it does no shutdown and I need to kill it to restart. After restart it works fine until the next time.

Platform is Windows, intput is tail on json.

Version is 1.25.0 … What can be a problem, any ideas?

Thank you in advance.

jpowers · April 24, 2023, 7:26pm

At a high level, this means that the output plugin took longer than the flush_interval setting to complete sending metrics. If an output was in the middle of a transaction we will skip the next interval so that we do not get into a state where multiple attempts are not made at the same time and to ensure metrics show up in some order.

The plugin is probably hung or waiting on something to come back or timing out trying to get data back.

Looking at the sample timestream config, If you are using multiple go routines in the config, I would try without those and see how it goes.

AlexL · April 25, 2023, 12:00pm

Thank you for your reply, set it to 1 and started, I’ll see how it goes.

AlexL · April 28, 2023, 7:01am

Hello,
Looks like it’s working now. So number of routines was probably the issue…

I guess its main purpose is performance?

jpowers · April 28, 2023, 2:05pm

Right that setting determines the number of concurrent jobs to write to timestream. So lowering reduced the load on your local system and reduced the number of outbound connections to AWS. One of those areas may have been getting overloaded. You might try increasing it again if you see things are taking too long again or leave it where it is if it works

AlexL · May 12, 2023, 1:26pm

Hello again,

So it continued to work with value “1”, but then I noticed, that it stopped picking up value from a log file, that was using “tail” plugin.

I increased the value to “2”, for some time it seemed that it solved the problem, but at some point it went back to the original “output plugin did not complete within flush interval”.

Interestingly enough, only one plugin fails, the other continues to work.
The difference between two files, is that one is updated very frequently and also log rotates at night, the other one is much more “quiet”.

So now I’m kind of stuck … If it’s on “1” it’s working, but “tail” plugin stops working at some point and if it’s greater than “1”, it can’t flush after some time.

May be the “tail” problem is not related to that parameter? If not, what could be a problem?

Thank you.

jpowers · May 17, 2023, 1:23pm

May be the “tail” problem is not related to that parameter? If not, what could be a problem?

I would agree that if you are having issues with tail, it is unrelated to the output. Do you have logs that show what the tail plugin was doing? Did it fail during a rotation? What is your config for tail?

AlexL · May 18, 2023, 5:31am

I don’t know if it fails during rotation, because the log file is not busy and it does work for several days (there is rotation every day at 6am) before it stops.

As a test, I may be cancel the rotation and see how it’s going.

Btw, I tried both “poll” and “inotify”.

(sorry, can’t copy paste from that server)

jpowers · May 18, 2023, 1:41pm

On Windows I would only expect watch_method poll to work. Otherwise, nothing in there stands out.

Do you have any logs you can share as well?
Any error messages?
You are also certain that the file was still getting data?
Did the message change and as a result the JSON parsing started failing?
Can you verify the file’s contents while it fails?

AlexL · May 19, 2023, 5:49am

Hi,
Sorry, my bad, this file is not rotated …
There are no error messages in the log, also I ran it in DEBUG mode, nothing printed there when it happens. The problem with DEBUG, is there are many optional fields in JSON, so JSON parser floods the DENUG log with (optional field is not found debug message) , so it’s hard to read it.
IT works exactly the same with “poll” and “inotify”. File is getting data, there are around 20-30 messages a day.
Messages do not change, they are generated by a product and it’s always the same schema.

I looked at the contents, nothing really looks weird … I will monitor it for couple more days, to see what is common in the messages before the failure.

AlexL · May 19, 2023, 5:57am

Also, when I restart Telegraf, it reads the whole JSON from beginning and no errors are reported, all lines are picked up.

AlexL · May 21, 2023, 11:24am

Looks like I found the cause of the problem …

I found a smiliar error discussion, it was solved by using “poll” instead of “inotify” … I tried “inotify”, because I thought it will solved that issue.

I switched back to “poll” and will continue to monitor.

AlexL · May 28, 2023, 7:17am

Hello, I will lock this thread and mark it as solved, not to mix two issues.

Thank you so much for the help.

Topic		Replies	Views
Tail plugin Interval parameter not working and High CPU usage over 100% for 10min Telegraf telegraf	0	546	March 8, 2021
Telegraf Timing (intervals, round_interval, collection_jitter, collection_offset and flush_interval) telegraf	2	847	March 20, 2023
Issue with emqtt ouput plugin Telegraf	2	868	May 8, 2018
Tail plugin stops reading on windows due to the lock of the file Telegraf telegraf , windows , tail	4	428	May 31, 2023
Telegraf-Interval issue Telegraf influxdb	31	868	February 27, 2024

AWS Timestream output plugin problem

Related Topics