AWS Timestream output plugin problem

Hello,

For some reason, sometimes I’m starting to get an error that output plugin did not complete within flush interval and agent is stuck in that state until I restart it.

Also, it does no shutdown and I need to kill it to restart. After restart it works fine until the next time.

Platform is Windows, intput is tail on json.

Version is 1.25.0 … What can be a problem, any ideas?

Thank you in advance.

At a high level, this means that the output plugin took longer than the flush_interval setting to complete sending metrics. If an output was in the middle of a transaction we will skip the next interval so that we do not get into a state where multiple attempts are not made at the same time and to ensure metrics show up in some order.

The plugin is probably hung or waiting on something to come back or timing out trying to get data back.

Looking at the sample timestream config, If you are using multiple go routines in the config, I would try without those and see how it goes.

1 Like

Thank you for your reply, set it to 1 and started, I’ll see how it goes.

Hello,
Looks like it’s working now. So number of routines was probably the issue…

I guess its main purpose is performance?

Right that setting determines the number of concurrent jobs to write to timestream. So lowering reduced the load on your local system and reduced the number of outbound connections to AWS. One of those areas may have been getting overloaded. You might try increasing it again if you see things are taking too long again or leave it where it is if it works :slight_smile:

Hello again,

So it continued to work with value “1”, but then I noticed, that it stopped picking up value from a log file, that was using “tail” plugin.

I increased the value to “2”, for some time it seemed that it solved the problem, but at some point it went back to the original “output plugin did not complete within flush interval”.

Interestingly enough, only one plugin fails, the other continues to work.
The difference between two files, is that one is updated very frequently and also log rotates at night, the other one is much more “quiet”.

So now I’m kind of stuck … If it’s on “1” it’s working, but “tail” plugin stops working at some point and if it’s greater than “1”, it can’t flush after some time.

May be the “tail” problem is not related to that parameter? If not, what could be a problem?

Thank you.

May be the “tail” problem is not related to that parameter? If not, what could be a problem?

I would agree that if you are having issues with tail, it is unrelated to the output. Do you have logs that show what the tail plugin was doing? Did it fail during a rotation? What is your config for tail?

I don’t know if it fails during rotation, because the log file is not busy and it does work for several days (there is rotation every day at 6am) before it stops.

As a test, I may be cancel the rotation and see how it’s going.

Btw, I tried both “poll” and “inotify”.

image

(sorry, can’t copy paste from that server)

On Windows I would only expect watch_method poll to work. Otherwise, nothing in there stands out.

Do you have any logs you can share as well?
Any error messages?
You are also certain that the file was still getting data?
Did the message change and as a result the JSON parsing started failing?
Can you verify the file’s contents while it fails?

Hi,
Sorry, my bad, this file is not rotated …
There are no error messages in the log, also I ran it in DEBUG mode, nothing printed there when it happens. The problem with DEBUG, is there are many optional fields in JSON, so JSON parser floods the DENUG log with (optional field is not found debug message) , so it’s hard to read it.
IT works exactly the same with “poll” and “inotify”. File is getting data, there are around 20-30 messages a day.
Messages do not change, they are generated by a product and it’s always the same schema.

I looked at the contents, nothing really looks weird … I will monitor it for couple more days, to see what is common in the messages before the failure.

Also, when I restart Telegraf, it reads the whole JSON from beginning and no errors are reported, all lines are picked up.

Looks like I found the cause of the problem …

I found a smiliar error discussion, it was solved by using “poll” instead of “inotify” … I tried “inotify”, because I thought it will solved that issue.

I switched back to “poll” and will continue to monitor.

Hello, I will lock this thread and mark it as solved, not to mix two issues.

Thank you so much for the help.