Telegraf file plugin with utf-16 le bom and grok not working

hi,
i configured telegraf to parse some files with grok. it works fine, but when the files are encoded in utf16-le-bom i am not able to parse them:

2021-10-04T16:10:06Z I! Starting Telegraf 1.19.2
2021-10-04T16:10:06Z D! [agent] Initializing plugins
2021-10-04T16:10:06Z D! [agent] Starting service inputs
2021-10-04T16:10:06Z D! Grok no match found for: "T\x00B\x00a\x00t\x00c\x00h\x00 \x00B\x00a\x00t\x00c\x00h\x00\r\x00"
2021-10-04T16:10:06Z D! Grok no match found for: "\x00{\x00\r\x00"
2021-10-04T16:10:06Z D! Grok no match found for: "\x00 \x00 \x00C\x00r\x00e\x00a\x00t\x00i\x00o\x00n\x00T\x00i\x00m\x00e\x00 \x00=\x00 \x00$\x000\x001\x00D\x007\x00B\x004\x00B\x00E\x00A\x00D\x007\x002\x00E\x00B\x00C\x00F\x00\r\x00"
2021-10-04T16:10:06Z D! Grok no match found for: "\x00 \x00 \x00G\x00U\x00I\x00D\x00 \x00=\x00 \x00\\\x007\x00B\x004\x002\x00f\x005\x00b\x000\x002\x004\x00-\x002\x00a\x00d\x00c\x00-\x004\x00b\x007\x007\x00-\x009\x006\x004\x00a\x00-\x003\x00f\x001\x00a\x003\x009\x00d\x004\x008\x001\x005\x006\x00\\\x007\x00D\x00\r\x00"
[...]

when i convert the file to utf8 it can be parsed without any issues.
any idea how to solve it? the files are parsed on windows.

kind regards,
andre

Hi astrakid, could you please upload your config file and your logs? This will make it easier to see what you are doing. Thank you!

this is the config:

[[inputs.file]]
files = ["C:/temp/btch.txt"]
data_format = "grok"

grok_patterns = [
"\\sPageCount\\s=\\s%{NUMBER:pageCount}",
"\\sDocumentCount\\s=\\s%{NUMBER:documentCount}",
"\\sDeclinedPageCount\\s=\\s%{NUMBER:declinedPageCount}",
"\\sDeclinedDocumentCount\\s=\\s%{NUMBER:declinedDocumentCount}",
"\\sDisplayName\\s=\\s%{GREEDYDATA:displayName}",
"\\sBatchClass\\s=\\s%{GREEDYDATA:batchClass}",
"\\sPosition\\s=\\s%{GREEDYDATA:position}",
"\\sState\\s=\\s%{NUMBER:state}",
"\\sStamp_Created\\s=\\s%{DATE:date_stampCreated}",
"\\sLogging\\s=\\s%{GREEDYDATA:logging}",
]

this code works when changing the encoding of the file that is parsed. but i need to parse the file in its original encoding:
image

with that encoding the debug-output looks like this:

with changed encoding it works and looks like this:

According to this tread grok treats utf-16 as utf-8 and is known to cause some issues. I think this is a grok issue and not a telegraf issue that can be fixed on our end.

according to the mentioned thread it is handled at all in logstash by converting the files to utf8. is something available for telegraf as well?

edit: or any way to cat the file to memory, and then parse the lines by grok?

Yes! You should be able to use the inputs.file plugin. It has a character_encoding config option. Here is a link to the documentation.

1 Like

yes, thanks, found it already! solved the issue! thx a lot!

1 Like