Telegraf log parser ---> Influxdb duplicates values

alvianno · January 18, 2019, 5:09am

I use log parser telegraf for push matric to influxdb
but in influxdb im look duplicates matric/value, please give me solution for my problem. Thanks

rawkode · January 18, 2019, 10:01am

Hey @alvianno,

I’d love to help, but I’m going to need a little bit more information please.

Can you tell us what metrics you are pushing to InfluxDB and how they’re getting there?

Are you using Telegraf with a plugin or the client libraries within code? Can you share examples?

The more information you provide, the more we can help.

Thanks

alvianno · January 18, 2019, 10:11am

Config Telegraf plugin log parser :

example Log .csv :
MOBILE,Payment,response_time,1547805960,4624
MOBILE,Payment,count,1547805960,181
MOBILE,Payment,error,1547805960,14
WEB,Payment,response_time,1547805960,67
WEB,Payment,count,1547805960,1
WEB,Payment,error,1547805960,0
WEB,Login,response_time,1547805960,1295
WEB,Login,count,1547805960,82
WEB,Login,error,1547805960,1
WEB,Emoney,response_time,1547805960,0

Telegraf Config :

rawkode · January 18, 2019, 10:22am

Hey,

In the sample CSV provided and in the screenshot from your original post, there are no duplicate points:

MOBILE,Payment,response_time,1547805960,4624
MOBILE,Payment,count,1547805960,181
MOBILE,Payment,error,1547805960,14
WEB,Payment,response_time,1547805960,67
WEB,Payment,count,1547805960,1
WEB,Payment,error,1547805960,0
WEB,Login,response_time,1547805960,1295
WEB,Login,count,1547805960,82
WEB,Login,error,1547805960,1
WEB,Emoney,response_time,1547805960,0

A point is only a duplicate if the timestamp and tag values are the same.

In your screenshot, each of the timestamps are unique.

In the sample CSV, while there are duplicate timestamps, the tags (MOBILE/WEB, Payment/Login) are unique across the timestamps.

Richard_Anthony · January 18, 2019, 10:43am

Hey rawkode, thanks for helping us.

Yeah there are no duplicate points in the source (csv), but when we check the influxdb using this query for example:

SELECT “metric_value” FROM “mandiri”.“RAW_OneDay”.“RAW_APPD_TRX” WHERE time > now() - 1h AND “category”=‘MOBILE’ AND “metric_type”=‘response_time’ AND “trx_name”=‘Payment’

will return these result:

There are duplicate data with random milisecond offset different from the source data. The source is timestamp in second, but why there are some data with millisecond offset?

The first time I realize the problem is when I using sum() aggregate function on ‘count’ or ‘error’ tag, the result is 4-5 times as the expected result

rawkode · January 18, 2019, 12:39pm

Hi @Richard_Anthony,

Sadly, I am unable to replicate this behaviour.

Config:

[[inputs.logparser]]
  files = ["/etc/telegraf/example.csv"]
  from_beginning = false
  watch_method = "poll"

  [inputs.logparser.grok]
    patterns =["%{GREEDYDATA:category:tag},%{GREEDYDATA:trx_name:tag},%{GREEDYDATA:metric_type:tag},%{NUMBER:timestamp:ts-epoch},%{GREEDYDATA:metric_value:float}"]
    measurement = "RAW_APPD_TRX"
    custom_patterns = '''
    '''
    timezone = "Local"

[[outputs.influxdb]]
  urls = ["http://influxdb:8086"]
  database = "telegraf"
  username = ""
  password = ""
  retention_policy = ""
  write_consistency = "any"
  timeout = "5s"

With the following CSV:

MOBILE,Payment,response_time,1547805960,4624
MOBILE,Payment,count,1547805960,181
MOBILE,Payment,error,1547805960,14
WEB,Payment,response_time,1547805960,67
WEB,Payment,count,1547805960,1
WEB,Payment,error,1547805960,0
WEB,Login,response_time,1547805960,1295
WEB,Login,count,1547805960,82
WEB,Login,error,1547805960,1
WEB,Emoney,response_time,1547805960,0

and adding lines, one by one:

WEB,Emoney,response_time,1547805961,1
WEB,Emoney,response_time,1547805962,2
WEB,Emoney,response_time,1547805963,3
WEB,Emoney,response_time,1547805964,4
WEB,Emoney,response_time,1547805963,3
WEB,Emoney,response_time,1547805965,5

Results:

> SELECT "metric_value" FROM "telegraf"."autogen"."RAW_APPD_TRX" WHERE "category"='WEB' AND "metric_type"='response_time' AND "trx_name"='Emoney'

name: RAW_APPD_TRX
time                metric_value
----                ------------
1547805962000000000 2

> SELECT "metric_value" FROM "telegraf"."autogen"."RAW_APPD_TRX" WHERE "category"='WEB' AND "metric_type"='response_time' AND "trx_name"='Emoney'

name: RAW_APPD_TRX
time                metric_value
----                ------------
1547805962000000000 2
1547805963000000000 3
1547805964000000000 4
1547805965000000000 5

I’ll raise this with one of my colleagues, in-case they’re aware of something I am not.

Richard_Anthony · January 18, 2019, 12:58pm

Thanks for your confirmation, but just curious here, is there any difference between using watch method poll and inotify?

rawkode · January 18, 2019, 1:23pm

I tried this with inotify too and didn’t experience the duplicates.

These tests were done with low volume data, so I increased the number of rows being appending to the file and I have been successfully able to replicate this (I got one row with a timestamp variance as seen in your data)

alvianno · January 24, 2019, 11:51am

please u try log :
WEB,Emoney,response_time,1547805965,1
WEB,Emoney,response_time,1547805965,2
WEB,Emoney,response_time,1547805965,3
WEB,Emoney,response_time_test,1547805965,4
WEB,Emoney,response_time_test,1547805965,3
WEB,Emoney,response_time_test,1547805965,5

ashleshmandke · February 13, 2019, 8:56am

I am also facing same issue. Issue is described in the link given below. But not sure where to start with either telegraf or influxdb.
Please find link below where I have opened an issue regarding the same-

github.com/influxdata/telegraf

Data Duplication in InfluxDB

opened 12:24PM - 08 Feb 19 UTC

closed 01:35AM - 27 Feb 19 UTC

ashleshmandke

area/tail

Hi All, Need help in scenario that we are facing:- We are loading data into …influxdb parsing via telegraf. We have one record in source file. Please find below snapshot of the source record- ![image](https://user-images.githubusercontent.com/46951204/52478184-971a9280-2bca-11e9-97a4-929500856c4c.png) After data loading above source file into InfluxDB we are getting two records when we are trying to query it. Please find below snapshot after querying the result set of influxdb ![image](https://user-images.githubusercontent.com/46951204/52477733-01323800-2bc9-11e9-95c3-a9ffd37db2d9.png) It seems that influxdb creates duplicate records itself with another timestamp i.e ![image](https://user-images.githubusercontent.com/46951204/52477778-28890500-2bc9-11e9-9b95-04b067020daf.png) The above scenario is observed sometimes but not in each and every data loading source file. Please help us. Warm regrards //Ashlesh

Warm Regards,
//Ashlesh

glinton · February 13, 2019, 6:37pm

@Richard_Anthony, what versions of influx and telegraf were you using when you experienced this?

daniel · February 13, 2019, 9:34pm

The grok parser, and logparser by extension, have a behavior where if two consecutive lines have the same timestamp, they will be adjusted so that they are strictly increasing. So if you input data like all with the same timestamp, like in @alvianno’s example:

WEB,Emoney,response_time,1547805965,1
WEB,Emoney,response_time,1547805965,2
WEB,Emoney,response_time,1547805965,3

The points will be created with timestamps:

1547805965000000000,1
1547805965001000000,2
1547805965002000000,3

Richard_Anthony · February 14, 2019, 8:52am

@glinton we are using telegraf 1.7.2 and influxdb 1.6.0, both are installed in Windows OS

@daniel but if I’m not mistaken, isn’t the default behaviour if there are multiple data points with same measurement, timestamp, and tag, it will only keep the last value? source: InfluxDB frequently asked questions | InfluxDB OSS 1.7 Documentation

daniel · February 14, 2019, 7:50pm

It is true that InfluxDB will only have one value per series at a time, but in this case Telegraf is adjusting the timestamp of the input data before it is sent to InfluxDB, preventing the timestamp conflict. The idea behind this was that logfiles often have many lines with the same timestamp, since they are often written a one second resolution, and we wanted to preserve ordering of the lines and have all lines stored in the database without overwrites.

I think we ought to allow this behavior to be disabled, since it isn’t something that will always be helpful. As a workaround, you might want to try using the tail plugin with the data_format = "csv" instead.

daniel · February 27, 2019, 2:12am

We are adding an option to 1.10 that will disable this behavior, with the logparser plugin it is called unique_timestamp:

[[inputs.logparser]]
  [inputs.logparser.grok]
    unique_timestamp = "disabled"

When using the grok parser as a data_format, there is a similarly behaving option grok_unique_timestamp. The 1.10 release candidate will be out very soon, hope everyone will be able to give it a try.

Topic		Replies	Views
Telegraf Logparser and duplicate data points Store influxdb , telegraf	9	3045	January 5, 2018
XML Parser code not working Telegraf influxdb , telegraf , query	2	476	January 4, 2023
Telegraf time-bucket conflict resolution? Telegraf influxdb , telegraf	12	1610	April 25, 2017
Hard time getting telegraf logparser automatically detect file change influxdb , telegraf , influxdata	2	2844	April 2, 2019
Problems with telegraf + influxdb Telegraf influxdb , telegraf	6	1069	March 17, 2022

Telegraf log parser ---> Influxdb duplicates values

Related topics