Telegraf read, handle, write big log files

Hello everyone, as I am new here, apologize in advance if I miss something, duplicate or don’t fill as it’s needed.
I’m using csv file with more then 10K of lines which is generated on the server.
I did my best to find out how to hadle next problems or situations:

  1. In the case this big csv file is updated once per day, is it possible to run Telegraf and after reading up whole file to stop? I don’t want to run it in interval every 10s or so…

  2. In the case file is updated, after reading whole file, how to get only new metrics from the file which came with updating and write only them to Influxdb?

In my case, Telegraf loading/reading whole file each 10s and writes to Influxdb as unique lines. I’ve tried with inputs.file, inputs.logparser(using log file and grok patterns), inputs.tail.

Hello @Zarko,

Thank you for your question! The tricky part of your problem is the fact that your data is in csv format. The first solution that comes to mind would be to create a script (or use something like this csv-to-influx script) that converts your csv points to line protocol to a txt file. Then I would append the script such that it appends updated values to the end of the txt file. If you use the inputs.tail and from_beginning = false then telegraf will only run and collect those points once. I’ll let you know if I think of a more elegant solution.

These days Telegraf has pretty good support for csv, so I think it won’t be a problem and it sounds like you already have that going. You can run this with the tail plugin as suggested by @Anaisdg to get the read only once behavior. However, if the file is being added on a daily basis, I suggest leaving from_beginning = true so that the entire file is read. It will still only be read one time start to end.

2 Likes

Thanks @daniel.

@Zarko, here’s a blog for an example of how to use the csv telegraf plugin in case you need it:

1 Like

@daniel Thanks for your answer. I’ve tried tail plugin but in this case, it register that file is updated but couldn’t write any line in influxdb.

So, I would say that tail plugin can read but can not write to db.

Maybe you can show your tail config and a few lines from the file?

Yes, of course.

From csv log file:

5cbe5580-6540-4376-9638-40055f8e4ee4,1,1559122530,207.46.13.92,retailer_view,4d02377c-7120-4107-83d3-3dead5a054c0,520b72d8-9d2c-4a23-846a-626d566e4bcb

5cbe5580-7990-44c8-886a-40055f8e4ee4,0,1559122530,207.46.13.92,retailer_logo,560cdc2c-126c-4515-b44b-0ed35f8e4e0e,5804c8d1-f6b8-402c-a9a3-774d5f8e4ee4

5cbe5580-83b8-4183-8d01-40055f8e4ee4,1,1559122530,207.46.13.92,gallery_image,568b9cf2-9420-4059-97a5-5bdb5f8e4ee4,56a74b43-da2c-4ff1-aab1-78b45f8e4ee4

5cbe5580-92f4-42ce-ad23-40055f8e4ee4,0,1559122530,207.46.13.92,gallery_image,568b9cf2-9420-4059-97a5-5bdb5f8e4ee4,56a749b1-6a48-4528-92ff-7b695f8e4ee4

Config file:

[global_tags]

[agent]
interval = “10s”
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = “0s”
flush_interval = “10s”
flush_jitter = “0s”
precision = “”
debug = true
quiet = false
logfile = “”
hostname = “”
omit_hostname = false

[[outputs.influxdb]]
urls = [“http://influxdb:8086”]
database = “Telegraf”
retention_policy = “autogen”

[[inputs.tail]]
files = [“/var/log/mylog.csv”]
from_beginning = true
pipe = false
watch_method = “inotify”
data_format = “csv”
csv_column_names = [“user_id”, “free_flag”, “timestamp”, “user_ip”, “user_action_type”, “company_id”, “reference_id”]
csv_header_row_count = “”
csv_skip_rows = 0
csv_skip_columns = 0
csv_comment = “#”
csv_measurement_column = “measurement_name”
csv_tag_columns = [“tag_key”]
csv_timestamp_column = “timestamp”
csv_timestamp_format = “unix”
fieldpass = [“user_id”, “free_flag”, “timestamp”, “user_ip”, “user_action_type”, “company_id”, “reference_id”]

Console screenshot:

@Anaisdg Sorry, I forgot to mention you and thanks for your answer.
I already used that link, it helped me in the beginning but afterwards have new issues.

1 Like

It looks like it was able to write successfully, the measurement name would be tail because csv_measurement_column = "measurement_name" could not be found.