Telegraf bulk import for large amout of data with input.tail

sab · December 15, 2019, 8:28pm

Hi there, I need to insert millions of rows (about 500.000.000) into influxdb (1.7.9).

I have 5 kind of measurements taken over 10 years, sampled every seconds or every ten seconds, thus I need to bulk import the raw files that I prepared with inline protocol format.
I tried the telegraf plugin input.csv, which seemed slow, so I tried input.influx with inline protocol format which seems better but still too slow. (In fact, my laptop keeps freezing (because of almost 100% CPUs and memory usage) and I constantly need to reboot it to be able to continue working.)
I am now trying to input the data with input.tail, which still freezes my laptop, however, I think data are imported faster.

Can anybody give me some advice on how to improve the import and make it stable to stop it freezing? I highly appreciate any advice for better practice of importing these billions of metrics in hundred of files. Thanks a lot in advance!

Here is my telegraf-input-env.conf:

# Telegraf Configuration
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"
  ## Environment variables can be used as tags, and throughout the config file
  # user = "$USER"

# Configuration for telegraf agent
[agent]
  interval = "60s"
  round_interval = true
  metric_batch_size = 5000
  metric_buffer_limit = 10000
  collection_jitter = "0s"

  flush_interval = "30s"
  flush_jitter = "60s"

  precision = ""
  debug = true
  quiet = false
  logfile = ""

  hostname = ""
  omit_hostname = true


###############################################################################
#                            OUTPUT PLUGINS                                   #
###############################################################################
# Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
  database = "db_name"

  # retention_policy = ""
  # write_consistency = "any"
  # timeout = "5s"
  # username = "telegraf"
  # password = "metricsmetricsmetricsmetrics"
  # user_agent = "telegraf"
  # udp_payload = "512B"

  # tls_ca = "/etc/telegraf/ca.pem"
  # tls_cert = "/etc/telegraf/cert.pem"
  # tls_key = "/etc/telegraf/key.pem"
  # insecure_skip_verify = false

  # http_proxy = "http://corporate.proxy:3128"
  # http_headers = {"X-Special-Header" = "Special-Value"}
  # content_encoding = "identity"
  # influx_uint_support = false

###############################################################################
#                            INPUT PLUGINS                                    #
###############################################################################
# Stream a log file, like the tail -f command
[[inputs.tail]]
  ## files to tail.
  ## These accept standard unix glob matching rules, but with the addition of
  ## ** as a "super asterisk". ie:
  ##   "/var/log/**.log"  -> recursively find all .log files in /var/log
  ##   "/var/log/*/*.log" -> find all .log files with a parent dir in /var/log
  ##   "/var/log/apache.log" -> just tail the apache log file
  ##
  ## See https://github.com/gobwas/glob for more examples
  ##
  files = ["../../data/prepared_data/*.dat"]
  ## Read file from beginning. Default false
  from_beginning = true 
  ## Whether file is a named pipe
  pipe = false

  ## Method used to watch for file updates.  Can be either "inotify" or "poll".
  watch_method = "poll"

  ## Data format to consume.
  ## Each data format has its own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
  data_format = "influx"

and here is a sample of how my data look like:

environmental temp=-2.9,air=912.81,prec=459.12,datetime=“2011-11-10 00:00:10” 1389347210000000000
environmental prec=0.0,datetime=“2011-11-10 00:00:10” 1229347210000000000
environmental temp=-1.29,air=929.8,prec=0.0,datetime=“2011-11-10 00:00:20” 1189347220000000000
environmental temp=-0.23,air=219.8,prec=0.0,datetime=“2011-11-10 00:00:30” 1489347230000000000

Pooh · December 15, 2019, 8:56pm

Hi there, I need to insert billions of rows into influxdb (1.7.9).

Whoo, that’s a lot. Not unreasonable, but a lot.

In fact, my laptop keeps freezing (because of almost 100% CPUs and
memory usage.

Er, laptop??

Can anybody give me some advice on how to improve the import

Use a machine with lots of CPU and RAM for a task like this.

If you don’t have one, set up a virtual server at a cloud provider for a few
hours. Then your biggest challenge will just be getting the data uploaded to
it.

Antony.

sab · December 15, 2019, 10:49pm

Oh, sorry, I just noticed, I confused the terms… I have have “just” half a billion records, so about 500.000.000.
Thanks, Antony. I wish I was able to do that… This is for my master thesis and I am not allowed to upload the provided data to any cloud. It is okay, if it calculates a while, but it must not freeze comletely. Any other suggestions?

rawkode · December 16, 2019, 9:40am

You mention this is already prepared in line protocol format, correct?

It would probably be easier to write some bash script to chunk the file and send it over the HTTP API.

Telegraf probably isn’t the best tool for one off large imports, due to the way to consumes / reads the file (I’ll need to double check the code).

sab · December 18, 2019, 3:15pm

Thank you so much @rawkode! It works!

Correct, I have prepared my data in line protocol. (Initially I have raw data files of the sensors with different formats, so I read all of them once into a dask dataframe with Python and prepare and export them as in line protocol. The dask export (dd.to_csv) created many chunks as files due to the partitions. With the bash script I will then loop over these files to import them into influxdb.)

For the HTTP API command, I had to set max-body-size = 0 in /etc/influxdb/influxdb.conf.

Bash script for Writing points from a file for all my prepared files in line protocol:

#!/bin/bash
# Loop over all prepared files in line protocol format to import them to influxdb

for f in ../../path/to/files/inlineprotocol/*.dat; do
  echo $f
  curl -i -XPOST 'http://localhost:8086/write?db=mydatabase' --data-binary @$f
done

echo All done

rawkode · December 19, 2019, 2:41pm

That’s awesome. Thanks for sharing the bash. I’m sure it’ll come in handy for others

Topic		Replies	Views
Not all the date are inserted into influx with telegraf Telegraf telegraf	4	1105	January 21, 2020
Why writing to inputs.http_listener is much slower then wirting to influxdb? telegraf	2	1205	December 15, 2017
Import CSV to influxDB using telegraf influxdb , telegraf , grafana	1	445	February 8, 2021
Will tail plugin work for this scenario Telegraf influxdb , telegraf , csv , tail	4	813	March 4, 2022
Metric buffer overflow Telegraf influxdb , telegraf	4	3778	March 16, 2022

Telegraf bulk import for large amout of data with input.tail

Related topics