Error trying to get docker logs using Telegraf and send to Loki

rafinha · December 13, 2022, 10:16am

Hi guys,

I have a ubuntu server running a docker service, a loki service and a telegraf service.

I would like to get the logs from docker and send it to loki using telegraf.

For telegraf I set up the docker logs input:

# # Read logging output from the Docker engine
[[inputs.docker_log]]
  endpoint = "unix:///var/run/docker.sock"
  source_tag = false

And loki output:

[[outputs.loki]]
  ## The domain of Loki
  domain = "http://loki.local"
  ## Endpoint to write api
  endpoint = "/loki/api/v1/push"

Looks like telegraf is receiving the input from docker, but when it tries to send, I got an error.

2022-12-13T00:04:32Z D! [outputs.loki] Buffer fullness: 170 / 10000 metrics
2022-12-13T00:04:32Z E! [agent] Error writing to outputs.loki: when writing to [http://loki.local/loki/api/v1/push] received status code: 400

I verified if loki is working and using a python script I was able to push information:

When I run:
sudo telegraf --test

I got some errors:

2022-12-13T09:56:51Z D! [agent] Initializing plugins
2022-12-13T09:56:51Z D! [agent] Starting service inputs
2022-12-13T09:56:51Z D! [agent] Stopping service inputs
2022-12-13T09:56:51Z E! [inputs.docker_log] Error in plugin: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/b0ef3c0c714c8f4121f8c8cd7ad7b2f8499f4d5d99d817b58204b5cd033bd25a/json": context canceled
2022-12-13T09:56:51Z E! [inputs.docker_log] Error in plugin: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/0494e2107851ae6df8e91efb738e4f893aa6ff8d84f70e4268c5800d11b9dfc0/json": context canceled
2022-12-13T09:56:51Z E! [inputs.docker_log] Error in plugin: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/c23dcda8eb38f2b06a6ac70ca7791bbbfa6f55a82aacb7806fbc9cdd77737dd5/json": context canceled
2022-12-13T09:56:51Z E! [inputs.docker_log] Error in plugin: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/b4ab33db17abe98863ab775ce131444d2689db14ca35537fa102220d119d65bc/json": context canceled
2022-12-13T09:56:51Z E! [inputs.docker_log] Error in plugin: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/47c056d80616965b441d8209a6a2eb08d02ac99551a2b31a1315a5dfd6a90bee/json": context canceled
2022-12-13T09:56:51Z E! [inputs.docker_log] Error in plugin: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/c4e31cfa8e01a0cf439df8e660ef5c333f749cd74e996a1b6c328bfb0095b39d/json": context canceled
2022-12-13T09:56:51Z D! [agent] Input channel closed
2022-12-13T09:56:51Z D! [agent] Stopped Successfully
2022-12-13T09:56:51Z E! [telegraf] Error running agent: input plugins recorded 6 errors

Does anybody have any idea what is happening?

jpowers · December 14, 2022, 2:30pm

As the error implies it means the context was canceled. The docker plugin does not play well with the --test option since the test option will quickly run the inputs and not wait if any of the inputs start additional go routines to collect data; as the docker plugin does.

2022-12-13T00:04:32Z E! [agent] Error writing to outputs.loki: when writing to [http://loki.local/loki/api/v1/push] received status code: 400

400 generally means a bad request and could be a number of things that could cause this. Having the response body would help. I have put up a PR fix(outputs.loki): return response body on error by powersj · Pull Request #12399 · influxdata/telegraf · GitHub that adds this to the error message. Would you be willing to download an artifact from that PR, which should be there in a 20-30mins after this message, try that, and see what you get?

Thanks!

rafinha · December 14, 2022, 11:39pm

Hi @jpowers,

I downloaded the package from your build and installed it using

sudo dpkg -i telegraf_1.26.0~53588286-0_amd64.deb

but I see no difference in logs.
Do I have to use any parameters or start the service enough?

 sudo systemctl start telegraf

 tail -f /var/log/telegraf/telegraf.log

2022-12-14T23:27:41Z I! Loaded processors:
2022-12-14T23:27:41Z I! Loaded secretstores:
2022-12-14T23:27:41Z I! Loaded outputs: loki
2022-12-14T23:27:41Z I! Tags enabled: host=apps-docker
2022-12-14T23:27:41Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"apps-docker", Flush Interval:10s
2022-12-14T23:27:41Z D! [agent] Initializing plugins
2022-12-14T23:27:41Z D! [agent] Connecting outputs
2022-12-14T23:27:41Z D! [agent] Attempting connection to [outputs.loki]
2022-12-14T23:27:41Z D! [agent] Successfully connected to outputs.loki
2022-12-14T23:27:41Z D! [agent] Starting service inputs
2022-12-14T23:32:21Z D! [outputs.loki] Buffer fullness: 0 / 10000 metrics
2022-12-14T23:32:31Z D! [outputs.loki] Buffer fullness: 184 / 10000 metrics
2022-12-14T23:32:31Z E! [agent] Error writing to outputs.loki: when writing to [http://loki.local/loki/api/v1/push] received status code, 400:

jpowers · December 15, 2022, 1:31pm

ha and empty body, so that is of no help. Are there any loki logs you can review? Sounds like your next step are trying to determine why you are getting a 400 from loki.

rafinha · December 15, 2022, 2:58pm

Hey, thanks,

I changed my loki log level to debug, but there is not too much showing.

level=debug ts=2022-12-15T14:35:45.047844905Z caller=logging.go:76 traceID=77354d54b657c32e orgID=fake msg="POST /loki/api/v1/push (400) 336.94µs"

What is weird, is that, if I try to push it using a python script that I created for testing, it works:

import logging
import logging_loki

logging_loki.emitter.LokiEmitter.level_tag = "level"
handler = logging_loki.LokiHandler(
    url="http://loki.local/loki/api/v1/push",
    version="1",
)
# create a new logger instance, name it whatever you want
logger = logging.getLogger("my-logger")

logger.addHandler(handler)
logger.setLevel(logging.DEBUG)

logger.error("errorrr", extra={
             "tags": {"service": "my-service"}})
logger.warning("warninggggggg",
               extra={"tags": {"service": "my-service"}})

jpowers · December 15, 2022, 3:19pm

Can you enable the [[outputs.file]] output in telegraf and see what metrics are getting generated?

Dan_Foxley · January 23, 2023, 9:00pm

@jpowers
Same/similar issue, can I jump on this thread?

output.file

XRLog,details1=Username:\ admin,host=myhost,xrclogfilepath=042 event="Login" 1673711468095000000
XRLog,details1=Site\ 57\ has\ Disconnected,details2=Reason:\ TCP\ Reset,host=myhost,xrlogfilepath=042 event="Site Connection Status" 1673711477537000000
XRLog,details1=Site\ 57\ has\ Connected,host=myhost,xrlogfilepath=042 event="Site Connection Status" 1673711477658000000

Telegraf

.\telegraf.exe --config .\telegraf.global.tc.conf --config-directory .\telegraf.d --debug --input-filter tail
2023-01-23T20:58:46Z D! [outputs.loki] Buffer fullness: 3 / 10000 metrics
2023-01-23T20:58:46Z E! [agent] Error writing to outputs.loki: when writing to [http://mylokihost:3100/loki/api/v1/push] received status code: 400

jpowers · January 23, 2023, 9:41pm

@Dan_Foxley

what version of Loki are you using? Is there any easy way for me to set up a server to reproduce?
was this ever working on Telegraf and broke between versions? If we can narrow it down to two versions it can help us find a root cause

thanks!

Dan_Foxley · January 23, 2023, 9:59pm

(Update)

tcpdump on Loki server says:

entry for stream ‘{__name=“XRLog”, details1=“Username: admin”, host=“myhost”, xrlogfilepath=“042”}’ has timestamp too old: 2023-01-14T15:51:08Z, oldest acceptable timestamp is: 2023-01-16T22:16:13Z

@jpowers

This is my first atempt to connect telegraf to Loki, so no previous experience or working setup.

With Loki 2.7.1 on CentOS x64, I did this to get LOKI running. (assuming you have a Grafana instance to add LOKI source).

Install and run LOKI

sudo curl -O -L "https://github.com/grafana/loki/releases/download/v2.7.1/loki-linux-amd64.zip"
sudo yum -y install unzip
sudo unzip "loki-linux-amd64.zip"
sudo mv loki-linux-amd64 /usr/local/bin/loki-linux-amd64
cd /usr/local/bin
sudo chmod a+x "loki-linux-amd64"
sudo mkdir /etc/loki
cd /etc/loki
sudo wget https://raw.githubusercontent.com/grafana/loki/main/cmd/loki/loki-local-config.yaml
./loki-linux-amd64 -config.file=loki-local-config.yaml

Input Conf

[[inputs.tail]]
  alias = "sitechecker"
  name_override = "XRLog"
  files = ["C:/Program Files/InfluxData/telegraf/telegraf.d/csv/*.csv"]
  data_format = "csv"
  csv_header_row_count = 1
  csv_column_names = ["timestamp", "event", "details1", "details2", "details3"]
  csv_column_types = []
  csv_skip_rows = 0
  csv_skip_columns = 0
  csv_delimiter = ","
  csv_comment = ""
  csv_trim_space = true
  csv_tag_columns = ["details1", "details2", "details3"]
  csv_measurement_column = ""
  csv_timestamp_column = "timestamp"
  csv_timestamp_format = "1/02/2006 15:04:05.000"
  csv_timezone = "Etc/UTC"
  csv_skip_values = [""]
  path_tag = "xrlogfilepath"
[[processors.regex]]
  order = 1
  [[processors.regex.tags]]
    key = "xrlogfilepath"
    pattern = "^.*_([0-9]+).csv$"
    replacement = "${1}"

Output.conf

[[outputs.loki]]
 #commented as only thing done to trbobleshoot Telegraf to Loki.
  #alias = "xrlogtoloki"
  ## The domain of Loki
  #domain = "https://loki.domain.tld:3100"
  domain = "http://myloki:3100"

  ## Endpoint to write api
  endpoint = "/loki/api/v1/push"
  
  namepass = ["XRLog"]

CSV
Filename: EVENTLOG-EXPORTED01-14-2023-10.52.45_041.csv

CSV Contents:

TIMESTAMP,EVENTTYPE,EVENTDETAILS
1/14/2023 15:51:08.095,Login,Username: admin
1/14/2023 15:51:17.537,Site Connection Status,Site 57 has Disconnected, Reason: TCP Reset
1/14/2023 15:51:17.658,Site Connection Status,Site 57 has Connected

Previous post has output.file results when using this input.tail and CSV.

Dan_Foxley · January 23, 2023, 10:23pm

@jpowers

After writing post on reproducing this. On Loki server, tcpdump showed:

entry for stream ‘{__name=“XRLog”, details1=“Username: admin”, host=“myhost”, xrlogfilepath=“042”}’ has timestamp too old: 2023-01-14T15:51:08Z, oldest acceptable timestamp is: 2023-01-16T22:16:13Z

jpowers · January 23, 2023, 10:59pm

Ah ha! I have run into this before with Loki.

Apparently the reject_old_samples and reject_old_samples_max_age Loki settings are used to determine the oldest setting. These default to true and 7 days ago since v2.4, hence the oldest acceptable timestamp that is printed out in your error message is from a week ago.

I do wish we got a message back, rather than an empty 400 error.

Dan_Foxley · January 24, 2023, 5:02pm

@jpowers
What am I missing, I captured the 400 reply in tcpdump, why wouldn’t telegraf been able to see it. It got the 400 packet which includes the reason.

jpowers · January 24, 2023, 5:16pm

I was going off the response the original poster made where the response body appeared empty after we added it to the error message. I am happy to put that PR back up and we can try again. It is possible that the issues were different.

jpowers · January 24, 2023, 5:46pm

I went back to look at my PR and forgot we merged it. So the nightly builds and our next build, v1.25.1 will contain a loki that that includes the message body on errors.

jasuncion · January 22, 2024, 1:13pm

sorry for resurrecting this old thread, i encountered the same issue (no error message) also and it seems related to the fact that the body was closed before accessing it, code snippet below:

  resp, err := l.client.Do(req)
        if err != nil {
                return err
        }
        _ = resp.Body.Close()

        if resp.StatusCode < 200 || resp.StatusCode >= 300 {
                body, _ := io.ReadAll(resp.Body)
                return fmt.Errorf("when writing to [%s] received status code, %d: %s", l.url, resp.StatusCode, body)
        }

I have updated this to use defer instead and the errored response content from loki started printing in the logs, hope this helps.

jpowers · January 22, 2024, 1:30pm

You should file an issue! That is clearly a bug that we can and should fix.

jasuncion · January 23, 2024, 7:08am

Sure I will do, this is my first time working on open source so I will need to familiarize myself with the process first.

jpowers · January 23, 2024, 1:23pm

I went ahead and pushed a PR yesterday: fix(outputs.loki): Closed body before reading it by powersj · Pull Request #14604 · influxdata/telegraf · GitHub

Does this line up with what you did?

jasuncion · January 23, 2024, 3:10pm

Thank you, this indeed fixes the problem.

Topic		Replies	Views
Telegraf agent not starting inputs.tail expect when starting from inside of docker container Telegraf influxdb , telegraf , docker , tail	1	43	July 23, 2024
Cannot export docker_log metrics to prometheus Telegraf telegraf , prometheus , docker	1	632	September 18, 2022
Telegraf doesn't start/send in docker InfluxDB 2	0	164	August 15, 2024
Tailing docker stdout logs with Telegraf from the host Telegraf telegraf	1	542	August 11, 2021
Telegraf doesn't seem to send data to InfluxDB in dockerised images Telegraf influxdb , telegraf	5	1272	March 26, 2022

Error trying to get docker logs using Telegraf and send to Loki

Related topics