Error trying to get docker logs using Telegraf and send to Loki

Hi guys,

I have a ubuntu server running a docker service, a loki service and a telegraf service.

I would like to get the logs from docker and send it to loki using telegraf.

For telegraf I set up the docker logs input:

# # Read logging output from the Docker engine
[[inputs.docker_log]]
  endpoint = "unix:///var/run/docker.sock"
  source_tag = false

And loki output:

[[outputs.loki]]
  ## The domain of Loki
  domain = "http://loki.local"
  ## Endpoint to write api
  endpoint = "/loki/api/v1/push"

Looks like telegraf is receiving the input from docker, but when it tries to send, I got an error.

2022-12-13T00:04:32Z D! [outputs.loki] Buffer fullness: 170 / 10000 metrics
2022-12-13T00:04:32Z E! [agent] Error writing to outputs.loki: when writing to [http://loki.local/loki/api/v1/push] received status code: 400

I verified if loki is working and using a python script I was able to push information:


When I run:
sudo telegraf --test

I got some errors:

2022-12-13T09:56:51Z D! [agent] Initializing plugins
2022-12-13T09:56:51Z D! [agent] Starting service inputs
2022-12-13T09:56:51Z D! [agent] Stopping service inputs
2022-12-13T09:56:51Z E! [inputs.docker_log] Error in plugin: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/b0ef3c0c714c8f4121f8c8cd7ad7b2f8499f4d5d99d817b58204b5cd033bd25a/json": context canceled
2022-12-13T09:56:51Z E! [inputs.docker_log] Error in plugin: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/0494e2107851ae6df8e91efb738e4f893aa6ff8d84f70e4268c5800d11b9dfc0/json": context canceled
2022-12-13T09:56:51Z E! [inputs.docker_log] Error in plugin: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/c23dcda8eb38f2b06a6ac70ca7791bbbfa6f55a82aacb7806fbc9cdd77737dd5/json": context canceled
2022-12-13T09:56:51Z E! [inputs.docker_log] Error in plugin: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/b4ab33db17abe98863ab775ce131444d2689db14ca35537fa102220d119d65bc/json": context canceled
2022-12-13T09:56:51Z E! [inputs.docker_log] Error in plugin: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/47c056d80616965b441d8209a6a2eb08d02ac99551a2b31a1315a5dfd6a90bee/json": context canceled
2022-12-13T09:56:51Z E! [inputs.docker_log] Error in plugin: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/containers/c4e31cfa8e01a0cf439df8e660ef5c333f749cd74e996a1b6c328bfb0095b39d/json": context canceled
2022-12-13T09:56:51Z D! [agent] Input channel closed
2022-12-13T09:56:51Z D! [agent] Stopped Successfully
2022-12-13T09:56:51Z E! [telegraf] Error running agent: input plugins recorded 6 errors

Does anybody have any idea what is happening?

As the error implies it means the context was canceled. The docker plugin does not play well with the --test option since the test option will quickly run the inputs and not wait if any of the inputs start additional go routines to collect data; as the docker plugin does.

2022-12-13T00:04:32Z E! [agent] Error writing to outputs.loki: when writing to [http://loki.local/loki/api/v1/push] received status code: 400

400 generally means a bad request and could be a number of things that could cause this. Having the response body would help. I have put up a PR fix(outputs.loki): return response body on error by powersj · Pull Request #12399 · influxdata/telegraf · GitHub that adds this to the error message. Would you be willing to download an artifact from that PR, which should be there in a 20-30mins after this message, try that, and see what you get?

Thanks!

Hi @jpowers,

I downloaded the package from your build and installed it using

sudo dpkg -i telegraf_1.26.0~53588286-0_amd64.deb

but I see no difference in logs.
Do I have to use any parameters or start the service enough?

 sudo systemctl start telegraf
 tail -f /var/log/telegraf/telegraf.log

2022-12-14T23:27:41Z I! Loaded processors:
2022-12-14T23:27:41Z I! Loaded secretstores:
2022-12-14T23:27:41Z I! Loaded outputs: loki
2022-12-14T23:27:41Z I! Tags enabled: host=apps-docker
2022-12-14T23:27:41Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"apps-docker", Flush Interval:10s
2022-12-14T23:27:41Z D! [agent] Initializing plugins
2022-12-14T23:27:41Z D! [agent] Connecting outputs
2022-12-14T23:27:41Z D! [agent] Attempting connection to [outputs.loki]
2022-12-14T23:27:41Z D! [agent] Successfully connected to outputs.loki
2022-12-14T23:27:41Z D! [agent] Starting service inputs
2022-12-14T23:32:21Z D! [outputs.loki] Buffer fullness: 0 / 10000 metrics
2022-12-14T23:32:31Z D! [outputs.loki] Buffer fullness: 184 / 10000 metrics
2022-12-14T23:32:31Z E! [agent] Error writing to outputs.loki: when writing to [http://loki.local/loki/api/v1/push] received status code, 400:

ha and empty body, so that is of no help. Are there any loki logs you can review? Sounds like your next step are trying to determine why you are getting a 400 from loki.

Hey, thanks,

I changed my loki log level to debug, but there is not too much showing.

level=debug ts=2022-12-15T14:35:45.047844905Z caller=logging.go:76 traceID=77354d54b657c32e orgID=fake msg="POST /loki/api/v1/push (400) 336.94µs"

What is weird, is that, if I try to push it using a python script that I created for testing, it works:

import logging
import logging_loki

logging_loki.emitter.LokiEmitter.level_tag = "level"
handler = logging_loki.LokiHandler(
    url="http://loki.local/loki/api/v1/push",
    version="1",
)
# create a new logger instance, name it whatever you want
logger = logging.getLogger("my-logger")

logger.addHandler(handler)
logger.setLevel(logging.DEBUG)

logger.error("errorrr", extra={
             "tags": {"service": "my-service"}})
logger.warning("warninggggggg",
               extra={"tags": {"service": "my-service"}})

Can you enable the [[outputs.file]] output in telegraf and see what metrics are getting generated?

@jpowers
Same/similar issue, can I jump on this thread?

output.file

XRLog,details1=Username:\ admin,host=myhost,xrclogfilepath=042 event="Login" 1673711468095000000
XRLog,details1=Site\ 57\ has\ Disconnected,details2=Reason:\ TCP\ Reset,host=myhost,xrlogfilepath=042 event="Site Connection Status" 1673711477537000000
XRLog,details1=Site\ 57\ has\ Connected,host=myhost,xrlogfilepath=042 event="Site Connection Status" 1673711477658000000

Telegraf

.\telegraf.exe --config .\telegraf.global.tc.conf --config-directory .\telegraf.d --debug --input-filter tail
2023-01-23T20:58:46Z D! [outputs.loki] Buffer fullness: 3 / 10000 metrics
2023-01-23T20:58:46Z E! [agent] Error writing to outputs.loki: when writing to [http://mylokihost:3100/loki/api/v1/push] received status code: 400

@Dan_Foxley

  1. what version of Loki are you using? Is there any easy way for me to set up a server to reproduce?
  2. was this ever working on Telegraf and broke between versions? If we can narrow it down to two versions it can help us find a root cause

thanks!

(Update)

tcpdump on Loki server says:

entry for stream ‘{__name=“XRLog”, details1=“Username: admin”, host=“myhost”, xrlogfilepath=“042”}’ has timestamp too old: 2023-01-14T15:51:08Z, oldest acceptable timestamp is: 2023-01-16T22:16:13Z


@jpowers

This is my first atempt to connect telegraf to Loki, so no previous experience or working setup.

With Loki 2.7.1 on CentOS x64, I did this to get LOKI running. (assuming you have a Grafana instance to add LOKI source).

Install and run LOKI

sudo curl -O -L "https://github.com/grafana/loki/releases/download/v2.7.1/loki-linux-amd64.zip"
sudo yum -y install unzip
sudo unzip "loki-linux-amd64.zip"
sudo mv loki-linux-amd64 /usr/local/bin/loki-linux-amd64
cd /usr/local/bin
sudo chmod a+x "loki-linux-amd64"
sudo mkdir /etc/loki
cd /etc/loki
sudo wget https://raw.githubusercontent.com/grafana/loki/main/cmd/loki/loki-local-config.yaml
./loki-linux-amd64 -config.file=loki-local-config.yaml

Input Conf

[[inputs.tail]]
  alias = "sitechecker"
  name_override = "XRLog"
  files = ["C:/Program Files/InfluxData/telegraf/telegraf.d/csv/*.csv"]
  data_format = "csv"
  csv_header_row_count = 1
  csv_column_names = ["timestamp", "event", "details1", "details2", "details3"]
  csv_column_types = []
  csv_skip_rows = 0
  csv_skip_columns = 0
  csv_delimiter = ","
  csv_comment = ""
  csv_trim_space = true
  csv_tag_columns = ["details1", "details2", "details3"]
  csv_measurement_column = ""
  csv_timestamp_column = "timestamp"
  csv_timestamp_format = "1/02/2006 15:04:05.000"
  csv_timezone = "Etc/UTC"
  csv_skip_values = [""]
  path_tag = "xrlogfilepath"
[[processors.regex]]
  order = 1
  [[processors.regex.tags]]
    key = "xrlogfilepath"
    pattern = "^.*_([0-9]+).csv$"
    replacement = "${1}"

Output.conf

[[outputs.loki]]
 #commented as only thing done to trbobleshoot Telegraf to Loki.
  #alias = "xrlogtoloki"
  ## The domain of Loki
  #domain = "https://loki.domain.tld:3100"
  domain = "http://myloki:3100"

  ## Endpoint to write api
  endpoint = "/loki/api/v1/push"
  
  namepass = ["XRLog"]

CSV
Filename: EVENTLOG-EXPORTED01-14-2023-10.52.45_041.csv

CSV Contents:

TIMESTAMP,EVENTTYPE,EVENTDETAILS
1/14/2023 15:51:08.095,Login,Username: admin
1/14/2023 15:51:17.537,Site Connection Status,Site 57 has Disconnected, Reason: TCP Reset
1/14/2023 15:51:17.658,Site Connection Status,Site 57 has Connected

Previous post has output.file results when using this input.tail and CSV.

@jpowers

After writing post on reproducing this. On Loki server, tcpdump showed:

entry for stream ‘{__name=“XRLog”, details1=“Username: admin”, host=“myhost”, xrlogfilepath=“042”}’ has timestamp too old: 2023-01-14T15:51:08Z, oldest acceptable timestamp is: 2023-01-16T22:16:13Z

Ah ha! I have run into this before with Loki.

Apparently the reject_old_samples and reject_old_samples_max_age Loki settings are used to determine the oldest setting. These default to true and 7 days ago since v2.4, hence the oldest acceptable timestamp that is printed out in your error message is from a week ago.

I do wish we got a message back, rather than an empty 400 error.

@jpowers
What am I missing, I captured the 400 reply in tcpdump, why wouldn’t telegraf been able to see it. It got the 400 packet which includes the reason.

I was going off the response the original poster made where the response body appeared empty after we added it to the error message. I am happy to put that PR back up and we can try again. It is possible that the issues were different.

I went back to look at my PR and forgot we merged it. So the nightly builds and our next build, v1.25.1 will contain a loki that that includes the message body on errors.

1 Like