Telegraf shut down after write to influxdb

Hello, I am having some trouble with the Telegraf -> InfluxDB -> Chronograf pipeline.

I am trying to set up an integration with particle, so added the input webhooks.particle

When I POST a json request to the webhook, it successfully routes through telegraf, stores in influxdb, and I can see the data in a chronograf dashboard.

However, every time I post a value to telegraf, it crashes (causing my composer to reboot it).

Here is the POST request I am sending:

url = 'http://*******:1619/particle'
payload = {
  "event": "sensor",
  "measurement": "test_temperature",
  "published_at": "2020-01-22",
  "ttl": 60,
  "data": {
    "tags": {"id": "123"},
    "values": {"temp": 27.28}
  }
}

r = requests.post(url, data=json.dumps(payload))

Here is my telegraf.conf:

[agent]
  collection_jitter = "0s"
  debug = true
  flush_interval = "10s"
  flush_jitter = "0s"
  hostname = "$HOSTNAME"
  interval = "10s"
  logfile = ""
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = false
  precision = ""
  quiet = false
  round_interval = true
[[processors.enum]]
  [[processors.enum.mapping]]
    dest = "status_code"
    field = "status"
 [processors.enum.mapping.value_mappings]
    critical = 3
    healthy = 1
    problem = 2

[[outputs.health]]
  service_address = "http://:8888"
  [[outputs.health.compares]]
    field = "buffer_size"
    lt = 5000.0
  [[outputs.health.contains]]
    field = "buffer_size"
[[outputs.influxdb]]
  database = "telegraf"
  urls = [
    "http://influxdb.monitoring:8086"
  ]

[[inputs.webhooks]]
  service_address = ":1619"
  [inputs.webhooks.particle]
    path = "/particle"

Here are the logs (notice that it seems OK until it writes some data, then triggers a safe shutdown for some reason…):

E 2020-01-23T01:42:25.645234Z 2020-01-23T01:42:25Z I! Starting Telegraf 1.12.6
E 2020-01-23T01:42:25.645436076Z 2020-01-23T01:42:25Z I! Using config file:       /etc/telegraf/telegraf.conf
E 2020-01-23T01:42:25.646189164Z 2020-01-23T01:42:25Z I! Loaded inputs: webhooks
E 2020-01-23T01:42:25.646269817Z 2020-01-23T01:42:25Z I! Loaded aggregators: 
E 2020-01-23T01:42:25.646315783Z 2020-01-23T01:42:25Z I! Loaded processors: enum
E 2020-01-23T01:42:25.646395180Z 2020-01-23T01:42:25Z I! Loaded outputs: health influxdb
E 2020-01-23T01:42:25.646437133Z 2020-01-23T01:42:25Z I! Tags enabled: host=telegraf-polling-service
E 2020-01-23T01:42:25.646512432Z 2020-01-23T01:42:25Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"telegraf-polling-service", Flush Interval:10s
E 2020-01-23T01:42:25.646588761Z 2020-01-23T01:42:25Z D! [agent] Initializing plugins
E 2020-01-23T01:42:25.646635004Z 2020-01-23T01:42:25Z D! [agent] Connecting outputs
E 2020-01-23T01:42:25.646694073Z 2020-01-23T01:42:25Z D! [agent] Attempting connection to [outputs.health]
E 2020-01-23T01:42:25.646933890Z 2020-01-23T01:42:25Z I! [outputs.health] Listening on http://[::]:8888
E 2020-01-23T01:42:25.646996796Z 2020-01-23T01:42:25Z D! [agent] Successfully connected to outputs.health
E 2020-01-23T01:42:25.647106379Z 2020-01-23T01:42:25Z D! [agent] Attempting connection to [outputs.influxdb]
E 2020-01-23T01:42:25.651996642Z 2020-01-23T01:42:25Z D! [agent] Successfully connected to outputs.influxdb
E 2020-01-23T01:42:25.652092027Z 2020-01-23T01:42:25Z D! [agent] Starting service inputs
E 2020-01-23T01:42:25.652257564Z 2020-01-23T01:42:25Z I! Started the webhooks service on :1619
E 2020-01-23T01:42:40.000932879Z 2020-01-23T01:42:40Z D! [outputs.health] Buffer fullness: 0 / 10000 metrics
E 2020-01-23T01:42:40.000966653Z 2020-01-23T01:42:40Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
E 2020-01-23T01:42:50.000807641Z 2020-01-23T01:42:50Z D! [outputs.health] Buffer fullness: 0 / 10000 metrics
E 2020-01-23T01:42:50.000856874Z 2020-01-23T01:42:50Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
E 2020-01-23T01:43:00.001179324Z 2020-01-23T01:43:00Z D! [outputs.health] Wrote batch of 1 metrics in 2.679µs
E 2020-01-23T01:43:00.001215065Z 2020-01-23T01:43:00Z D! [outputs.health] Buffer fullness: 0 / 10000 metrics
E 2020-01-23T01:43:00.005477536Z 2020-01-23T01:43:00Z D! [outputs.influxdb] Wrote batch of 1 metrics in 4.584854ms
E 2020-01-23T01:43:00.005517516Z 2020-01-23T01:43:00Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
E 2020-01-23T01:43:10.000855316Z 2020-01-23T01:43:10Z D! [outputs.health] Buffer fullness: 0 / 10000 metrics
E 2020-01-23T01:43:10.000892713Z 2020-01-23T01:43:10Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
E 2020-01-23T01:43:20.000837346Z 2020-01-23T01:43:20Z D! [outputs.health] Buffer fullness: 0 / 10000 metrics
E 2020-01-23T01:43:20.000911941Z 2020-01-23T01:43:20Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
E 2020-01-23T01:43:25.290604800Z 2020-01-23T01:43:25Z D! [agent] Stopping service inputs
E 2020-01-23T01:43:25.290662495Z 2020-01-23T01:43:25Z I! Stopping the Webhooks service
E 2020-01-23T01:43:25.290667538Z 2020-01-23T01:43:25Z D! [agent] Input channel closed
E 2020-01-23T01:43:25.290671068Z 2020-01-23T01:43:25Z D! [agent] Processor channel closed
E 2020-01-23T01:43:25.290674608Z 2020-01-23T01:43:25Z I! [agent] Hang on, flushing any cached metrics before shutdown
E 2020-01-23T01:43:25.290678216Z 2020-01-23T01:43:25Z D! [outputs.health] Buffer fullness: 0 / 10000 metrics
E 2020-01-23T01:43:25.290681597Z 2020-01-23T01:43:25Z D! [outputs.influxdb] Buffer fullness: 0 / 10000 metrics
E 2020-01-23T01:43:25.290685030Z 2020-01-23T01:43:25Z D! [agent] Closing outputs
E 2020-01-23T01:43:25.290688531Z 2020-01-23T01:43:25Z D! [agent] Stopped Successfully

I should add that the same thing happens when I POST from the particle.io webhook, and from python. There must be either some kind of crash occurring causing telegraf to shut itself down safely, or perhaps there is some command I am sending that is causing this?

How do I make the logs more verbose to see the root cause?

This should only be possible if Telegraf received a SIGINT or SIGQUIT signal. Try running this while reproducing the issue, does it show any signals?:

sudo strace -p $(pgrep telegraf) -f 2>&1 | grep signo

Hmm unfortunately when I try to run this in the container shell, it returns immediately before I can trigger the reproduction of the issue. I would have expected the trace command to tail the output or something like that?

It’s interesting that a SIGINT or SIGQUIT should be causing this… I am hosting this on a kubernetes cluster - perhaps it is issuing this signal. Why would it happen when a data point is submitted though? Maybe it detects some sort of error and triggers a reboot? Kubernetes config below:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: "2020-01-23T20:37:56Z"
  generation: 1
  labels:
    app.kubernetes.io/instance: telegraf
    app.kubernetes.io/name: telegraf
    helm.sh/chart: telegraf-1.5.0
  name: telegraf
  namespace: monitoring
  resourceVersion: "360178"
  selfLink: /apis/apps/v1/namespaces/monitoring/deployments/telegraf
  uid: 3c6e292e-3e20-11ea-a171-42010aa80099
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: telegraf
      app.kubernetes.io/name: telegraf
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        checksum/config: 4cf01cd290ea4f18669c68977385e39186d6da8e247cdd70150cee3e482864d5
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: telegraf
        app.kubernetes.io/name: telegraf
    spec:
      containers:
      - env:
        - name: HOSTNAME
          value: telegraf-polling-service
        image: telegraf:1.12-alpine
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 8888
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: telegraf
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 8888
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/telegraf
          name: config
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: telegraf
      serviceAccountName: telegraf
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: telegraf
        name: config
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2020-01-23T20:37:56Z"
    lastUpdateTime: "2020-01-23T20:38:04Z"
    message: ReplicaSet "telegraf-557ff4d7f" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2020-01-23T21:32:31Z"
    lastUpdateTime: "2020-01-23T21:32:31Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

I think I figured it out. The liveness probe and readiness probe were set too aggressively on my k8 pod. After a request or a write, liveness probe would fail and trigger a reboot. Setting number of successful attempts to be higher and timeout to be higher resolved this issue. Now I can push data without reboots. Thanks!

I think the health output is reporting an error and that is triggering k8s to send the signal. With your current checks:

[[outputs.health]]
  service_address = "http://:8888"
  [[outputs.health.compares]]
    field = "buffer_size"
    lt = 5000.0
  [[outputs.health.contains]]
    field = "buffer_size"

Both of these conditions must be true, so normally you would want to limit this to a particular series with metric filtering:

  namepass = ["internal_write"]
  tagpass = { output = ["influxdb"] }

This example also requires the internal input to be enabled.

This solved it - thanks so much!