Everything worked fine with my influxdb 2.2, telegraf inputs and a bunch of buckets (everything on the same server) until /var partition became full.
I add disk space and restart services but telegraf can’t write to influxdb server on some buckets. Here is log msg :
2022-05-16T13:17:46Z E! [outputs.influxdb_v2::UR1_Forti] When writing to [http://127.0.0.1:8086]: 500 Internal Server Error: internal error: unexpected error writing points to database: [shard 450] unexpected end of JSON input
2022-05-16T13:17:46Z E! [agent] Error writing to outputs.influxdb_v2::UR1_Forti: failed to send metrics to any configured server(s)
I suspect corrupt bucket files but i don’t know how to check this.
Do you have any clue ?
Roderick
Hey @rod,
This one is a little beyond my expertise so I am going to push it back to our edge team. Could you let us know what version of Linux you are running?
Could you also provide the logs you are seeing from InfluxDB?
# cat /etc/redhat-release
Rocky Linux release 8.6 (Green Obsidian)
Here is influx error message :
May 17 06:21:57 vmmonit-ttig1 telegraf[851361]: 2022-05-17T04:21:57Z E! [outputs.influxdb_v2::UR1_Coeur] When writing to [http://127.0.0.1:8086]: 500 Internal Server Error: internal error: unexpected error writing points to database: [shard 449] unexpected end of JSON input
Since yesterday at 6:23 AM, writes in Buckets is ok (i didn’t do anything).
Here is the log just before 6:23AM :
May 17 06:22:47 vmmonit-ttig1 telegraf[851361]: 2022-05-17T04:22:47Z D! [outputs.influxdb_v2::UR1_Bucket] Wrote batch of 5000 metrics in 673.959929ms
May 17 06:22:47 vmmonit-ttig1 influxd-systemd-start.sh[3299924]: ts=2022-05-17T04:22:47.965549Z lvl=info msg="Snapshot for path written" log_id=0aVYAFF0000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot path=/var/lib/influxdb/engine/data/4cac7379720f0d3e/autogen/447 duration=3805.054ms
May 17 06:22:47 vmmonit-ttig1 influxd-systemd-start.sh[3299924]: ts=2022-05-17T04:22:47.965617Z lvl=info msg="Cache snapshot (end)" log_id=0aVYAFF0000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=end op_elapsed=3805.131ms
May 17 06:22:47 vmmonit-ttig1 influxd-systemd-start.sh[3299924]: ts=2022-05-17T04:22:47.965639Z lvl=info msg="Cache snapshot (start)" log_id=0aVYAFF0000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=start
May 17 06:22:48 vmmonit-ttig1 telegraf[851361]: 2022-05-17T04:22:48Z D! [outputs.influxdb_v2::UR1_Bucket] Wrote batch of 4631 metrics in 373.115019ms
May 17 06:22:48 vmmonit-ttig1 telegraf[851361]: 2022-05-17T04:22:48Z D! [outputs.influxdb_v2::UR1_Bucket] Buffer fullness: 0 / 400000 metrics
May 17 06:22:48 vmmonit-ttig1 telegraf[851361]: 2022-05-17T04:22:48Z I! [agent] Stopping running outputs
May 17 06:22:48 vmmonit-ttig1 telegraf[851361]: 2022-05-17T04:22:48Z D! [agent] Stopped Successfully
May 17 06:22:48 vmmonit-ttig1 systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB.
May 17 06:22:48 vmmonit-ttig1 systemd[1]: Starting The plugin-driven server agent for reporting metrics into InfluxDB...
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z I! Starting Telegraf 1.22.4
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z I! Loaded inputs: cisco_telemetry_mdt internal ping snmp
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z I! Loaded aggregators:
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z I! Loaded processors:
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z I! Loaded outputs: influxdb_v2 (5x)
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z I! Tags enabled: host=vmmonit-ttig1.univ-rennes1.fr
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"vmmonit-ttig1.univ-rennes1.fr", Flush Interval:10s
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Initializing plugins
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "IF-MIB::ifName"
May 17 06:22:48 vmmonit-ttig1 systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "IF-MIB::ifHCOutOctets"
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "IF-MIB::ifHCInOctets"
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "SNMPv2-MIB::sysName.0"
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "SNMPv2-MIB::sysLocation.0"
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Connecting outputs
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Attempting connection to [outputs.influxdb_v2::UR1_Internal]
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Successfully connected to outputs.influxdb_v2::UR1_Internal
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Attempting connection to [outputs.influxdb_v2::UR1_Ping]
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Successfully connected to outputs.influxdb_v2::UR1_Ping
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Attempting connection to [outputs.influxdb_v2::UR1_Coeur]
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Successfully connected to outputs.influxdb_v2::UR1_Coeur
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Attempting connection to [outputs.influxdb_v2::UR1_Forti]
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Successfully connected to outputs.influxdb_v2::UR1_Forti
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Attempting connection to [outputs.influxdb_v2::UR1_Bucket]
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Successfully connected to outputs.influxdb_v2::UR1_Bucket
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Starting service inputs
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [inputs.cisco_telemetry_mdt] Accepted Cisco MDT GRPC dialout connection from 10.60.0.218:63220
After that, no writes error.
Is there something that ring a bell for you ?
Roderick
Hmm, so the team think it might have been down to a corrupted shard. We are fixing the JSON input error you see. This is a red herring for the actual issue you were having. My guess is that the corrupted shard was deleted based on your retention policy leading to the metrics being wrote as expected.
That’s a tough one; having duplicate copies of shards is the reason we have an Enterprise version, so if one shard copy gets corrupted there is an alternative.
You can try to export the line protocol and re-ingest; it’s possible with careful use of time ranges you could avoid the corrupt part of the shard. But recovering a corrupt shard from running out of disk space is by no means a guaranteed or reliable process.
I would promote using retention policies on your data and you can monitor bucket sizes through the scrapers. If you need to expand disk space I would look to take a backup first in case.