Failed to send metrics on some buckets after disk full

rod · May 16, 2022, 1:20pm

Hi,

Everything worked fine with my influxdb 2.2, telegraf inputs and a bunch of buckets (everything on the same server) until /var partition became full.

I add disk space and restart services but telegraf can’t write to influxdb server on some buckets. Here is log msg :
2022-05-16T13:17:46Z E! [outputs.influxdb_v2::UR1_Forti] When writing to [http://127.0.0.1:8086]: 500 Internal Server Error: internal error: unexpected error writing points to database: [shard 450] unexpected end of JSON input
2022-05-16T13:17:46Z E! [agent] Error writing to outputs.influxdb_v2::UR1_Forti: failed to send metrics to any configured server(s)

I suspect corrupt bucket files but i don’t know how to check this.
Do you have any clue ?
Roderick

Jay_Clifford · May 17, 2022, 9:12am

Hey @rod,
This one is a little beyond my expertise so I am going to push it back to our edge team. Could you let us know what version of Linux you are running?

Could you also provide the logs you are seeing from InfluxDB?

Thanks,
Jay

rod · May 18, 2022, 4:26pm

Hi @Jay_Clifford,
Thanks for your help.

Here is linux version :

# cat /etc/redhat-release 
Rocky Linux release 8.6 (Green Obsidian)

Here is influx error message :

May 17 06:21:57 vmmonit-ttig1 telegraf[851361]: 2022-05-17T04:21:57Z E! [outputs.influxdb_v2::UR1_Coeur] When writing to [http://127.0.0.1:8086]: 500 Internal Server Error: internal error: unexpected error writing points to database: [shard 449] unexpected end of JSON input

Since yesterday at 6:23 AM, writes in Buckets is ok (i didn’t do anything).

Here is the log just before 6:23AM :

May 17 06:22:47 vmmonit-ttig1 telegraf[851361]: 2022-05-17T04:22:47Z D! [outputs.influxdb_v2::UR1_Bucket] Wrote batch of 5000 metrics in 673.959929ms
May 17 06:22:47 vmmonit-ttig1 influxd-systemd-start.sh[3299924]: ts=2022-05-17T04:22:47.965549Z lvl=info msg="Snapshot for path written" log_id=0aVYAFF0000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot path=/var/lib/influxdb/engine/data/4cac7379720f0d3e/autogen/447 duration=3805.054ms
May 17 06:22:47 vmmonit-ttig1 influxd-systemd-start.sh[3299924]: ts=2022-05-17T04:22:47.965617Z lvl=info msg="Cache snapshot (end)" log_id=0aVYAFF0000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=end op_elapsed=3805.131ms
May 17 06:22:47 vmmonit-ttig1 influxd-systemd-start.sh[3299924]: ts=2022-05-17T04:22:47.965639Z lvl=info msg="Cache snapshot (start)" log_id=0aVYAFF0000 service=storage-engine engine=tsm1 op_name=tsm1_cache_snapshot op_event=start
May 17 06:22:48 vmmonit-ttig1 telegraf[851361]: 2022-05-17T04:22:48Z D! [outputs.influxdb_v2::UR1_Bucket] Wrote batch of 4631 metrics in 373.115019ms
May 17 06:22:48 vmmonit-ttig1 telegraf[851361]: 2022-05-17T04:22:48Z D! [outputs.influxdb_v2::UR1_Bucket] Buffer fullness: 0 / 400000 metrics
May 17 06:22:48 vmmonit-ttig1 telegraf[851361]: 2022-05-17T04:22:48Z I! [agent] Stopping running outputs
May 17 06:22:48 vmmonit-ttig1 telegraf[851361]: 2022-05-17T04:22:48Z D! [agent] Stopped Successfully
May 17 06:22:48 vmmonit-ttig1 systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB.
May 17 06:22:48 vmmonit-ttig1 systemd[1]: Starting The plugin-driven server agent for reporting metrics into InfluxDB...
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z I! Starting Telegraf 1.22.4
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z I! Loaded inputs: cisco_telemetry_mdt internal ping snmp
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z I! Loaded aggregators:
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z I! Loaded processors:
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z I! Loaded outputs: influxdb_v2 (5x)
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z I! Tags enabled: host=vmmonit-ttig1.univ-rennes1.fr
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"vmmonit-ttig1.univ-rennes1.fr", Flush Interval:10s
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Initializing plugins
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "IF-MIB::ifName"
May 17 06:22:48 vmmonit-ttig1 systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "IF-MIB::ifHCOutOctets"
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "IF-MIB::ifHCInOctets"
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "SNMPv2-MIB::sysName.0"
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [inputs.snmp] executing "snmptranslate" "-Td" "-Ob" "SNMPv2-MIB::sysLocation.0"
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Connecting outputs
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Attempting connection to [outputs.influxdb_v2::UR1_Internal]
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Successfully connected to outputs.influxdb_v2::UR1_Internal
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Attempting connection to [outputs.influxdb_v2::UR1_Ping]
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Successfully connected to outputs.influxdb_v2::UR1_Ping
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Attempting connection to [outputs.influxdb_v2::UR1_Coeur]
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Successfully connected to outputs.influxdb_v2::UR1_Coeur
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Attempting connection to [outputs.influxdb_v2::UR1_Forti]
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Successfully connected to outputs.influxdb_v2::UR1_Forti
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Attempting connection to [outputs.influxdb_v2::UR1_Bucket]
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Successfully connected to outputs.influxdb_v2::UR1_Bucket
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [agent] Starting service inputs
May 17 06:22:48 vmmonit-ttig1 telegraf[3183125]: 2022-05-17T04:22:48Z D! [inputs.cisco_telemetry_mdt] Accepted Cisco MDT GRPC dialout connection from 10.60.0.218:63220

After that, no writes error.
Is there something that ring a bell for you ?
Roderick

Jay_Clifford · May 23, 2022, 9:05am

Hmm, so the team think it might have been down to a corrupted shard. We are fixing the JSON input error you see. This is a red herring for the actual issue you were having. My guess is that the corrupted shard was deleted based on your retention policy leading to the metrics being wrote as expected.

rod · May 24, 2022, 1:22pm

OK, so what can i do if this event occurs again to fix corrupted shard ?
Thanks for your help.

Jay_Clifford · May 25, 2022, 4:36pm

That’s a tough one; having duplicate copies of shards is the reason we have an Enterprise version, so if one shard copy gets corrupted there is an alternative.

You can try to export the line protocol and re-ingest; it’s possible with careful use of time ranges you could avoid the corrupt part of the shard. But recovering a corrupt shard from running out of disk space is by no means a guaranteed or reliable process.

I would promote using retention policies on your data and you can monitor bucket sizes through the scrapers. If you need to expand disk space I would look to take a backup first in case.

rod · May 31, 2022, 7:01am

OK, thanks for your help Jay !

Topic		Replies	Views
Shard file created but not written Store	2	417	June 7, 2023
Telegraf logs errors while metrics are successfully written in InfluxDB Telegraf	8	5009	December 17, 2019
E! [outputs.influxdb] Failed to write metric (will be dropped: 400 Bad Request): partial write: points beyond retention policy dropped=1 InfluxDB 1	8	2212	January 26, 2023
Influxdb Error Please solution	10	21578	February 27, 2019
InfluxDB V2 and Telegraf -> unauthorized: unauthorized access InfluxDB 2	2	4259	March 16, 2021

Failed to send metrics on some buckets after disk full

Related topics