Anti-entropy error repairing shard: "Queueing repair: shard 1043 appears to be in an infinite repair cycle"

Hi,
I’m running 1.7.10 cluster with anti-enthropy enabled, multiple databases.

I’ve got many (over 40) shards reported ‘diff’ which need to be repaired.

# influxd-ctl entropy show
Entropy
==========
ID    Database                  Retention Policy  Start                          End                            Expires                        Status
1043  fusion_stats              autogen           2019-10-21 00:00:00 +0000 UTC  2019-10-28 00:00:00 +0000 UTC  1970-01-01 00:00:00 +0000 UTC  diff
1453  actionlogger              autogen           2019-12-02 00:00:00 +0000 UTC  2019-12-09 00:00:00 +0000 UTC  2022-12-05 00:00:00 +0000 UTC  diff
1795  actionlogger              autogen           2020-01-27 00:00:00 +0000 UTC  2020-02-03 00:00:00 +0000 UTC  2023-01-30 00:00:00 +0000 UTC  diff
1839  actionlogger              autogen           2020-02-03 00:00:00 +0000 UTC  2020-02-10 00:00:00 +0000 UTC  2023-02-06 00:00:00 +0000 UTC  diff
1842  kepler_test               autogen           2020-02-03 00:00:00 +0000 UTC  2020-02-10 00:00:00 +0000 UTC  2023-02-09 00:00:00 +0000 UTC  diff

Some shards can’t be repaired , because of this error:

May 15 13:16:14  influxd[18916]: ts=2020-05-15T13:16:14.793877Z lvl=info msg="Queue shard repair request recieved" log_id=0Mn4Z6Xl000 service=cluster node=5 shard_id=1043 visited=
May 15 13:16:14  influxd[18916]: ts=2020-05-15T13:16:14.794429Z lvl=info msg="Queued shard repair" log_id=0Mn4Z6Xl000 service=ae node=5 db_shard_id=1043
May 15 13:16:14  influxd[18916]: ts=2020-05-15T13:16:14.794527Z lvl=info msg="Repair shard (start)" log_id=0Mn4Z6Xl000 service=ae trace_id=0Mn7VCIW000 op_name=ae_repair op_event=start
May 15 13:16:14  influxd[18916]: ts=2020-05-15T13:16:14.850204Z lvl=info msg="No shard entropy to repair between nodes" log_id=0Mn4Z6Xl000 service=ae trace_id=0Mn7VCIW000 op_name=ae_repair local_node=5 remote_node=4 db_shard_id=1043 error="no values written"
May 15 13:16:14  influxd[18916]: ts=2020-05-15T13:16:14.851815Z lvl=info msg="Successfully queued repair on next node" log_id=0Mn4Z6Xl000 service=ae trace_id=0Mn7VCIW000 op_name=ae_repair local_node=5 remote_node=4 db_shard_id=1043 error="no values written"
May 15 13:16:14  influxd[18916]: ts=2020-05-15T13:16:14.851848Z lvl=info msg="Repair shard (end)" log_id=0Mn4Z6Xl000 service=ae trace_id=0Mn7VCIW000 op_name=ae_repair op_event=end op_elapsed=57.323ms
May 15 13:16:14  influxd[18916]: ts=2020-05-15T13:16:14.948906Z lvl=info msg="Queue shard repair request recieved" log_id=0Mn4Z6Xl000 service=cluster node=5 shard_id=1043 visited=5,4
May 15 13:16:14  influxd[18916]: ts=2020-05-15T13:16:14.949416Z lvl=info msg="Queued shard repair" log_id=0Mn4Z6Xl000 service=ae node=5 db_shard_id=1043
May 15 13:16:14  influxd[18916]: ts=2020-05-15T13:16:14.949485Z lvl=info msg="Repair shard (start)" log_id=0Mn4Z6Xl000 service=ae trace_id=0Mn7VCuG000 op_name=ae_repair op_event=start
May 15 13:16:15  influxd[18916]: ts=2020-05-15T13:16:15.082275Z lvl=info msg="No shard entropy to repair between nodes" log_id=0Mn4Z6Xl000 service=ae trace_id=0Mn7VCuG000 op_name=ae_repair local_node=5 remote_node=4 db_shard_id=1043 error="no values written"
May 15 13:16:15  influxd[18916]: ts=2020-05-15T13:16:15.098997Z lvl=info msg="Successfully queued repair on next node" log_id=0Mn4Z6Xl000 service=ae trace_id=0Mn7VCuG000 op_name=ae_repair local_node=5 remote_node=4 db_shard_id=1043 error="no values written"
May 15 13:16:15  influxd[18916]: ts=2020-05-15T13:16:15.099025Z lvl=info msg="Repair shard (end)" log_id=0Mn4Z6Xl000 service=ae trace_id=0Mn7VCuG000 op_name=ae_repair op_event=end op_elapsed=149.542ms
May 15 13:16:15  influxd[18916]: ts=2020-05-15T13:16:15.148714Z lvl=info msg="Queue shard repair request recieved" log_id=0Mn4Z6Xl000 service=cluster node=5 shard_id=1043 visited=5,4,5,4
May 15 13:16:15  influxd[18916]: ts=2020-05-15T13:16:15.149187Z lvl=error msg="Queue repair selecting next owner" log_id=0Mn4Z6Xl000 service=ae error="shard 1043 appears to be in an infinite repair cycle"
May 15 13:16:15  influxd[18916]: ts=2020-05-15T13:16:15.149248Z lvl=info msg="Process queue shard repair error" log_id=0Mn4Z6Xl000 service=cluster error="Queueing repair: shard 1043 appears to be in an infinite repair cycle"

Some other shards can’t be repaired because of anohter error:

May 15 12:59:57  influxd[18916]: ts=2020-05-15T12:59:57.636190Z lvl=info msg="Queue shard repair request recieved" log_id=0Mn4Z6Xl000 service=cluster node=5 shard_id=1842 visited=5,4
May 15 12:59:57  influxd[18916]: ts=2020-05-15T12:59:57.636487Z lvl=info msg="Queued shard repair" log_id=0Mn4Z6Xl000 service=ae node=5 db_shard_id=1842
May 15 12:59:57  influxd[18916]: ts=2020-05-15T12:59:57.636521Z lvl=info msg="Repair shard (start)" log_id=0Mn4Z6Xl000 service=ae trace_id=0Mn6ZZH0000 op_name=ae_repair op_event=start
May 15 12:59:57  influxd[18916]: ts=2020-05-15T12:59:57.636766Z lvl=info msg="Error creating diff reader for shard" log_id=0Mn4Z6Xl000 service=ae trace_id=0Mn6ZZH0000 op_name=ae_repair local_node=5 remote_node=4 db_shard_id=1842 error="local digest: shard not idle"
May 15 12:59:57  influxd[18916]: ts=2020-05-15T12:59:57.636773Z lvl=info msg="Repair shard (end)" log_id=0Mn4Z6Xl000 service=ae trace_id=0Mn6ZZH0000 op_name=ae_repair op_event=end op_elapsed=0.254ms
May 15 12:59:57  influxd[18916]: ts=2020-05-15T12:59:57.636780Z lvl=error msg="Error repairing shard" log_id=0Mn4Z6Xl000 service=ae error="local digest: shard not idle"

Can you expain what the error means and what can I do to repair the shards ?
I restarted cluster couple of times, it does not help.

The last log snippet includes this line:

May 15 12:59:57  influxd[18916]: ts=2020-05-15T12:59:57.636766Z lvl=info msg="Error creating diff reader for shard" log_id=0Mn4Z6Xl000 service=ae trace_id=0Mn6ZZH0000 op_name=ae_repair local_node=5 remote_node=4 db_shard_id=1842 error="local digest: shard not idle"

The error “shard not idle” usually means the shard is being written to. Are you backfilling old data?

1 Like

Some shards are months old, I expected them to be already full-compacted and stale.
I’m sure we are not backfilling any data.

It is worth to mention that cluster data nodes have hit disk usage 100% recently and I wonder it it might corrupt meta db.

Also, I noticed when I start entropy repair shard - it fails and then an empty WAL file is created:

-rw-r–r-- 1 influxdb influxdb 0 May 25 08:54 /opt/influxdb/wal/kepler_test/autogen/1842/_00001.wal

(it may be OK though)