Leader stopped sending appendEntries to the follower, but heartbeat was working

Hi, Thanks for the great library! I really appreciate the efforts you guys have taken to build this project.

Issue Description:
We recently encountered an issue, leader stopped sending appendEntries to the follower but could send the heartbeat.
The screenshot below, taken from the metrics,  shows the situation. Only one follower node was affected, and all other nodes were fine.

<img width="825" alt="AppendEntried_heartbeat" src="https://github.com/user-attachments/assets/32101c11-f739-490f-9365-048a5c023b9d">

Let me describe the issue in more detail, and share my findings so far on the issue.
```
At 09:44:36.106, Follower tries to storeLogs from Index: 151153297 to 151153341.
At 09:44:36.115, Follower received the error, "failed to append to logs error=unable to store logs in logStore, 
                          err=Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction"
                          // note that our logStore is backed by MySQL. 
At 09:44:36.126, On Follower, raft: failed to get previous log previous-index=151153341 last-index=151153296 error=log not found
At 09:44:36.136, On Follower, raft: failed to get previous log previous-index=151153342 last-index=151153296 error=log not found
```

The above timeline represents the events that occur at the time of the issue. At the same time, I didn't get any warn/error from the leader.

Following is the Stats on the Follower node at 9.45:29.236
```
{
	"applied_index":"151153288",
	"commit_index":"151153288",
	"fsm_pending":"0",
	"last_contact":"40.69452ms",
	"last_log_index":"151153296",
	"last_log_term":"34105",
	"last_snapshot_index":"151134079",
	"last_snapshot_term":"34105",
	"latest_configuration":"<Sorry, I can't share this. But I've double checked the follower node is available in the list>",
	"latest_configuration_index":"0",
	"num_peers":"8",
	"protocol_version":"3",
	"protocol_version_max":"3",
	"protocol_version_min":"0",
	"server_id":"FollowerNode",
	"snapshot_version_max":"1",
	"snapshot_version_min":"0",
	"state":"Follower",
	"term":"34105"
}
```

I'm referring to the code below,
```
    // Update s based on success
	if resp.Success {
		// Update our replication state
		updateLastAppended(s, &req)

		// Clear any failures, allow pipelining
		s.failures = 0
		s.allowPipeline = true
	} else {
		atomic.StoreUint64(&s.nextIndex, max(min(s.nextIndex-1, resp.LastLog+1), 1))
		if resp.NoRetryBackoff {
			s.failures = 0
		} else {
			s.failures++
		}
		r.logger.Warn("appendEntries rejected, sending older logs", "peer", peer, "next", atomic.LoadUint64(&s.nextIndex))
	}
```

I've the following questions.
1. When "failed to append to logs" error occurred on the follower node, on the leader `atomic.StoreUint64(&s.nextIndex, max(min(s.nextIndex-1, resp.LastLog+1), 1))` should set `s.nextIndex`
to 151153297, and try to replicate again from Index=151153297. But in this case "appendEntries rejected, sending older logs" error didn't occur on the leader node. Does that mean the leader received `resp.Success=true` when an error occurred?

2. Does the leader consider it has replicated logs till Index 151153341 at the follower node? As from the logs on the follower, it is looking for previousLog at index=151153341

3. I don't understand the reason why replication stopped from Leader to Follower, i don't see "removed peer, stopping replication" in the logs on the leader node.

Note: when the leadershipTransfer happened again, a new leader was able to restart the replication.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leader stopped sending appendEntries to the follower, but heartbeat was working #612

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Leader stopped sending appendEntries to the follower, but heartbeat was working #612

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions