Skip to content

Leader stopped sending appendEntries to the follower, but heartbeat was working #612

@pranayhere

Description

@pranayhere

Hi, Thanks for the great library! I really appreciate the efforts you guys have taken to build this project.

Issue Description:
We recently encountered an issue, leader stopped sending appendEntries to the follower but could send the heartbeat.
The screenshot below, taken from the metrics, shows the situation. Only one follower node was affected, and all other nodes were fine.

AppendEntried_heartbeat

Let me describe the issue in more detail, and share my findings so far on the issue.

At 09:44:36.106, Follower tries to storeLogs from Index: 151153297 to 151153341.
At 09:44:36.115, Follower received the error, "failed to append to logs error=unable to store logs in logStore, 
                          err=Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction"
                          // note that our logStore is backed by MySQL. 
At 09:44:36.126, On Follower, raft: failed to get previous log previous-index=151153341 last-index=151153296 error=log not found
At 09:44:36.136, On Follower, raft: failed to get previous log previous-index=151153342 last-index=151153296 error=log not found

The above timeline represents the events that occur at the time of the issue. At the same time, I didn't get any warn/error from the leader.

Following is the Stats on the Follower node at 9.45:29.236

{
	"applied_index":"151153288",
	"commit_index":"151153288",
	"fsm_pending":"0",
	"last_contact":"40.69452ms",
	"last_log_index":"151153296",
	"last_log_term":"34105",
	"last_snapshot_index":"151134079",
	"last_snapshot_term":"34105",
	"latest_configuration":"<Sorry, I can't share this. But I've double checked the follower node is available in the list>",
	"latest_configuration_index":"0",
	"num_peers":"8",
	"protocol_version":"3",
	"protocol_version_max":"3",
	"protocol_version_min":"0",
	"server_id":"FollowerNode",
	"snapshot_version_max":"1",
	"snapshot_version_min":"0",
	"state":"Follower",
	"term":"34105"
}

I'm referring to the code below,

    // Update s based on success
	if resp.Success {
		// Update our replication state
		updateLastAppended(s, &req)

		// Clear any failures, allow pipelining
		s.failures = 0
		s.allowPipeline = true
	} else {
		atomic.StoreUint64(&s.nextIndex, max(min(s.nextIndex-1, resp.LastLog+1), 1))
		if resp.NoRetryBackoff {
			s.failures = 0
		} else {
			s.failures++
		}
		r.logger.Warn("appendEntries rejected, sending older logs", "peer", peer, "next", atomic.LoadUint64(&s.nextIndex))
	}

I've the following questions.

  1. When "failed to append to logs" error occurred on the follower node, on the leader atomic.StoreUint64(&s.nextIndex, max(min(s.nextIndex-1, resp.LastLog+1), 1)) should set s.nextIndex
    to 151153297, and try to replicate again from Index=151153297. But in this case "appendEntries rejected, sending older logs" error didn't occur on the leader node. Does that mean the leader received resp.Success=true when an error occurred?

  2. Does the leader consider it has replicated logs till Index 151153341 at the follower node? As from the logs on the follower, it is looking for previousLog at index=151153341

  3. I don't understand the reason why replication stopped from Leader to Follower, i don't see "removed peer, stopping replication" in the logs on the leader node.

Note: when the leadershipTransfer happened again, a new leader was able to restart the replication.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions