-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Hi, Thanks for the great library! I really appreciate the efforts you guys have taken to build this project.
Issue Description:
We recently encountered an issue, leader stopped sending appendEntries to the follower but could send the heartbeat.
The screenshot below, taken from the metrics, shows the situation. Only one follower node was affected, and all other nodes were fine.
Let me describe the issue in more detail, and share my findings so far on the issue.
At 09:44:36.106, Follower tries to storeLogs from Index: 151153297 to 151153341.
At 09:44:36.115, Follower received the error, "failed to append to logs error=unable to store logs in logStore,
err=Error 1213 (40001): Deadlock found when trying to get lock; try restarting transaction"
// note that our logStore is backed by MySQL.
At 09:44:36.126, On Follower, raft: failed to get previous log previous-index=151153341 last-index=151153296 error=log not found
At 09:44:36.136, On Follower, raft: failed to get previous log previous-index=151153342 last-index=151153296 error=log not found
The above timeline represents the events that occur at the time of the issue. At the same time, I didn't get any warn/error from the leader.
Following is the Stats on the Follower node at 9.45:29.236
{
"applied_index":"151153288",
"commit_index":"151153288",
"fsm_pending":"0",
"last_contact":"40.69452ms",
"last_log_index":"151153296",
"last_log_term":"34105",
"last_snapshot_index":"151134079",
"last_snapshot_term":"34105",
"latest_configuration":"<Sorry, I can't share this. But I've double checked the follower node is available in the list>",
"latest_configuration_index":"0",
"num_peers":"8",
"protocol_version":"3",
"protocol_version_max":"3",
"protocol_version_min":"0",
"server_id":"FollowerNode",
"snapshot_version_max":"1",
"snapshot_version_min":"0",
"state":"Follower",
"term":"34105"
}
I'm referring to the code below,
// Update s based on success
if resp.Success {
// Update our replication state
updateLastAppended(s, &req)
// Clear any failures, allow pipelining
s.failures = 0
s.allowPipeline = true
} else {
atomic.StoreUint64(&s.nextIndex, max(min(s.nextIndex-1, resp.LastLog+1), 1))
if resp.NoRetryBackoff {
s.failures = 0
} else {
s.failures++
}
r.logger.Warn("appendEntries rejected, sending older logs", "peer", peer, "next", atomic.LoadUint64(&s.nextIndex))
}
I've the following questions.
-
When "failed to append to logs" error occurred on the follower node, on the leader
atomic.StoreUint64(&s.nextIndex, max(min(s.nextIndex-1, resp.LastLog+1), 1))should sets.nextIndex
to 151153297, and try to replicate again from Index=151153297. But in this case "appendEntries rejected, sending older logs" error didn't occur on the leader node. Does that mean the leader receivedresp.Success=truewhen an error occurred? -
Does the leader consider it has replicated logs till Index 151153341 at the follower node? As from the logs on the follower, it is looking for previousLog at index=151153341
-
I don't understand the reason why replication stopped from Leader to Follower, i don't see "removed peer, stopping replication" in the logs on the leader node.
Note: when the leadershipTransfer happened again, a new leader was able to restart the replication.