...
/Replication Types and Node Outages
复制类型和节点故障
Replication Types and Node Outages
复制类型和节点故障
Get to know replication types and learn how to handle node outages.
了解复制类型并学习如何处理节点故障。
Now that we have a basic understanding of replication, let’s discuss replication mechanisms and outages.
现在我们对复制有了基本的了解,让我们来讨论复制机制和停机。
Just like in the previous lessons, let’s assume that the My Cool App has a good amount of data, but all the data can be hosted on a single machine. Now we want to build replication mechanisms so that the database system can scale and can handle more load.
就像之前的课程一样,让我们假设 My Cool App 有大量数据,但所有数据都可以托管在一台机器上。现在我们想构建复制机制,以便数据库系统可以扩展并处理更多负载。
Synchronous or asynchronous replication?#
同步还是异步复制?#
When a leader receives some data to write, should the replication happen synchronously or asynchronously?
当领导者接收到一些要写入的数据时,复制应该是同步的还是异步的?
Note: In synchronous replication, a write is successful only when all the followers have successfully processed the write-request. In asynchronous replication, a write is successful as soon as the leader has processed the write-request, after which the followers can process it asynchronously.
注意:在同步复制中,只有当所有从属者都成功处理了写入请求时,写入才会成功。在异步复制中,一旦领导者处理了写入请求,写入就成功了,之后从属者可以异步处理。
Synchronous replication guarantees that every follower in the system contains the same up-to-date data as the leader. If the leader fails, data is already available to the followers—so no write is lost.
同步复制确保系统中的每个跟随者都包含与领导者相同的最新数据。如果领导者发生故障,数据已经对跟随者可用——因此不会丢失写入。
In the above diagram, we see a visualization of synchronous replication. Upon receiving a write-request, a leader processes it in itself, then triggers the same to the followers. To report whether the write was successful, the leader waits for both followers to report that the write was successful on their end. This essentially means a write is processed as slow as the slowest follower.
在上述图中,我们看到同步复制的可视化。当接收到写请求时,领导者在自己的系统中处理它,然后触发相同操作给跟随者。为了报告写操作是否成功,领导者等待两个跟随者都报告写操作在他们这边成功。这基本上意味着写操作的处理速度取决于最慢的跟随者。
Now, what if one of the followers crashes and cannot report success? Unfortunately, in this case, that particular write-request cannot succeed and the leader will have to block all writes until the follower is back. This is a major bottleneck of synchronous replication. For many systems, this approach is too inflexible.
那么,如果一个跟随者崩溃并且无法报告成功会怎样?不幸的是,在这种情况下,特定的写请求无法成功,领导者将不得不阻止所有写操作,直到跟随者恢复。这是同步复制的一个主要瓶颈。对于许多系统来说,这种方法过于僵化。
On the other hand, in asynchronous replication, a follower might not always be as up-to-date as the leader, and sometimes it might fall behind. Even worse, a situation could occur where all the followers have lagged behind, and only the leader has the latest data. This means upon receiving read requests, followers will only serve stale information to the clients.
另一方面,在异步复制中,一个跟随者可能并不总是与领导者保持同步,有时可能会落后。更糟糕的是,可能会出现所有跟随者都落后,而只有领导者拥有最新数据的情况。这意味着在接收到读请求时,跟随者只会向客户端提供过时的信息。
As you can see, synchronous replication can be very problematic because failure or the unresponsiveness of one node will bring the whole system to a standstill. On the other hand, a completely asynchronous replication puts the system at risk of stale data, and if the leader itself fails, then the latest changes might be lost.
如您所见,同步复制可能会带来很多问题,因为一个节点的故障或无响应会导致整个系统停顿。另一方面,完全异步复制会使系统面临过时数据的风险,如果领导者本身也发生故障,那么最新的更改可能会丢失。
This is why many systems prefer a hybrid approach.
这就是为什么许多系统倾向于采用混合方法。
Note: In a hybrid approach, one or a few nodes are made synchronous, whereas other nodes are asynchronous. This means a write-request is successful as soon as a fixed number of followers have processed the write on their sides. Other followers can process it asynchronously.
注意:在混合方法中,一个或几个节点是同步的,而其他节点是异步的。这意味着只要固定数量的从节点在其端处理了写操作,写请求就会成功。其他从节点可以异步处理。
How to handle temporary follower outages#
如何处理临时从节点故障#
In the previous section, we mentioned that followers might crash or become unresponsive. Apart from failures due to faults in nodes, followers might be down due to planned maintenance or network connectivity issues. If there are such temporary outages in a follower node, what happens when it comes back on?
在上一节中,我们提到从节点可能会崩溃或变得无响应。除了由于节点故障导致的失败外,从节点可能会因计划维护或网络连接问题而停机。如果一个从节点出现这样的临时停机,当它恢复时会发生什么?
In such a scenario, the follower will have to catch up with the leader when it comes back. Briefly, the process is as follows:
在这种情况下,从节点在恢复时必须赶上领导者。简而言之,该过程如下:
-
Each follower has a log in its storage. The log keeps all the data changes from the leader.
每个追随者在其存储中都有一个日志。该日志记录了所有来自领导者的数据变更。 -
When a follower comes back after a temporary outage, it can look up its log and decide from which point the follower needs to begin catching up.
当追随者在临时故障后恢复时,它可以查询其日志并决定追随者需要从哪个点开始追赶。 -
The follower then requests corresponding changes from the leader and recovers from the outage.
然后追随者向领导者请求相应的变更,并从故障中恢复。
How to handle leader outages#
如何处理领导者故障#
Leaders might also go down for various reasons, right? But leader outage is much more complicated as compared to follower outages.
领导者也可能因为各种原因下线,对吧?但与追随者下线相比,领导者下线的情况要复杂得多。
If a follower fails temporarily, it can recover. If it fails permanently, other followers can potentially take the extra load and keep the system up. If they can’t take the extra load new followers can be added to the system.
如果追随者暂时失效,它可以恢复。如果追随者永久失效,其他追随者可以潜在地承担额外的负载并保持系统运行。如果它们无法承担额外的负载,可以向系统中添加新的追随者。
On the flip side, if the leader fails, no write can proceed. During scheduled maintenance, we can tolerate this kind of downtime. But if a failure occurs during the business operational period and writes are blocked, that surely hurts the business.
另一方面,如果领导者失效,则无法进行任何写入操作。在计划维护期间,我们可以容忍这种停机。但如果在业务运营期间发生故障并且写入操作被阻塞,那肯定会对业务造成损害。
So, how do we handle leader outages?
那么,我们该如何处理领导者下线的情况呢?
The idea is known as failover. We will now briefly discuss the process.
这个概念被称为故障转移。现在我们将简要讨论这个过程。
Detect leader failure# 检测领导者故障#
If there is a failure in the leader, the first step is to detect that a failure has happened. Detecting the failure type is difficult. Most systems do not even try it. Generally, a timeout is used to detect a failure in a node.
如果领导者出现故障,第一步是检测到故障已经发生。检测故障类型是困难的。大多数系统甚至不尝试这样做。通常,超时被用来检测节点中的故障。
A timeout is a way of limiting how long a client will wait for a response from a server. For example, a timeout of 30 seconds means that a client waits for 30 seconds after making a request to the server, and if a response is not received within 30 seconds, the request is canceled, and possibly a timeout error is shown on the client’s side.
超时是一种限制客户端等待服务器响应时间的方式。例如,30 秒的超时意味着客户端在向服务器发送请求后等待 30 秒,如果在 30 秒内未收到响应,请求将被取消,并且可能在客户端显示超时错误。
To detect a failure, we can use a separate controller node, which sends a simple request to the nodes periodically. For example, for the leader, if a timeout occurs for a request from the controller node, the leader is deemed to have failed, and the next steps are triggered to do a failover.
为了检测故障,我们可以使用一个独立的控制器节点,该节点定期向其他节点发送简单请求。例如,对于领导者节点,如果从控制器节点发出的请求发生超时,则认为领导者节点已经故障,并触发后续步骤进行故障转移。
❓Quick question: Can you think of a node in the system which can periodically ping the leader and determine whether the node has failed?
❓快速提问:你能想到系统中哪个节点可以定期向领导者节点发送心跳并判断该节点是否故障吗?
Promote a follower# 提升一个跟随者节点
The next step is to promote one of the followers to the new leader. If the system uses synchronous replication, any of the follower nodes could be promoted to the new leader. Generally, the most up-to-date follower should be the new leader.
下一步是将其中一个追随者提升为新的领导者。如果系统使用同步复制,任何追随者节点都可以被提升为新的领导者。通常,最新的追随者应该是新的领导者。
Route write-requests# 路由写请求#
Since the leader handles the write-requests, clients should be able to send these to the new leader. This means all write-requests coming to the system should now go to the new leader. If the old leader comes back, the system should not treat it as the leader.
由于领导者处理写请求,客户端应该能够将这些请求发送给新的领导者。这意味着所有发送到系统的写请求现在都应该发送给新的领导者。如果旧领导者回来了,系统不应该将其视为领导者。
Generally, after electing a new leader, the load balancer should be made aware of the new leader. Then write requests will be handled accordingly.
通常,在选举出新领导者后,负载均衡器应该知道新的领导者。然后写请求将相应地被处理。
Key takeaways# 关键要点#
- As an owner, it is critical that you choose the correct replication type for your system.
作为所有者,为您的系统选择正确的复制类型至关重要。 - Synchronous replication gives us correctness at the cost of slow processing of writes and reads.
同步复制以牺牲写入和读取处理的缓慢为代价,保证了数据的正确性。 - Asynchronous replication ensures writes and reads are processed fast but at the cost of stale data in the system.
异步复制确保写入和读取处理得快,但代价是系统中存在过时数据。 - If there are node outages in the system, replication can become even more complicated.
如果系统中存在节点故障,复制会变得更加复杂。
Replication Techniques 复制技术
Quick Quiz #6: Replication
第六次快速测验:复制