这是用户在 2025-7-30 14:58 为 https://pve.proxmox.com/pve-docs/chapter-ha-manager.html 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Our modern society depends heavily on information provided by computers over the network. Mobile devices amplified that dependency, because people can access the network any time from anywhere. If you provide such services, it is very important that they are available most of the time.
我们现代社会在很大程度上依赖于通过网络由计算机提供的信息。移动设备加剧了这种依赖,因为人们可以随时随地访问网络。如果您提供此类服务,确保它们大部分时间可用非常重要。

We can mathematically define the availability as the ratio of (A), the total time a service is capable of being used during a given interval to (B), the length of the interval. It is normally expressed as a percentage of uptime in a given year.
我们可以用数学方式定义可用性,即在给定时间间隔内,服务能够被使用的总时间(A)与该时间间隔长度(B)之比。通常以一年内的正常运行时间百分比来表示。

Table 1. Availability - Downtime per Year
表 1. 可用性 - 每年停机时间
Availability %   可用性百分比 Downtime per year  每年停机时间

99

3.65 days  3.65 天

99.9

8.76 hours  8.76 小时

99.99

52.56 minutes  52.56 分钟

99.999

5.26 minutes  5.26 分钟

99.9999

31.5 seconds  31.5 秒

99.99999

3.15 seconds  3.15 秒

There are several ways to increase availability. The most elegant solution is to rewrite your software, so that you can run it on several hosts at the same time. The software itself needs to have a way to detect errors and do failover. If you only want to serve read-only web pages, then this is relatively simple. However, this is generally complex and sometimes impossible, because you cannot modify the software yourself. The following solutions works without modifying the software:
有多种方法可以提高可用性。最优雅的解决方案是重写您的软件,使其能够同时在多台主机上运行。软件本身需要具备检测错误和故障切换的能力。如果您只想提供只读网页,那么这相对简单。然而,这通常很复杂,有时甚至不可能,因为您无法自行修改软件。以下解决方案无需修改软件即可实现:

  • Use reliable “server” components
    使用可靠的“服务器”组件

    Note Computer components with the same functionality can have varying reliability numbers, depending on the component quality. Most vendors sell components with higher reliability as “server” components - usually at higher price.
    具有相同功能的计算机组件,其可靠性数值可能因组件质量而异。大多数厂商将可靠性更高的组件作为“服务器”组件销售——通常价格更高。
  • Eliminate single point of failure (redundant components)
    消除单点故障(冗余组件)

    • use an uninterruptible power supply (UPS)
      使用不间断电源(UPS)

    • use redundant power supplies in your servers
      在服务器中使用冗余电源

    • use ECC-RAM   使用 ECC 内存

    • use redundant network hardware
      使用冗余网络硬件

    • use RAID for local storage
      对本地存储使用 RAID

    • use distributed, redundant storage for VM data
      为虚拟机数据使用分布式冗余存储

  • Reduce downtime   减少停机时间

    • rapidly accessible administrators (24/7)
      快速可访问的管理员(全天候 24 小时)

    • availability of spare parts (other nodes in a Proxmox VE cluster)
      备件的可用性(Proxmox VE 集群中的其他节点)

    • automatic error detection (provided by ha-manager)
      自动错误检测(由 ha-manager 提供)

    • automatic failover (provided by ha-manager)
      自动故障转移(由 ha-manager 提供)

Virtualization environments like Proxmox VE make it much easier to reach high availability because they remove the “hardware” dependency. They also support the setup and use of redundant storage and network devices, so if one host fails, you can simply start those services on another host within your cluster.
像 Proxmox VE 这样的虚拟化环境使实现高可用性变得更加容易,因为它们消除了“硬件”依赖。它们还支持冗余存储和网络设备的设置与使用,因此如果一个主机发生故障,您可以简单地在集群中的另一台主机上启动这些服务。

Better still, Proxmox VE provides a software stack called ha-manager, which can do that automatically for you. It is able to automatically detect errors and do automatic failover.
更好的是,Proxmox VE 提供了一个名为 ha-manager 的软件栈,可以为你自动完成这些操作。它能够自动检测错误并执行自动故障切换。

Proxmox VE ha-manager works like an “automated” administrator. First, you configure what resources (VMs, containers, …) it should manage. Then, ha-manager observes the correct functionality, and handles service failover to another node in case of errors. ha-manager can also handle normal user requests which may start, stop, relocate and migrate a service.
Proxmox VE ha-manager 就像一个“自动化”的管理员。首先,你配置它应该管理哪些资源(虚拟机、容器等)。然后,ha-manager 监控其正常运行,并在发生错误时将服务故障切换到另一个节点。ha-manager 还可以处理用户的正常请求,如启动、停止、迁移和迁移服务。

But high availability comes at a price. High quality components are more expensive, and making them redundant doubles the costs at least. Additional spare parts increase costs further. So you should carefully calculate the benefits, and compare with those additional costs.
但高可用性是有代价的。高质量的组件更昂贵,使其冗余至少会使成本翻倍。额外的备件会进一步增加成本。因此,你应仔细计算收益,并与这些额外成本进行比较。

Tip Increasing availability from 99% to 99.9% is relatively simple. But increasing availability from 99.9999% to 99.99999% is very hard and costly. ha-manager has typical error detection and failover times of about 2 minutes, so you can get no more than 99.999% availability.
将可用性从 99% 提升到 99.9% 相对简单。但将可用性从 99.9999% 提升到 99.99999% 则非常困难且成本高昂。ha-manager 的典型错误检测和故障切换时间约为 2 分钟,因此你最多只能获得 99.999% 的可用性。

Requirements   要求

You must meet the following requirements before you start with HA:
在开始使用高可用性(HA)之前,您必须满足以下要求:

  • at least three cluster nodes (to get reliable quorum)
    至少三个集群节点(以获得可靠的仲裁)

  • shared storage for VMs and containers
    用于虚拟机和容器的共享存储

  • hardware redundancy (everywhere)
    硬件冗余(全方位)

  • use reliable “server” components
    使用可靠的“服务器”组件

  • hardware watchdog - if not available we fall back to the linux kernel software watchdog (softdog)
    硬件看门狗——如果不可用,则回退到 Linux 内核软件看门狗(softdog)

  • optional hardware fencing devices
    可选的硬件隔离设备

Resources   资源

We call the primary management unit handled by ha-manager a resource. A resource (also called “service”) is uniquely identified by a service ID (SID), which consists of the resource type and a type specific ID, for example vm:100. That example would be a resource of type vm (virtual machine) with the ID 100.
我们称由 ha-manager 处理的主要管理单元为资源。资源(也称为“服务”)通过服务 ID(SID)唯一标识,SID 由资源类型和特定类型的 ID 组成,例如 vm:100。这个例子表示一个类型为 vm(虚拟机)、ID 为 100 的资源。

For now we have two important resources types - virtual machines and containers. One basic idea here is that we can bundle related software into such a VM or container, so there is no need to compose one big service from other services, as was done with rgmanager. In general, a HA managed resource should not depend on other resources.
目前我们有两种重要的资源类型——虚拟机和容器。这里的一个基本理念是,我们可以将相关的软件打包到这样的虚拟机或容器中,因此无需像 rgmanager 那样将一个大服务由其他服务组合而成。一般来说,HA 管理的资源不应依赖于其他资源。

Management Tasks   管理任务

This section provides a short overview of common management tasks. The first step is to enable HA for a resource. This is done by adding the resource to the HA resource configuration. You can do this using the GUI, or simply use the command-line tool, for example:
本节简要介绍常见的管理任务。第一步是为资源启用 HA。这可以通过将资源添加到 HA 资源配置中来完成。你可以使用图形界面,也可以简单地使用命令行工具,例如:

# ha-manager add vm:100

The HA stack now tries to start the resources and keep them running. Please note that you can configure the “requested” resources state. For example you may want the HA stack to stop the resource:
HA 堆栈现在尝试启动资源并保持其运行。请注意,您可以配置“请求的”资源状态。例如,您可能希望 HA 堆栈停止该资源:

# ha-manager set vm:100 --state stopped

and start it again later:
然后稍后再次启动它:

# ha-manager set vm:100 --state started

You can also use the normal VM and container management commands. They automatically forward the commands to the HA stack, so
您也可以使用常规的虚拟机和容器管理命令。它们会自动将命令转发给 HA 堆栈,因此

# qm start 100

simply sets the requested state to started. The same applies to qm stop, which sets the requested state to stopped.
仅仅是将请求状态设置为已启动。对于 qm stop 同样适用,它将请求状态设置为已停止。

Note The HA stack works fully asynchronous and needs to communicate with other cluster members. Therefore, it takes some seconds until you see the result of such actions.
HA 堆栈完全异步工作,需要与其他集群成员通信。因此,您需要等待几秒钟才能看到这些操作的结果。

To view the current HA resource configuration use:
要查看当前的 HA 资源配置,请使用:

# ha-manager config
vm:100
        state stopped

And you can view the actual HA manager and resource state with:
您还可以查看实际的 HA 管理器和资源状态:

# ha-manager status
quorum OK
master node1 (active, Wed Nov 23 11:07:23 2016)
lrm elsa (active, Wed Nov 23 11:07:19 2016)
service vm:100 (node1, started)

You can also initiate resource migration to other nodes:
您也可以启动资源迁移到其他节点:

# ha-manager migrate vm:100 node2

This uses online migration and tries to keep the VM running. Online migration needs to transfer all used memory over the network, so it is sometimes faster to stop the VM, then restart it on the new node. This can be done using the relocate command:
这使用在线迁移并尝试保持虚拟机运行。在线迁移需要通过网络传输所有已使用的内存,因此有时停止虚拟机然后在新节点上重启会更快。可以使用 relocate 命令来完成此操作:

# ha-manager relocate vm:100 node2

Finally, you can remove the resource from the HA configuration using the following command:
最后,您可以使用以下命令从高可用性配置中移除资源:

# ha-manager remove vm:100
Note This does not start or stop the resource.
这不会启动或停止资源。

But all HA related tasks can be done in the GUI, so there is no need to use the command line at all.
但所有与高可用性相关的任务都可以在图形界面中完成,因此根本不需要使用命令行。

How It Works
工作原理

This section provides a detailed description of the Proxmox VE HA manager internals. It describes all involved daemons and how they work together. To provide HA, two daemons run on each node:
本节详细介绍了 Proxmox VE 高可用性(HA)管理器的内部结构。它描述了所有相关守护进程及其协同工作方式。为了实现高可用性,每个节点上运行两个守护进程:

pve-ha-lrm

The local resource manager (LRM), which controls the services running on the local node. It reads the requested states for its services from the current manager status file and executes the respective commands.
本地资源管理器(LRM),负责控制本地节点上运行的服务。它从当前管理器状态文件中读取服务的请求状态,并执行相应的命令。

pve-ha-crm

The cluster resource manager (CRM), which makes the cluster-wide decisions. It sends commands to the LRM, processes the results, and moves resources to other nodes if something fails. The CRM also handles node fencing.
集群资源管理器(CRM),负责做出全集群范围的决策。它向本地资源管理器(LRM)发送命令,处理结果,并在出现故障时将资源迁移到其他节点。CRM 还负责节点隔离。

Note
Locks in the LRM & CRM
LRM 和 CRM 中的锁
Locks are provided by our distributed configuration file system (pmxcfs). They are used to guarantee that each LRM is active once and working. As an LRM only executes actions when it holds its lock, we can mark a failed node as fenced if we can acquire its lock. This then lets us recover any failed HA services securely without any interference from the now unknown failed node. This all gets supervised by the CRM which currently holds the manager master lock.
锁由我们的分布式配置文件系统(pmxcfs)提供。它们用于保证每个 LRM 只激活一次并正常工作。由于 LRM 只有在持有其锁时才执行操作,我们可以在获得某节点的锁时将其标记为已隔离的故障节点。这样就能安全地恢复任何失败的高可用服务,而不会受到现已未知的故障节点的干扰。所有这些都由当前持有管理主锁的 CRM 进行监督。

Service States  服务状态

The CRM uses a service state enumeration to record the current service state. This state is displayed on the GUI and can be queried using the ha-manager command-line tool:
CRM 使用服务状态枚举来记录当前的服务状态。该状态会显示在图形界面上,也可以通过 ha-manager 命令行工具查询:

# ha-manager status
quorum OK
master elsa (active, Mon Nov 21 07:23:29 2016)
lrm elsa (active, Mon Nov 21 07:23:22 2016)
service ct:100 (elsa, stopped)
service ct:102 (elsa, started)
service vm:501 (elsa, started)

Here is the list of possible states:
以下是可能的状态列表:

stopped   已停止

Service is stopped (confirmed by LRM). If the LRM detects a stopped service is still running, it will stop it again.
服务已停止(由 LRM 确认)。如果 LRM 检测到已停止的服务仍在运行,它将再次停止该服务。

request_stop

Service should be stopped. The CRM waits for confirmation from the LRM.
服务应停止。CRM 等待 LRM 的确认。

stopping   停止中

Pending stop request. But the CRM did not get the request so far.
等待停止请求。但 CRM 到目前为止尚未收到该请求。

started   已启动

Service is active an LRM should start it ASAP if not already running. If the Service fails and is detected to be not running the LRM restarts it (see Start Failure Policy).
服务处于活动状态,LRM 应尽快启动它(如果尚未运行)。如果服务失败且被检测到未运行,LRM 会重新启动它(参见启动失败策略)。

starting   正在启动

Pending start request. But the CRM has not got any confirmation from the LRM that the service is running.
等待启动请求。但 CRM 尚未收到 LRM 确认服务正在运行。

fence   隔离

Wait for node fencing as the service node is not inside the quorate cluster partition (see Fencing). As soon as node gets fenced successfully the service will be placed into the recovery state.
等待节点隔离,因为服务节点不在法定集群分区内(参见隔离)。一旦节点成功隔离,服务将被置于恢复状态。

recovery   恢复

Wait for recovery of the service. The HA manager tries to find a new node where the service can run on. This search depends not only on the list of online and quorate nodes, but also if the service is a group member and how such a group is limited. As soon as a new available node is found, the service will be moved there and initially placed into stopped state. If it’s configured to run the new node will do so.
等待服务恢复。HA 管理器尝试找到一个新的节点来运行该服务。此搜索不仅依赖于在线且法定的节点列表,还取决于服务是否为组成员以及该组的限制。一旦找到新的可用节点,服务将被迁移到该节点,并初始置于停止状态。如果配置为运行,新的节点将启动该服务。

freeze   冻结

Do not touch the service state. We use this state while we reboot a node, or when we restart the LRM daemon (see Package Updates).
不要更改服务状态。我们在重启节点或重启 LRM 守护进程时会使用此状态(参见包更新)。

ignored   忽略

Act as if the service were not managed by HA at all. Useful, when full control over the service is desired temporarily, without removing it from the HA configuration.
表现得好像该服务根本不受 HA 管理。当暂时需要对服务进行完全控制而不将其从 HA 配置中移除时,这非常有用。

migrate   迁移

Migrate service (live) to other node.
将服务(实时)迁移到其他节点。

error   错误

Service is disabled because of LRM errors. Needs manual intervention (see Error Recovery).
服务因 LRM 错误被禁用。需要人工干预(参见错误恢复)。

queued   排队中

Service is newly added, and the CRM has not seen it so far.
服务是新添加的,CRM 目前还未检测到它。

disabled   已禁用

Service is stopped and marked as disabled
服务已停止并标记为禁用

Local Resource Manager  本地资源管理器

The local resource manager (pve-ha-lrm) is started as a daemon on boot and waits until the HA cluster is quorate and thus cluster-wide locks are working.
本地资源管理器(pve-ha-lrm)作为守护进程在启动时启动,并等待直到 HA 集群达到法定人数,从而集群范围的锁定功能生效。

It can be in three states:
它可以处于三种状态:

wait for agent lock
等待代理锁

The LRM waits for our exclusive lock. This is also used as idle state if no service is configured.
LRM 等待我们的独占锁。如果没有配置任何服务,这也用作空闲状态。

active   活动

The LRM holds its exclusive lock and has services configured.
LRM 持有其独占锁并且已配置服务。

lost agent lock   丢失代理锁

The LRM lost its lock, this means a failure happened and quorum was lost.
LRM 失去了锁,这意味着发生了故障并且失去了法定人数。

After the LRM gets in the active state it reads the manager status file in /etc/pve/ha/manager_status and determines the commands it has to execute for the services it owns. For each command a worker gets started, these workers are running in parallel and are limited to at most 4 by default. This default setting may be changed through the datacenter configuration key max_worker. When finished the worker process gets collected and its result saved for the CRM.
当 LRM 进入活动状态后,它会读取位于 /etc/pve/ha/manager_status 的管理器状态文件,并确定它必须为其拥有的服务执行的命令。对于每个命令,都会启动一个工作进程,这些工作进程并行运行,默认最多限制为 4 个。此默认设置可以通过数据中心配置键 max_worker 进行更改。完成后,工作进程会被回收,其结果会保存给 CRM。

Note
Maximum Concurrent Worker Adjustment Tips
最大并发工作进程调整提示
The default value of at most 4 concurrent workers may be unsuited for a specific setup. For example, 4 live migrations may occur at the same time, which can lead to network congestions with slower networks and/or big (memory wise) services. Also, ensure that in the worst case, congestion is at a minimum, even if this means lowering the max_worker value. On the contrary, if you have a particularly powerful, high-end setup you may also want to increase it.
默认的最多 4 个并发工作线程的值可能不适合特定的环境。例如,可能会同时发生 4 次实时迁移,这在网络较慢和/或服务内存较大的情况下可能导致网络拥堵。此外,请确保在最坏情况下,拥堵最小化,即使这意味着降低 max_worker 的值。相反,如果您拥有特别强大、高端的环境,也可以考虑增加该值。

Each command requested by the CRM is uniquely identifiable by a UID. When the worker finishes, its result will be processed and written in the LRM status file /etc/pve/nodes/<nodename>/lrm_status. There the CRM may collect it and let its state machine - respective to the commands output - act on it.
CRM 请求的每个命令都由一个唯一的 UID 标识。当工作进程完成后,其结果将被处理并写入 LRM 状态文件 /etc/pve/nodes/<nodename>/lrm_status。CRM 可以在此收集结果,并让其状态机根据命令输出对其进行处理。

The actions on each service between CRM and LRM are normally always synced. This means that the CRM requests a state uniquely marked by a UID, the LRM then executes this action one time and writes back the result, which is also identifiable by the same UID. This is needed so that the LRM does not execute an outdated command. The only exceptions to this behaviour are the stop and error commands; these two do not depend on the result produced and are executed always in the case of the stopped state and once in the case of the error state.
CRM 与 LRM 之间对每个服务的操作通常始终保持同步。这意味着 CRM 请求一个由 UID 唯一标记的状态,LRM 随后执行该操作一次并写回结果,该结果也由相同的 UID 标识。这是为了防止 LRM 执行过时的命令。唯一的例外是停止和错误命令;这两种命令不依赖于产生的结果,在停止状态时总是执行,在错误状态时执行一次。

Note
Read the Logs  查看日志
The HA Stack logs every action it makes. This helps to understand what and also why something happens in the cluster. Here its important to see what both daemons, the LRM and the CRM, did. You may use journalctl -u pve-ha-lrm on the node(s) where the service is and the same command for the pve-ha-crm on the node which is the current master.
HA 堆栈会记录其执行的每个操作。这有助于理解集群中发生了什么以及为什么发生。这里重要的是查看两个守护进程,LRM 和 CRM,所做的操作。你可以在服务所在的节点上使用 journalctl -u pve-ha-lrm 命令,在当前主节点上使用相同的命令查看 pve-ha-crm。

Cluster Resource Manager
集群资源管理器

The cluster resource manager (pve-ha-crm) starts on each node and waits there for the manager lock, which can only be held by one node at a time. The node which successfully acquires the manager lock gets promoted to the CRM master.
集群资源管理器(pve-ha-crm)在每个节点上启动,并在那里等待管理锁,该锁一次只能被一个节点持有。成功获取管理锁的节点将被提升为 CRM 主节点。

It can be in three states:
它可以处于三种状态:

wait for agent lock
等待代理锁

The CRM waits for our exclusive lock. This is also used as idle state if no service is configured
CRM 正在等待我们的独占锁。这也用作空闲状态,如果没有配置服务

active   活动

The CRM holds its exclusive lock and has services configured
CRM 持有其独占锁并配置了服务

lost agent lock   丢失代理锁

The CRM lost its lock, this means a failure happened and quorum was lost.
CRM 失去了锁,这意味着发生了故障并且失去了法定人数。

Its main task is to manage the services which are configured to be highly available and try to always enforce the requested state. For example, a service with the requested state started will be started if its not already running. If it crashes it will be automatically started again. Thus the CRM dictates the actions the LRM needs to execute.
它的主要任务是管理被配置为高可用的服务,并尽力始终强制执行请求的状态。例如,请求状态为已启动的服务,如果尚未运行,将会被启动。如果它崩溃了,将会自动重新启动。因此,CRM 决定了 LRM 需要执行的操作。

When a node leaves the cluster quorum, its state changes to unknown. If the current CRM can then secure the failed node’s lock, the services will be stolen and restarted on another node.
当一个节点离开集群法定人数时,其状态会变为未知。如果当前的 CRM 能够获取失败节点的锁,服务将被抢占并在另一个节点上重新启动。

When a cluster member determines that it is no longer in the cluster quorum, the LRM waits for a new quorum to form. Until there is a cluster quorum, the node cannot reset the watchdog. If there are active services on the node, or if the LRM or CRM process is not scheduled or is killed, this will trigger a reboot after the watchdog has timed out (this happens after 60 seconds).
当集群成员确定自己不再处于集群法定人数中时,LRM 会等待新的法定人数形成。在没有集群法定人数之前,节点无法重置看门狗。如果节点上有活动服务,或者 LRM 或 CRM 进程未被调度或被终止,看门狗超时后(60 秒后)将触发重启。

Note that if a node has an active CRM but the LRM is idle, a quorum loss will not trigger a self-fence reset. The reason for this is that all state files and configurations that the CRM accesses are backed up by the clustered configuration file system, which becomes read-only upon quorum loss. This means that the CRM only needs to protect itself against its process being scheduled for too long, in which case another CRM could take over unaware of the situation, causing corruption of the HA state. The open watchdog ensures that this cannot happen.
请注意,如果一个节点有一个活动的 CRM 但 LRM 处于空闲状态,仲裁丢失不会触发自我隔离重置。原因是 CRM 访问的所有状态文件和配置都由集群配置文件系统备份,该文件系统在仲裁丢失时变为只读。这意味着 CRM 只需要防止其进程被调度时间过长,在这种情况下,另一个 CRM 可能会在不知情的情况下接管,导致 HA 状态损坏。开放的看门狗确保这种情况不会发生。

If no service is configured for more than 15 minutes, the CRM automatically returns to the idle state and closes the watchdog completely.
如果超过 15 分钟没有配置任何服务,CRM 会自动返回空闲状态并完全关闭看门狗。

HA Simulator   高可用模拟器

screenshot/gui-ha-manager-status.png

By using the HA simulator you can test and learn all functionalities of the Proxmox VE HA solutions.
通过使用 HA 模拟器,您可以测试和学习 Proxmox VE HA 解决方案的所有功能。

By default, the simulator allows you to watch and test the behaviour of a real-world 3 node cluster with 6 VMs. You can also add or remove additional VMs or Container.
默认情况下,模拟器允许您观察和测试一个由 3 个节点和 6 个虚拟机组成的真实集群的行为。您还可以添加或移除额外的虚拟机或容器。

You do not have to setup or configure a real cluster, the HA simulator runs out of the box.
您无需设置或配置真实集群,HA 模拟器开箱即用。

Install with apt:  使用 apt 安装:

apt install pve-ha-simulator

You can even install the package on any Debian-based system without any other Proxmox VE packages. For that you will need to download the package and copy it to the system you want to run it on for installation. When you install the package with apt from the local file system it will also resolve the required dependencies for you.
您甚至可以在任何基于 Debian 的系统上安装该包,而无需其他 Proxmox VE 包。为此,您需要下载该包并将其复制到您想要运行安装的系统上。当您从本地文件系统使用 apt 安装该包时,它还会为您解决所需的依赖关系。

To start the simulator on a remote machine you must have an X11 redirection to your current system.
要在远程机器上启动模拟器,您必须将 X11 重定向到您当前的系统。

If you are on a Linux machine you can use:
如果您使用的是 Linux 机器,可以使用:

ssh root@<IPofPVE> -Y

On Windows it works with mobaxterm.
在 Windows 上,可以使用 mobaxterm。

After connecting to an existing Proxmox VE with the simulator installed or installing it on your local Debian-based system manually, you can try it out as follows.
连接到已安装模拟器的现有 Proxmox VE,或在本地基于 Debian 的系统上手动安装后,您可以按如下方式尝试使用。

First you need to create a working directory where the simulator saves its current state and writes its default config:
首先,您需要创建一个工作目录,模拟器将在此保存其当前状态并写入默认配置:

mkdir working

Then, simply pass the created directory as a parameter to pve-ha-simulator:
然后,只需将创建的目录作为参数传递给 pve-ha-simulator:

pve-ha-simulator working/

You can then start, stop, migrate the simulated HA services, or even check out what happens on a node failure.
之后,您可以启动、停止、迁移模拟的 HA 服务,甚至查看节点故障时的情况。

Configuration   配置

The HA stack is well integrated into the Proxmox VE API. So, for example, HA can be configured via the ha-manager command-line interface, or the Proxmox VE web interface - both interfaces provide an easy way to manage HA. Automation tools can use the API directly.
HA 堆栈与 Proxmox VE API 集成良好。例如,HA 可以通过 ha-manager 命令行界面或 Proxmox VE 网页界面进行配置——这两种界面都提供了便捷的 HA 管理方式。自动化工具可以直接使用 API。

All HA configuration files are within /etc/pve/ha/, so they get automatically distributed to the cluster nodes, and all nodes share the same HA configuration.
所有 HA 配置文件都位于/etc/pve/ha/目录下,因此它们会自动分发到集群节点,所有节点共享相同的 HA 配置。

Resources  资源

screenshot/gui-ha-manager-status.png

The resource configuration file /etc/pve/ha/resources.cfg stores the list of resources managed by ha-manager. A resource configuration inside that list looks like this:
资源配置文件/etc/pve/ha/resources.cfg 存储由 ha-manager 管理的资源列表。该列表中的资源配置如下所示:

<type>: <name>
        <property> <value>
        ...

It starts with a resource type followed by a resource specific name, separated with colon. Together this forms the HA resource ID, which is used by all ha-manager commands to uniquely identify a resource (example: vm:100 or ct:101). The next lines contain additional properties:
它以资源类型开头,后跟资源特定名称,两者之间用冒号分隔。组合在一起形成 HA 资源 ID,所有 ha-manager 命令都使用该 ID 来唯一标识资源(例如:vm:100 或 ct:101)。接下来的几行包含附加属性:

comment: <string>  注释:<string>

Description.   描述。

group: <string>

The HA group identifier.
HA 组标识符。

max_relocate: <integer> (0 - N) (default = 1)
max_relocate: <整数> (0 - N) (默认 = 1)

Maximal number of service relocate tries when a service failes to start.
服务启动失败时,最大迁移尝试次数。

max_restart: <integer> (0 - N) (default = 1)
max_restart: <整数> (0 - N) (默认 = 1)

Maximal number of tries to restart the service on a node after its start failed.
在节点上启动服务失败后,最大重启尝试次数。

state: <disabled | enabled | ignored | started | stopped> (default = started)
状态:<禁用 | 启用 | 忽略 | 已启动 | 已停止>(默认 = 已启动)

Requested resource state. The CRM reads this state and acts accordingly. Please note that enabled is just an alias for started.
请求的资源状态。CRM 会读取此状态并相应地采取行动。请注意,启用只是已启动的别名。

started  已启动

The CRM tries to start the resource. Service state is set to started after successful start. On node failures, or when start fails, it tries to recover the resource. If everything fails, service state it set to error.
CRM 尝试启动资源。启动成功后,服务状态被设置为已启动。在节点故障或启动失败时,它会尝试恢复资源。如果所有操作都失败,服务状态将被设置为错误。

stopped  已停止

The CRM tries to keep the resource in stopped state, but it still tries to relocate the resources on node failures.
CRM 尝试保持资源处于已停止状态,但在节点故障时仍会尝试重新定位资源。

disabled  已禁用

The CRM tries to put the resource in stopped state, but does not try to relocate the resources on node failures. The main purpose of this state is error recovery, because it is the only way to move a resource out of the error state.
CRM 会尝试将资源置于停止状态,但不会在节点故障时尝试重新定位资源。此状态的主要目的是错误恢复,因为这是将资源从错误状态中移出的唯一方法。

ignored  忽略

The resource gets removed from the manager status and so the CRM and the LRM do not touch the resource anymore. All {pve} API calls affecting this resource will be executed, directly bypassing the HA stack. CRM commands will be thrown away while there source is in this state. The resource will not get relocated on node failures.
资源将从管理器状态中移除,因此 CRM 和 LRM 不再操作该资源。所有影响该资源的 {pve} API 调用将直接执行,绕过 HA 栈。当资源处于此状态时,CRM 命令将被丢弃。资源在节点故障时不会被重新定位。

Here is a real world example with one VM and one container. As you see, the syntax of those files is really simple, so it is even possible to read or edit those files using your favorite editor:
这里是一个包含一台虚拟机和一个容器的实际示例。如你所见,这些文件的语法非常简单,因此甚至可以使用你喜欢的编辑器来读取或编辑这些文件:

Configuration Example (/etc/pve/ha/resources.cfg)
配置示例(/etc/pve/ha/resources.cfg)
vm: 501
    state started
    max_relocate 2

ct: 102
    # Note: use default settings for everything
screenshot/gui-ha-manager-add-resource.png

The above config was generated using the ha-manager command-line tool:
上述配置是使用 ha-manager 命令行工具生成的:

# ha-manager add vm:501 --state started --max_relocate 2
# ha-manager add ct:102

Groups  

screenshot/gui-ha-manager-groups-view.png

The HA group configuration file /etc/pve/ha/groups.cfg is used to define groups of cluster nodes. A resource can be restricted to run only on the members of such group. A group configuration look like this:
HA 组配置文件 /etc/pve/ha/groups.cfg 用于定义集群节点的组。资源可以被限制只在该组的成员上运行。组配置如下所示:

group: <group>
       nodes <node_list>
       <property> <value>
       ...
comment: <string>  注释:<string>

Description.   描述。

nodes: <node>[:<pri>]{,<node>[:<pri>]}*
节点: <node>[:<pri>]{,<node>[:<pri>]}*

List of cluster node members, where a priority can be given to each node. A resource bound to a group will run on the available nodes with the highest priority. If there are more nodes in the highest priority class, the services will get distributed to those nodes. The priorities have a relative meaning only. The higher the number, the higher the priority.
集群节点成员列表,可以为每个节点指定优先级。绑定到某个组的资源将在具有最高优先级的可用节点上运行。如果最高优先级类别中有多个节点,服务将分布到这些节点上。优先级仅具有相对意义,数字越大,优先级越高。

nofailback: <boolean> (default = 0)
nofailback: <布尔值>(默认 = 0)

The CRM tries to run services on the node with the highest priority. If a node with higher priority comes online, the CRM migrates the service to that node. Enabling nofailback prevents that behavior.
CRM 会尝试在优先级最高的节点上运行服务。如果一个优先级更高的节点上线,CRM 会将服务迁移到该节点。启用 nofailback 可以防止这种行为。

restricted: <boolean> (default = 0)
restricted: <布尔值>(默认 = 0)

Resources bound to restricted groups may only run on nodes defined by the group. The resource will be placed in the stopped state if no group node member is online. Resources on unrestricted groups may run on any cluster node if all group members are offline, but they will migrate back as soon as a group member comes online. One can implement a preferred node behavior using an unrestricted group with only one member.
绑定到受限组的资源只能在该组定义的节点上运行。如果没有组内节点成员在线,资源将被置于停止状态。绑定到非受限组的资源如果所有组成员都离线,可以在任何集群节点上运行,但一旦有组成员上线,它们会迁移回去。可以通过仅包含一个成员的非受限组来实现首选节点行为。

screenshot/gui-ha-manager-add-group.png

A common requirement is that a resource should run on a specific node. Usually the resource is able to run on other nodes, so you can define an unrestricted group with a single member:
一个常见的需求是资源应运行在特定的节点上。通常资源能够在其他节点上运行,因此你可以定义一个只有单个成员的无限制组:

# ha-manager groupadd prefer_node1 --nodes node1

For bigger clusters, it makes sense to define a more detailed failover behavior. For example, you may want to run a set of services on node1 if possible. If node1 is not available, you want to run them equally split on node2 and node3. If those nodes also fail, the services should run on node4. To achieve this you could set the node list to:
对于较大的集群,定义更详细的故障转移行为是有意义的。例如,你可能希望尽可能在 node1 上运行一组服务。如果 node1 不可用,你希望这些服务在 node2 和 node3 上平均分配运行。如果这些节点也失败,服务应运行在 node4 上。为实现此目的,你可以将节点列表设置为:

# ha-manager groupadd mygroup1 -nodes "node1:2,node2:1,node3:1,node4"

Another use case is if a resource uses other resources only available on specific nodes, lets say node1 and node2. We need to make sure that HA manager does not use other nodes, so we need to create a restricted group with said nodes:
另一个用例是,如果一个资源使用的其他资源仅在特定节点上可用,比如 node1 和 node2。我们需要确保 HA 管理器不使用其他节点,因此需要创建一个包含上述节点的受限组:

# ha-manager groupadd mygroup2 -nodes "node1,node2" -restricted

The above commands created the following group configuration file:
上述命令创建了以下组配置文件:

Configuration Example (/etc/pve/ha/groups.cfg)
配置示例(/etc/pve/ha/groups.cfg)
group: prefer_node1
       nodes node1

group: mygroup1
       nodes node2:1,node4,node1:2,node3:1

group: mygroup2
       nodes node2,node1
       restricted 1

The nofailback options is mostly useful to avoid unwanted resource movements during administration tasks. For example, if you need to migrate a service to a node which doesn’t have the highest priority in the group, you need to tell the HA manager not to instantly move this service back by setting the nofailback option.
nofailback 选项主要用于避免在管理任务期间资源的非预期移动。例如,如果您需要将服务迁移到组中优先级不是最高的节点,您需要通过设置 nofailback 选项告诉 HA 管理器不要立即将该服务移回。

Another scenario is when a service was fenced and it got recovered to another node. The admin tries to repair the fenced node and brings it up online again to investigate the cause of failure and check if it runs stably again. Setting the nofailback flag prevents the recovered services from moving straight back to the fenced node.
另一种情况是服务被隔离(fenced)后恢复到了另一个节点。管理员尝试修复被隔离的节点并重新上线,以调查故障原因并检查其是否能稳定运行。设置 nofailback 标志可以防止恢复的服务直接移回被隔离的节点。

Fencing    刀闩

On node failures, fencing ensures that the erroneous node is guaranteed to be offline. This is required to make sure that no resource runs twice when it gets recovered on another node. This is a really important task, because without this, it would not be possible to recover a resource on another node.
在节点故障时,围栏机制确保错误的节点被保证离线。这是为了确保在资源在另一节点恢复时,不会出现资源被同时运行两次的情况。这是一项非常重要的任务,因为没有它,就无法在另一节点上恢复资源。

If a node did not get fenced, it would be in an unknown state where it may have still access to shared resources. This is really dangerous! Imagine that every network but the storage one broke. Now, while not reachable from the public network, the VM still runs and writes to the shared storage.
如果节点没有被围栏,它将处于未知状态,可能仍然能够访问共享资源。这是非常危险的!想象一下,除了存储网络外,所有网络都断开了。此时,虽然无法从公共网络访问,但虚拟机仍在运行并写入共享存储。

If we then simply start up this VM on another node, we would get a dangerous race condition, because we write from both nodes. Such conditions can destroy all VM data and the whole VM could be rendered unusable. The recovery could also fail if the storage protects against multiple mounts.
如果我们随后在另一节点上简单地启动该虚拟机,就会出现危险的竞态条件,因为两个节点都会写入数据。这种情况可能会破坏所有虚拟机数据,使整个虚拟机无法使用。如果存储系统防止多重挂载,恢复也可能失败。

How Proxmox VE Fences
Proxmox VE 如何进行刀闩

There are different methods to fence a node, for example, fence devices which cut off the power from the node or disable their communication completely. Those are often quite expensive and bring additional critical components into a system, because if they fail you cannot recover any service.
有多种方法可以对节点进行隔离,例如使用隔离设备切断节点的电源或完全禁用其通信。这些设备通常相当昂贵,并且会为系统带来额外的关键组件,因为如果它们发生故障,您将无法恢复任何服务。

We thus wanted to integrate a simpler fencing method, which does not require additional external hardware. This can be done using watchdog timers.
因此,我们希望集成一种更简单的隔离方法,不需要额外的外部硬件。这可以通过使用看门狗定时器来实现。

Possible Fencing Methods
可能的隔离方法
  • external power switches   外部电源开关

  • isolate nodes by disabling complete network traffic on the switch
    通过禁用交换机上的全部网络流量来隔离节点

  • self fencing using watchdog timers
    使用看门狗定时器进行自我隔离

Watchdog timers have been widely used in critical and dependable systems since the beginning of microcontrollers. They are often simple, independent integrated circuits which are used to detect and recover from computer malfunctions.
看门狗定时器自微控制器诞生以来就被广泛应用于关键和可靠的系统中。它们通常是简单、独立的集成电路,用于检测和恢复计算机故障。

During normal operation, ha-manager regularly resets the watchdog timer to prevent it from elapsing. If, due to a hardware fault or program error, the computer fails to reset the watchdog, the timer will elapse and trigger a reset of the whole server (reboot).
在正常运行期间,ha-manager 会定期重置看门狗定时器以防止其超时。如果由于硬件故障或程序错误,计算机未能重置看门狗,定时器将超时并触发整个服务器的重置(重启)。

Recent server motherboards often include such hardware watchdogs, but these need to be configured. If no watchdog is available or configured, we fall back to the Linux Kernel softdog. While still reliable, it is not independent of the servers hardware, and thus has a lower reliability than a hardware watchdog.
近年来的服务器主板通常包含此类硬件看门狗,但需要进行配置。如果没有可用或配置的看门狗,我们将回退到 Linux 内核的软狗。虽然软狗仍然可靠,但它不独立于服务器硬件,因此其可靠性低于硬件看门狗。

Configure Hardware Watchdog
配置硬件看门狗

By default, all hardware watchdog modules are blocked for security reasons. They are like a loaded gun if not correctly initialized. To enable a hardware watchdog, you need to specify the module to load in /etc/default/pve-ha-manager, for example:
默认情况下,出于安全原因,所有硬件看门狗模块都被阻止。它们如果未正确初始化,就像一把上膛的枪。要启用硬件看门狗,需要在 /etc/default/pve-ha-manager 中指定要加载的模块,例如:

# select watchdog module (default is softdog)
WATCHDOG_MODULE=iTCO_wdt

This configuration is read by the watchdog-mux service, which loads the specified module at startup.
该配置由 watchdog-mux 服务读取,该服务在启动时加载指定的模块。

Recover Fenced Services  恢复被隔离的服务

After a node failed and its fencing was successful, the CRM tries to move services from the failed node to nodes which are still online.
当节点发生故障且隔离成功后,CRM 会尝试将服务从故障节点迁移到仍在线的节点上。

The selection of nodes, on which those services gets recovered, is influenced by the resource group settings, the list of currently active nodes, and their respective active service count.
这些服务恢复到哪些节点,受资源组设置、当前活跃节点列表及其各自活跃服务数量的影响。

The CRM first builds a set out of the intersection between user selected nodes (from group setting) and available nodes. It then choose the subset of nodes with the highest priority, and finally select the node with the lowest active service count. This minimizes the possibility of an overloaded node.
CRM 首先构建一个集合,该集合是用户选择的节点(来自组设置)与可用节点的交集。然后选择优先级最高的节点子集,最后选择活跃服务数量最少的节点。这样可以最大限度地减少节点过载的可能性。

Caution On node failure, the CRM distributes services to the remaining nodes. This increases the service count on those nodes, and can lead to high load, especially on small clusters. Please design your cluster so that it can handle such worst case scenarios.
当节点发生故障时,CRM 会将服务分配到剩余的节点上。这会增加这些节点上的服务数量,尤其是在小型集群中,可能导致负载过高。请设计您的集群,使其能够应对这种最坏情况。

Start Failure Policy
启动失败策略

The start failure policy comes into effect if a service failed to start on a node one or more times. It can be used to configure how often a restart should be triggered on the same node and how often a service should be relocated, so that it has an attempt to be started on another node. The aim of this policy is to circumvent temporary unavailability of shared resources on a specific node. For example, if a shared storage isn’t available on a quorate node anymore, for instance due to network problems, but is still available on other nodes, the relocate policy allows the service to start nonetheless.
启动失败策略在服务在某个节点上启动失败一次或多次时生效。它可用于配置在同一节点上触发重启的频率以及服务应被迁移的频率,从而尝试在另一节点上启动服务。该策略的目的是规避特定节点上共享资源的临时不可用。例如,如果共享存储因网络问题在有法定人数的节点上不再可用,但在其他节点上仍然可用,迁移策略允许服务仍然启动。

There are two service start recover policy settings which can be configured specific for each resource.
有两种服务启动恢复策略设置,可以针对每个资源进行具体配置。

max_restart   最大重启次数

Maximum number of attempts to restart a failed service on the actual node. The default is set to one.
在实际节点上重启失败服务的最大尝试次数。默认值为一次。

max_relocate   最大重新定位

Maximum number of attempts to relocate the service to a different node. A relocate only happens after the max_restart value is exceeded on the actual node. The default is set to one.
将服务迁移到不同节点的最大尝试次数。只有在实际节点上的 max_restart 值被超过后,才会进行迁移。默认值为一次。

Note The relocate count state will only reset to zero when the service had at least one successful start. That means if a service is re-started without fixing the error only the restart policy gets repeated.
重新定位计数状态只有在服务至少成功启动一次后才会重置为零。这意味着如果服务在未修复错误的情况下重新启动,只有重启策略会被重复执行。

Error Recovery   错误恢复

If, after all attempts, the service state could not be recovered, it gets placed in an error state. In this state, the service won’t get touched by the HA stack anymore. The only way out is disabling a service:
如果经过所有尝试后,服务状态仍无法恢复,则服务会进入错误状态。在此状态下,HA 堆栈将不再对该服务进行操作。唯一的解决方法是禁用该服务:

# ha-manager set vm:100 --state disabled

This can also be done in the web interface.
这也可以在网页界面中完成。

To recover from the error state you should do the following:
要从错误状态中恢复,您应执行以下操作:

  • bring the resource back into a safe and consistent state (e.g.: kill its process if the service could not be stopped)
    将资源恢复到安全且一致的状态(例如:如果服务无法停止,则终止其进程)

  • disable the resource to remove the error flag
    禁用该资源以清除错误标志

  • fix the error which led to this failures
    修复导致此故障的错误

  • after you fixed all errors you may request that the service starts again
    修复所有错误后,您可以请求服务重新启动

Package Updates   包更新

When updating the ha-manager, you should do one node after the other, never all at once for various reasons. First, while we test our software thoroughly, a bug affecting your specific setup cannot totally be ruled out. Updating one node after the other and checking the functionality of each node after finishing the update helps to recover from eventual problems, while updating all at once could result in a broken cluster and is generally not good practice.
在更新 ha-manager 时,您应逐个节点进行,而不是一次性全部更新,原因有多方面。首先,虽然我们对软件进行了彻底测试,但仍无法完全排除影响您特定环境的漏洞。逐个节点更新并在完成更新后检查每个节点的功能,有助于在出现问题时进行恢复,而一次性全部更新可能导致集群崩溃,且通常不是良好的操作习惯。

Also, the Proxmox VE HA stack uses a request acknowledge protocol to perform actions between the cluster and the local resource manager. For restarting, the LRM makes a request to the CRM to freeze all its services. This prevents them from getting touched by the Cluster during the short time the LRM is restarting. After that, the LRM may safely close the watchdog during a restart. Such a restart happens normally during a package update and, as already stated, an active master CRM is needed to acknowledge the requests from the LRM. If this is not the case the update process can take too long which, in the worst case, may result in a reset triggered by the watchdog.
此外,Proxmox VE HA 堆栈使用请求确认协议在集群和本地资源管理器之间执行操作。对于重启,LRM 会向 CRM 发送请求以冻结其所有服务。这可以防止在 LRM 重启的短时间内集群对其进行操作。之后,LRM 可以在重启期间安全地关闭看门狗。这样的重启通常发生在包更新期间,正如前面所述,需要一个活动的主 CRM 来确认来自 LRM 的请求。如果情况不是这样,更新过程可能会耗时过长,最坏情况下可能导致看门狗触发重置。

Node Maintenance   节点维护

Sometimes it is necessary to perform maintenance on a node, such as replacing hardware or simply installing a new kernel image. This also applies while the HA stack is in use.
有时需要对节点进行维护,例如更换硬件或简单地安装新的内核镜像。这在使用 HA 堆栈时同样适用。

The HA stack can support you mainly in two types of maintenance:
HA 堆栈主要可以支持您进行两种类型的维护:

  • for general shutdowns or reboots, the behavior can be configured, see Shutdown Policy.
    对于一般的关机或重启,可以配置其行为,详见关机策略。

  • for maintenance that does not require a shutdown or reboot, or that should not be switched off automatically after only one reboot, you can enable the manual maintenance mode.
    对于不需要关机或重启的维护,或者不应在仅一次重启后自动关闭的维护,可以启用手动维护模式。

Maintenance Mode  维护模式

You can use the manual maintenance mode to mark the node as unavailable for HA operation, prompting all services managed by HA to migrate to other nodes.
您可以使用手动维护模式将节点标记为不可用于 HA 操作,促使所有由 HA 管理的服务迁移到其他节点。

The target nodes for these migrations are selected from the other currently available nodes, and determined by the HA group configuration and the configured cluster resource scheduler (CRS) mode. During each migration, the original node will be recorded in the HA managers' state, so that the service can be moved back again automatically once the maintenance mode is disabled and the node is back online.
这些迁移的目标节点是从其他当前可用的节点中选择的,并由 HA 组配置和配置的集群资源调度器(CRS)模式决定。在每次迁移过程中,原始节点将被记录在 HA 管理器的状态中,以便在维护模式被禁用且节点重新上线后,服务可以自动迁回。

Currently you can enabled or disable the maintenance mode using the ha-manager CLI tool.
目前,您可以使用 ha-manager 命令行工具启用或禁用维护模式。

Enabling maintenance mode for a node
为节点启用维护模式
# ha-manager crm-command node-maintenance enable NODENAME

This will queue a CRM command, when the manager processes this command it will record the request for maintenance-mode in the manager status. This allows you to submit the command on any node, not just on the one you want to place in, or out of the maintenance mode.
这将排队一个 CRM 命令,当管理器处理该命令时,它会在管理器状态中记录维护模式的请求。这允许您在任何节点上提交该命令,而不仅限于您想要进入或退出维护模式的节点。

Once the LRM on the respective node picks the command up it will mark itself as unavailable, but still process all migration commands. This means that the LRM self-fencing watchdog will stay active until all active services got moved, and all running workers finished.
一旦相应节点上的 LRM 接收到命令,它将标记自身为不可用,但仍会处理所有迁移命令。这意味着 LRM 自我隔离看门狗将保持激活状态,直到所有活动服务被迁移,且所有运行中的工作进程完成。

Note that the LRM status will read maintenance mode as soon as the LRM picked the requested state up, not only when all services got moved away, this user experience is planned to be improved in the future. For now, you can check for any active HA service left on the node, or watching out for a log line like: pve-ha-lrm[PID]: watchdog closed (disabled) to know when the node finished its transition into the maintenance mode.
请注意,LRM 状态一旦接收到请求的状态,就会显示为维护模式,而不仅仅是在所有服务迁移完成后才显示,这种用户体验未来计划改进。目前,您可以检查节点上是否还有任何活动的 HA 服务,或者关注类似以下日志行:pve-ha-lrm[PID]: watchdog closed (disabled),以了解节点何时完成进入维护模式的过渡。

Note The manual maintenance mode is not automatically deleted on node reboot, but only if it is either manually deactivated using the ha-manager CLI or if the manager-status is manually cleared.
手动维护模式在节点重启时不会自动删除,只有在使用 ha-manager CLI 手动停用或手动清除 manager-status 时才会删除。
Disabling maintenance mode for a node
为节点禁用维护模式
# ha-manager crm-command node-maintenance disable NODENAME

The process of disabling the manual maintenance mode is similar to enabling it. Using the ha-manager CLI command shown above will queue a CRM command that, once processed, marks the respective LRM node as available again.
禁用手动维护模式的过程与启用它类似。使用上面显示的 ha-manager CLI 命令将排队一个 CRM 命令,该命令处理后会将相应的 LRM 节点标记为可用。

If you deactivate the maintenance mode, all services that were on the node when the maintenance mode was activated will be moved back.
如果您停用维护模式,所有在激活维护模式时位于该节点上的服务将被迁移回去。

Shutdown Policy  关闭策略

Below you will find a description of the different HA policies for a node shutdown. Currently Conditional is the default due to backward compatibility. Some users may find that Migrate behaves more as expected.
下面是关于节点关闭的不同 HA 策略的描述。目前由于向后兼容性,默认策略为 Conditional。一些用户可能会发现 Migrate 的行为更符合预期。

The shutdown policy can be configured in the Web UI (DatacenterOptionsHA Settings), or directly in datacenter.cfg:
关机策略可以在网页界面中配置(数据中心 → 选项 → HA 设置),也可以直接在 datacenter.cfg 中配置:

ha: shutdown_policy=<value>

Migrate   迁移

Once the Local Resource manager (LRM) gets a shutdown request and this policy is enabled, it will mark itself as unavailable for the current HA manager. This triggers a migration of all HA Services currently located on this node. The LRM will try to delay the shutdown process, until all running services get moved away. But, this expects that the running services can be migrated to another node. In other words, the service must not be locally bound, for example by using hardware passthrough. As non-group member nodes are considered as runnable target if no group member is available, this policy can still be used when making use of HA groups with only some nodes selected. But, marking a group as restricted tells the HA manager that the service cannot run outside of the chosen set of nodes. If all of those nodes are unavailable, the shutdown will hang until you manually intervene. Once the shut down node comes back online again, the previously displaced services will be moved back, if they were not already manually migrated in-between.
一旦本地资源管理器(LRM)收到关机请求且此策略被启用,它将标记自身为当前 HA 管理器不可用。这会触发将所有当前位于该节点上的 HA 服务迁移。LRM 会尝试延迟关机过程,直到所有运行中的服务被迁移走。但这要求运行中的服务能够迁移到其他节点。换句话说,服务不能是本地绑定的,例如使用硬件直通。由于非组成员节点在没有组成员可用时被视为可运行的目标,因此在使用仅选择部分节点的 HA 组时仍可使用此策略。但将组标记为受限会告诉 HA 管理器该服务不能在所选节点集之外运行。如果所有这些节点都不可用,关机将会挂起,直到你手动干预。一旦关机节点重新上线,之前迁移走的服务将被迁回,前提是它们在此期间没有被手动迁移。

Note The watchdog is still active during the migration process on shutdown. If the node loses quorum it will be fenced and the services will be recovered.
在关机迁移过程中,watchdog 仍然处于激活状态。如果节点失去仲裁权,它将被隔离,服务将被恢复。

If you start a (previously stopped) service on a node which is currently being maintained, the node needs to be fenced to ensure that the service can be moved and started on another available node.
如果您在当前正在维护的节点上启动一个(之前已停止的)服务,则需要隔离该节点,以确保该服务可以迁移并在另一个可用节点上启动。

Failover   故障切换

This mode ensures that all services get stopped, but that they will also be recovered, if the current node is not online soon. It can be useful when doing maintenance on a cluster scale, where live-migrating VMs may not be possible if too many nodes are powered off at a time, but you still want to ensure HA services get recovered and started again as soon as possible.
此模式确保所有服务都会停止,但如果当前节点很快无法上线,它们也会被恢复。当进行集群级别的维护时,这种模式非常有用,因为如果同时关闭太多节点,可能无法实时迁移虚拟机,但您仍希望确保高可用性服务能够尽快恢复并重新启动。

Freeze   冻结

This mode ensures that all services get stopped and frozen, so that they won’t get recovered until the current node is online again.
此模式确保所有服务都被停止并冻结,因此在当前节点重新上线之前,它们不会被恢复。

Conditional   条件

The Conditional shutdown policy automatically detects if a shutdown or a reboot is requested, and changes behaviour accordingly.
条件关闭策略会自动检测是否请求了关闭或重启,并相应地改变行为。

Shutdown  关闭

A shutdown (poweroff) is usually done if it is planned for the node to stay down for some time. The LRM stops all managed services in this case. This means that other nodes will take over those services afterwards.
关机(断电)通常在计划节点停机一段时间时进行。此时,LRM 会停止所有受管理的服务。这意味着其他节点随后将接管这些服务。

Note Recent hardware has large amounts of memory (RAM). So we stop all resources, then restart them to avoid online migration of all that RAM. If you want to use online migration, you need to invoke that manually before you shutdown the node.
现代硬件拥有大量内存(RAM)。因此,我们先停止所有资源,然后重新启动它们,以避免对所有这些内存进行在线迁移。如果您想使用在线迁移,需要在关机前手动调用该操作。
Reboot  重启

Node reboots are initiated with the reboot command. This is usually done after installing a new kernel. Please note that this is different from “shutdown”, because the node immediately starts again.
节点重启是通过 reboot 命令启动的。通常在安装新内核后执行此操作。请注意,这与“关机”不同,因为节点会立即重新启动。

The LRM tells the CRM that it wants to restart, and waits until the CRM puts all resources into the freeze state (same mechanism is used for Package Updates). This prevents those resources from being moved to other nodes. Instead, the CRM starts the resources after the reboot on the same node.
LRM 告诉 CRM 它想要重启,并等待 CRM 将所有资源置于冻结状态(相同机制也用于包更新)。这防止了这些资源被移动到其他节点。相反,CRM 会在重启后在同一节点上启动这些资源。

Manual Resource Movement
手动资源移动

Last but not least, you can also manually move resources to other nodes, before you shutdown or restart a node. The advantage is that you have full control, and you can decide if you want to use online migration or not.
最后但同样重要的是,您还可以在关闭或重启节点之前,手动将资源移动到其他节点。这样做的好处是您拥有完全的控制权,可以决定是否使用在线迁移。

Note Please do not kill services like pve-ha-crm, pve-ha-lrm or watchdog-mux. They manage and use the watchdog, so this can result in an immediate node reboot or even reset.
请不要终止 pve-ha-crm、pve-ha-lrm 或 watchdog-mux 等服务。它们管理并使用看门狗,因此这可能导致节点立即重启甚至复位。

Cluster Resource Scheduling
集群资源调度

The cluster resource scheduler (CRS) mode controls how HA selects nodes for the recovery of a service as well as for migrations that are triggered by a shutdown policy. The default mode is basic, you can change it in the Web UI (DatacenterOptions), or directly in datacenter.cfg:
集群资源调度器(CRS)模式控制 HA 如何选择节点来恢复服务,以及由关机策略触发的迁移。默认模式是 basic,您可以在 Web UI(数据中心 → 选项)中更改,或直接在 datacenter.cfg 中修改:

crs: ha=static
screenshot/gui-datacenter-options-crs.png

The change will be in effect starting with the next manager round (after a few seconds).
更改将在下一次管理器轮询时生效(几秒钟后)。

For each service that needs to be recovered or migrated, the scheduler iteratively chooses the best node among the nodes with the highest priority in the service’s group.
对于每个需要恢复或迁移的服务,调度器会在该服务组中优先级最高的节点中迭代选择最佳节点。

Note There are plans to add modes for (static and dynamic) load-balancing in the future.
未来计划添加(静态和动态)负载均衡模式。

Basic Scheduler  基本调度器

The number of active HA services on each node is used to choose a recovery node. Non-HA-managed services are currently not counted.
每个节点上活动的高可用服务数量用于选择恢复节点。目前不计算非高可用管理的服务。

Static-Load Scheduler  静态负载调度器

Important The static mode is still a technology preview.
静态模式仍处于技术预览阶段。

Static usage information from HA services on each node is used to choose a recovery node. Usage of non-HA-managed services is currently not considered.
使用来自每个节点上高可用(HA)服务的静态使用信息来选择恢复节点。目前不考虑非高可用管理服务的使用情况。

For this selection, each node in turn is considered as if the service was already running on it, using CPU and memory usage from the associated guest configuration. Then for each such alternative, CPU and memory usage of all nodes are considered, with memory being weighted much more, because it’s a truly limited resource. For both, CPU and memory, highest usage among nodes (weighted more, as ideally no node should be overcommitted) and average usage of all nodes (to still be able to distinguish in case there already is a more highly committed node) are considered.
在此选择过程中,依次将每个节点视为服务已在其上运行,使用关联的虚拟机配置中的 CPU 和内存使用情况。然后,对于每个此类备选方案,考虑所有节点的 CPU 和内存使用情况,其中内存权重更大,因为内存是真正有限的资源。对于 CPU 和内存,既考虑节点中最高的使用率(权重更大,因为理想情况下不应有节点过度分配),也考虑所有节点的平均使用率(以便在已有更高负载节点的情况下仍能区分)。

Important The more services the more possible combinations there are, so it’s currently not recommended to use it if you have thousands of HA managed services.
服务越多,可能的组合也越多,因此如果您有成千上万个由 HA 管理的服务,目前不建议使用它。

CRS Scheduling Points  CRS 调度点

The CRS algorithm is not applied for every service in every round, since this would mean a large number of constant migrations. Depending on the workload, this could put more strain on the cluster than could be avoided by constant balancing. That’s why the Proxmox VE HA manager favors keeping services on their current node.
CRS 算法并不会在每一轮对每个服务都应用,因为这将导致大量的持续迁移。根据工作负载,这可能会给集群带来比持续平衡所能避免的更多的压力。这就是为什么 Proxmox VE HA 管理器更倾向于让服务保持在其当前节点上。

The CRS is currently used at the following scheduling points:
CRS 目前在以下调度点使用:

  • Service recovery (always active). When a node with active HA services fails, all its services need to be recovered to other nodes. The CRS algorithm will be used here to balance that recovery over the remaining nodes.
    服务恢复(始终激活)。当一个运行有活动 HA 服务的节点发生故障时,所有其上的服务都需要恢复到其他节点。此处将使用 CRS 算法在剩余节点间平衡恢复过程。

  • HA group config changes (always active). If a node is removed from a group, or its priority is reduced, the HA stack will use the CRS algorithm to find a new target node for the HA services in that group, matching the adapted priority constraints.
    HA 组配置变更(始终激活)。如果一个节点被从组中移除,或其优先级被降低,HA 堆栈将使用 CRS 算法为该组中的 HA 服务找到新的目标节点,以匹配调整后的优先级约束。

  • HA service stopped → start transition (opt-in). Requesting that a stopped service should be started is an good opportunity to check for the best suited node as per the CRS algorithm, as moving stopped services is cheaper to do than moving them started, especially if their disk volumes reside on shared storage. You can enable this by setting the ha-rebalance-on-start CRS option in the datacenter config. You can change that option also in the Web UI, under DatacenterOptionsCluster Resource Scheduling.
    HA 服务停止→启动转换(可选)。请求启动已停止的服务是检查 CRS 算法推荐的最佳节点的好机会,因为移动已停止的服务比移动已启动的服务成本更低,尤其当其磁盘卷位于共享存储上时。你可以通过在数据中心配置中设置 ha-rebalance-on-start CRS 选项来启用此功能。你也可以在 Web UI 中更改该选项,路径为数据中心 → 选项 → 集群资源调度。