这是用户在 2025-7-31 15:17 为 https://app.immersivetranslate.com/pdf-pro/d1bd144a-27eb-4e03-b942-2b38e504d950/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Jade: A High-throughput Concurrent Copying Garbage Collector
Jade:一种高吞吐并发复制垃圾收集器

Mingyu Wu 1 1 ^(1){ }^{1}, Liang Mao 2 2 ^(2){ }^{2}, Yude Lin 2 2 ^(2){ }^{2}, Yifeng Jin 2 2 ^(2){ }^{2}, Zhe Li 1 1 ^(1){ }^{1}, Hongtao Lyu 1 1 ^(1){ }^{1}, Jiawei Tang 2 2 ^(2){ }^{2}, Xiaowei Lu 2 2 ^(2){ }^{2}, Hao Tang 2 2 ^(2){ }^{2}, Denghui Dong 2 2 ^(2){ }^{2}, Haibo Chen 1 , 3 1 , 3 ^(1,3){ }^{1,3}, Binyu Zang 1 , 3 1 , 3 ^(1,3){ }^{1,3}
吴明宇 1 1 ^(1){ }^{1} 、毛亮 2 2 ^(2){ }^{2} 、林育德 2 2 ^(2){ }^{2} 、金一峰 2 2 ^(2){ }^{2} 、李哲 1 1 ^(1){ }^{1} 、吕洪涛 1 1 ^(1){ }^{1} 、汤佳伟 2 2 ^(2){ }^{2} 、卢晓伟 2 2 ^(2){ }^{2} 、唐浩 2 2 ^(2){ }^{2} 、董登辉 2 2 ^(2){ }^{2} 、陈海波 1 , 3 1 , 3 ^(1,3){ }^{1,3} 、臧斌宇 1 , 3 1 , 3 ^(1,3){ }^{1,3}
1 1 ^(1){ }^{1} Institute of Parallel and Distributed Systems, SEIEE, Shanghai Jiao Tong University
1 1 ^(1){ }^{1} 上海交通大学电子信息与电气工程学院并行与分布式系统研究所
2 2 ^(2){ }^{2} Alibaba Group   2 2 ^(2){ }^{2} 阿里巴巴集团 3 3 ^(3){ }^{3} Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China
3 3 ^(3){ }^{3} 教育部领域专用操作系统工程研究中心

Abstract  摘要

Garbage collection (GC) pauses are a notorious issue threatening the latency of applications. To mitigate this problem, state-of-the-art concurrent copying collectors allow GC threads to run simultaneously with application threads (mutators) in nearly all GC phases. However, the design of concurrent copying collectors does not always lead to low application latency. To this end, this work studies the behaviors of mainstream concurrent copying collectors in OpenJDK and mainly focuses on long application pauses under heavy workloads. By analyzing the design of those collectors, this work uncovers that lengthy pre-reclamation cycles (including GC phases before actual memory release), high GC frequency, and large metadata maintenance overhead are major factors for long pauses. Therefore, this work proposes fade, a concurrent copying collector aiming to achieve both short pauses and high GC efficiency. Compared with existing collectors, Jade provides a group-wise collection mechanism to shorten pre-reclamation cycles while controlling GC frequency. It also embraces a generational heap layout and a single-phase algorithm to maximize young GC’s throughput. The evaluation results on representative latency-critical applications show that fade can reach sub-millisecond-level pauses even under heavy workloads and significantly improve applications’ peak throughput compared with state-of-the-art concurrent collectors.
垃圾回收(GC)停顿是威胁应用程序延迟的 notorious 问题。为缓解此问题,最先进的并发复制回收器允许 GC 线程在几乎所有 GC 阶段与应用程序线程(mutators)同时运行。然而,并发复制回收器的设计并不总能实现低应用延迟。为此,本研究分析了 OpenJDK 中主流并发复制回收器的行为,重点关注高负载下的长应用停顿。通过解析这些回收器的设计,本研究发现冗长的预回收周期(包括实际内存释放前的 GC 阶段)、高 GC 频率以及大量元数据维护开销是导致长停顿的主要因素。因此,本研究提出 fade——一个旨在同时实现短停顿与高 GC 效率的并发复制回收器。与现有回收器相比,Jade 采用分组回收机制来缩短预回收周期并控制 GC 频率,同时采用分代堆布局和单阶段算法来最大化年轻代 GC 的吞吐量。 对典型延迟敏感型应用的评估结果表明,即使在繁重工作负载下,fade 也能实现亚毫秒级停顿,相比最先进的并发收集器显著提升了应用的峰值吞吐量。

CCS Concepts: - Software and its engineering rarr\rightarrow Garbage collection.
CCS 概念分类: - 软件及其工程 rarr\rightarrow 垃圾回收
Keywords: Language Runtime, Garbage Collection
关键词: 语言运行时,垃圾回收

ACM Reference Format:  ACM 文献引用格式:

Mingyu Wu, Liang Mao, Yude Lin, Yifeng Jin, Zhe Li, Hongtao Lyu, Jiawei Tang, Xiaowei Lu, Hao Tang, Denghui Dong, Haibo Chen, Binyu Zang. 2024. Jade: A High-throughput Concurrent Copying Garbage Collector. In Nineteenth European Conference on Computer Systems (EuroSys '24), April 22-25, 2024, Athens, Greece. ACM, New York, NY, USA, 15 pages. https://doi.org/10.1145/3627703.3650087
吴明宇、毛亮、林育德、金一峰、李哲、吕洪涛、唐家伟、卢晓伟、唐浩、董登辉、陈海波、臧斌宇。2024。Jade:一种高吞吐量的并发复制垃圾回收器。发表于第十九届欧洲计算机系统会议(EuroSys '24),2024 年 4 月 22-25 日,希腊雅典。美国纽约州纽约市:ACM,共 15 页。https://doi.org/10.1145/3627703.3650087

1 Introduction  1 引言

Garbage collectors are one of the most critical modules in managed runtimes like Java Virtual Machine (JVM), JavaScript V8, and Go runtime. By automatically detecting dead objects (or garbage) and reclaiming memory resources, collectors remove the burden of manual memory management but also induce performance overhead. Pauses are a notorious issue introduced by collectors: to perform garbage collection (GC), language runtimes may require application threads (namely mutators) to stop running. Collectors pausing mutators during GC are usually called stop-the-world (STW) collectors. As the memory demand of applications increases, the duration of STW pauses becomes longer since GC threads need to scan and collect a larger heap, which affects application performance, especially for latency-critical ones. To this end, recent collectors are designed to reach short and controllable pause times regardless of the heap size [ 8 , 10 , 15 , 18 , 25 , 27 , 28 , 33 8 , 10 , 15 , 18 , 25 , 27 , 28 , 33 8,10,15,18,25,27,28,338,10,15,18,25,27,28,33 ]. In those concurrent copying collectors, GC threads are running simultaneously with mutators to find live objects and copy them to new addresses, which can reach nearly pauseless collections.
垃圾回收器是 Java 虚拟机(JVM)、JavaScript V8 和 Go 运行时等托管运行时环境中最关键的模块之一。通过自动检测死亡对象(即垃圾)并回收内存资源,回收器解除了手动内存管理的负担,但同时也带来了性能开销。暂停是垃圾回收器带来的一个臭名昭著的问题:为了执行垃圾回收(GC),语言运行时可能需要应用程序线程(即变异器)停止运行。在 GC 期间暂停变异器的回收器通常被称为"停止世界"(STW)回收器。随着应用程序内存需求的增加,STW 暂停的持续时间会变得更长,因为 GC 线程需要扫描和收集更大的堆,这会影响应用程序性能,特别是对延迟敏感的应用程序。为此,现代回收器被设计为无论堆大小如何都能达到短暂且可控的暂停时间[ 8 , 10 , 15 , 18 , 25 , 27 , 28 , 33 8 , 10 , 15 , 18 , 25 , 27 , 28 , 33 8,10,15,18,25,27,28,338,10,15,18,25,27,28,33 ]。在这些并发复制回收器中,GC 线程与变异器同时运行以查找存活对象并将其复制到新地址,这可以实现几乎无暂停的回收。
However, short pauses come at a cost. Prior work [7, 39, 41] has shown that the design of low-pause concurrent copying collectors does not always translate to short latency for applications due to their overhead like costly read/write barriers (code instrumentation), poor GC efficiency, and frequent interferences with mutators. Nevertheless, it is still not clear what exactly causes unsatisfying latency, especially under heavy workloads (e.g., serving a large number of requests within a short period of time). To this end, this work provides a detailed study of the memory behaviors of
然而,短暂的暂停是有代价的。先前的研究[7,39,41]表明,低暂停并发复制收集器的设计并不总能实现应用程序的低延迟,因为它们存在诸如高开销的读/写屏障(代码插桩)、垃圾回收效率低下以及频繁干扰突变线程等问题。不过,目前尚不清楚究竟是什么原因导致了不理想的延迟,尤其是在高负载工作场景下(例如短时间内处理大量请求)。为此,本研究对

two state-of-the-art concurrent copying collectors in OpenJDK (ZGC [25] and Shenandoah [15]). We mainly focus on the behaviors when running applications with heavy workloads, since those collectors introduce long pauses to mutators (similar to pauses in STW collectors), which significantly affect the throughput and latency of applications.
OpenJDK 中两款最先进的并发复制垃圾收集器(ZGC[25]和 Shenandoah[15])。我们主要关注运行高负载应用时的表现,因为这些收集器会导致突变线程长时间停顿(类似于 STW 收集器的停顿),这会显著影响应用程序的吞吐量和延迟。
To understand those pauses, we explore the designs inside mainstream concurrent collectors and analyze their performance issues under heavy workloads. For Shenandoah, its generation-wise collection releases memory resources only after all chosen live objects are evacuated and corresponding references are updated, which induces lengthy pre-reclamation cycles and thus causes long STW pauses due to free memory shortage. Meanwhile, the number of objects collected in each GC cycle is also restricted by the remaining free space size, which contributes to higher GC frequency and more pauses. As for ZGC, although its regionwise collection allows incremental memory reclamation and lazy reference updating, its reference memorization mechanism (called color pointers) introduces significant performance overhead and even longer mutator stalls (the same effect as pauses).
为理解这些停顿现象,我们深入研究了主流并发收集器的内部设计,并分析了它们在高负载工作场景下的性能问题。对于 Shenandoah 而言,其分代式回收机制仅在选定存活对象全部转移且相关引用更新完成后才会释放内存资源,这导致回收前周期冗长,进而因空闲内存不足引发长时间的 STW 停顿。同时,每个 GC 周期可回收的对象数量还受剩余空闲空间大小的限制,这进一步加剧了 GC 频率升高与停顿次数增多的问题。至于 ZGC,虽然其分区式回收支持渐进式内存释放与延迟引用更新,但其引用记忆机制(称为颜色指针)会带来显著的性能开销,甚至造成更长的应用线程冻结(与停顿效果相同)。
According to limitations found in existing concurrent collectors, this work introduces Jade, a collector to achieve high GC efficiency while maintaining low pauses even under heavy workloads. Although sharing similarities in design with ZGC (region-wise) and Shenandoah (generationwise), Jade leverages groups (a set of regions) as the basic unit of collection. It further divides the collection into multiple rounds where each round collects and releases memory resources consumed by one group. Thanks to groups, Jade achieves per-group incremental reclamation and avoids overhead introduced by color pointers. However, it needs to handle performance issues brought by grouping. To this end, Jade provides a simulation-based algorithm to finish grouping in less than one millisecond and proposes groupbased remembered sets and cross-region discover table (CRDT) for low-cost reference memorization.
针对现有并发收集器的局限性,本文提出 Jade 收集器,旨在高负载下仍能实现高效垃圾回收并保持低停顿。尽管在设计理念上与 ZGC(基于分区)和 Shenandoah(基于分代)有相似之处,但 Jade 创新性地采用组(一组分区)作为基本回收单元。它将回收过程划分为多轮次,每轮仅回收并释放单个组占用的内存资源。通过分组机制,Jade 实现了按组增量回收,避免了颜色指针带来的开销,但需解决分组引发的性能问题。为此,Jade 提供基于模拟的算法实现亚毫秒级快速分组,并提出基于组的记忆集与跨分区发现表(CRDT)来实现低成本引用记录。
To reduce the pre-reclamation phase, Jade also embraces a generational design and leverages young GC to collect newly created objects. Instead of reusing the group-wise collection, Jade provides a specialized single-phase mechanism to maximize the young GC throughput. When facing inevitable pauses or application stalls, Jade also provides a chasing mode to fully leverage idle CPU cores to complete them as soon as possible.
为缩短预回收阶段耗时,Jade 采用分代式设计并利用年轻代 GC 来回收新创建对象。不同于传统的分组回收机制,Jade 提供专门的单阶段处理方案以最大化年轻代 GC 吞吐量。当面临不可避免的停顿或应用卡顿时,Jade 还提供追赶模式,充分利用空闲 CPU 核心尽快完成回收任务。
Jade has been implemented in OpenJDK and evaluated against state-of-the-art production garbage collectors (G1, ZGC, Shenandoah) and recent efforts (LXR [41]). The evaluation uses a series of representative latency-sensitive applications, including a widely used online service deployed
Jade 已在 OpenJDK 中实现,并与主流生产级垃圾收集器(G1、ZGC、Shenandoah)及最新研究成果(LXR[41])进行对比评估。测试采用了一系列对延迟敏感的典型应用,包括阿里巴巴部署的

in Alibaba. The results show that Fade can significantly improve the application’s maximum throughput while still inducing controllable low pauses and satisfying application latency.
广泛使用的在线服务。结果表明 Fade 能在保持可控低停顿和满意应用延迟的同时,显著提升应用最大吞吐量。
To summarize, the contributions of this work include:
本工作的主要贡献包括:
  • A detailed analysis of pauses in concurrent collectors to uncover their deficiencies in design (§2);
    对并发回收器中暂停现象的详细分析,揭示其设计缺陷(§2);
  • Jade, a concurrent generational collector leveraging group-wise collection to achieve both high GC efficiency and low pauses (§3-4);
    Jade——一种利用分组回收实现高 GC 效率与低暂停的并发分代回收器(§3-4);
  • Comprehensive evaluation with representative latencysensitive applications (including commercial online services) to show Jade’s improvement over state-of-the-art collectors (§5).
    通过典型延迟敏感应用(含商业在线服务)的全面评估,展示 Jade 相较前沿回收器的改进(§5);

2 Analysis on Concurrent Copying GC
2 并发复制式垃圾回收分析

2.1 Concurrent copying garbage collectors
2.1 并发复制垃圾回收器

Although garbage collectors are helpful by automatically reclaiming memory resources for later reuse, they also introduce performance overhead to applications. Stop-theworld (STW) pauses are one of the most severe problems introduced by collectors since they require application threads (known as mutators) to pause until GC finishes. Although GC threads can leverage STW pauses to monopolize CPU resources for high-throughput collection, the duration of pauses becomes intolerable for applications with large memory demands and strict latency requirements. Therefore, nowadays concurrent collectors are trying to minimize pauses by allowing GC threads to run concurrently with mutators. Concurrent copying collectors are one of the most important categories in concurrent collectors. The collection in concurrent copying collectors mainly consists of two phases: concurrent marking to locate live objects, and concurrent evacuation to copy live objects into new addresses in a compact way. Various language runtimes have embraced concurrent copying for their garbage collection mechanisms. Taking OpenJDK, the mainstream runtime for Java, as an example, it has introduced two concurrent copying collectors, Shenandoah [15] and ZGC [25], and both of them only induce sub-millisecond-level pauses. Other language runtimes like Azul JDK [33] and Android Runtime (ART) [29] also provide concurrent copying collectors.
虽然垃圾回收器能自动回收内存资源以供后续复用,但它们也会给应用程序带来性能开销。其中"全局停顿"(STW)是最严重的问题之一——回收期间必须暂停所有应用程序线程(称为 mutator),直至垃圾回收完成。尽管垃圾回收线程可利用全局停顿独占 CPU 资源实现高吞吐量回收,但对于内存需求庞大且延迟要求严格的应用程序而言,这种停顿时长已变得难以忍受。因此现代并发回收器正尝试通过让垃圾回收线程与 mutator 并发运行来最小化停顿时间。并发复制回收器是并发回收器中最重要的类别之一,其回收过程主要包含两个阶段:并发标记定位存活对象,以及并发转移以紧凑方式将存活对象复制至新地址。目前多种语言运行时环境都已采用并发复制机制作为其垃圾回收方案。 以 Java 主流运行时环境 OpenJDK 为例,其引入了两款并发复制收集器 Shenandoah[15]和 ZGC[25],两者均仅引发亚毫秒级停顿。其他语言运行时如 Azul JDK[33]和 Android Runtime(ART)[29]同样提供了并发复制收集器方案。

2.2 Characterizing pauses in concurrent copying collectors
2.2 并发复制收集器的停顿特征分析

Interestingly, although existing concurrent copying collectors are designed to control the duration of pauses, they still induce large pauses especially when under heavy workload. We have analyzed the performance of two representative concurrent copying collectors in OpenJDK: ZGC and Shenandoah. The evaluated application derives from the Da Da Da-\mathrm{Da}- Capo benchmark [2], which uses the TPC-C workload [34] to evaluate a relational database, H2 [17]. The application
值得注意的是,尽管现有并发复制收集器旨在控制停顿时长,但在高负载场景下仍会产生显著停顿。我们分析了 OpenJDK 中两款代表性并发复制收集器 ZGC 与 Shenandoah 的性能表现。测试应用基于 Da Da Da-\mathrm{Da}- Capo 基准测试[2]构建,该测试通过 TPC-C 工作负载[34]来评估关系型数据库 H2[17]的性能。该应用程序

runs with eight physical cores to simulate a common web service scenario, which is widely deployed in Alibaba. The maximum Java heap size is configured as 8 GB , which is quite generous compared with the live data size (about 2 GB ). Meanwhile, the number of concurrent GC threads is set to two for a fair comparison (the default number for ZGC is one while the others are two). Since JVMs require a warmup phase to improve application performance, we evaluate H2 with six iterations while the first one is used for warmup. Each iteration finishes when H2 has processed a fixed number of TPC-C transactions (256000), and the following results are average numbers among the five iterations.
该测试采用八核物理处理器模拟阿里巴巴广泛部署的典型 Web 服务场景。Java 堆内存上限设置为 8GB,相较于实际数据量(约 2GB)显得相当充裕。为公平对比,并发 GC 线程数统一设定为两个(ZGC 默认线程数为 1,其他垃圾回收器为 2)。由于 JVM 需要预热阶段来提升应用性能,我们通过六轮迭代评估 H2 性能,首轮用于预热。每轮迭代在 H2 处理完固定数量 TPC-C 事务(256000 笔)后结束,后续结果取五轮迭代的平均值。
We first run the application by repetitively sending requests to maximize the application’s throughput and compare those concurrent copying collectors with G1 [13], which copies objects in a STW fashion and thus achieves better collection efficiency. Table 1 shows the throughput (queries per second) for three collectors under this setting. Compared with G1, both concurrent collectors cause the application’s throughput to drop: they only reach 23.25 % 23.25 % 23.25%23.25 \% (ZGC) and 46.52 % 46.52 % 46.52%46.52 \% (Shenandoah) of G1’s maximum throughput. When under their own maximum throughput, the p99 latency for both concurrent collectors is also worse than G1.
我们首先通过重复发送请求来运行应用程序,以最大化其吞吐量,并将这些并发复制收集器与 G1[13]进行比较——G1 采用 STW(Stop-The-World)方式复制对象,因此能实现更高的收集效率。表 1 展示了该场景下三种收集器的吞吐量(每秒查询数)。与 G1 相比,两种并发收集器均导致应用程序吞吐量下降:仅能达到 G1 最大吞吐量的 23.25 % 23.25 % 23.25%23.25 \% (ZGC)和 46.52 % 46.52 % 46.52%46.52 \% (Shenandoah)。当处于各自最大吞吐量时,两种并发收集器的 p99 延迟表现也逊于 G1。
To further understand the performance, we analyze the memory behaviors in Shenandoah and ZGC by setting the application throughput close to the maximum (9600 and 4800 for Shenandoah and ZGC, respectively). Note that we fix the thoughput by throttling clients to control the send rate (details in Section 5.5). As shown in Table 1, both ZGC and Shenandoah induce quite long pauses to mutators when under heavy workload. For Shenandoah, it triggers degenerated collections when facing heavy memory pressure, which pauses all mutators. As for ZGC, although it does not directly introduce STW pauses, mutators can be stalled when no free memory is available, which has the same effect as a pause in STW collectors. As shown in Table 2, the cumulative pause time for Shenandoah and ZGC is both larger than G1 (even though its application throughput is 20800). For ZGC, the cumulative pause time is even 30.36 × 30.36 × 30.36 xx30.36 \times larger than G1 ( 1.81 × 1.81 × 1.81 xx1.81 \times for Shenandoah). Meanwhile, their p99 pause time is also larger than G1, which suggests that those long pauses are not rare and thus significantly affect application latency. The results indicate that reducing those pauses is vital to improving the performance of both GC and applications for existing concurrent copying collectors. To this end, we further explore the reasons for long pauses by analyzing the design of those two collectors.
为深入理解性能表现,我们通过将应用吞吐量设定至接近最大值(Shenandoah 为 9600,ZGC 为 4800)来分析两款垃圾回收器的内存行为。需说明的是,我们通过限制客户端发送速率来固定吞吐量(详见第 5.5 节)。如表 1 所示,在高负载场景下,ZGC 和 Shenandoah 都会导致应用线程出现较长停顿。Shenandoah 在面临严重内存压力时会触发退化收集,此时所有应用线程都将暂停;而 ZGC 虽不直接引发 STW 停顿,但当可用内存耗尽时,应用线程仍会被阻塞,其效果与传统 STW 回收器的暂停等同。表 2 数据显示,Shenandoah 和 ZGC 的累计暂停时间均超过 G1(尽管 G1 的应用吞吐量高达 20800)。其中 ZGC 的累计暂停时间比 G1 高出 30.36 × 30.36 × 30.36 xx30.36 \times (Shenandoah 为 1.81 × 1.81 × 1.81 xx1.81 \times )。此外,两者的 p99 暂停时间也高于 G1,这表明此类长停顿并非偶发现象,会显著影响应用延迟。 研究结果表明,减少这些停顿对于提升现有并发复制收集器(GC)及应用性能至关重要。为此,我们通过分析两种收集器的设计原理,进一步探究了长停顿现象的成因。

2.3 Shenandoah: heap-wise collection
2.3 Shenandoah:堆式回收机制

Shenandoah tends to treat the whole heap as a monolithic unit. Figure 1 illustrates three major phases in Shenandoah. The first marking phase traverses all equal-sized regions inside Shenandoah’s heap to locate live objects and store their
Shenandoah 倾向于将整个堆视为单一整体。图 1 展示了 Shenandoah 的三个主要阶段:初始标记阶段会遍历堆内所有等大小的区域,定位存活对象并存储其

Figure 1. Concurrent phases in Shenandoah (suppose only one region is collected). Colored rectangles stand for live objects while lines with arrows stand for references.
图 1. Shenandoah 中的并发阶段(假设仅收集一个区域)。彩色矩形代表存活对象,带箭头的线条代表引用关系。

address information in a live bitmap. Afterward, Shenandoah evacuates objects marked in the live bitmap to free regions (marked with a different color). During evacuation, Shenandoah stores an object’s new address into the header of its old copy. Since the number of free regions might be limited, only a part of the regions can be evacuated. Finally, Shenandoah scans the live bitmap again to update references so that they all point to an object’s newest address (with the help of marked headers). Only after the reference updating phase can the memory resources be recycled.
存活位图中记录着地址信息。随后,Shenandoah 将存活位图中标记的对象迁移至空闲区域(以不同颜色标记)。迁移过程中,Shenandoah 会将对象的新地址存储在其旧副本的头部。由于空闲区域数量可能有限,仅部分区域能完成迁移。最终,Shenandoah 再次扫描存活位图,借助标记头部更新引用关系,确保所有引用指向对象的最新地址。只有完成引用更新阶段后,内存资源才能被回收。
Although this design has advantages like low memory overhead (only one bitmap is used to memorize live objects), it introduces the following two performance issues.
尽管这种设计具有内存开销低(仅使用单个位图记录存活对象)等优势,但也带来了以下两个性能问题。
  • Longer pre-reclamation cycle. Since Shenandoah only maintains live objects’ information in a simple bitmap, it does not know exactly which objects hold references to copied ones. Therefore, GC threads have to check all live objects for reference updates, and free regions can be reclaimed only when all phases are finished, which lengthens the time before the memory resources can be reused (we refer to this cycle as a prereclamation cycle). Meanwhile, each live object needs to be accessed at least three times in one GC cycle, which adds more performance overhead and also affects the duration of pre-reclamation.
    更长的预回收周期。由于 Shenandoah 仅通过简单位图维护存活对象信息,无法精确掌握哪些对象持有对复制对象的引用。因此垃圾回收线程必须检查所有存活对象以更新引用,且只有待所有阶段完成后才能释放内存区域,这延长了内存资源可被重复利用前的等待时间(我们称此周期为预回收周期)。此外,每个存活对象在单次 GC 周期中至少需被访问三次,这不仅增加了性能开销,也影响了预回收阶段的持续时间。
  • Higher collection frequency. Shenandoah’s reclamation efficiency heavily relies on the number of free regions in its one-generation heap. If the heap only contains a few free regions, the number of reclaimed bytes is limited, and the frequency of GC would increase.
    更高的回收频率。Shenandoah 的回收效率高度依赖于其单代堆中空闲区域的数量。当堆内存仅剩少量空闲区域时,可回收的字节数将受到限制,从而导致垃圾回收频率上升。
Table 1. Application and pause statistics for three mainstream collectors.
表 1. 三种主流收集器的应用与暂停时间统计
Collectors  收集器 Max Thru. (tx/s)  最大吞吐量(事务数/秒) p99 latency (ms)  p99 延迟(毫秒) Cumulative pause time (ms)
累计暂停时间(毫秒)
p99 pause time (ms)
p99 暂停时间(毫秒)
G1 21208 34.20 447.32 82.77
ZGC 4931 128.11 13581.87 1963.38
Shenandoah  雪兰多 9865 37.65 809.22 297.39
Collectors Max Thru. (tx/s) p99 latency (ms) Cumulative pause time (ms) p99 pause time (ms) G1 21208 34.20 447.32 82.77 ZGC 4931 128.11 13581.87 1963.38 Shenandoah 9865 37.65 809.22 297.39| Collectors | Max Thru. (tx/s) | p99 latency (ms) | Cumulative pause time (ms) | p99 pause time (ms) | | :--- | :--- | :--- | :--- | :--- | | G1 | 21208 | 34.20 | 447.32 | 82.77 | | ZGC | 4931 | 128.11 | 13581.87 | 1963.38 | | Shenandoah | 9865 | 37.65 | 809.22 | 297.39 |
Those two issues make GC a severe performance bottleneck with heavy workloads. When the application’s allocation rate increases, mutators may exhaust all available free space and have to wait until the pre-reclamation cycle finishes. Since Shenandoah’s pre-reclamation cycle is quite long, the pause time could also become longer. Meanwhile, higher collection frequency suggests more pre-reclamation cycles are required, which induces more performance overhead and potentially more pauses. As shown in Table 2, the three phases (Mark and Other) in Shenandoah together contribute to 24.89 seconds, which means GC threads are active for over 90 % 90 % 90%90 \% of execution time and potentially block mutators from memory allocation. Worse still, although collections are long and frequent, their marking results are actually quite similar. We have dumped the live bitmap generated in Shenandoah’s marking cycles and analyzed regions not moved by the subsequent evacuation phase. By comparing those regions in the live bitmaps for two consecutive cycles, we find that only 0.53 % 0.53 % 0.53%0.53 \% of objects are different, which suggests that the high frequency of marking is actually unnecessary. Therefore, the key to reducing Shenandoah’s overhead would be decreasing both the duration of pre-reclamation and the whole-heap GC frequency.
这两个问题使得垃圾回收在高负载下成为严重的性能瓶颈。当应用程序分配率上升时,变异线程可能耗尽所有可用空闲空间,不得不等待预回收周期完成。由于雪兰多的预回收周期相当长,暂停时间也会随之延长。同时,更高的回收频率意味着需要更多预回收周期,这会带来更多性能开销并可能导致更多暂停。如表 2 所示,雪兰多的三个阶段(标记和其他)共耗时 24.89 秒,这意味着垃圾回收线程活跃时间超过执行时间的 90 % 90 % 90%90 \% ,并可能阻止变异线程进行内存分配。更糟的是,尽管回收过程既长又频繁,但其标记结果实际上非常相似。我们转储了雪兰多标记周期生成的存活位图,并分析了后续疏散阶段未移动的区域。通过比较连续两个周期存活位图中的这些区域,我们发现仅有 0.53 % 0.53 % 0.53%0.53 \% 的对象存在差异,这表明高频标记实际上是不必要的。 因此,降低 Shenandoah 开销的关键在于缩短预回收阶段的持续时间并减少全堆 GC 频率。

2.4 ZGC: region-wise collection
2.4 ZGC:基于分区的收集机制

Compared with Shenandoah, the algorithm of ZGC is more incremental and region-wise: regions can be reclaimed independently. As illustrated in Figure 2, the vanilla ZGC has a similar marking phase to Shenandoah, which marks the whole heap to find live objects. However, when all live objects in a region have been evacuated, the region can be immediately reclaimed without waiting for reference updating. ZGC achieves this by using color pointers, which encode information inside object references to determine if the referred object requires updating. Furthermore, it also maintains per-region forwarding tables storing mappings between an object’s old address and the new one. With this design, reference updating can be lazily conducted during mutator execution (for example, when a stale reference is loaded) or in the next marking phase (ultimately updating all stale references to live objects).
相比 Shenandoah,ZGC 的算法更具增量性和分区特性:各分区可独立回收。如图 2 所示,标准 ZGC 的标记阶段与 Shenandoah 类似,需扫描整个堆栈标记存活对象。但当某个分区内所有存活对象完成转移后,该分区可立即被回收而无需等待引用更新。ZGC 通过彩色指针技术实现这一特性——在对象引用中编码信息以判断被引用对象是否需要更新。此外,ZGC 还维护每个分区的转发表,存储对象旧地址与新地址的映射关系。基于这种设计,引用更新可延迟执行:既可在赋值器运行时惰性处理(例如加载陈旧引用时),也可留待下次标记阶段最终更新所有指向存活对象的陈旧引用。
This region-wise design seems to shorten the prereclamation cycle by moving evacuation and reference updating off the critical path. Unfortunately, the application stall becomes even longer as shown in Table 1. Although ZGC’s pre-reclamation only contains the marking phase,
这种分区设计通过将疏散和引用更新移出关键路径,似乎缩短了预回收周期。但如表 1 所示,应用程序停顿时间反而变得更长。尽管 ZGC 的预回收仅包含标记阶段,

Figure 2. Concurrent phases in ZGC.
图 2. ZGC 中的并发阶段

its duration is much longer than Shenandoah ( 2.40 × 2.40 × 2.40 xx2.40 \times ) and remains active in 92.89 % 92.89 % 92.89%92.89 \% of the overall execution time of H2 (Table 2). The overhead mainly comes from ZGC’s color pointer design. Since it requires four extra bits in virtual addresses to encode reference-related information, the virtual memory size of Java heaps is enlarged by 16 × 16 × 16 xx16 \times and makes the reference compression optimization [21] (encoding 64-bit heap references to 32-bit offsets from the start address of heap) impractical, which affects both GC and application performance. Meanwhile, to fix all stale references, the marking phase in ZGC needs to recolor them by modifying their encoded bits, which introduce a large number of atomic instructions. To simulate the overhead, we have disabled the reference compression optimization and added a dummy compare-and-swap instruction before marking each object in Shenandoah. With those modifications, H2’s peak throughput on Shenandoah drops from 9865 to 6052, which is close to that in ZGC (note that disabling compressed references also significantly affects mutators’ performance in addition to GC).
其持续时间远长于 Shenandoah( 2.40 × 2.40 × 2.40 xx2.40 \times ),且在 H2 整体执行时间的 92.89 % 92.89 % 92.89%92.89 \% 内保持活跃(表 2)。该开销主要源于 ZGC 的彩色指针设计。由于需要在虚拟地址中使用额外四位来编码引用相关信息,Java 堆的虚拟内存大小会扩大 16 × 16 × 16 xx16 \times ,这使得引用压缩优化[21](将 64 位堆引用编码为 32 位相对于堆起始地址的偏移量)难以实施,从而同时影响 GC 和应用性能。此外,为修复所有过时引用,ZGC 的标记阶段需要通过修改编码位来重新着色引用,这引入了大量原子指令。为模拟该开销,我们禁用了引用压缩优化,并在 Shenandoah 中标记每个对象前添加了虚拟比较交换指令。经此修改后,H2 在 Shenandoah 上的峰值吞吐量从 9865 降至 6052,与 ZGC 的表现接近(需注意禁用压缩引用除了影响 GC 外还会显著影响 mutator 性能)。

2.5 Discussion: generational variants
2.5 讨论:分代变体

To achieve better collection efficiency, both ZGC and Shenandoah have developed their own generation variants (referred to as GenZ and GenShen), which typically divide the heap into two generations. The young generation is used
为实现更高的回收效率,ZGC 和 Shenandoah 都开发了各自的分代版本(分别称为 GenZ 和 GenShen),通常将堆划分为两个代。年轻代用于
Table 2. Breakdown analysis for concurrent copying collectors on the H2 application. The first three columns show the duration for overall execution (App.), GC marking (Marking), and other phases in the pre-reclamation cycle (Other, only exists in Shenandoah). The latter two show the average time for marking and other phases for each GC cycle, while the last two show the cumulative pause time in marking and other phases. All results are wall-clock time in seconds.
表 2. H2 应用上并发复制收集器的细分分析。前三列显示整体执行时间(应用)、GC 标记阶段(标记)以及预回收周期中其他阶段耗时(其他,仅 Shenandoah 存在)。后两列展示每个 GC 周期中标记与其他阶段的平均耗时,最后两列则统计标记阶段与其他阶段的累计暂停时间。所有结果均为挂钟时间(单位:秒)。
Collectors  收集器 App.  应用 Marking  标记 Other  其他
  平均标记
Avg.
Marking
Avg. Marking| Avg. | | :--- | | Marking |
  平均其他
Avg.
Other
Avg. Other| Avg. | | :--- | | Other |

累计暂停标记
Cumulative Pause
Marking
Cumulative Pause Marking| Cumulative Pause | | :--- | | Marking |
  累计暂停其他
Cumulative Pause
Other
Cumulative Pause Other| Cumulative Pause | | :--- | | Other |
ZGC 56.95 52.90 - 2.40 - 13.39 -
Shenandoah  雪兰多 27.40 13.56 11.33 1.00 0.81 0.12 0.68
Collectors App. Marking Other "Avg. Marking" "Avg. Other" "Cumulative Pause Marking" "Cumulative Pause Other" ZGC 56.95 52.90 - 2.40 - 13.39 - Shenandoah 27.40 13.56 11.33 1.00 0.81 0.12 0.68| Collectors | App. | Marking | Other | Avg. <br> Marking | Avg. <br> Other | Cumulative Pause <br> Marking | Cumulative Pause <br> Other | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | ZGC | 56.95 | 52.90 | - | 2.40 | - | 13.39 | - | | Shenandoah | 27.40 | 13.56 | 11.33 | 1.00 | 0.81 | 0.12 | 0.68 |
to serve memory allocation requests, and a separate young G C G C GCG C cycle is provided to only collect the young generation. Meanwhile, a full GC cycle is only triggered when memory resources in the whole heap become exhausted. Since the young generation is relatively smaller, the pre-reclamation cycle is largely shortened, which helps reduce application stalls and pauses. Nevertheless, the design of GenZ’s and GenShen’s young GC algorithm is similar to the corresponding single-generation one. For GenZ, the young GC algorithm still contains the overhead of color pointers. As for GenShen, it turns into a generation-wise collector which still collects the young generation in three phases. Therefore, they still induce considerable overhead. We will show the detailed results of GenZ and GenShen in Section 5.
用于处理内存分配请求,同时提供一个独立的年轻代 G C G C GCG C 周期专门收集年轻代。而当整个堆内存资源耗尽时,才会触发完整 GC 周期。由于年轻代容量相对较小,预回收周期大幅缩短,这有助于减少应用程序停顿和暂停。不过 GenZ 和 GenShen 的年轻代 GC 算法设计与对应的单代算法相似:GenZ 的年轻代 GC 仍存在颜色指针开销;GenShen 则转变为分代收集器,仍采用三阶段收集年轻代。因此它们仍会产生可观的开销。我们将在第 5 节展示 GenZ 和 GenShen 的详细测试结果。

2.6 Summary  2.6 总结

According to the performance analysis, we find that reducing the duration of pre-reclamation cycles is critical to control pauses when under a heavy workload. Meanwhile, the collector should also avoid introducing large runtime overhead. We therefore build our own collector to achieve both goals.
根据性能分析,我们发现缩短预回收周期的持续时间对于控制高负载下的停顿至关重要。同时,收集器还应避免引入较大的运行时开销。为此,我们构建了自己的收集器来实现这两个目标。

3 Group-wise design in Jade
3 Jade 中的分组式设计

3.1 Overview  3.1 概述

This work introduces Fade, a high-throughput concurrent copying collector achieving both low GC pauses and satisfying collection efficiency. Compared with prior concurrent copying collectors (heap/generation-wise or region-wise), Jade provides a new abstraction named group as a unit of reclamation, which is the core of Jade’s design. Thanks to the group-wise collection mechanism, Jade can achieve incremental reclamation while inducing moderate performance overhead.
本文介绍了 Fade,一种高吞吐量的并发复制垃圾收集器,既能实现低 GC 停顿,又能满足收集效率要求。与现有的并发复制收集器(按堆/代或按区域)相比,Jade 提出了名为"组"的新回收单元抽象,这是 Jade 设计的核心所在。得益于分组收集机制,Jade 能够在引入适度性能开销的同时实现增量式回收。
Analogous to prior concurrent copying collectors, Jade also adopts equal-sized regions to organize its heap. When GC is triggered, Jade also contains a concurrent marking phase which generates a live bitmap to memorize live objects in its regions. The live bitmap uses each bit to denote if a memory range of 8 bytes contains the header of a live object, so the bitmap’s memory consumption is 1.56 % 1.56 % 1.56%1.56 \% of the heap size. Marking is followed by a special grouping phase, which divides the heap into collection groups where each
与现有并发复制收集器类似,Jade 同样采用等大小区域来组织堆空间。当触发 GC 时,Jade 也包含并发标记阶段,该阶段会生成存活位图以记录各区域中的存活对象。存活位图中每个比特位表示 8 字节内存范围是否包含存活对象的头部信息,因此位图内存消耗为堆大小的 1.56 % 1.56 % 1.56%1.56 \% 。标记阶段后是特殊的分组阶段,该阶段将堆划分为若干回收组,每个...

group contains several regions. The reclamation phase is then divided into multiple rounds. Each round only evacuates regions in one group, i.e., copying live objects in those regions to free regions in the heap. Jade releases memory resources consumed by a group immediately after the corresponding round ends. To coordinate concurrent operations from mutators, Jade adopts loaded value barriers (also used in ZGC) to ensure mutators always hold references to an object’s latest copy. With the group-wise design, Jade achieves group-level incremental reclamation since free regions can be recycled when each group is evacuated. Meanwhile, it also leverages the same marking results for all rounds in one collection cycle, which reduces collection frequency and performance overhead compared with an algorithm like Shenandoah which treats the whole heap/generation as the reclamation unit.
该组包含多个区域。回收阶段随后被划分为多轮进行,每轮仅清空一个组内的区域,即将这些区域中的存活对象复制至堆中的空闲区域。Jade 在每轮结束后立即释放该组占用的内存资源。为协调来自突变线程的并发操作,Jade 采用加载值屏障(同样应用于 ZGC)来确保突变线程始终持有对象最新副本的引用。通过分组设计,Jade 实现了组级增量回收——每当一个组被清空时,其空闲区域即可被循环利用。与此同时,Jade 在单个回收周期内所有轮次中复用相同的标记结果,相较于将整个堆/代作为回收单元的算法(如 Shenandoah),这种设计降低了回收频率与性能开销。
Nevertheless, the results on ZGC suggest that an incremental reclamation algorithm is not enough. The design of a group-wise algorithm should also control its runtime overhead, which mainly consists of two parts: (1) the grouping algorithm to assign regions to groups and (2) reference memorization for subsequent reference updating. To this end, Jade proposes designs for those two parts separately.
然而,ZGC 的实验结果表明,仅靠增量回收算法是不够的。分组式算法的设计还需控制其运行时开销,这主要包含两部分:(1) 将内存区域分配到分组的算法;(2) 为后续引用更新服务的引用记忆机制。为此,Jade 针对这两部分分别提出了设计方案。

3.2 Simulation-based hand-over-hand grouping
3.2 基于模拟的递进式分组算法

As introduced before, the group abstraction can be seen as a middle ground between generation and region: when each group only contains one region, the evacuation is similar to ZGC; when only one group is generated in a GC cycle, the behavior is close to Shenandoah. Therefore, the grouping algorithm is important for Jade’s collection efficiency. Meanwhile, the time required for grouping should be short as it is included in the pre-reclamation cycle. To this end, Jade proposes a simulation-based hand-over-hand grouping algorithm.
如前所述,分组抽象可视为分代与分区之间的折中方案:当每个分组仅含单个区域时,回收过程类似 ZGC;当单个 GC 周期只生成一个分组时,其行为接近 Shenandoah。因此分组算法对 Jade 的回收效率至关重要。同时由于分组过程属于回收前周期,其耗时必须足够短。为此 Jade 提出了一种基于模拟的递进式分组算法。
After marking is finished, Fade simulates a hand-overhand compaction [20] that assumes a former group’s released memory is directly reused by the evacuation of a latter one. As illustrated by Algorithm 1, Jade first constructs a tracked list to include regions to be evacuated in the current GC cycle (line 1-6). This step filters out regions whose live bytes exceed a preset threshold ( 85 % 85 % 85%85 \% by default) to avoid inducing large memory copying overhead. The tracked list
标记完成后,Fade 模拟了一种手递手压缩算法[20],该算法假设前一组释放的内存会被后一组的疏散操作直接复用。如算法 1 所示,Jade 首先构建一个追踪列表来包含当前 GC 周期需要疏散的区域(第 1-6 行)。此步骤会过滤掉存活字节数超过预设阈值(默认 85 % 85 % 85%85 \% )的区域,以避免引发过大的内存复制开销。追踪列表

is then sorted by the number of live bytes so that the evacuation can start with those containing the fewest live bytes. This part of the algorithm is similar to the construction of collection sets in collectors like G1 and Shenandoah, but Jade needs to further split the list into multiple groups to achieve group-wise collection. It starts by estimating the free space size available for evacuation (line 9 , more details in Section 4.2) and adds regions to the first group until the accumulated live bytes exceed the overall free bytes (line 13-22). Fade uses the size of the first group for all subsequent groups as well (line 23). This design is used to control the group size: since fade compacts live objects in multiple used regions to free ones, the number of free regions gradually increases every time a group is reclaimed. If all released regions are used for a per-group evacuation, the pre-reclamation time becomes longer, which could hinder mutators from memory allocation. For subsequent groups, since the number of regions is fixed, Jade only needs to continuously fill them up until the tracked list is drained (line 26-33). Lastly, Jade sets a maximum number of groups (16 by default) to avoid a long collection cycle with too many groups, and regions in the tracked list would not be collected when fade has constructed enough groups (line 34-36). Note that since fade’s grouping mechanism is based on a simulation, it requires no data copying and thus only introduces trivial (microsecond-level) overhead. Other advanced policies (e.g., controlling a group’s size by setting the maximum number of regions) are also possible to tune fade’s performance.
随后按存活字节数进行排序,以便从包含最少存活字节的区域开始回收。该算法部分类似于 G1 和 Shenandoah 等收集器中集合的构建过程,但 Jade 需要进一步将列表拆分为多个组以实现分组回收。它首先估算可用于回收的空闲空间大小(第 9 行,详见 4.2 节),然后持续向第一组添加区域,直到累计存活字节数超过总空闲字节数(第 13-22 行)。Fade 对所有后续组也采用第一组的大小(第 23 行)。这一设计用于控制组大小:由于 Fade 将多个使用区域中的存活对象压缩到空闲区域,每次回收一个组时空闲区域数量会逐渐增加。若将所有释放区域用于每组回收,预回收时间会延长,可能阻碍赋值器进行内存分配。对于后续组,由于区域数量固定,Jade 只需持续填充直至跟踪列表耗尽(第 26-33 行)。 最后,Jade 设置了最大分组数量(默认为 16 组),以避免因分组过多导致收集周期过长。当 fade 已构建足够多的分组时,跟踪列表中的区域将不会被收集(第 34-36 行)。值得注意的是,由于 fade 的分组机制基于模拟实现,无需数据拷贝操作,因此仅引入微不足道(微秒级)的开销。其他高级策略(例如通过设置区域数量上限来控制分组大小)也可用于微调 fade 的性能。

3.3 Group-wise remembered sets
3.3 分组记忆集

After grouping, Jade can start following its simulation process to evacuate live objects in each group. The remaining challenge is how to guarantee regions in a group are free to release after live objects have been moved to free regions. Considering the overhead of color pointers, Jade embraces remembered sets to memorize references at a coarser granularity.
完成分组后,Jade 即可启动模拟流程来疏散每个分组中的存活对象。剩下的挑战在于如何确保组内区域在存活对象被迁移至空闲区域后能够安全释放。考虑到颜色指针的开销,Jade 采用记忆集以更粗粒度来记录引用关系。
Remembered sets are usually used to memorize incoming references to a given memory range (e.g., a region). Taking regions as an example, the collector can maintain a remembered set for each region (G1 has a similar design for its remembered sets), which is usually implemented with a bitmap where each bit corresponds to a 512-byte memory range (referred to as a card) in the heap. After the marking phase, GC threads can scan the live bitmap to locate crossregion references. Suppose a reference r r rr in region x x xx points to an object in another region y y yy, GC threads need to (1) acquire the remembered set for region y y yy and (2) set the bit for the corresponding card where r r rr resides. When a region is evacuated, GC threads only need to use its remembered set, scan memory ranges where the corresponding bits are set, and update incoming references therein.
记忆集通常用于记录指向特定内存范围(如区域)的传入引用。以区域为例,垃圾收集器可为每个区域维护一个记忆集(G1 的记集设计与此类似),通常采用位图实现,其中每位对应堆中一个 512 字节的内存范围(称为卡)。标记阶段结束后,GC 线程可扫描存活位图以定位跨区域引用。假设区域 x x xx 中的引用 r r rr 指向另一区域 y y yy 的对象,GC 线程需要:(1)获取区域 y y yy 的记忆集;(2)设置 r r rr 所在对应卡的位。当区域被转移时,GC 线程只需使用其记忆集,扫描已置位对应的内存范围,并更新其中的传入引用。
Algorithm 1 Constructing collection groups
    for region in old_regions do
        if region.live_bytes/region.size_bytes <
            threshold then
                tracked_list.add(region)
        end if
    end for
    \# Sorting the tracked regions (organized as a list) by the
    size of live bytes
    tracked_list.sort()
    free_bytes \(\leftarrow\) estimate_free_space()
    while !tracked_list.is_empty() do
        group \(\leftarrow\) empty
        if groups.is_empty() then
            \# Constructing the first group
            while !tracked_list.is_empty() do
                region \(\leftarrow\) tracked_list.take_first ()
                free_bytes \(\leftarrow\) free_bytes-
                        region.live_bytes
                if free_bytes < 0 then
                    break
                end if
                group.add(region)
            end while
            group_size \(\leftarrow\) group.size ()
        else
            \# Constructing subsequent groups
            while group.size () < group_size do
                if tracked_list.is_empty() then
                    break
                end if
                group.add(tracked_list.take_first())
            end while
        end if
        groups.add(group)
        if groups.size() >= MAX_GROUP then
            break
        end if
    end while
    \# Output groups
Although enabling incremental reclamation, remembered sets have three disadvantages compared with color pointers. First, since the references are memorized in a coarse (card-level) granularity, GC threads have to scan more objects to find and update cross-region references. Second, the memory overhead is proportional to the number of regions, which can be quite large (usually more than thousands). Third, remembered sets need to be rebuilt after each marking phase, which potentially lengthens the prereclamation cycle. To mitigate the first two problems, fade provides group-wise remembered sets, where regions in the same group share a remembered set. Since regions in the
尽管记忆集支持增量回收,但与颜色指针相比存在三个缺点。首先,由于引用是以粗粒度(卡表级)记录的,GC 线程不得不扫描更多对象来查找和更新跨区域引用。其次,内存开销与区域数量成正比,而区域数量可能非常庞大(通常超过数千个)。第三,每次标记阶段后都需要重建记忆集,这可能会延长预回收周期。为缓解前两个问题,Fade 提供了分组式记忆集,同组区域共享一个记忆集。由于同组区域会

same group are released together, we do not need to memorize inter-region references inside them, which reduces the re-scanning overhead. Meanwhile, the memory overhead is also reduced as the number of groups is much smaller. Since each bitmap consumes only 1 / 4096 1 / 4096 1//40961 / 4096 of the heap size, the overall memory overhead is only 0.39 % 0.39 % 0.39%0.39 \% for 16 groups.
被整体释放,我们无需记录组内区域间的引用,从而降低了重复扫描开销。同时,由于组数量远少于区域数量,内存开销也得以降低。每个位图仅占用堆内存的 1 / 4096 1 / 4096 1//40961 / 4096 ,因此 16 个分组的总内存开销仅为 0.39 % 0.39 % 0.39%0.39 \%

Figure 3. Heap layout in Fade.
图 3. Fade 中的堆内存布局。
Optimization: piggyback with marking. Nevertheless, rebuilding group-based remembered sets is still timeconsuming as it needs to scan the remembered sets. Fortunately, this step can be further improved considering the fact that the number of inter-region references is usually limited. For example, our analysis on the Specjbb2015 workload [12] shows that 83.13 % 83.13 % 83.13%83.13 \% of dirty cards (containing interregion references) in the whole memory range contain references to only one or two regions (excluding the region the card resides in), so this region-related information can be memorized with acceptable overhead. To this end, Jade proposes to collect the information in the preceding marking phase. This is achieved by a global cross-region discover table (CRDT), which memorizes inter-region references also in a card granularity ( 512 bytes). CRDT maintains a mapping between cards and an integer (4 bytes). In contrast to the remembered set, CRDT maintains outgoing references to other regions by storing region IDs in the integer. For example, suppose fade discovers an address in card x x xx from region y y yy which stores a reference to another region z z zz, then the region ID z z zz should be stored into x x xx 's corresponding integer in CRDT. For each card, two outgoing references to the same region are only stored once. Since the number of regions is usually in the thousands, 4 bytes are enough to store two region numbers. If the number of discovered regions exceeds two, fade marks the corresponding integer in the CRDT with a special value (currently -1 ) so that it needs to be rescanned in the remembered set building phase. With CRDT, the work required by building remembered sets is greatly reduced. First, cards not containing cross-region references do not need scanning since it is impossible for
优化:附带标记机制。尽管如此,重建基于组的记忆集仍耗时较长,因其需要扫描整个记忆集。幸运的是,考虑到跨区域引用数量通常有限这一事实,该步骤可进一步优化。例如,我们对 Specjbb2015 工作负载[12]的分析表明,整个内存范围内 83.13 % 83.13 % 83.13%83.13 \% 的脏卡(包含跨区域引用)仅指向一两个区域(不含卡片所在区域),因此这类区域关联信息可通过可接受的开销进行记录。为此,Jade 提出在前置标记阶段收集这些信息,这是通过全局跨区域发现表(CRDT)实现的——该表同样以 512 字节卡片粒度记录跨区域引用。CRDT 维护卡片与 4 字节整数的映射关系,不同于记忆集的是,CRDT 通过存储区域 ID 来记录指向其他区域的出向引用。 例如,假设 fade 在区域 y y yy 的卡片 x x xx 中发现一个存储着指向另一区域 z z zz 引用的地址,那么区域 ID z z zz 应被存入 CRDT 中 x x xx 对应的整型字段。对于每张卡片,指向同一区域的两个出向引用仅存储一次。由于区域数量通常以千计,4 字节足以存储两个区域编号。若发现的区域数量超过两个,fade 会用特殊值(当前为-1)标记 CRDT 中对应的整型字段,以便在记忆集构建阶段重新扫描。借助 CRDT 技术,构建记忆集所需的工作量大幅减少。首先,不含跨区域引用的卡片无需扫描,因为这类卡片不可能存在

them to contain cross-group references. For those recording region IDs, fade finds the group containing the regions and marks the bit for the card in the group-wise remembered set. Because both remembered sets and CRDT use the same granularity for reference memorization, this step does not need card scanning. It also eliminates unnecessary scanning if the recorded regions are in the same group as the card. Lastly, only cards marked with -1 should be thoroughly scanned, so the scanning frequency is greatly reduced (evaluated in Section 5.6). Meanwhile, the memory consumption of CRDTs is also moderate: Since one CRDT is enough to maintain cross-region references for all groups, its memory consumption is only 0.78 % 0.78 % 0.78%0.78 \% of the heap size.
它们包含跨组引用。对于记录区域 ID 的情况,fade 会找到包含这些区域的组,并在组级记忆集中标记该卡片的对应位。由于记忆集和 CRDT 采用相同的粒度来记录引用,此步骤无需进行卡片扫描。如果记录的区域与卡片同属一个组,还能避免不必要的扫描。最后,只有标记为-1 的卡片才需要全面扫描,因此扫描频率大幅降低(第 5.6 节将进行评估)。同时,CRDT 的内存消耗也较为适中:由于单个 CRDT 即可维护所有组的跨区域引用,其内存消耗仅占堆大小的 0.78 % 0.78 % 0.78%0.78 \%

4 Generational collection
4 分代回收

4.1 Single-phase young GC
4.1 单阶段年轻代 GC

To break long collection cycles into smaller ones, Jade also embraces a generational design and provides young GC to collect its young generation. As shown in Figure 3, the regions in Jade’s heap can be free (unmarked), used by the young generation (marked as Y Y YY ), or consumed by the old generation (marked as O O OO ). Compared with a multigeneration approach with more than two generations [5], Fade’s two-generation design is simpler and enough to collect short-lived objects. When young GC is triggered, objects in young regions can be copied (or promoted) to the old generation or still reside in the young one. As for the old GC (the group-wise collection introduced before), objects are only evacuated to other old regions. Those two can run together as they reclaim different parts of memory resources.
为了将长收集周期分解为更小的周期,Jade 同样采用了分代设计,并提供了年轻代垃圾收集器(young GC)来回收其年轻代。如图 3 所示,Jade 堆中的区域可分为空闲(未标记)、年轻代使用(标记为 Y Y YY )或老年代占用(标记为 O O OO )。与超过两代的多代方案[5]相比,Fade 的两代设计更为简洁,且足以回收短生命周期对象。当触发年轻代 GC 时,年轻区域中的对象可被复制(或晋升)至老年代,或仍驻留在年轻代。至于老年代 GC(前文介绍的组式回收),对象仅会被迁移至其他老年代区域。由于二者回收的是不同的内存资源部分,这两种 GC 可以同时运行。
Generational concurrent collectors like GenZ and C4 use the same algorithm for young and old GC, so the young GC still contains two phases: marking and incremental evacuation. This two-phase design is helpful to (1) prioritize regions with the most garbage for collection and (2) immediately reclaim a region once it is evacuated. However, both two advantages are somewhat unnecessary for young GC. First, since the weak generational hypothesis [35] still holds for many modern applications, the survival rate for young regions is usually low and all young regions will be selected for reclamation, rendering the prioritization useless. Second, reclamation immediacy is not that important for young GC because it is much faster than old GC and hardly becomes a bottleneck for memory allocation. Therefore, Jade instead adopts a single-phase young GC algorithm to maximize its collection efficiency and thus saves more CPU cycles for the execution of mutators and old GC.
诸如 GenZ 和 C4 这类分代并发收集器对新生代与老年代 GC 采用相同算法,因此新生代 GC 仍包含标记和增量疏散两个阶段。这种两阶段设计有助于:(1)优先回收垃圾最多的区域;(2)区域完成疏散后立即进行回收。然而这两个优势对新生代 GC 而言都略显冗余。首先,由于弱分代假说[35]在现代应用中依然成立,新生代区域的存活率通常较低,所有新生代区域都会被选中回收,使得优先级机制失去意义。其次,即时回收对新生代 GC 并不关键,因其执行速度远快于老年代 GC,几乎不会成为内存分配的瓶颈。为此 Jade 转而采用单阶段新生代 GC 算法,通过最大化回收效率为 mutator 线程和老年代 GC 节省更多 CPU 周期。
The single-phase design finishes the marking, evacuation, and reference updating in the same phase: Fade traverses the young generation and immediately copies an object to a survivor region once it is marked live. Instead of using a marking bitmap like the old GC, Fade directly stores an object’s
单阶段设计在同一阶段完成标记、疏散和引用更新:Fade 遍历新生代,一旦对象被标记为存活就立即将其复制到幸存区。与老年代 GC 使用标记位图不同,Fade 直接存储对象的

new address in its old header as the marking result and uses atomic instructions to avoid repetitive copying of the same object. Since young GC does not traverse old regions, Jade also needs to remember inter-generation references from them. This is achieved by an old-to-young remembered set, where each bit maintains if a corresponding 512-byte memory range contains inter-generation references. Meanwhile, references inside live objects are pushed into a GC-threadlocal marking stack so that they can be updated immediately after their referents have been evacuated. By coalescing multiple operations into one single phase, Jade reduces the number of memory accesses and significantly improves the GC efficiency.
它会在旧头部中记录新地址作为标记结果,并使用原子指令避免重复复制同一对象。由于年轻代 GC 不会遍历老年代区域,Jade 还需记录来自老年代的跨代引用。这是通过一个老年代到年轻代的记忆集实现的,其中每个比特位对应 512 字节内存范围是否包含跨代引用。同时,存活对象内部的引用会被压入 GC 线程本地的标记栈,这样当被引用对象完成转移后就能立即更新这些引用。通过将多个操作合并为单一阶段,Jade 减少了内存访问次数,显著提升了 GC 效率。

4.2 Free space estimation
4.2 空闲空间估算

To avoid blocking mutator allocation during old GC, fade allows the co-running of both GC cycles. Therefore, the free space estimation in Algorithm 1 should consider two factors: (1) allocatable free bytes in the heap and (2) memory behaviors in the young generation. The first part is simple to calculate: since only completely free regions can be allocated to mutators, we can get the value by multiplying the number of free regions by the region size (Line 2). Meanwhile, since the young GC promotes objects to the old generation (Figure 3), Jade calculates its size with the historical promotion rate and the estimated remaining GC time (Line 3-4). As for the young generation, since its memory behaviors are complicated (perhaps including multi-round memory allocation and collection), Jade conservatively leaves a part for all memory-related activities within the young generation, whose size is determined by an empirical ratio (young_ratio, 85 % 85 % 85%85 \% by default). Only the remaining free bytes are treated as free space and used as destinations for evacuation in old GC.
为避免在老年代垃圾回收期间阻塞分配器线程,fade 允许两种 GC 周期并行运行。因此,算法 1 中的空闲空间估算需考虑两个因素:(1)堆中可分配的空闲字节数;(2)新生代的内存行为。第一部分计算较为简单:由于只有完全空闲的内存区域才能分配给分配器,我们只需将空闲区域数量乘以区域大小即可得到该值(第 2 行)。同时,由于新生代 GC 会将对象提升至老年代(图 3),Jade 通过历史提升率和预估的剩余 GC 时间来计算这部分空间大小(第 3-4 行)。对于新生代而言,鉴于其内存行为较为复杂(可能包含多轮内存分配与回收),Jade 保守地为其所有内存相关活动预留部分空间,该空间大小由经验比率(young_ratio,默认值为 85 % 85 % 85%85 \% )决定。只有剩余的空闲字节才会被视为可用空间,并作为老年代 GC 过程中对象疏散的目标区域。
Algorithm 2 Estimating free space size for collection
    function ESTIMATE_FREE_SPACE
        free_space \(\leftarrow\) free_region_count \(*\) region_size
        free_space \(\leftarrow\) free_space - promotion_ratio*
            estimated_gc_time
        free_bytes \(\leftarrow\) free_space \(*(1-\) young_ratio \()\)
        return free_bytes
    end function

4.3 Chasing mode and full GC
4.3 追赶模式与完全 GC

When the allocation rate becomes too high, Jade may face inevitable mutator stalls due to free memory shortage. Since the number of running mutators becomes smaller, the CPU resources can be reaped to run more GC threads. However, since the default number of GC threads in ZGC and Shenandoah is usually small, they fail to leverage idle CPU cores
当分配率过高时,由于空闲内存不足,Jade 可能面临不可避免的 mutator 停顿。由于运行的 mutator 数量减少,可以回收 CPU 资源来运行更多 GC 线程。然而,由于 ZGC 和 Shenandoah 默认的 GC 线程数通常较少,它们无法有效利用空闲的 CPU 核心

in a mutator stall (e.g., ZGC only occupies 26.80 % 26.80 % 26.80%26.80 \% of overall CPU resources during application stalls in Table 1’s experiment). Meanwhile, if the user manually increases the thread number, GC threads would compete for more CPU resources and affect application performance when no stall happens. To this end, Jade introduces a chasing mode, which launches more GC threads to concurrently evacuate groups only when mutators begin to stall. In the default setting, the number of concurrent GC threads is equal to the number of cores, which maximizes the GC throughput to finish processing the current collection group. Jade also supports full GC but only triggers it when extreme cases occur (e.g., consecutive long application stalls), and it also sufficiently utilizes all available CPU resources to improve the collection efficiency.
在 mutator 停顿期间(例如表 1 实验中,ZGC 在应用停顿时仅占用总 CPU 资源的 26.80 % 26.80 % 26.80%26.80 \% )。同时,如果用户手动增加线程数量,在没有停顿时 GC 线程会竞争更多 CPU 资源,影响应用性能。为此,Jade 引入了追赶模式,该模式仅在 mutator 开始停顿时启动更多 GC 线程来并发转移分组。默认设置下,并发 GC 线程数等于核心数,这能最大化 GC 吞吐量以完成当前回收组的处理。Jade 也支持完全 GC,但仅在极端情况下触发(例如连续长时间的应用停顿),并且会充分利用所有可用 CPU 资源来提高回收效率。

4.4 Weak references handling
4.4 弱引用处理

Fade supports handling weak references in both young and old GC. In the marking phase (mark-and-copy phase in the young GC), Jade puts objects into a discover list if they are pointed to by weak references. After all concurrent phases finish, Fade leverages an extra stop-the-world phase to check if objects in the list are marked through other strong references. If an object is not marked, Jade tries to reclaim it and invokes its corresponding callback function (if any). Since the number of weak references is not large for many applications, the induced pause time is trivial. We plan to modify this phase into a concurrent one in our future work.
Fade 支持在新生代与老年代 GC 中处理弱引用。在标记阶段(新生代 GC 中的标记-复制阶段),若对象被弱引用指向,Jade 会将其放入发现列表。当所有并发阶段完成后,Fade 会利用额外的全局停顿阶段来检查列表中的对象是否通过其他强引用被标记。若对象未被标记,Jade 会尝试回收该对象并调用其对应的回调函数(如有)。由于多数应用程序的弱引用数量不多,因此引发的停顿时间可忽略不计。我们计划在后续工作中将该阶段改造为并发执行。

5 Evaluation  5 性能评估

5.1 Experiment setup  5.1 实验配置

Fade is implemented on OpenJDK (the version is 11.0.17.13) with about 30,000 lines of code. During implementation, some components of fade derive from existing collectors in OpenJDK, such as snapshot-at-the-beginning (SATB) marking and card tables.
Fade 基于 OpenJDK(版本 11.0.17.13)实现,代码量约 3 万行。在实现过程中,Fade 的某些组件继承自 OpenJDK 现有收集器,例如起始快照(SATB)标记和卡表技术。
We mainly evaluate fade on a bare-metal instance in the public cloud environment. The instance has dual Intel Xeon Platinum 8369B CPUs (32 physical cores with SMT enabled) and 512GB DRAM. To avoid NUMA issues, we bind applications to separated physical cores on the same socket. We leverage the following applications to evaluate fade.
我们主要在公有云环境的裸金属实例上评估 Fade 性能。该实例配备双路英特尔至强铂金 8369B 处理器(32 个物理核心并启用超线程)和 512GB DRAM。为避免 NUMA 问题,我们将应用程序绑定至同一插槽的不同物理核心。采用以下应用进行 Fade 评估:
  • Specjbb2015 [12] is the de facto standard for Java server performance, especially for garbage collectors. It simulates an online supermarket serving various kinds of requests. Both ZGC and Shenandoah adopt this application to show their improvement against prior collectors upon their release [15, 22]. Meanwhile, Oracle leverages it as the primary application to show the progress of GC in OpenJDK [19, 30].
    Specjbb2015[12]是 Java 服务器性能(尤其是垃圾收集器)的事实标准基准。它模拟在线超市处理各类请求的场景。ZGC 和 Shenandoah 发布时都采用该应用展示其相对前代收集器的改进[15,22]。同时 Oracle 也将其作为展示 OpenJDK 垃圾收集器进展的主要测试应用[19,30]。
  • HBase [1] is a distributed big-data store for nonrelational data. We mainly use it to study the singlemachine performance, and the evaluated version is 2.4.14.
    HBase[1]是面向非关系型数据的分布式大数据存储系统。我们主要用其研究单机性能,评估版本为 2.4.14。
  • Shop is a real-world online service used in Alibaba’s e-commerce platform. This service is used by tens of millions of users every day to serve their requests to access online shops.
    Shop(店铺)是阿里巴巴电商平台中一项真实在线的服务,每天为数千万用户提供访问线上店铺的请求服务。
  • DaCapo [2] is a benchmark suite containing various Java workloads. Compared with other evaluated applications, most workloads in DaCapo have smaller memory demand, so we mainly use them to show the performance of Jade with tight memory budgets.
    DaCapo [2] 是一个包含多种 Java 工作负载的基准测试套件。与其他评估应用相比,DaCapo 中的大多数工作负载对内存需求较小,因此我们主要用它来展示 Jade 在严格内存预算下的性能表现。

    With those applications, we mainly compare Jade with four different concurrent garbage collectors: G1, ZGC, Shenandoah, and LXR. G1 and LXR evacuate objects in an STW fashion, so it is helpful to improve application throughput. However, their STW pauses may cause worse application latency especially when the size of live data grows larger. Both ZGC and Shenandoah concurrently evacuate objects and thus have larger interferences on application throughput. All those collectors are also evaluated under OpenJDK-11.0. Since G1 allows users to control its pause time by setting a soft limit (-XX:MaxGCPauseMillis), we manually set it to 10 ms for better application latency, in addition to a default setting ( 200 ms ) ( 200 ms ) (200ms)(200 \mathrm{~ms}). We also evaluate the generational variants of ZGC and Shenandoah (namely GenZ and GenShen), and the respective versions are OpenJDK21 and OpenJDK-22 1 1 ^(1){ }^{1} since they do not backport to older versions. As GenZ and GenShen are implemented in different JDK versions, the evaluation results would be affected by other factors other than GC (e.g., efficiency of JIT optimizations). LXR is built from its latest commit (the commit number for mmtk-core and mmtk-openjdk is 4 ab 99 bb and cdbc8de, respectively), which is also based on OpenJDK11.0 (commit 7caf8f7). The mmtk/opt cargo feature is also enabled to avoid performance issues. C4 is not included as it uses a similar algorithm to GenZ but a much different JDK implementation (Azul JDK).
    在这些应用中,我们主要将 Jade 与四种不同的并发垃圾收集器进行比较:G1、ZGC、Shenandoah 和 LXR。G1 和 LXR 采用 STW(Stop-The-World)方式迁移对象,这有助于提高应用吞吐量。然而,它们的 STW 停顿可能导致应用延迟恶化,尤其是当存活数据规模增大时。ZGC 和 Shenandoah 均采用并发方式迁移对象,因此对应用吞吐量的干扰更大。所有收集器均在 OpenJDK-11.0 环境下进行评估。由于 G1 允许用户通过设置软性限制(-XX:MaxGCPauseMillis)来控制停顿时间,我们除了默认设置 ( 200 ms ) ( 200 ms ) (200ms)(200 \mathrm{~ms}) 外,还手动将其设为 10 毫秒以获得更好的应用延迟。我们还评估了 ZGC 和 Shenandoah 的分代版本(即 GenZ 和 GenShen),由于它们未向后移植到旧版本,相应版本分别为 OpenJDK21 和 OpenJDK-22 1 1 ^(1){ }^{1} 。由于 GenZ 和 GenShen 在不同 JDK 版本中实现,评估结果会受到 GC 之外其他因素(如 JIT 优化效率)的影响。 LXR 基于其最新提交构建(mmtk-core 和 mmtk-openjdk 的提交号分别为 4ab99bb 和 cdbc8de),该版本同样基于 OpenJDK11.0(提交号 7caf8f7)。同时启用了 mmtk/opt cargo 特性以避免性能问题。未包含 C4 是因为其采用了与 GenZ 相似的算法但 JDK 实现差异较大(Azul JDK)。
For all evaluated applications, we mainly concentrate on two metrics: application throughput and latency. To this end, we collect the latency statistics when applications are under different throughput levels. Meanwhile, we also leverage various heap configurations for all applications but the online Shop service since we have no permission to adjust its heap size. To configure the heap size, we first calculate a minimum heap size for ZGC and simulate three scenarios: tight ( 1.5 × 1.5 × 1.5 xx1.5 \times ), medium ( 2 × 2 × 2xx2 \times ), and large ( 4 × 4 × 4xx4 \times ). The minimum heap size for Specjbb2015 and HBase is 1941 MB and 1100 MB , respectively. For Shop, we directly use its fixed configuration ( 8 GB heap, approximately 4 × 4 × 4xx4 \times of the live data size).
针对所有评估应用,我们主要关注两项指标:应用吞吐量和延迟。为此,我们在不同吞吐量水平下收集应用的延迟统计数据。同时,除在线商店服务外,我们对所有应用配置了多种堆内存方案,因无权限调整其堆大小。在堆大小配置方面,我们首先计算 ZGC 所需的最小堆大小,并模拟三种场景:紧张( 1.5 × 1.5 × 1.5 xx1.5 \times )、中等( 2 × 2 × 2xx2 \times )和宽松( 4 × 4 × 4xx4 \times )。Specjbb2015 和 HBase 的最小堆大小分别为 1941MB 和 1100MB。对于商店应用,我们直接采用其固定配置(8GB 堆大小,约为存活数据量的 4 × 4 × 4xx4 \times )。
As for DaCapo, we use a tighter memory configuration (details in Section 5.5). All applications use 8 physical cores regardless of heap size.
至于 DaCapo 基准测试,我们采用了更严格的内存配置(详见 5.5 节)。无论堆内存大小如何,所有应用均使用 8 个物理核心。

5.2 Specjbb2015  5.2 Specjbb2015 性能测试

We mainly evaluate Specjbb2015 with its default mode (HBIR_RT), which reports two scores for each execution: max-jops stands for the peak throughput while critical-jops stands for the maximum throughput satisfying p99 latency requirements.
我们主要采用默认模式(HBIR_RT)评估 Specjbb2015,该模式每次执行会生成两个指标:max-jops 代表峰值吞吐量,critical-jops 代表满足 99%分位延迟要求的最大吞吐量。
As shown in Table 3, Jade has worse max-jops compared with the default setting of G1 (3.98% smaller in arithmetic mean) and LXR ( 6.81 % 6.81 % 6.81%6.81 \% ), but the result is much better than other concurrent copying collectors. Meanwhile, its criticaljops score of J a d e J a d e Jade\mathfrak{J a d e} is the best for all settings. The evaluation suggests that the critical-jops score of other concurrent collectors is significantly restricted by their maximum throughput, especially under tight heap configurations. As for G1, although setting a smaller soft limit ( 10 ms ) helps improve the critical-jops score, the max-jops is significantly decreased due to more frequent collections and larger overhead.
如表 3 所示,与 G1 默认设置(算术平均值低 3.98%)和 LXR( 6.81 % 6.81 % 6.81%6.81 \% )相比,Jade 的 max-jops 表现稍逊,但远优于其他并发复制收集器。同时,其 J a d e J a d e Jade\mathfrak{J a d e} 的 criticaljops 得分在所有配置中最佳。评估表明其他并发收集器的 critical-jops 得分受限于其最大吞吐量,尤其在紧凑堆配置下更为明显。对于 G1 而言,虽然设置较小的软限制(10 毫秒)有助于提升 critical-jops 得分,但由于更频繁的垃圾回收和更大开销,max-jops 会显著下降。

Figure 4. P99 latency results under various throughput settings in Specjbb2015.
图 4. Specjbb2015 在不同吞吐量设置下的 P99 延迟结果。
Figure 4 further shows the p99 latency when under various throughput settings. The latency statistics are collected when Specjbb2015 gradually increases the application throughput to find the peak one. The results show that other concurrent copying collectors have smaller p99 latency than G1 when the throughput is moderate, but the latency soon becomes larger due to their poor GC efficiency. In contrast, Jade has optimized its collection phases and thus reaches satisfying latency even under heavy workloads. As for other percentiles, Jade’s worst latency in the 2x heap
图 4 进一步展示了不同吞吐量设置下的 p99 延迟情况。这些延迟统计数据是在 Specjbb2015 逐步提高应用吞吐量以寻找峰值时收集的。结果显示,当吞吐量适中时,其他并发复制收集器的 p99 延迟比 G1 更小,但由于其垃圾回收效率较低,延迟很快会变得更大。相比之下,Jade 优化了其收集阶段,因此即使在重负载下也能达到令人满意的延迟。至于其他百分位数,Jade 在 2 倍堆内存情况下的最差延迟
Table 3. Maximum throughput numbers for various collectors. For Specjbb2015, we also show the critical-jops score (the first of the two numbers).
表 3. 各收集器的最大吞吐量数据。对于 Specjbb2015,我们还展示了关键-jops 分数(两个数字中的第一个)。
Application  应用程序 Collector  收集器 1 . 5 × 1 . 5 × 1.5xx\mathbf{1 . 5} \times heap   1 . 5 × 1 . 5 × 1.5xx\mathbf{1 . 5} \times 2 × 2 × 2xx\mathbf{2} \times heap   2 × 2 × 2xx\mathbf{2} \times 4 × 4 × 4xx\mathbf{4} \times heap   4 × 4 × 4xx\mathbf{4} \times
Specjbb2015 Jade 8299/13101 9497/15257 14149/21454
G1 4462/13433 4788/14926 7942/24297
G1-10ms  G1-10 毫秒 4122/11761 5058/12936 8811/19786
ZGC 1618/2774 3443/4563 8576/9931
Shenandoah 3771/6120 4602/7245 11371/13599
LXR 3170/16302 4038/17694 5310/18989
GenZ 6675/8387 8654/11346 12258/15308
GenShen 1605/11345 4313/10218 7212/10931
HBase-Insert  HBase-插入 Jade 1128 1164 1284
G1 970 1141 1304
G1-10ms  G1-10 毫秒 866 931 1294
ZGC 668 690 1096
Shenandoah  雪兰多 840 873 969
GenZ 874 890 1045
GenShen 726 528 777
HBase-Mixed  HBase 混合模式 Jade  翡翠 1334 1577 1929
G1 1384 1489 1820
G1-10ms  G1-10 毫秒 1332 1313 1954
ZGC 737 811 1302
Shenandoah  雪兰多 843 734 944
GenZ 1085 1146 1666
GenShen 861 663 1218
Shop (8GB)  商店 (8GB) Jade - - 400
G1 - - 350
ZGC - - 150
Shenandoah  雪兰多 - - 300
Application Collector 1.5xx heap 2xx heap 4xx heap Specjbb2015 Jade 8299/13101 9497/15257 14149/21454 G1 4462/13433 4788/14926 7942/24297 G1-10ms 4122/11761 5058/12936 8811/19786 ZGC 1618/2774 3443/4563 8576/9931 Shenandoah 3771/6120 4602/7245 11371/13599 LXR 3170/16302 4038/17694 5310/18989 GenZ 6675/8387 8654/11346 12258/15308 GenShen 1605/11345 4313/10218 7212/10931 HBase-Insert Jade 1128 1164 1284 G1 970 1141 1304 G1-10ms 866 931 1294 ZGC 668 690 1096 Shenandoah 840 873 969 GenZ 874 890 1045 GenShen 726 528 777 HBase-Mixed Jade 1334 1577 1929 G1 1384 1489 1820 G1-10ms 1332 1313 1954 ZGC 737 811 1302 Shenandoah 843 734 944 GenZ 1085 1146 1666 GenShen 861 663 1218 Shop (8GB) Jade - - 400 G1 - - 350 ZGC - - 150 Shenandoah - - 300| Application | Collector | $\mathbf{1 . 5} \times$ heap | $\mathbf{2} \times$ heap | $\mathbf{4} \times$ heap | | :--- | :--- | :--- | :--- | :--- | | Specjbb2015 | Jade | 8299/13101 | 9497/15257 | 14149/21454 | | | G1 | 4462/13433 | 4788/14926 | 7942/24297 | | | G1-10ms | 4122/11761 | 5058/12936 | 8811/19786 | | | ZGC | 1618/2774 | 3443/4563 | 8576/9931 | | | Shenandoah | 3771/6120 | 4602/7245 | 11371/13599 | | | LXR | 3170/16302 | 4038/17694 | 5310/18989 | | | GenZ | 6675/8387 | 8654/11346 | 12258/15308 | | | GenShen | 1605/11345 | 4313/10218 | 7212/10931 | | HBase-Insert | Jade | 1128 | 1164 | 1284 | | | G1 | 970 | 1141 | 1304 | | | G1-10ms | 866 | 931 | 1294 | | | ZGC | 668 | 690 | 1096 | | | Shenandoah | 840 | 873 | 969 | | | GenZ | 874 | 890 | 1045 | | | GenShen | 726 | 528 | 777 | | HBase-Mixed | Jade | 1334 | 1577 | 1929 | | | G1 | 1384 | 1489 | 1820 | | | G1-10ms | 1332 | 1313 | 1954 | | | ZGC | 737 | 811 | 1302 | | | Shenandoah | 843 | 734 | 944 | | | GenZ | 1085 | 1146 | 1666 | | | GenShen | 861 | 663 | 1218 | | Shop (8GB) | Jade | - | - | 400 | | | G1 | - | - | 350 | | | ZGC | - | - | 150 | | | Shenandoah | - | - | 300 |
configuration is better than GenZ when QPS exceeds 9000 and GenShen for almost all QPS settings.
当 QPS 超过 9000 时,其配置优于 GenZ;在几乎所有 QPS 设置下,GenShen 的表现都更佳。

5.3 HBase

HBase is evaluated using YCSB [11] with two workloads: an insert-only one and a mixed one ( 50 % 50 % 50%50 \% read and 50 % 50 % 50%50 \% insert). Furthermore, we use the throttle option to control the request rate and generate results under various throughputs. LXR is not included since it is based on MMTk, whose current version cannot run HBase [38]. As shown in Figure 5, when compared with other concurrent copying collectors, Fade has better latency results than Shenandoah and GenShen for nearly all configurations and remains comparable with ZGC and GenZ under moderate throughput. Meanwhile, Jade’s peak throughput is 1.63 × 1.63 × 1.27 × 1.63 × 1.63 × 1.27 × 1.63 xx1.63 xx1.27 xx1.63 \times 1.63 \times 1.27 \times, and 1.82 × 1.82 × 1.82 xx1.82 \times against ZGC, Shenandoah, GenZ, and GenShen, respectively. When setting the pause soft limit to 10 ms , G1 has better p99 latency under moderate workload, but the maximum throughput decreases.
我们使用 YCSB[11]对 HBase 进行测试,采用两种工作负载:纯插入型与混合型( 50 % 50 % 50%50 \% 读取和 50 % 50 % 50%50 \% 插入)。通过节流选项控制请求速率,生成不同吞吐量下的测试结果。由于 LXR 基于当前版本无法运行 HBase 的 MMTk[38],故未纳入比较。如图 5 所示,与其他并发复制收集器相比:在绝大多数配置下,Fade 的延迟表现优于雪兰多和 GenShen;在中等吞吐量场景中,与 ZGC 和 GenZ 保持相当。Jade 的峰值吞吐量分别达到 ZGC、雪兰多、GenZ 和 GenShen 的 1.63 × 1.63 × 1.27 × 1.63 × 1.63 × 1.27 × 1.63 xx1.63 xx1.27 xx1.63 \times 1.63 \times 1.27 \times 1.82 × 1.82 × 1.82 xx1.82 \times 。当暂停软限制设为 10 毫秒时,G1 在中等工作负载下具有更好的 p99 延迟,但最大吞吐量会降低。

5.4 Shop  5.4 在线商店服务评估

To evaluate the online Shop service, we use four nodes to generate concurrent requests and simulate a stressful workload. Since the online service only supports JDK11, GenZ is not evaluated in this experiment. If the p99 latency is larger than a second, the Shop service automatically reports itself as unavailable for user-experience considerations. As shown in Figure 6, Jade can achieve the best peak throughput among all collectors. Although G1 can endure high throughput, it triggers too many long pauses and thus induces a large p99 latency, so its peak throughput is worse than 7 ade. In contrast, 7 ade reaches comparable p99 latency with Shenandoah and ZGC under moderate throughput while outperforming G1 by 3.94 × 3.94 × 3.94 xx3.94 \times on average (from 100 to 350 QPS).
为评估在线商店服务性能,我们使用四个节点生成并发请求模拟高压工作负载。由于该服务仅支持 JDK11,本次实验未对 GenZ 进行测试。基于用户体验考量,当 p99 延迟超过 1 秒时,系统将自动标记服务为不可用状态。如图 6 所示,Jade 在所有垃圾收集器中实现了最佳峰值吞吐量。虽然 G1 能承受较高吞吐量,但其频繁触发长时间停顿导致 p99 延迟过高,因此峰值吞吐表现劣于 7 ade。相比之下,7 ade 在中等吞吐量下与 Shenandoah 和 ZGC 保持相当的 p99 延迟,同时平均性能较 G1 提升 3.94 × 3.94 × 3.94 xx3.94 \times (100 至 350 QPS 区间)。
We also collect the average CPU utilization during application execution. The results in Figure 6b show that the CPU utilization under moderate throughput is similar among all collectors except ZGC. However, when the throughput is close to the maximum, both Shenandoah and ZGC’s CPU utilization quickly increases. As analyzed before, they introduce long pre-reclamation cycles, which consume much
我们同时采集了应用执行期间的平均 CPU 利用率。图 6b 显示,除 ZGC 外,各收集器在中等吞吐量下的 CPU 利用率相近。但当吞吐量接近峰值时,Shenandoah 和 ZGC 的 CPU 利用率会急剧攀升。如前文分析,这是由于它们执行耗时较长的预回收周期所致,这些周期会消耗大量

Figure 5. P99 latency results under various throughput settings in HBase.
图 5. HBase 在不同吞吐量设置下的 P99 延迟结果

more CPU resources and put heavier burdens on mutators. Meanwhile, the p99 latency of fade is relatively stable even when the CPU utilization reaches 85.78 % 85.78 % 85.78%85.78 \% (QPS is 350 ), which confirms that Jade can retain high GC efficiency and short pauses even under heavy CPU-intensive workloads.
更多 CPU 资源会给 mutator 带来更重的负担。同时,即使 CPU 利用率达到 85.78 % 85.78 % 85.78%85.78 \% (QPS 为 350)时,fade 的 p99 延迟仍保持相对稳定,这证实了 Jade 即使在繁重的 CPU 密集型工作负载下也能保持较高的 GC 效率和短暂停顿。

Figure 6. Evaluation results for Shop.
图 6. Shop 的评估结果

5.5 DaCapo

Table 4 first shows the normalized application execution time under two heap configurations ( 1.5 x and 2 x of G1’s minimal heap [37]) for all workloads in DaCapo. The results
表 4 首先展示了 DaCapo 基准测试中所有工作负载在两种堆配置(G1 最小堆的 1.5 倍和 2 倍[37])下的标准化应用执行时间。这些结果

are averaged over ten times of execution, where each execution reports the numbers in the fifth iteration for each application (i.e., the other four iterations are used as warm-up execution). The relative standard error ranged from 0.002 % 0.002 % 0.002%0.002 \% to 51.49 % 51.49 % 51.49%51.49 \%, with 98.18 % 98.18 % 98.18%98.18 \% of data points having a standard error of 10 % 10 % 10%10 \% or less. Due to the tight memory configuration, ZGC and Shenandoah (including their generational variants) fail to execute some applications due to out-of-memory errors or inducing large performance overhead. In contrast, Jade can execute all applications and its performance remains relatively stable. Compared with them, LXR and G1 show better performance on DaCapo. Those two collectors both introduce STW pauses for evacuation to reach better collection efficiency. Since the memory budget for most applications in DaCapo is relatively small (less than 1GB), the pause time is also short. As for concurrent copying collectors, their runtime overhead like load barriers becomes more significant under those scenarios.
是十次运行的平均值,每次运行记录各应用第五次迭代的数据(前四次迭代作为预热执行)。相对标准误差范围在 0.002 % 0.002 % 0.002%0.002 \% 51.49 % 51.49 % 51.49%51.49 \% 之间,其中 98.18 % 98.18 % 98.18%98.18 \% 的数据点标准误差不超过 10 % 10 % 10%10 \% 。由于严格的内存配置,ZGC 和 Shenandoah(包括其分代变体)因内存不足错误或产生较大性能开销而无法执行部分应用。相比之下,Jade 能执行所有应用且性能保持相对稳定。与它们相比,LXR 和 G1 在 DaCapo 上表现出更优性能——这两种收集器都通过引入 STW 暂停的疏散操作来提升回收效率。由于 DaCapo 中大多数应用的内存占用较小(小于 1GB),其暂停时间也较短。而对于并发复制式收集器,在此类场景下其运行时开销(如加载屏障)会变得更为显著。
For latency results, in applications like tomcat ( 142 MB heap size), LXR has 10.31 % 10.31 % 10.31%10.31 \% better p99 latency against J a d e J a d e Jade\mathfrak{J a d e}. However, for those with larger heap sizes, the application latency is largely affected by LXR’s pauses. Figure 7 shows the results for two different size configurations in H2 from DaCapo, normal (by default) and large (the corresponding minimal heap size is 4099 MB ). To collect latency statistics in various throughput levels, we modify the metered latency measurements from the Chopin version of Dacapo [16] to model request queuing with adjustable QPS configurations (we refer to this application as H2-throttle). For both settings, Jade shows better p99 metered latency than G1 and LXR although its maximum throughput is smaller. By analyzing the GC log, we find LXR’s average pause time under the large configuration and moderate throughput ( 8000 QPS) is 46.30 ms , while G1 is 40.41 ms . In contrast, Jade’s average pause time is only 0.52 ms , which can explain its better latency than the other two collectors.
在延迟性能方面,对于 Tomcat 这类应用(堆内存 142MB),LXR 的 p99 延迟比 J a d e J a d e Jade\mathfrak{J a d e} 表现 10.31 % 10.31 % 10.31%10.31 \% 更优。然而对于堆内存较大的应用,LXR 的停顿会显著影响应用延迟。图 7 展示了 DaCapo 基准测试中 H2 数据库两种内存配置(默认常规配置与堆内存 4099MB 的大内存配置)的测试结果。为收集不同吞吐量水平下的延迟数据,我们改进了 Dacapo Chopin 版本[16]的计量延迟测量方法,通过可调节 QPS 配置来模拟请求队列(我们将此应用称为 H2-throttle)。在两种配置下,尽管 Jade 的最大吞吐量较低,但其 p99 计量延迟表现均优于 G1 和 LXR。通过分析 GC 日志发现,在大内存配置和中等吞吐量(8000 QPS)下,LXR 的平均停顿时间为 46.30 毫秒,G1 为 40.41 毫秒,而 Jade 仅 0.52 毫秒,这解释了为何其延迟表现优于另外两款收集器。

5.6 Breakdown analysis  5.6 分解分析

Group-related parameters. Figure 8 shows the p99 latency when varying the maximum group and region number. The evaluated application is Specjbb2015 with its preset mode, which runs with a fixed QPS (2000) for 10 minutes. When Jade is only allowed to evacuate one group in each collection cycle, the collection efficiency is affected, so the tail latency becomes worse. As for the region number, different configurations lead to similar results, which suggest Fade’s performance is not sensitive to them.
组相关参数。图 8 展示了在调整最大组数和区域数时的 p99 延迟。评估应用为 Specjbb2015 预设模式,该模式以固定 QPS(2000)运行 10 分钟。当 Jade 每个回收周期仅允许撤离一个组时,回收效率会受到影响,导致尾部延迟恶化。至于区域数量,不同配置产生相似结果,表明 Fade 的性能对此不敏感。
Collection efficiency. We also compare the GC performance of GenZ and Fade since they both use a generational design and GenZ has better performance than GenShen (Table 3). The workload is the same as Figure 8. To achieve a relatively fair comparison, we disable the compressed pointers in Fade and fix the number of GC threads to two (one for young and another for old, chasing mode disabled) and
回收效率。我们还比较了 GenZ 和 Fade 的 GC 性能,因为两者都采用分代设计且 GenZ 表现优于 GenShen(表 3)。工作负载与图 8 相同。为进行相对公平的比较,我们在 Fade 中禁用压缩指针功能,并将 GC 线程数固定为两个(一个负责新生代,另一个负责老年代,禁用追逐模式)
Table 4. Application execution time for the DaCapo benchmark (normalized to G1), under 1.5x (left) and 2x (right) minimal heap configurations. N/A means the applications cannot run due to unsupported JDK versions.
表 4. DaCapo 基准测试的应用执行时间(归一化为 G1),分别在 1.5 倍(左)和 2 倍(右)最小堆配置下。N/A 表示因不支持的 JDK 版本导致应用无法运行。
App  应用 G1 G1-10ms  G1-10 毫秒 Shen.   ZGC GenShen  根申 GenZ  根 Z LXR Jade  
avrora  阿芙罗拉 2902/2811 0.994/0.972 1.092/1.037 OOM 1.734/1.437 OOM 0.992/1.027 1.747/1.770
batik  巴蒂克 1707/1735 1.004/0.978 1.152/0.976 OOM 1.389/1.269 1.314/1.044 OOM/1.048  内存溢出/1.048 1.741/1.712
biojava  生物 Java 7487/7334 1.000/0.992 2.912/2.062 OOM 3.105/2.581 OOM/3.781  内存溢出/3.781 0.954/0.969 1.825/1.803
cassandra  卡桑德拉 8812/7760 1.009/0.989 1.275/1.177 OOM N/A  不适用 N/A  无可用数据 0.961/1.011 1.793/1.801
eclipse  日蚀 12473/12215 0.999/0.996 1.068/1.047 OOM/0.980  内存溢出/0.980 1.418/1.309 1.119/0.978 1.023/1.038 1.770/1.767
fop  格式化对象处理器 1041/860 0.998/0.989 4.035/1.592 OOM 6.049/4.609 OOM/8.059  内存不足/8.059 0.731/0.829 1.655/1.700
graphchi  图计算框架 4252/4188 0.998/0.997 1.909/1.636 OOM/1.403  内存不足/1.403 4.414/3.987 1.405/1.227 0.949/0.956 1.825/1.831
h2 4972/3790 0.989/1.005 7.104/5.642 OOM/11.47 2.304/2.373 2.212/1.914 0.904/1.109 1.941/2.130
h2o 4573/3793 1.015/0.998 2.259/1.750 OOM N/A N/A 1.068/1.110 1.913/1.857
jme 6873/6873 1.000/1.000 1.008/1.005 OOM 1.088/1.097 1.006/1.005 0.999/1.000 1.733/1.734
jython 5890/5393 0.999/1.002 3.323/2.022 OOM 6.548/5.238 OOM/1.954 0.958/1.031 1.883/1.829
kafka 5186/5200 0.999/0.996 1.000/0.996 OOM/0.995  内存溢出/0.995 2.441/3.588 0.993/0.989 0.991/0.992 1.728/1.725
luindex 4290/4283 0.994/0.989 1.173/1.089 OOM 18.20/20.93 1.054/0.976 0.969/0.988 1.780/1.782
lusearch 5398/4688 1.021/0.987 OOM OOM OOM OOM/6.914  内存溢出/6.914 0.981/1.101 2.313/2.240
pmd  内存不足检测 2549/2407 1.002/1.013 1.338/1.243 OOM/1.432  内存溢出/1.432 10.95/31.76 1.794/1.469 0.959/0.987 1.793/1.797
spring  弹簧框架 4414/3077 1.005/1.043 9.403/5.243 OOM 7.621/5.314 OOM 0.804/0.995 1.683/1.853
sunflow  阳光流 8285/8100 1.000/0.967 8.371/2.729 OOM OOM/13.10  内存溢出/13.10 3.009/2.234 0.699/0.705 2.143/1.945
tomcat  Tomcat 服务器 13377/13330 1.001/0.999 1.494/1.198 OOM 4.510/3.245 OOM/4.019  内存溢出/4.019 1.005/1.003 1.750/1.747
tradebeans  交易豆 5984/5691 1.004/0.997 4.842/2.626 OOM OOM OOM/9.141  内存溢出/9.141 1.089/1.072 2.081/2.076
tradesoap  交易肥皂 4615/3031 0.989/1.007 3.022/2.480 OOM OOM OOM/13.68  内存溢出/13.68 0.757/1.095 1.664/1.999
xalan 2747/1753 0.988/1.008 26.03/25.83 OOM 37.19/43.93 6.657/7.631 0.737/0.962 2.077/2.312
zxing 2432/2404 1.003/0.960 1.023/1.009 1.454/1.017 1.794/1.386 0.995/0.958 0.917/0.957 1.680/1.680
geomean  几何平均数 - 1.000/0.995 2.450/1.873 1.454/1.685 4.047/4.263 1.597/2.468 0.919/0.995 1.835/1.860
App G1 G1-10ms Shen. ZGC GenShen GenZ LXR Jade avrora 2902/2811 0.994/0.972 1.092/1.037 OOM 1.734/1.437 OOM 0.992/1.027 1.747/1.770 batik 1707/1735 1.004/0.978 1.152/0.976 OOM 1.389/1.269 1.314/1.044 OOM/1.048 1.741/1.712 biojava 7487/7334 1.000/0.992 2.912/2.062 OOM 3.105/2.581 OOM/3.781 0.954/0.969 1.825/1.803 cassandra 8812/7760 1.009/0.989 1.275/1.177 OOM N/A N/A 0.961/1.011 1.793/1.801 eclipse 12473/12215 0.999/0.996 1.068/1.047 OOM/0.980 1.418/1.309 1.119/0.978 1.023/1.038 1.770/1.767 fop 1041/860 0.998/0.989 4.035/1.592 OOM 6.049/4.609 OOM/8.059 0.731/0.829 1.655/1.700 graphchi 4252/4188 0.998/0.997 1.909/1.636 OOM/1.403 4.414/3.987 1.405/1.227 0.949/0.956 1.825/1.831 h2 4972/3790 0.989/1.005 7.104/5.642 OOM/11.47 2.304/2.373 2.212/1.914 0.904/1.109 1.941/2.130 h2o 4573/3793 1.015/0.998 2.259/1.750 OOM N/A N/A 1.068/1.110 1.913/1.857 jme 6873/6873 1.000/1.000 1.008/1.005 OOM 1.088/1.097 1.006/1.005 0.999/1.000 1.733/1.734 jython 5890/5393 0.999/1.002 3.323/2.022 OOM 6.548/5.238 OOM/1.954 0.958/1.031 1.883/1.829 kafka 5186/5200 0.999/0.996 1.000/0.996 OOM/0.995 2.441/3.588 0.993/0.989 0.991/0.992 1.728/1.725 luindex 4290/4283 0.994/0.989 1.173/1.089 OOM 18.20/20.93 1.054/0.976 0.969/0.988 1.780/1.782 lusearch 5398/4688 1.021/0.987 OOM OOM OOM OOM/6.914 0.981/1.101 2.313/2.240 pmd 2549/2407 1.002/1.013 1.338/1.243 OOM/1.432 10.95/31.76 1.794/1.469 0.959/0.987 1.793/1.797 spring 4414/3077 1.005/1.043 9.403/5.243 OOM 7.621/5.314 OOM 0.804/0.995 1.683/1.853 sunflow 8285/8100 1.000/0.967 8.371/2.729 OOM OOM/13.10 3.009/2.234 0.699/0.705 2.143/1.945 tomcat 13377/13330 1.001/0.999 1.494/1.198 OOM 4.510/3.245 OOM/4.019 1.005/1.003 1.750/1.747 tradebeans 5984/5691 1.004/0.997 4.842/2.626 OOM OOM OOM/9.141 1.089/1.072 2.081/2.076 tradesoap 4615/3031 0.989/1.007 3.022/2.480 OOM OOM OOM/13.68 0.757/1.095 1.664/1.999 xalan 2747/1753 0.988/1.008 26.03/25.83 OOM 37.19/43.93 6.657/7.631 0.737/0.962 2.077/2.312 zxing 2432/2404 1.003/0.960 1.023/1.009 1.454/1.017 1.794/1.386 0.995/0.958 0.917/0.957 1.680/1.680 geomean - 1.000/0.995 2.450/1.873 1.454/1.685 4.047/4.263 1.597/2.468 0.919/0.995 1.835/1.860| App | G1 | G1-10ms | Shen. | ZGC | GenShen | GenZ | LXR | Jade | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | avrora | 2902/2811 | 0.994/0.972 | 1.092/1.037 | OOM | 1.734/1.437 | OOM | 0.992/1.027 | 1.747/1.770 | | batik | 1707/1735 | 1.004/0.978 | 1.152/0.976 | OOM | 1.389/1.269 | 1.314/1.044 | OOM/1.048 | 1.741/1.712 | | biojava | 7487/7334 | 1.000/0.992 | 2.912/2.062 | OOM | 3.105/2.581 | OOM/3.781 | 0.954/0.969 | 1.825/1.803 | | cassandra | 8812/7760 | 1.009/0.989 | 1.275/1.177 | OOM | N/A | N/A | 0.961/1.011 | 1.793/1.801 | | eclipse | 12473/12215 | 0.999/0.996 | 1.068/1.047 | OOM/0.980 | 1.418/1.309 | 1.119/0.978 | 1.023/1.038 | 1.770/1.767 | | fop | 1041/860 | 0.998/0.989 | 4.035/1.592 | OOM | 6.049/4.609 | OOM/8.059 | 0.731/0.829 | 1.655/1.700 | | graphchi | 4252/4188 | 0.998/0.997 | 1.909/1.636 | OOM/1.403 | 4.414/3.987 | 1.405/1.227 | 0.949/0.956 | 1.825/1.831 | | h2 | 4972/3790 | 0.989/1.005 | 7.104/5.642 | OOM/11.47 | 2.304/2.373 | 2.212/1.914 | 0.904/1.109 | 1.941/2.130 | | h2o | 4573/3793 | 1.015/0.998 | 2.259/1.750 | OOM | N/A | N/A | 1.068/1.110 | 1.913/1.857 | | jme | 6873/6873 | 1.000/1.000 | 1.008/1.005 | OOM | 1.088/1.097 | 1.006/1.005 | 0.999/1.000 | 1.733/1.734 | | jython | 5890/5393 | 0.999/1.002 | 3.323/2.022 | OOM | 6.548/5.238 | OOM/1.954 | 0.958/1.031 | 1.883/1.829 | | kafka | 5186/5200 | 0.999/0.996 | 1.000/0.996 | OOM/0.995 | 2.441/3.588 | 0.993/0.989 | 0.991/0.992 | 1.728/1.725 | | luindex | 4290/4283 | 0.994/0.989 | 1.173/1.089 | OOM | 18.20/20.93 | 1.054/0.976 | 0.969/0.988 | 1.780/1.782 | | lusearch | 5398/4688 | 1.021/0.987 | OOM | OOM | OOM | OOM/6.914 | 0.981/1.101 | 2.313/2.240 | | pmd | 2549/2407 | 1.002/1.013 | 1.338/1.243 | OOM/1.432 | 10.95/31.76 | 1.794/1.469 | 0.959/0.987 | 1.793/1.797 | | spring | 4414/3077 | 1.005/1.043 | 9.403/5.243 | OOM | 7.621/5.314 | OOM | 0.804/0.995 | 1.683/1.853 | | sunflow | 8285/8100 | 1.000/0.967 | 8.371/2.729 | OOM | OOM/13.10 | 3.009/2.234 | 0.699/0.705 | 2.143/1.945 | | tomcat | 13377/13330 | 1.001/0.999 | 1.494/1.198 | OOM | 4.510/3.245 | OOM/4.019 | 1.005/1.003 | 1.750/1.747 | | tradebeans | 5984/5691 | 1.004/0.997 | 4.842/2.626 | OOM | OOM | OOM/9.141 | 1.089/1.072 | 2.081/2.076 | | tradesoap | 4615/3031 | 0.989/1.007 | 3.022/2.480 | OOM | OOM | OOM/13.68 | 0.757/1.095 | 1.664/1.999 | | xalan | 2747/1753 | 0.988/1.008 | 26.03/25.83 | OOM | 37.19/43.93 | 6.657/7.631 | 0.737/0.962 | 2.077/2.312 | | zxing | 2432/2404 | 1.003/0.960 | 1.023/1.009 | 1.454/1.017 | 1.794/1.386 | 0.995/0.958 | 0.917/0.957 | 1.680/1.680 | | geomean | - | 1.000/0.995 | 2.450/1.873 | 1.454/1.685 | 4.047/4.263 | 1.597/2.468 | 0.919/0.995 | 1.835/1.860 |
Figure 7. P99 latency results under various throughput and workload settings in H2-throttle.
图 7. H2-throttle 场景下不同吞吐量和工作负载设置时的 P99 延迟结果。

Figure 8. P99 latency results with varying group-related parameters (the application is Specjbb2015).
图 8. 不同分组相关参数下的 P99 延迟结果(测试应用为 Specjbb2015)。
Table 5. Breakdown GC statistics for Jade and GenZ, including average time (in milliseconds) and GC throughput (reclaimed megabytes per second) for both young and old GC. Again, GenZ and Jade are running on different JDK versions.
表 5. Jade 与 GenZ 的垃圾回收统计明细,包括新生代和老年代 GC 的平均耗时(毫秒)及吞吐量(每秒回收兆字节数)。需注意 GenZ 和 Jade 运行在不同 JDK 版本上。
Cycle  回收周期 Collector  收集器 Phase  阶段 Avg.  平均 Thru.  吞吐量
Young  年轻 Jade  翡翠 Total  总计 111.90 8579.47
GenZ  Z 世代 Mark  标记 366.29 2259.70
Evac.  疏散 86.87
Total  总计 453.16
Old   Jade   Mark  标记 1124.64 1248.36
Build  构建 218.88
Evac.  撤离 763.23
Total  总计 2106.75
GenZ Mark  标记 2525.76 760.53
Evac.  撤离 932.31
Total  总计 3458.07
Cycle Collector Phase Avg. Thru. Young Jade Total 111.90 8579.47 GenZ Mark 366.29 2259.70 Evac. 86.87 Total 453.16 Old Jade Mark 1124.64 1248.36 Build 218.88 Evac. 763.23 Total 2106.75 GenZ Mark 2525.76 760.53 Evac. 932.31 Total 3458.07 | Cycle | Collector | Phase | Avg. | Thru. | | :--- | :--- | :--- | :--- | :--- | | Young | Jade | Total | 111.90 | 8579.47 | | | GenZ | Mark | 366.29 | 2259.70 | | | | Evac. | 86.87 | | | | | Total | 453.16 | | | Old | Jade | Mark | 1124.64 | 1248.36 | | | | Build | 218.88 | | | | | Evac. | 763.23 | | | | | Total | 2106.75 | | | | GenZ | Mark | 2525.76 | 760.53 | | | | Evac. | 932.31 | | | | | Total | 3458.07 | |
the young generation size to 1 GB . Note that in this configuration, Jade’s application throughput is also affected (e.g., the max-jops for 2 x 2 x 2x2 x heap is 11761). According to the results
将年轻代大小设置为 1GB。请注意在此配置下,Jade 的应用吞吐量也会受到影响(例如 2 x 2 x 2x2 x 堆的最大 jops 为 11761)。根据结果

in Table 5, fade reaches 3.80 × 3.80 × 3.80 xx3.80 \times larger GC throughput (calculated by released memory size over collection time) in the young GC cycle. The speedup mainly comes from (1) the single-phase algorithm in Jade that avoids repetitive object graph traversals, (2) the inherent color pointer overhead and (3) the performance loss due to disabling compressed pointers in the presence of color pointers in GenZ. As for the old GC cycle, Jade performs better in both marking and evacuation than GenZ and thus reaches 64.14 % 64.14 % 64.14%64.14 \% improvement on GC throughput. The results further confirm that Jade’s group-based design outperforms prior concurrent copying collectors even with similar generational layouts.
表 5 显示,fade 在年轻代 GC 周期中实现了 3.80 × 3.80 × 3.80 xx3.80 \times 倍的垃圾回收吞吐量提升(通过释放内存大小除以回收时间计算)。这一加速主要源于:(1) Jade 采用单阶段算法避免了重复的对象图遍历,(2) 原生颜色指针的开销,以及(3) GenZ 因启用颜色指针而禁用压缩指针导致的性能损耗。在老年代 GC 周期中,Jade 在标记和迁移阶段均优于 GenZ,最终实现 64.14 % 64.14 % 64.14%64.14 \% 倍的吞吐量提升。这些结果进一步证实,即便采用相似的分代布局,Jade 基于分组的设计仍优于先前的并发复制回收器。
Duration of different phases. Table 6 shows different phases’ execution time under various heap configurations and maximum throughput for the H2 application. The workload is the same as Table 2, making the statistics comparable with ZGC and Shenandoah. Note that when the heap configuration is more than 2 × 2 × 2xx2 \times, fade only has young GC, so we evaluate it with two tighter heap configurations: 1 × 1 × 1xx1 \times and 1.2 × 1.2 × 1.2 xx1.2 \times of ZGC’s minimum heap size. As for the smallest heap size, although GC threads are mostly active, each pre-reclamation cycle is much shorter (averaging 0.09 s and 1.09 s for young and old GC), which allows Jade to quickly reclaim memory for mutators. Meanwhile, the accumulated pause time contributes to less than 1 % 1 % 1%1 \% of overall execution time, while the average pause time is less than 1 ms even with a tight heap size, which induces little interference on application latency. When the heap becomes larger, the pause is even shorter and does not restrict the application latency and throughput like ZGC and Shenandoah.
不同阶段的持续时间。表 6 展示了 H2 应用在不同堆配置下各阶段的执行时间及最大吞吐量。工作负载与表 2 相同,使得统计数据可与 ZGC 和 Shenandoah 进行对比。需注意当堆配置超过 2 × 2 × 2xx2 \times 时,fade 仅执行年轻代 GC,因此我们采用两个更严格的堆配置进行评估:ZGC 最小堆大小的 1 × 1 × 1xx1 \times 1.2 × 1.2 × 1.2 xx1.2 \times 。对于最小堆尺寸,虽然 GC 线程大多处于活跃状态,但每次预回收周期都更短(年轻代和年老代 GC 平均分别为 0.09 秒和 1.09 秒),这使得 Jade 能快速为赋值器回收内存。同时,累计暂停时间仅占整体执行时间的 1 % 1 % 1%1 \% 以下,即便在堆空间紧张时平均暂停时间也不足 1 毫秒,对应用延迟影响甚微。当堆空间增大时,暂停时间进一步缩短,不会像 ZGC 和 Shenandoah 那样限制应用延迟和吞吐量。
Table 6. GC-related statistics in 7ade, including time for applications (App.) and different GC phases.
表 6. 7ade 中与 GC 相关的统计数据,包括应用运行时间(App.)及不同 GC 阶段耗时。
Time  耗时 1 × 1 × 1xx\mathbf{1} \times 1.2 × 1.2 × 1.2 xx1.2 \times 1.5 × 1.5 × 1.5 xx1.5 \times   2 倍
Total  总计 App. (s)  应用(秒) 20.28 18.15 16.37 16.22
Mark (s)  标记(秒) 17.68 2.41 0 0
Build (s)  构建(秒) 11.82 0.62 0 0
Pause (s)  暂停(秒) 0.13 0.08 0.04 0.03
Young Evac. (s)  年轻疏散者(秒) 6.87 7.20 6.89 6.44
Old Evac. (s)  年长疏散者(秒) 1.39 0.47 0 0
Avg.  平均 Mark (s)  标记(秒) 0.85 0.60 0 0
Build (s)  构建(秒) 0.24 0.62 0 0
Pause (ms)  暂停(毫秒) 0.72 0.71 0.55 0.66
Young Evac. (s)  新生代回收(秒) 0.09 0.13 0.18 0.26
Old Evac. (s)  旧代回收(秒) 0.20 0.26 0 0
p99 Pause (ms)  暂停时间(毫秒) 3.21 1.64 1.08 1.39
Time 1xx 1.2 xx 1.5 xx 2× Total App. (s) 20.28 18.15 16.37 16.22 Mark (s) 17.68 2.41 0 0 Build (s) 11.82 0.62 0 0 Pause (s) 0.13 0.08 0.04 0.03 Young Evac. (s) 6.87 7.20 6.89 6.44 Old Evac. (s) 1.39 0.47 0 0 Avg. Mark (s) 0.85 0.60 0 0 Build (s) 0.24 0.62 0 0 Pause (ms) 0.72 0.71 0.55 0.66 Young Evac. (s) 0.09 0.13 0.18 0.26 Old Evac. (s) 0.20 0.26 0 0 p99 Pause (ms) 3.21 1.64 1.08 1.39| Time | | $\mathbf{1} \times$ | $1.2 \times$ | $1.5 \times$ | 2× | | :--- | :--- | :--- | :--- | :--- | :--- | | Total | App. (s) | 20.28 | 18.15 | 16.37 | 16.22 | | | Mark (s) | 17.68 | 2.41 | 0 | 0 | | | Build (s) | 11.82 | 0.62 | 0 | 0 | | | Pause (s) | 0.13 | 0.08 | 0.04 | 0.03 | | | Young Evac. (s) | 6.87 | 7.20 | 6.89 | 6.44 | | | Old Evac. (s) | 1.39 | 0.47 | 0 | 0 | | Avg. | Mark (s) | 0.85 | 0.60 | 0 | 0 | | | Build (s) | 0.24 | 0.62 | 0 | 0 | | | Pause (ms) | 0.72 | 0.71 | 0.55 | 0.66 | | | Young Evac. (s) | 0.09 | 0.13 | 0.18 | 0.26 | | | Old Evac. (s) | 0.20 | 0.26 | 0 | 0 | | p99 | Pause (ms) | 3.21 | 1.64 | 1.08 | 1.39 |
CRDT. To show how CRDT helps reduce Jade’s concurrent GC time, we break down Jade’s GC cycle and compare it against G1, which uses region-wise remembered sets for its mixed collections (reclaiming both young and old generation) and thus also includes a concurrent marking and remembered set building phase. The workload is also the
CRDT。为了展示 CRDT 如何帮助减少 Jade 的并发 GC 时间,我们分解了 Jade 的 GC 周期并与 G1 进行对比——G1 在其混合回收(同时回收新生代和旧生代)中使用分区域记忆集,因此同样包含并发标记和记忆集构建阶段。测试负载同样采用

same as Figure 8. The results in Table 7 show that 7 ade improves the remembered set building time by 67.81 % 67.81 % 67.81%67.81 \%. This is mainly because CRDT reduces the number of cards to be scanned in the building phase by 64.63 % 64.63 % 64.63%64.63 \%. Meanwhile, although CRDT can introduce overhead in the marking phase, Jade still outperforms G1 by 24.95 % 24.95 % 24.95%24.95 \% in marking. The improvement mainly comes from Jade’s co-running design: during an old GC cycle, young GC threads can help by pushing young-to-old references into marking stacks. In contrast, since G1 conducts young GC in an STW fashion, it has to temporarily store those references in the live bitmap (similar to that in concurrent collectors), which needs rescanning in a future old marking cycle. Due to those two optimizations, Fade achieves a 40.48 % 40.48 % 40.48%40.48 \% improvement on the two concurrent phases together even compared with a high-throughput collector like G1, which further confirms Jade’s GC efficiency. As for the H2 benchmark analyzed before, CRDT also reduces the average number of scanned cards by 61.11 % 61.11 % 61.11%61.11 \%.
与图 8 结果相同。表 7 数据显示,Jade 将记忆集构建时间提升了 67.81 % 67.81 % 67.81%67.81 \% ,这主要得益于 CRDT 技术在构建阶段将需扫描的卡表数量减少了 64.63 % 64.63 % 64.63%64.63 \% 。虽然 CRDT 会在标记阶段引入额外开销,但 Jade 的标记效率仍比 G1 高出 24.95 % 24.95 % 24.95%24.95 \% 。这一优势源于 Jade 的协同运行设计:在老年代 GC 周期中,年轻代 GC 线程可通过将年轻代到老年代的引用压入标记栈来协助工作。相比之下,由于 G1 采用 STW 方式执行年轻代 GC,不得不将这些引用暂存于存活位图(类似并发收集器的做法),后续老年代标记周期还需重新扫描。凭借这两项优化,即使与 G1 这类高吞吐收集器相比,Jade 在并发阶段整体仍实现了 40.48 % 40.48 % 40.48%40.48 \% 的性能提升,进一步验证了其 GC 效率。以先前分析的 H2 基准测试为例,CRDT 技术还将平均扫描卡表数量降低了 61.11 % 61.11 % 61.11%61.11 \%
Table 7. Remembered set-related time breakdown in milliseconds, which mainly divides into two phases: marking (Mark) and remembered set re-building (Build).
表 7. 记忆集相关时间分解(单位:毫秒),主要分为标记(Mark)和记忆集重建(Build)两个阶段
Collectors  收集器类型 Mark  标记耗时 Build  构建 Total  总计 No. of cards  卡片数量
G1 1369.02 777.93 2146.95 1215774
Jade  翡翠 1027.36 250.43 1277.78 430041
Collectors Mark Build Total No. of cards G1 1369.02 777.93 2146.95 1215774 Jade 1027.36 250.43 1277.78 430041| Collectors | Mark | Build | Total | No. of cards | | :--- | :--- | :--- | :--- | :--- | | G1 | 1369.02 | 777.93 | 2146.95 | 1215774 | | Jade | 1027.36 | 250.43 | 1277.78 | 430041 |
Chasing mode. Thanks to Jade’s efficient marking and collection phases, we do not observe application stalls in most configurations. Therefore, we run Specjbb2015 with high throughput ( 13,000 for 1.5 × 1.5 × 1.5 xx1.5 \times heap) for 15 minutes and find the average pause time introduced by application stalls is 40.05 ms (p99 is 97.96 ms ), which suggests that 7 ade does not induce large pauses even under extreme configurations. Meanwhile, the average CPU utilization within the chasing mode is 90.75 % 90.75 % 90.75%90.75 \%, showing that Jade sufficiently leverages CPU resources when mutators are stalled.
追踪模式。得益于 Jade 高效的标记和收集阶段,在大多数配置下我们未观察到应用程序停顿。因此,我们以高吞吐量( 1.5 × 1.5 × 1.5 xx1.5 \times 堆时为 13,000)运行 Specjbb2015 持续 15 分钟,测得应用程序停顿引入的平均暂停时间为 40.05 毫秒(p99 为 97.96 毫秒),这表明即使在极端配置下,Jade 也不会引发长时间停顿。同时,追踪模式下的平均 CPU 利用率为 90.75 % 90.75 % 90.75%90.75 \% ,说明当突变线程停顿时,Jade 能充分调度 CPU 资源。

6.1 Concurrent collectors
6.1 并发收集器

As applications’ memory demands constantly grow, concurrent collectors are becoming popular to provide controlled GC pause time regardless of heap sizes. The Garbage-First (G1) collector introduces soft limits and a concurrent marking phase, but its evacuation phase is still stop-the-world. Pauseless GC [10] divides the collection into three phases (marking, relocating, remapping) and each phase allows coexecution with mutators, which inspires the design of today’s concurrent collectors. C4 [33] extends Pauseless GC with a generational design while Collie [18] proposes to use hardware transactional memory (HTM) to atomically relocate objects. Compressor [20] also uses hand-over-hand compaction to retain low physical memory overhead, but its
随着应用程序内存需求的持续增长,并发垃圾收集器正日益流行,以提供不受堆大小影响的可控 GC 停顿时间。Garbage-First(G1)收集器引入了软限制和并发标记阶段,但其疏散阶段仍采用全局停顿机制。无停顿 GC[10]将收集过程划分为三个阶段(标记、重定位、重映射),每个阶段都允许与赋值器并发执行,这一设计启发了当今并发收集器的开发。C4[33]在无停顿 GC 基础上扩展了分代设计,而 Collie[18]提出使用硬件事务内存(HTM)实现对象原子化重定位。Compressor[20]同样采用交替压缩技术来保持较低的物理内存开销,但其

reference calculation algorithm is costly. Block-free GC [26] introduces non-block handshakes for concurrent stack scanning and object copying. OpenJDK also introduced two concurrent copying collectors, Shenandoah and ZGC, which have been studied in this work. Cai et al. [7] also find ZGC and Shenandoah can introduce long pauses when under heavy workload, but they do not explore the design of collectors to explain those pauses. Jade summarizes the deficiencies inside existing concurrent collectors and provides group-based evacuation and single-phase young GC to improve GC efficiency and application performance.
引用计数算法计算成本高昂。无阻塞 GC[26]引入了非阻塞握手机制来实现并发栈扫描和对象复制。OpenJDK 也推出了两款并发复制收集器 Shenandoah 和 ZGC,这正是本研究探讨的对象。Cai 等人[7]同样发现 ZGC 和 Shenandoah 在高负载下会产生长暂停,但未深入探究收集器设计来解释这些停顿现象。Jade 系统总结了现有并发收集器的缺陷,通过基于分组的回收策略和单阶段年轻代 GC 来提升垃圾回收效率与应用性能。

6.2 Reference counting  6.2 引用计数

In contrast to tracing collectors, reference counting (RC) collectors record incoming references for objects and can immediately reclaim them when the number reaches zero. Immediacy is an appealing feature in RC, but it also has two limitations: (1) the inability to handle cyclic references and (2) the large overhead for maintaining the per-object counter. The first one is the inherent limitation for RC, so prior work mainly focuses on optimizing the maintenance overhead. Biased reference counting (BRC) [9] observes most objects are only accessed by a single thread (namely owner) and thus allows the owner to modify the counters without atomic instructions. Deferred RC [14] introduces a collection phase to RC, which only focuses on objects updated since the last collection and updates those objects’ counters. RCImmix [31, 32] further combines the deferred RC design with Immix’s heap layout [3] to reach comparable performance with trace-based collectors. LXR [41] finds that RC can be elegantly integrated with the concurrent marking algorithm of G1 and thus proposes to still use RC-based STW pauses for both GC efficiency and low application latency. Jade instead focuses on improving the performance of concurrent collectors.
与追踪式垃圾回收器不同,引用计数(RC)回收器会记录对象的传入引用,并在计数归零时立即回收对象。即时性是引用计数的显著优势,但也存在两大局限:(1)无法处理循环引用;(2)维护对象计数器的开销较大。前者是引用计数的固有缺陷,因此现有研究主要聚焦于优化维护开销。偏置引用计数(BRC)[9]发现多数对象仅被单一线程(即所有者)访问,因而允许所有者无需原子指令即可修改计数器。延迟引用计数(Deferred RC)[14]为引用计数引入了回收阶段,仅处理自上次回收后更新的对象并调整其计数器。RCImmix[31,32]进一步将延迟引用计数设计与 Immix 堆布局[3]结合,实现了与追踪式回收器相当的性能。LXR[41]发现引用计数可与 G1 的并发标记算法优雅结合,因此提出仍采用基于引用计数的 STW 停顿,兼顾垃圾回收效率与应用低延迟。 Jade 则专注于提升并发收集器的性能表现。

6.3 GC optimizations  6.3 垃圾回收优化

Another line of work proposes optimizations to collectors so that they can be adapted to various scenarios. Yak [24] provides an epoch-based GC design for big-data applications while NG2C [4-6] pre-tenures long-lived objects for similar workloads. Yang et al. [40] provide NVM-friendly GC designs according to the bandwidth characteristics of nonvolatile memory devices. Mako [23] and MemLiner [36] optimize the performance of concurrent GC on a far memory scenario. Our work mainly focuses on optimizing the performance of concurrent copying GC when under heavy workload and thus orthogonal to those prior efforts.
另一研究方向提出针对收集器的优化方案,使其能适配多样化场景。Yak[24]为大数据应用设计了基于纪元的垃圾回收机制,而 NG2C[4-6]则针对类似工作负载预先提升长生命周期对象的分代。Yang 等人[40]根据非易失性存储设备的带宽特性,提出了 NVM 友好的垃圾回收设计方案。Mako[23]和 MemLiner[36]优化了远内存场景下的并发垃圾回收性能。我们的工作主要聚焦于高负载场景下并发复制式垃圾回收的性能优化,因此与前述研究形成互补关系。

7 Conclusion  7 结论

Garbage collectors (GC) are among the most important modules in language runtimes. Recent concurrent collectors claim to reach pauseless by allowing concurrent execution of
垃圾回收器(GC)是语言运行时中最重要的模块之一。最新的并发回收器声称通过允许变异线程与 GC 线程并发执行来实现无暂停

mutators and GC threads, but they still induce long pauses when under heavy workloads. To this end, this work proposes Jade, which provides corresponding designs to improve GC efficiency and reduce the duration of pauses. The evaluation results show that Jade can significantly improve the peak application throughput while remaining comparable tail latency with mainstream concurrent collectors.
但在高负载下仍会导致长时间停顿。为此,本研究提出 Jade 系统,通过相应设计提升 GC 效率并缩短停顿时间。评估结果表明,Jade 在保持与主流并发回收器相近的尾延迟同时,能显著提升应用峰值吞吐量。

Acknowledgments  致谢

We sincerely thank our shepherd Martin Maas and the anonymous EuroSys’24 reviewers for their insightful comments and feedback. We also thank Wenyu Zhao for helping us evaluate LXR. This work was supported in part by the National Natural Science Foundation of China (No. 62202295, 62172272, 61925206), and in part by Alibaba Group through the Alibaba Innovative Research Program. Corresponding author: Liang Mao (maoliang.ml@alibaba-inc.com).
衷心感谢指导委员 Martin Maas 和 EuroSys'24 匿名评审专家提出的深刻意见。同时感谢赵文宇在 LXR 评估中的帮助。本研究得到国家自然科学基金(62202295、62172272、61925206)和阿里巴巴集团通过阿里创新研究计划的部分资助。通讯作者:毛亮(maoliang.ml@alibaba-inc.com)。

References  参考文献

[1] Apache. Welcome to apache hbase. https://hbase.apache.org/, 2022.
[1] Apache. 欢迎访问 Apache HBase 官网. https://hbase.apache.org/, 2022.

[2] Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khan, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony L. Hosking, Maria Jump, Han Bok Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanovic, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. The dacapo benchmarks: java benchmarking development and analysis. In OOPSLA, pages 169-190. ACM, 2006.
[2] Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khan, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony L. Hosking, Maria Jump, Han Bok Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanovic, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. Dacapo 基准测试集:Java 性能测试的开发与分析. 发表于 OOPSLA 会议论文集, 第 169-190 页. ACM 出版社, 2006.

[3] Stephen M. Blackburn and Kathryn S. McKinley. Immix: a markregion garbage collector with space efficiency, fast collection, and mutator performance. In PLDI, pages 22-32. ACM, 2008.
[3] Stephen M. Blackburn 和 Kathryn S. McKinley. Immix:一款兼具空间效率、快速回收与程序性能的区域标记垃圾收集器. 发表于 PLDI 会议论文集, 第 22-32 页. ACM 出版社, 2008.

[4] Rodrigo Bruno and Paulo Ferreira. POLM2: automatic profiling for object lifetime-aware memory management for hotspot big data applications. In Middleware, pages 147-160. ACM, 2017.
[4] 罗德里戈·布鲁诺与保罗·费雷拉。POLM2:面向热点大数据应用的对象生命周期感知内存管理自动分析工具。发表于《中间件》会议论文集,第 147-160 页。ACM 出版社,2017 年。

[5] Rodrigo Bruno, Luís Picciochi Oliveira, and Paulo Ferreira. NG2C: pretenuring garbage collection with dynamic generations for hotspot big data applications. In ISMM, pages 2-13. ACM, 2017.
[5] 罗德里戈·布鲁诺、路易斯·皮西奥奇·奥利维拉与保罗·费雷拉。NG2C:支持动态分代的热点大数据应用预老化垃圾回收机制。发表于《内存管理国际研讨会》,第 2-13 页。ACM 出版社,2017 年。

[6] Rodrigo Bruno, Duarte Patrício, José Simão, Luís Veiga, and Paulo Ferreira. Runtime object lifetime profiler for latency sensitive big data applications. In EuroSys, pages 28:1-28:16. ACM, 2019.
[6] 罗德里戈·布鲁诺、杜阿尔特·帕特里西奥、何塞·西芒、路易斯·维加与保罗·费雷拉。面向时延敏感型大数据应用的运行时对象生命周期分析器。发表于《欧洲系统会议》,第 28:1-28:16 页。ACM 出版社,2019 年。

[7] Zixian Cai, Stephen M. Blackburn, Michael D. Bond, and Martin Maas. Distilling the real cost of production garbage collectors. In ISPASS, pages 46-57. IEEE, 2022.
[7] 蔡子贤、斯蒂芬·M·布莱克本、迈克尔·D·邦德与马丁·马斯。生产环境垃圾回收器的真实成本剖析。发表于《性能分析与系统仿真国际研讨会》,第 46-57 页。IEEE 出版社,2022 年。

[8] Maria Carpen-Amarie, Yaroslav Hayduk, Pascal Felber, Christof Fetzer, Gaël Thomas, and Dave Dice. Towards an efficient pauseless java GC with selective htm-based access barriers. In ManLang, pages 8591. ACM, 2017.
[8] Maria Carpen-Amarie, Yaroslav Hayduk, Pascal Felber, Christof Fetzer, Gaël Thomas, Dave Dice. 基于选择性 HTM 访问屏障的高效无停顿 Java 垃圾回收研究. 发表于 ManLang 会议, 第 85-91 页. ACM 出版社, 2017 年.

[9] Jiho Choi, Thomas Shull, and Josep Torrellas. Biased reference counting: minimizing atomic operations in garbage collection. In PACT, pages 35:1-35:12. ACM, 2018.
[9] Jiho Choi, Thomas Shull, Josep Torrellas. 偏向引用计数:垃圾回收中原子操作的最小化实践. 发表于 PACT 会议, 第 35:1-35:12 页. ACM 出版社, 2018 年.

[10] Cliff Click, Gil Tene, and Michael Wolf. The pauseless GC algorithm. In VEE, pages 46-56. ACM, 2005.
[10] Cliff Click, Gil Tene, Michael Wolf. 无停顿垃圾回收算法研究. 发表于 VEE 会议, 第 46-56 页. ACM 出版社, 2005 年.

[11] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking cloud serving systems with YCSB. In SoCC, pages 143-154. ACM, 2010.
[11] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears. 使用 YCSB 基准测试云服务系统. 发表于 SoCC 会议, 第 143-154 页. ACM 出版社, 2010 年.

[12] Standard Performance Evaluation Corporation. The specjbb2915 benchmark. https://www.spec.org/jbb2015/, 2021.
[12] 标准性能评估公司。SPECjbb2015 基准测试。https://www.spec.org/jbb2015/, 2021 年。

[13] David Detlefs, Christine H. Flood, Steve Heller, and Tony Printezis. Garbage-first garbage collection. In ISMM, pages 37-48. ACM, 2004.
[13] David Detlefs、Christine H. Flood、Steve Heller 和 Tony Printezis。G1 垃圾优先收集算法。发表于 ISMM 会议,第 37-48 页。ACM 出版社,2004 年。

[14] L. Peter Deutsch and Daniel G. Bobrow. An efficient, incremental, automatic garbage collector. Commun. ACM, 19(9):522-526, 1976.
[14] L. Peter Deutsch 与 Daniel G. Bobrow。一种高效、增量式、自动化的垃圾回收器。《ACM 通讯》,19(9):522-526 页,1976 年。

[15] Christine H. Flood, Roman Kennke, Andrew E. Dinn, Andrew Haley, and Roland Westrelin. Shenandoah: An open-source concurrent compacting garbage collector for openjdk. In PPP7, pages 13:1-13:9. ACM, 2016.
[15] Christine H. Flood、Roman Kennke、Andrew E. Dinn、Andrew Haley 和 Roland Westrelin。Shenandoah:OpenJDK 开源并发压缩垃圾回收器。发表于 PPP7 会议,第 13:1-13:9 页。ACM 出版社,2016 年。

[16] Dacapo Group. The dacapo benchmark suite (chopin development). https://github.com/dacapobench/dacapobench/tree/devchopin, 2022.
[16] Dacapo Group. Dacapo 基准测试套件(Chopin 开发版)。https://github.com/dacapobench/dacapobench/tree/devchopin, 2022 年。

[17] H2. H2 database engine. https://www.h2database.com/html/main.html, 2022.
[17] H2. H2 数据库引擎。https://www.h2database.com/html/main.html, 2022 年。

[18] Balaji Iyengar, Gil Tene, Michael Wolf, and Edward F. Gehringer. The collie: a wait-free compacting collector. In ISMM, pages 85-96. ACM, 2012.
[18] Balaji Iyengar, Gil Tene, Michael Wolf, 和 Edward F. Gehringer。The Collie:一种无等待的紧凑型垃圾收集器。发表于 ISMM 会议,第 85-96 页。ACM 出版社,2012 年。

[19] Stefan Johansson. Gc progress from jdk 8 to jdk 17. https://kstefanj.github.io/2021/11/24/gc-progress-8-17.html, 2021.
[19] Stefan Johansson。从 JDK 8 到 JDK 17 的 GC 进展。https://kstefanj.github.io/2021/11/24/gc-progress-8-17.html, 2021 年。

[20] Haim Kermany and Erez Petrank. The compressor: concurrent, incremental, and parallel compaction. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 354-363, 2006.
[20] 海姆·科尔曼尼与埃雷兹·佩特兰克。《压缩器:并发、增量式并行压缩技术》。收录于第 27 届 ACM SIGPLAN 编程语言设计与实现会议论文集,第 354-363 页,2006 年。

[21] Chris Lattner and Vikram S. Adve. Transparent pointer compression for linked data structures. In Memory System Performance, pages 2435. ACM, 2005.
[21] 克里斯·拉特纳与维克拉姆·S·阿德维。《链表数据结构的透明指针压缩技术》。收录于内存系统性能专题,第 2435 页,ACM 出版社,2005 年。

[22] Per Lidén and Stefan Karlsson. The z garbage collector - low latency gc for openjdk. http://cr.openjdk.java.net/ pliden/slides/ZGC-Jfokus2018.pdf, 2018.
[22] 佩尔·利登与斯特凡·卡尔松。《Z 垃圾收集器——OpenJDK 的低延迟 GC 技术》。http://cr.openjdk.java.net/ pliden/slides/ZGC-Jfokus2018.pdf,2018 年。

[23] Haoran Ma, Shi Liu, Chenxi Wang, Yifan Qiao, Michael D. Bond, Stephen M. Blackburn, Miryung Kim, and Guoqing Harry Xu. Mako: a low-pause, high-throughput evacuating collector for memorydisaggregated datacenters. In PLDI, pages 92-107. ACM, 2022.
[23] 马浩然、刘石、王晨曦、乔一凡、迈克尔·D·邦德、斯蒂芬·M·布莱克本、金美英与徐国庆。《Mako:面向内存解耦数据中心的高吞吐低暂停转移式回收器》。收录于 PLDI 会议论文集,第 92-107 页,ACM 出版社,2022 年。

[24] Khanh Nguyen, Lu Fang, Guoqing Xu, Brian Demsky, Shan Lu, Sanazsadat Alamian, and Onur Mutlu. Yak: A high-performance big-data-friendly garbage collector. In OSDI, pages 349-365. USENIX Association, 2016.
[24] Khanh Nguyen、Lu Fang、Guoqing Xu、Brian Demsky、Shan Lu、Sanazsadat Alamian 与 Onur Mutlu。Yak:面向高性能大数据场景的友好垃圾回收器。载于 OSDI 会议录,第 349-365 页。USENIX 协会,2016 年。

[25] OpenJDK. Zgc - the z garbage collector. https://openjdk.org/projects/zgc/, 2022.
[25] OpenJDK。ZGC——Z 垃圾回收器。https://openjdk.org/projects/zgc/,2022 年。

[26] Erik Österlund and Welf Löwe. Block-free concurrent GC: stack scanning and copying. In ISMM, pages 1-12. ACM, 2016.
[26] Erik Österlund 与 Welf Löwe。无阻塞并发 GC:栈扫描与复制技术。载于 ISMM 会议录,第 1-12 页。ACM,2016 年。

[27] Filip Pizlo, Daniel Frampton, Erez Petrank, and Bjarne Steensgaard. Stopless: a real-time garbage collector for multiprocessors. In ISMM, pages 159-172. ACM, 2007.
[27] Filip Pizlo、Daniel Frampton、Erez Petrank 与 Bjarne Steensgaard。Stopless:面向多处理器的实时垃圾回收器。载于 ISMM 会议录,第 159-172 页。ACM,2007 年。

[28] Filip Pizlo, Erez Petrank, and Bjarne Steensgaard. A study of concurrent real-time garbage collectors. In PLDI, pages 33-44. ACM, 2008.
[28] Filip Pizlo、Erez Petrank 和 Bjarne Steensgaard。《并发实时垃圾收集器研究》。发表于 PLDI 会议,第 33-44 页。ACM 出版社,2008 年。

[29] Android Open Source Project. Art gc overview. https://source.android.com/docs/core/runtime/gcdebug#art_gc_overview, 2022.
[29] Android 开源项目。《ART 垃圾回收概述》。https://source.android.com/docs/core/runtime/gcdebug#art_gc_overview,2022 年。

[30] Thomas Schatzl. Java garbage collection: The 10-release evolution from jdk 8 to jdk 18. https://blogs.oracle.com/javamagazine/post/java-garbage-collectors-evolution, 2022.
[30] Thomas Schatzl。《Java 垃圾回收:从 JDK 8 到 JDK 18 的 10 个版本演进》。https://blogs.oracle.com/javamagazine/post/java-garbage-collectors-evolution,2022 年。

[31] Rifat Shahriyar, Stephen M. Blackburn, and Daniel Frampton. Down for the count? getting reference counting back in the ring. In ISMM, pages 73-84. ACM, 2012.
[31] Rifat Shahriyar、Stephen M. Blackburn 和 Daniel Frampton。《引用计数卷土重来?》。发表于 ISMM 会议,第 73-84 页。ACM 出版社,2012 年。

[32] Rifat Shahriyar, Stephen M. Blackburn, Xi Yang, and Kathryn S. McKinley. Taking off the gloves with reference counting immix. In OOPSLA, pages 93-110. ACM, 2013.
[32] Rifat Shahriyar, Stephen M. Blackburn, Xi Yang, 和 Kathryn S. McKinley。通过引用计数 Immix 技术摘下手套。收录于 OOPSLA 会议论文集,第 93-110 页。ACM 出版社,2013 年。

[33] Gil Tene, Balaji Iyengar, and Michael Wolf. C4: the continuously concurrent compacting collector. In ISMM, pages 79-88. ACM, 2011.
[33] Gil Tene, Balaji Iyengar, 和 Michael Wolf。C4:持续并发压缩垃圾收集器。收录于 ISMM 会议论文集,第 79-88 页。ACM 出版社,2011 年。

[34] TPC. Tpc-c is an on-line transaction processing benchmark. https://www.tpc.org/tpcc/, 2022.
[34] TPC。TPC-C 在线事务处理基准测试规范。https://www.tpc.org/tpcc/,2022 年。

[35] David M. Ungar. Generation scavenging: A non-disruptive high performance storage reclamation algorithm. In Software Development Environments (SDE), pages 157-167. ACM, 1984.
[35] David M. Ungar。分代式回收:一种非破坏性高性能存储回收算法。收录于软件开发环境(SDE)会议论文集,第 157-167 页。ACM 出版社,1984 年。

[36] Chenxi Wang, Haoran Ma, Shi Liu, Yifan Qiao, Jonathan Eyolfson, Christian Navasca, Shan Lu, and Guoqing Harry Xu. Memliner: Lining up tracing and application for a far-memory-friendly runtime. In OSDI, pages 35-53. USENIX Association, 2022.
[36] 陈曦(音译)、马浩然(音译)、刘石(音译)、乔一凡(音译)、Jonathan Eyolfson、Christian Navasca、陆珊(音译)、徐国庆(音译)。Memliner:为远内存友好型运行时对齐追踪与应用。收录于《操作系统设计与实现研讨会论文集》,第 35-53 页。美国计算机协会,2022 年。

[37] wenyuzhao. Dacapo minheap values. https://gist.github.com/wenyuzhao/29e3e0e10bb68c4f2862851c874e0275, 2023.
[37] 赵文宇(音译)。Dacapo 最小堆数值。https://gist.github.com/wenyuzhao/29e3e0e10bb68c4f2862851c874e0275,2023 年。

[38] wenyuzhao. Incorrect heap usage reporting. https://github.com/mmtk/mmtk-openjdk/issues/270, 2024.
[38] 赵文宇(音译)。堆使用量报告错误。https://github.com/mmtk/mmtk-openjdk/issues/270,2024 年。

[39] Mingyu Wu, Ziming Zhao, Yanfei Yang, Haoyu Li, Haibo Chen, Binyu Zang, Haibing Guan, Sanhong Li, Chuansheng Lu, and Tongbao Zhang. Platinum: A cpu-efficient concurrent garbage collector for tail-reduction of interactive services. In USENIX Annual Technical Conference, pages 159-172. USENIX Association, 2020.
[39] 吴明宇(音译)、赵子铭(音译)、杨燕飞(音译)、李浩宇(音译)、陈海波(音译)、臧彬宇(音译)、管海兵(音译)、李三红(音译)、陆传胜(音译)、张同宝(音译)。Platinum:一种面向交互服务尾延迟优化的 CPU 高效并发垃圾回收器。收录于《USENIX 年度技术会议论文集》,第 159-172 页。美国计算机协会,2020 年。

[40] Yanfei Yang, Mingyu Wu, Haibo Chen, and Binyu Zang. Bridging the performance gap for copy-based garbage collectors atop non-volatile memory. In EuroSys, pages 343-358. ACM, 2021.
[40] 闫飞阳、吴明宇、陈海波、臧斌宇。《基于非易失性内存的复制式垃圾回收器性能优化》。载于 EuroSys 会议论文集,第 343-358 页。ACM 出版社,2021 年。

[41] Wenyu Zhao, Stephen M. Blackburn, and Kathryn S. McKinley. Lowlatency, high-throughput garbage collection. In PLDI, pages 76-91. ACM, 2022.
[41] 赵文宇、Stephen M. Blackburn、Kathryn S. McKinley。《低延迟高吞吐垃圾回收技术》。载于 PLDI 会议论文集,第 76-91 页。ACM 出版社,2022 年。

  1. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
    本作品允许为个人或课堂教学目的以数字或硬拷贝形式全部或部分复制,前提是复制件不以盈利或商业优势为目的制作或分发,且复制件首页须保留本声明及完整引用。作品非作者所有的组成部分版权必须得到尊重。允许在注明出处的前提下进行摘要。其他形式的复制、重新发布、在服务器上张贴或向列表重新分发,均需事先获得特定许可和/或支付费用。许可请求请发送至 permissions@acm.org。

    EuroSys '24, April 22-25, 2024, Athens, Greece
    EuroSys '24 会议,2024 年 4 月 22-25 日,希腊雅典

    © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
    © 2024 版权归作者所有。出版权由 ACM 授权许可。

    ACM ISBN 979-8-4007-0437-6/24/04… $15.00
    ACM ISBN 979-8-4007-0437-6/24/04… 15.00 美元

    https://doi.org/10.1145/3627703.3650087
  2. 1 1 ^(1){ }^{1} Generational Shenandoah is still under development, so we download the latest version (commit f3c9eda) for evaluation.
    1 1 ^(1){ }^{1} 分代式 Shenandoah 仍在开发中,因此我们下载了最新版本(提交记录 f3c9eda)进行评估。