Best Practices#  最佳实践 #

Performance Benchmarking using trtexec#
使用 trtexec 进行性能基准测试 #

This section introduces how to use trtexec, a command-line tool designed for TensorRT performance benchmarking, to get the inference performance measurements of your deep learning models.
本节介绍了如何使用 trtexec ,一个专为 TensorRT 性能基准测试设计的命令行工具,来获取您的深度学习模型的推理性能指标。

If you use the TensorRT NGC container, trtexec is installed at /opt/tensorrt/bin/trtexec.
如果你使用 TensorRT NGC 容器, trtexec 将安装在 /opt/tensorrt/bin/trtexec

If you manually installed TensorRT, trtexec is part of the installation.
如果你手动安装 TensorRT, trtexec 是安装的一部分。

Alternatively, you can build trtexec from source code using the TensorRT OSS repository.
或者,你可以使用 TensorRT OSS 仓库从源代码构建 trtexec

Performance Benchmarking with an ONNX File#
使用 ONNX 文件进行性能基准测试

If your model is already in the ONNX format, the trtexec tool can measure its performance directly. In this example, we will use the ResNet-50 v1 ONNX model from the ONNX model zoo to showcase how to use trtexec to measure its performance.
如果你的模型已经是 ONNX 格式,可以使用 trtexec 工具直接测量其性能。在这个示例中,我们将使用 ONNX 模型库中的 ResNet-50 v1 ONNX 模型,展示如何使用 trtexec 来测量其性能。

For example, the trtexec command to measure the performance of ResNet-50 with batch size 4 is:
例如,使用批处理大小为 4 来测量 ResNet-50 性能的 trtexec 命令是:

trtexec --onnx=resnet50-v1-12.onnx --shapes=data:4x3x224x224 --fp16 --noDataTransfers --useCudaGraph --useSpinWait

Where:  其中:

  • The --onnx flag specifies the path to the ONNX file.
    --onnx 标志指定了 ONNX 文件的路径。

  • The --shapes flag specifies the input tensor shapes.
    --shapes 标志指定输入张量的形状。

  • The --fp16 flag enables FP16 tactics.
    --fp16 标志启用 FP16 策略。

  • The other flags have been added to make performance results more stable.
    其他标志已添加以提高性能结果的稳定性。

The value for the --shapes flag is in the format of name1:shape1,name2:shape2,... Suppose you do not know the input tensor names and shapes. You can get the information by visualizing the ONNX model using tools like Netron or by running a Polygraphy model inspection.
--shapes 标志的值采用 name1:shape1,name2:shape2,.. 格式。如果您不知道输入张量的名称和形状,可以通过使用 Netron 等工具可视化 ONNX 模型,或运行 Polygraphy 模型检查来获取这些信息。

For example, running polygraphy inspect model resnet50-v1-12.onnx prints out:
例如,运行 polygraphy inspect model resnet50-v1-12.onnx 会输出:

[I] Loading model: /home/pohanh/trt/resnet50-v1-12.onnx
[I] ==== ONNX Model ====
    Name: mxnet_converted_model | ONNX Opset: 12
    ---- 1 Graph Input(s) ----
    {data [dtype=float32, shape=('N', 3, 224, 224)]}
    ---- 1 Graph Output(s) ----
    {resnetv17_dense0_fwd [dtype=float32, shape=('N', 1000)]}
    ---- 299 Initializer(s) ----
    ---- 175 Node(s) ----

It shows that the ONNX model has a graph input tensor named data whose shape is ('N', 3, 224, 224), where 'N' represents that the dimension can be dynamic. Therefore, the trtexec flag to specify the input shapes with batch size 4 would be --shapes=data:4x3x224x224.
它显示 ONNX 模型有一个名为 data 的图输入张量,其形状为 ('N', 3, 224, 224) ,其中 'N' 表示该维度可以是动态的。因此,指定输入形状为批量大小 4 的 trtexec 标志将是 --shapes=data:4x3x224x224

After running the trtexec command, trtexec will parse your ONNX file, build a TensorRT plan file, measure the performance of this plan file, and then print a performance summary as follows:
运行 trtexec 命令后, trtexec 将解析您的 ONNX 文件,构建一个 TensorRT 计划文件,测量该计划文件的性能,然后打印性能摘要如下:

[04/25/2024-23:57:45] [I] === Performance summary ===
[04/25/2024-23:57:45] [I] Throughput: 507.399 qps
[04/25/2024-23:57:45] [I] Latency: min = 1.96301 ms, max = 1.97534 ms, mean = 1.96921 ms, median = 1.96917 ms, percentile(90%) = 1.97122 ms, percentile(95%) = 1.97229 ms, percentile(99%) = 1.97424 ms
[04/25/2024-23:57:45] [I] Enqueue Time: min = 0.0032959 ms, max = 0.0340576 ms, mean = 0.00421173 ms, median = 0.00415039 ms, percentile(90%) = 0.00463867 ms, percentile(95%) = 0.00476074 ms, percentile(99%) = 0.0057373 ms
[04/25/2024-23:57:45] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
[04/25/2024-23:57:45] [I] GPU Compute Time: min = 1.96301 ms, max = 1.97534 ms, mean = 1.96921 ms, median = 1.96917 ms, percentile(90%) = 1.97122 ms, percentile(95%) = 1.97229 ms, percentile(99%) = 1.97424 ms
[04/25/2024-23:57:45] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
[04/25/2024-23:57:45] [I] Total Host Walltime: 3.00355 s
[04/25/2024-23:57:45] [I] Total GPU Compute Time: 3.00108 s
[04/25/2024-23:57:45] [I] Explanations of the performance metrics are printed in the verbose logs.

It prints many performance metrics, but the most important are Throughput and median Latency. In this case, the ResNet-50 model with batch size 4 can run with a throughput of 507 inferences per second (2028 images per second since the batch size is 4) and a median latency of 1.969 ms.
它打印了许多性能指标,但最重要的是吞吐量和中位数延迟。在这种情况下,批量大小为 4 的 ResNet-50 模型可以以每秒 507 次推理(由于批量大小为 4,相当于每秒 2028 张图像)的吞吐量和 1.969 毫秒的中位数延迟运行。

Refer to the Advanced Performance Measurement Techniques section for explanations about what Throughput and Latency mean to your deep learning inference applications. Refer to the trtexec section for detailed explanations about other trtexec flags and other performance metrics that trtexec reports.
参考《高级性能测量技术》部分,了解吞吐量和延迟对您的深度学习推理应用意味着什么。参考`trtexec`部分,了解`trtexec`其他标志和其他性能指标的详细说明。

Performance Benchmarking with ONNX+Quantization#
使用 ONNX+量化进行性能基准测试 #

To enjoy the additional performance benefit from quantizations, Quantize/Dequantize operations need to be inserted into the ONNX model to tell TensorRT where to quantize/dequantize the tensors and what scaling factors to use.
为了享受量化的额外性能优势,需要在 ONNX 模型中插入量化/反量化操作,以告诉 TensorRT 在哪里量化/反量化张量以及使用什么缩放因子。

Our recommended tool for ONNX quantization is the ModelOptimizer package. You can install it by running:
我们推荐的 ONNX 量化工具是 ModelOptimizer 包。您可以通过运行以下命令来安装它:

pip3 install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-modelopt

Using the ModelOptimizer, you can get a quantized ONNX model by running:
使用 ModelOptimizer ,你可以通过运行以下命令获取量化后的 ONNX 模型:

python3 -m modelopt.onnx.quantization --onnx_path resnet50-v1-12.onnx --quantize_mode int8 --output_path resnet50-v1-12-quantized.onnx

It loads the original ONNX model from resnet50-v1-12.onnx, runs calibration using random data, inserts Quantize/Dequantize ops into the graph, and then saves the ONNX model with Quantize/Dequantize ops to resnet50-v1-12-quantized.onnx.
它从 resnet50-v1-12.onnx 加载原始的 ONNX 模型,使用随机数据进行校准,将量化/反量化操作插入图中,然后将包含量化/反量化操作的 ONNX 模型保存到 resnet50-v1-12-quantized.onnx

Now that the new ONNX model contains the INT8 Quantize/Dequantize ops, we can run trtexec again using a similar command:
现在新的 ONNX 模型包含了 INT8 量化/反量化操作,我们可以再次使用类似的命令运行 trtexec

trtexec --onnx=resnet50-v1-12-quantized.onnx --shapes=data:4x3x224x224 --stronglyTyped --noDataTransfers --useCudaGraph --useSpinWait

We use the --stronglyTyped flag instead of the --fp16 flag to require TensorRT to strictly follow the data types in the quantized ONNX model, including all the INT8 Quantize/Dequantize ops.
我们使用 --stronglyTyped 标志而不是 --fp16 标志,要求 TensorRT 严格遵循量化后的 ONNX 模型中的数据类型,包括所有 INT8 量化/反量化操作。

Here is an example output after running this trtexec command with the quantized ONNX model:
运行此 trtexec 命令后,使用量化 ONNX 模型的示例输出如下:

[04/26/2024-00:31:43] [I] === Performance summary ===
[04/26/2024-00:31:43] [I] Throughput: 811.74 qps
[04/26/2024-00:31:43] [I] Latency: min = 1.22559 ms, max = 1.23608 ms, mean = 1.2303 ms, median = 1.22998 ms, percentile(90%) = 1.23193 ms, percentile(95%) = 1.23291 ms, percentile(99%) = 1.23395 ms
[04/26/2024-00:31:43] [I] Enqueue Time: min = 0.00354004 ms, max = 0.00997925 ms, mean = 0.00431524 ms, median = 0.00439453 ms, percentile(90%) = 0.00463867 ms, percentile(95%) = 0.00476074 ms, percentile(99%) = 0.00512695 ms
[04/26/2024-00:31:43] [I] H2D Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
[04/26/2024-00:31:43] [I] GPU Compute Time: min = 1.22559 ms, max = 1.23608 ms, mean = 1.2303 ms, median = 1.22998 ms, percentile(90%) = 1.23193 ms, percentile(95%) = 1.23291 ms, percentile(99%) = 1.23395 ms
[04/26/2024-00:31:43] [I] D2H Latency: min = 0 ms, max = 0 ms, mean = 0 ms, median = 0 ms, percentile(90%) = 0 ms, percentile(95%) = 0 ms, percentile(99%) = 0 ms
[04/26/2024-00:31:43] [I] Total Host Walltime: 3.00219 s
[04/26/2024-00:31:43] [I] Total GPU Compute Time: 2.99824 s
[04/26/2024-00:31:43] [I] Explanations of the performance metrics are printed in the verbose logs.

The Throughput is 811 inferences per second, and the median Latency is 1.23 ms. The Throughput has improved by 60% compared to the FP16 performance results in the previous section.
吞吐量为每秒 811 次推理,中位数延迟为 1.23 毫秒。与上一节中的 FP16 性能结果相比,吞吐量提高了 60%。

Per-Layer Runtime and Layer Information#
按层运行时和层信息 #

In previous sections, we described using trtexec to measure the end-to-end latency. This section will show an example of per-layer runtime and per-layer information using trtexec. This will help you determine how much latency each layer contributes to the end-to-end latency and in which layers the performance bottlenecks are.
在之前的章节中,我们描述了使用 trtexec 来测量端到端延迟。本节将展示使用 trtexec 获取每层运行时和每层信息的示例。这将帮助你确定每层对端到端延迟的贡献程度,以及性能瓶颈出现在哪些层。

This is an example trtexec command to print per-layer runtime and per-layer information using the quantized ResNet-50 ONNX model:
这是一个示例 trtexec 命令,用于使用量化 ResNet-50 ONNX 模型打印每层运行时和每层信息:

trtexec --onnx=resnet50-v1-12-quantized.onnx --shapes=data:4x3x224x224 --stronglyTyped --noDataTransfers --useCudaGraph --useSpinWait --profilingVerbosity=detailed --dumpLayerInfo --dumpProfile --separateProfileRun

The --profilingVerbosity=detailed flag enables detailed layer information capturing, --dumpLayerInfo flag shows the per-layer information in the log, and --dumpProfile --separateProfileRun flags show the per-layer runtime latencies in the log.
--profilingVerbosity=detailed 标志启用详细层信息捕获, --dumpLayerInfo 标志在日志中显示每层信息,而 --dumpProfile --separateProfileRun 标志在日志中显示每层的运行时延迟。

The following code is an example log of the per-layer information for one of the convolution layers in the quantized ResNet-50 model:
以下代码是量化 ResNet-50 模型中一个卷积层的每层信息的示例日志:

Name: resnetv17_stage1_conv0_weight + resnetv17_stage1_conv0_weight_QuantizeLinear + resnetv17_stage1_conv0_fwd, LayerType: CaskConvolution, Inputs: [ { Name: resnetv17_pool0_fwd_QuantizeLinear_Output_1, Location: Device, Dimensions: [4,64,56,56], Format/Datatype: Thirty-two wide channel vectorized row major Int8 format }], Outputs: [ { Name: resnetv17_stage1_relu0_fwd_QuantizeLinear_Output, Location: Device, Dimensions: [4,64,56,56], Format/Datatype: Thirty-two wide channel vectorized row major Int8 format }], ParameterType: Convolution, Kernel: [1,1], PaddingMode: kEXPLICIT_ROUND_DOWN, PrePadding: [0,0], PostPadding: [0,0], Stride: [1,1], Dilation: [1,1], OutMaps: 64, Groups: 1, Weights: {"Type": "Int8", "Count": 4096}, Bias: {"Type": "Float", "Count": 64}, HasBias: 1, HasReLU: 1, HasSparseWeights: 0, HasDynamicFilter: 0, HasDynamicBias: 0, HasResidual: 0, ConvXAsActInputIdx: -1, BiasAsActInputIdx: -1, ResAsActInputIdx: -1, Activation: RELU, TacticName: sm80_xmma_fprop_implicit_gemm_interleaved_i8i8_i8i32_f32_nchw_vect_c_32kcrs_vect_c_32_nchw_vect_c_32_tilesize96x64x64_stage3_warpsize2x2x1_g1_tensor16x8x32_simple_t1r1s1, TacticValue: 0x483ad1560c6e5e27, StreamId: 0, Metadata: [ONNX Layer: resnetv17_stage1_conv0_fwd]

The log shows the layer name, the input and output tensor names, tensor shapes, tensor data types, convolution parameters, tactic names, and metadata. The Metadata field shows which ONNX ops this layer corresponds to. Since TensorRT has graph fusion optimizations, one engine layer may correspond to multiple ONNX ops in the original model.
日志显示了层名称、输入和输出张量名称、张量形状、张量数据类型、卷积参数、策略名称和元数据。 Metadata 字段显示了该层对应哪些 ONNX 算子。由于 TensorRT 具有图融合优化,一个引擎层可能对应原始模型中的多个 ONNX 算子。

The following code is an example log of the per-layer runtime latencies for the last few layers in the quantized ResNet-50 model:
以下代码是量化 ResNet-50 模型中最后几层的每层运行时延迟的示例日志:

[04/26/2024-00:42:55] [I]    Time(ms)     Avg.(ms)   Median(ms)   Time(%)   Layer
[04/26/2024-00:42:55] [I]       56.57       0.0255       0.0256       1.8   resnetv17_stage4_conv7_weight + resnetv17_stage4_conv7_weight_QuantizeLinear + resnetv17_stage4_conv7_fwd
[04/26/2024-00:42:55] [I]      103.86       0.0468       0.0471       3.3   resnetv17_stage4_conv8_weight + resnetv17_stage4_conv8_weight_QuantizeLinear + resnetv17_stage4_conv8_fwd
[04/26/2024-00:42:55] [I]       46.93       0.0211       0.0215       1.5   resnetv17_stage4_conv9_weight + resnetv17_stage4_conv9_weight_QuantizeLinear + resnetv17_stage4_conv9_fwd + resnetv17_stage4__plus2 + resnetv17_stage4_activation2
[04/26/2024-00:42:55] [I]       34.64       0.0156       0.0154       1.1   resnetv17_pool1_fwd
[04/26/2024-00:42:55] [I]       63.21       0.0285       0.0287       2.0   resnetv17_dense0_weight + resnetv17_dense0_weight_QuantizeLinear + transpose_before_resnetv17_dense0_fwd + resnetv17_dense0_fwd + resnetv17_dense0_bias + ONNXTRT_Broadcast + unsqueeze_node_after_resnetv17_dense0_bias + ONNXTRT_Broadcast_ONNXTRT_Broadcast_output + (Unnamed Layer* 851) [ElementWise]
[04/26/2024-00:42:55] [I]     3142.40       1.4149       1.4162     100.0   Total

It shows that the median latency of the resnetv17_pool1_fwd layer is 0.0154 ms and contributes to 1.1% of the end-to-end latency. With this log, you can identify which layers take the largest portion of the end-to-end latency and is the performance bottleneck.
它显示 resnetv17_pool1_fwd 层的中位数延迟为 0.0154 毫秒,占端到端延迟的 1.1%。通过这个日志,你可以识别出哪些层占端到端延迟的最大部分,并成为性能瓶颈。

The Total latency reported in the per-layer runtime log is the summation of the per-layer latencies. It is typically slightly longer than the reported end-to-end latency due to the overheads caused by measuring per-layer latencies. For example, the Total median latency is 1.4162 ms, but the end-to-end latency shown in the previous section is 1.23 ms.
Total 在每层运行时日志中报告的延迟是每层延迟的总和。由于测量每层延迟产生的开销,它通常比报告的端到端延迟稍长。例如, Total 中位数延迟为 1.4162 毫秒,但上一节中显示的端到端延迟为 1.23 毫秒。

Performance Benchmarking with TensorRT Plan File#
使用 TensorRT 计划文件进行性能基准测试 #

If you construct the TensorRT INetworkDefinition using TensorRT APIs and build the plan file in a separate script, you can still use trtexec to measure the plan file’s performance.
如果你使用 TensorRT API 构建 TensorRT INetworkDefinition ,并在单独的脚本中构建计划文件,你仍然可以使用 trtexec 来测量计划文件的性能。

For example, if the plan file is saved as resnet50-v1-12-quantized.plan, then you can run the trtexec command to measure the performance using this plan file:
例如,如果计划文件保存为 resnet50-v1-12-quantized.plan ,则可以使用该计划文件运行 trtexec 命令来测量性能:

trtexec --loadEngine=resnet50-v1-12-quantized.plan --shapes=data:4x3x224x224 --noDataTransfers --useCudaGraph --useSpinWait

The performance summary output is similar to those in the previous sections.
性能摘要输出与前一节中的类似。

Duration and Number of Iterations#
持续时间与迭代次数 #

By default, trtexec warms up for at least 200 ms and runs inference for at least 10 iterations or at least 3 seconds, whichever is longer. You can modify these parameters by adding the --warmUp=500, --iterations=100, and --duration=60 flags, which mean running the warm-up for at least 500 ms and running the inference for at least 100 iterations or at least 60 seconds, whichever is longer.
默认情况下, trtexec 会预热至少 200 毫秒,并运行推理至少 10 次或至少 3 秒,以较长者为准。您可以通过添加 --warmUp=500--iterations=100--duration=60 标志来修改这些参数,这意味着预热至少 500 毫秒,并运行推理至少 100 次或至少 60 秒,以较长者为准。

Refer to the trtexec section or run trtexec --help for a detailed explanation about other trtexec flags.
参考 trtexec 部分,或运行 trtexec --help 获取关于其他 trtexec 标志的详细说明。

Advanced Performance Measurement Techniques#
高级性能测量技术 #

Before starting any optimization effort with TensorRT, it is essential to determine what should be measured. Without measurements, it is impossible to make reliable progress or measure whether success has been achieved.
在使用 TensorRT 进行任何优化工作之前,确定要测量什么至关重要。没有测量,就无法可靠地取得进展或衡量是否已成功。

Latency: A performance measurement for network inference is how much time elapses from an input presented to the network until an output is available. This is the latency of the network for a single inference. Lower latencies are better. In some applications, low latency is a critical safety requirement. In other applications, latency is directly visible to users as a quality-of-service issue. For bulk processing, latency may not be important.
延迟:网络推理的性能测量是指从输入呈现到网络输出可用所经过的时间。这是网络单次推理的延迟。延迟越低越好。在某些应用中,低延迟是一个关键的安全要求。在其他应用中,延迟作为服务质量问题直接对用户可见。对于批量处理,延迟可能并不重要。

Throughput: Another performance measurement is how many inferences can be completed in a fixed time. This is the throughput of the network. Higher throughput is better. Higher throughputs indicate a more efficient utilization of fixed compute resources. For bulk processing, the total time taken will be determined by the network’s throughput.
吞吐量:另一种性能测量是指在一定时间内可以完成多少次推理。这是网络的吞吐量。吞吐量越高越好。更高的吞吐量表明固定计算资源的利用效率更高。对于批量处理,总时间将由网络的吞吐量决定。

Another way of looking at latency and throughput is to fix the maximum latency and measure throughput at that latency. A quality-of-service measurement like this can be a reasonable compromise between the user experience and system efficiency.
另一种看待延迟和吞吐量的方式是固定最大延迟,并测量在该延迟下的吞吐量。这种服务质量测量可以在用户体验和系统效率之间提供一个合理的折中方案。

Before measuring latency and throughput, you must choose the exact points to start and stop timing. Different points make sense depending on the network and application.
在测量延迟和吞吐量之前,你必须选择精确的开始和停止计时点。不同的点取决于网络和应用。

In many applications, there is a processing pipeline, and the latency and throughput of the entire pipeline can measure the overall system performance. Because the pre-and post-processing steps depend so strongly on the particular application, this section considers the latency and throughput of the network inference only.
在许多应用中,存在一个处理管道,测量整个管道的延迟和吞吐量可以评估整个系统的性能。由于预处理和后处理步骤强烈依赖于特定的应用,本节仅考虑网络推理的延迟和吞吐量。

Wall-Clock Timing#  墙上时钟计时 #

The wall clock time (the elapsed time between the start of a computation and its end) can be useful for measuring the application’s overall throughput and latency and placing inference times in context within a larger system. To measure wall clock time, we can use std::chrono::steady_clock provided by the C++11 <chrono> standard library.
墙钟时间(计算开始到结束之间的经过时间)可用于测量应用程序的整体吞吐量和延迟,以及在一个更大的系统中将推理时间置于上下文中。要测量墙钟时间,我们可以使用 C++11 标准库提供的 std::chrono::steady_clock

The following example code snippet shows measuring network inference host time:
以下示例代码片段展示了测量网络推理主机时间:

1#include <chrono>
2
3auto startTime = std::chrono::steady_clock::now();
4context->enqueueV3(stream);
5cudaStreamSynchronize(stream);
6auto endTime = std::chrono::steady_clock::now();
7float totalTime = std::chrono::duration<float, std::milli>
8    (endTime - startTime).count();
1import time
2from cuda import cudart
3err, stream = cudart.cudaStreamCreate()
4start_time = time.time()
5context.execute_async_v3(stream)
6cudart.cudaStreamSynchronize(stream)
7total_time = time.time() - start_time

If there is only one inference happening on the device at one time, then this can be a simple way of profiling the time-various operations take. Inference is typically asynchronous, so ensure you add an explicit CUDA stream or device synchronization to wait for results to become available.
如果设备上同一时间只有一个推理操作,那么这可以是一种简单的方式来分析不同操作所需的时间。推理通常是异步的,因此请确保添加显式的 CUDA 流或设备同步来等待结果变得可用。

CUDA Events#  CUDA 事件 #

One problem with timing on the host exclusively is that it requires host/device synchronization. Optimized applications may have many inferences running parallel on the device with overlapping data movement. In addition, the synchronization adds some noise to timing measurements.
在主机上独占计时的问题在于它需要主机/设备同步。优化的应用程序可能允许设备上并行运行许多推理,并且数据移动存在重叠。此外,同步会给计时测量带来一些噪声。

To help with these issues, CUDA provides an Event API. This API allows you to place events into CUDA streams that the GPU will time-stamp as they are encountered. Differences in timestamps can then tell you how long different operations took.
为了帮助解决这些问题,CUDA 提供了一个事件 API。该 API 允许您将事件放入 GPU 将在遇到时进行时间戳的 CUDA 流中。时间戳的差异可以告诉您不同操作花费了多长时间。

The following example code snippet shows computing the time between two CUDA events:
以下示例代码片段展示了计算两个 CUDA 事件之间的时间:

 1cudaEvent_t start, end;
 2cudaEventCreate(&start);
 3cudaEventCreate(&end);
 4
 5cudaEventRecord(start, stream);
 6context->enqueueV3(stream);
 7cudaEventRecord(end, stream);
 8
 9cudaEventSynchronize(end);
10float totalTime;
11cudaEventElapsedTime(&totalTime, start, end);
1from cuda import cudart
2err, stream = cudart.cudaStreamCreate()
3err, start = cudart.cudaEventCreate()
4err, end = cudart.cudaEventCreate()
5cudart.cudaEventRecord(start, stream)
6context.execute_async_v3(stream)
7cudart.cudaEventRecord(end, stream)
8cudart.cudaEventSynchronize(end)
9err, total_time = cudart.cudaEventElapsedTime(start, end)

Built-In TensorRT Profiling#
内置 TensorRT 性能分析 #

Digging deeper into inference performance requires more fine-grained timing measurements within the optimized network.
要深入挖掘推理性能,需要在优化后的网络中进行更细粒度的计时测量。

TensorRT has a Profiler (C++, Python) interface, which you can implement to have TensorRT pass profiling information to your application. When called, the network will run in a profiling mode. After finishing the inference, the profiler object of your class is called to report the timing for each layer in the network. These timings can be used to locate bottlenecks, compare different versions of a serialized engine, and debug performance issues.
TensorRT 提供了一个 Profiler (C++、Python)接口,您可以通过实现该接口让 TensorRT 将性能分析信息传递给您的应用程序。当调用该接口时,网络将运行在性能分析模式下。在完成推理后,您的类中的分析器对象会被调用,以报告网络中每个层的耗时。这些耗时信息可用于定位性能瓶颈、比较不同版本的序列化引擎以及调试性能问题。

The profiling information can be collected from a regular inference enqueueV3() launch or a CUDA graph launch. Refer to IExecutionContext::setProfiler() and IExecutionContext::reportToProfiler() (C++, Python) for more information.
性能分析信息可以从常规推理 enqueueV3() 启动或 CUDA 图启动中收集。有关更多信息,请参阅 IExecutionContext::setProfiler()IExecutionContext::reportToProfiler() (C++, Python)。

Layers inside a loop are compiled into a single monolithic layer; therefore, separate timings for those layers are unavailable. Also, some subgraphs (especially with Transformer-like networks) are handled by a next-generation graph optimizer that has not yet been integrated with the Profiler APIs. For those networks, use the CUDA Profiling Tools to profile per-layer performance.
循环内的层会被编译成一个单一的整体层;因此,无法获取这些层的单独时间。此外,一些子图(尤其是具有 Transformer 类似网络的子图)由新一代图优化器处理,该优化器尚未与 Profiler API 集成。对于这些网络,请使用 CUDA Profiling Tools 进行逐层性能分析。

An example showing how to use the IProfiler interface is provided in the common sample code (common.h).
在公共示例代码( common.h )中提供了一个使用 IProfiler 接口的示例。

Given an input network or plan file, you can use trtexec to profile a network with TensorRT. For more information, refer to the trtexec section.
给定一个输入网络或计划文件,您可以使用 trtexec 使用 TensorRT 对网络进行性能分析。有关更多信息,请参阅 trtexec 部分。

ONNX Profiling Tools#  ONNX 性能分析工具 #

Nsight Deep Learning Designer is an integrated design environment for ONNX models. It is built on top of TensorRT. Its built-in profiler runs inference for an ONNX model through TensorRT and collects profiling data based on GPU performance metrics. The profiler report generated by Nsight Deep Learning Designer provides a comprehensive view of an ONNX model’s inference performance at the TensorRT layer level. Its GUI also helps developers correlate the performance of individual TensorRT layers with their originating ONNX operators.
Nsight Deep Learning Designer 是一个用于 ONNX 模型的集成式设计环境。它基于 TensorRT 构建。其内置的性能分析器通过 TensorRT 对 ONNX 模型进行推理,并根据 GPU 性能指标收集性能分析数据。Nsight Deep Learning Designer 生成的性能分析报告提供了 ONNX 模型在 TensorRT 层级的推理性能的全面视图。其图形用户界面还帮助开发者将单个 TensorRT 层的性能与其原始的 ONNX 运算符关联起来。

Nsight Deep Learning Designer profiling typically begins with the GUI. Open the Nsight Deep Learning Designer application and click Start Activity from the Welcome screen. Select the target platform type from the list, and you may additionally define a remote connection if you wish to profile on a Linux or L4T target from a remote machine. Select Profile TensorRT Model as the activity type.
Nsight Deep Learning Designer 的性能分析通常从图形界面开始。打开 Nsight Deep Learning Designer 应用程序,并从欢迎界面点击“开始活动”。从列表中选择目标平台类型,如果您希望从远程机器上对 Linux 或 L4T 目标进行性能分析,还可以定义一个远程连接。将活动类型选择为“分析 TensorRT 模型”。

Start Activity Dialog

Profiler activity settings typically have analogs in trtexec and are split across four tabs in the GUI. Refer to the Nsight Deep Learning Designer documentation for details of each setting. The most frequently used settings are listed here:
分析器活动设置通常在 trtexec 中有对应项,并在图形界面中分为四个选项卡。有关每个设置的详细信息,请参阅 Nsight Deep Learning Designer 文档。最常用的设置列表如下:

  • Common: The ONNX model to profile, its corresponding TensorRT engine if one has already been built, and the location to save the profiler report.
    通用:要分析的 ONNX 模型,如果已经构建了对应的 TensorRT 引擎,还需要保存分析报告的位置。

  • Tactics: Typing mode (default typing, type constraints (Layer-Level Control of Permission), or strong typing (Strongly Typed Networks)) and on/off toggles for FP16, BF16, TF32, INT8, and FP8 precisions in weakly typed networks (Network-Level Control of Precision).
    策略:输入模式(默认输入、类型约束(层级权限控制)、强类型(强类型网络))以及在弱类型网络中对 FP16、BF16、TF32、INT8 和 FP8 精度的开关切换(网络级精度控制)。

  • Optimizer: Refittable weights (Refitting an Engine), INT8 quantization cache path (Post-Training Quantization Using Calibration).
    优化器:可重新适配的权重(重新适配引擎)、INT8 量化缓存路径(使用校准进行训练后量化)。

  • Profiler: Locking GPU clocks to base values (GPU Clock Locking and Floating Clock) and GPU counter sampling rate.
    分析器:锁定 GPU 时钟到基础值(GPU 时钟锁定和浮动时钟)以及 GPU 计数器采样率。

Networks using dynamic shapes (Working with Dynamic Shapes) should specify an optimization profile before profiling. This can be done by editing the ONNX network within Nsight Deep Learning Designer, profiling from the command line, or (for compatible networks) setting the Inferred Batch option in the Optimizer tab. When a batch size is provided, input shapes with a single leading wildcard will be automatically populated with the batch size. This feature works with input shapes of arbitrary rank.
使用动态形状的网络(与动态形状一起使用)应在分析之前指定优化配置文件。这可以通过在 NVIDIA Nsight Deep Learning Designer 中编辑 ONNX 网络、从命令行进行分析,或(对于兼容的网络)在优化器选项卡中设置推断批处理选项来完成。当提供批处理大小时,具有单个前置通配符的输入形状将自动填充批处理大小。此功能适用于任意维度的输入形状。

To start the Nsight Deep Learning Designer profiler, click the Launch button. The tool will automatically deploy TensorRT and CUDA runtime libraries to the target as needed and then generate a profiling report:
要启动 Nsight Deep Learning Designer 分析器,请点击启动按钮。该工具将根据需要自动将 TensorRT 和 CUDA 运行时库部署到目标,然后生成分析报告:

Profiling Report Overview

Nsight Deep Learning Designer includes a command-line profiler; refer to the tool documentation for usage instructions.
Nsight Deep Learning Designer 包含一个命令行分析器;请参考工具文档以获取使用说明。

Understanding Nsight Deep Learning Designer Timeline View
理解 Nsight Deep Learning Designer 时间线视图

Sample Timeline View Showing Layer Execution and GPU Activity

In the Nsight Deep Learning Designer Timeline View, each network inference stream is shown as a row of the timeline alongside collected GPU metrics such as SM activity and PCIe throughput. Each layer executed on an inference stream is depicted as a range on the corresponding timeline view. Overhead sources such as tensor memory copies or reformats are highlighted in blue.
在 Nsight Deep Learning Designer 时间线视图中,每个网络推理流显示为时间线的一行,旁边显示了收集的 GPU 指标,例如 SM 活动和 PCIe 吞吐量。在每个推理流上执行的每一层都描绘为相应时间线视图上的一个范围。张量内存复制或重格式化等开销源以蓝色突出显示。

Understanding Nsight Deep Learning Designer Layer Table
理解 Nsight Deep Learning Designer 层级表

Sample Profiling Report Layer Table

The Network Metrics table view shows all TensorRT layers executed by the network, their type, dimensions, precision, and inference time. Layer inference times are provided in the table as both raw time measurements and percentages of the inference pass. You can filter the table by name. Hyperlinks in the table indicate where a layer name references nodes in the original ONNX source model. Click the hyperlink or use the drop-down menu in a selected layer’s Name column to open the original ONNX model and highlight the layer in its context.
网络指标表格视图显示了网络执行的所有 TensorRT 层,包括它们的类型、维度、精度和推理时间。表格中提供了层推理时间的原始时间测量值和推理过程中的百分比。您可以通过名称筛选表格。表格中的超链接指示层名称在原始 ONNX 源模型中引用的节点位置。点击超链接或使用选定层名称列中的下拉菜单,可以打开原始 ONNX 模型并突出显示该层在其上下文中的位置。

Pop-up Showing Originating and Related ONNX Nodes for a TensorRT Layer

Selecting a range of layers in the table aggregates their statistics into a higher-level summary. Each table column is represented in the summary area with the most common values observed within the selection, sorted by frequency. Hover the mouse cursor over an information icon to see the full list of values and associated frequencies. Inference time columns are shown as minimum, maximum, mean, and total, using absolute times, and inference pass percentages as units. Total times in this context can be used to quickly sum the inference time for layers within a single execution stream. Layers in the selection do not need to be contiguous.
在表格中选择一系列层会将它们的统计信息聚合到更高层次的摘要中。每个表格列在摘要区域中以最常见的值表示,并按频率排序。将鼠标光标悬停在信息图标上,即可查看完整的值列表及其关联的频率。推理时间列以绝对时间显示最小值、最大值、平均值和总计,并以推理过程百分比作为单位。在此上下文中,总时间可用于快速汇总单个执行流中层的推理时间。所选层不必是连续的。

Sample Aggregated Statistics for a Selection Range

Understanding Nsight Deep Learning Designer Network Graphs
理解 Nsight Deep Learning Designer 网络图

Sample Average Inference Latency Graph for a Simple Network

Nsight Deep Learning Designer shows the average inference latency for each type of layer in the TensorRT engine. This can highlight areas where the network spends significant time on non-critical computations.
Nsight Deep Learning Designer 显示了 TensorRT 引擎中每种层类型的平均推理延迟。这可以突出网络在非关键计算上花费大量时间的地方。

Example Graph Showing Tensor Precisions for Layers in a Network

Nsight Deep Learning Designer also shows the precisions used for each type of layer in the TensorRT engine. This can highlight potential opportunities for network quantization and visualize the effect of setting TensorRT’s tactic precision flags.
Nsight Deep Learning Designer 还显示了 TensorRT 引擎中每种层类型所使用的精度。这可以突出网络量化的潜在机会,并可视化设置 TensorRT 的策略精度标志的效果。

CUDA Profiling Tools#  CUDA 分析工具 #

The recommended CUDA profiler is NVIDIA Nsight Systems. Some CUDA developers may be more familiar with nvprof and nvvp. However, these are being deprecated. These profilers can be used on any CUDA program to report timing information about the kernels launched during execution, data movement between host and device, and CUDA API calls used.
推荐的 CUDA 分析器是 NVIDIA Nsight Systems。一些 CUDA 开发者可能更熟悉 nvprof 和 nvvp。然而,这些工具正在被弃用。这些分析器可用于任何 CUDA 程序,以报告执行期间启动的内核的计时信息、主机和设备之间的数据移动以及使用的 CUDA API 调用。

Nsight Systems can be configured to report timing information for only a portion of the program’s execution or to report traditional CPU sampling profile information and GPU information.
Nsight Systems 可以配置为仅报告程序执行部分部分的计时信息,或报告传统的 CPU 采样分析信息和 GPU 信息。

The basic usage of Nsight Systems is first to run the command nsys profile -o <OUTPUT> <INFERENCE_COMMAND>, then open the generated <OUTPUT>.nsys-rep file in the Nsight Systems GUI to visualize the captured profiling results.
Nsight Systems 的基本使用方法是首先运行命令 nsys profile -o <OUTPUT> <INFERENCE_COMMAND> ,然后在 Nsight Systems GUI 中打开生成的 <OUTPUT>.nsys-rep 文件以可视化捕获的性能分析结果。

Profile Only the Inference Phase
仅分析推理阶段

When profiling a TensorRT application, you should enable profiling only after the engine has been built. During the build phase, all possible tactics are tried and timed. Profiling this portion of the execution will not show any meaningful performance measurements and will include all possible kernels, not the ones selected for inference. One way to limit the scope of profiling is to:
在分析 TensorRT 应用程序时,您应该在引擎构建完成后才启用分析。在构建阶段,会尝试并计时所有可能的策略。分析这部分执行将不会显示任何有意义的性能测量结果,并且会包含所有可能的内核,而不是用于推理的内核。限制分析范围的一种方法是:

  • First phase: Structure the application to build and then serialize the engines in one phase.
    第一阶段:将应用程序构建和序列化引擎放在一个阶段进行。

  • Second phase: Load the serialized engines, run inference in a second phase, and profile this second phase only.
    第二阶段:加载序列化的引擎,在第二阶段运行推理,并仅分析这一阶段。

Suppose the application cannot serialize the engines or must run through the two phases consecutively. In that case, you can also add cudaProfilerStart() and cudaProfilerStop() CUDA APIs around the second phase and add the -c cudaProfilerApi flag to the Nsight Systems command to profile only the part between cudaProfilerStart() and cudaProfilerStop().
假设应用程序无法序列化引擎或必须连续运行两个阶段。在这种情况下,您也可以在第二阶段周围添加 cudaProfilerStart()cudaProfilerStop() CUDA API,并将 -c cudaProfilerApi 标志添加到 Nsight Systems 命令中,以仅分析 cudaProfilerStart()cudaProfilerStop() 之间的部分。

Understand Nsight Systems Timeline View
理解 Nsight Systems 时间线视图

In the Nsight Systems Timeline View, the GPU activities are shown in the rows under CUDA HW, and the CPU activities are shown in the rows under Threads. By default, the rows under CUDA HW are collapsed. Therefore, you must click on it to expand the rows.
在 Nsight Systems 时间线视图中,GPU 活动显示在 CUDA HW 下方的行中,CPU 活动显示在 Threads 下方的行中。默认情况下,CUDA HW 下方的行是折叠的。因此,您必须点击它来展开行。

In a typical inference workflow, the application calls the context->enqueueV3() or context->executeV3() APIs to enqueue the jobs and then synchronize on the stream to wait until the GPU completes the jobs. If you only look at the CPU activities, it may appear that the system is doing nothing for a while in the cudaStreamSychronize() call. The GPU may be busy executing the enqueued jobs while the CPU waits. The following figure shows an example timeline of the inference of a query.
在一个典型的推理工作流程中,应用程序调用 context->enqueueV3()context->executeV3() API 将任务入队,然后在流上同步等待 GPU 完成任务。如果你只看 CPU 活动,可能会觉得在 cudaStreamSychronize() 调用中系统有一段时间什么都没做。CPU 等待时,GPU 可能正在执行已入队的任务。下图展示了查询推理的一个示例时间线。

The trtexec tool uses a slightly more complicated approach to enqueue the jobs. It enqueues the next query while the GPU still executes the jobs from the previous query. For more information, refer to the trtexec section.
trtexec 工具使用稍微复杂一点的方法来入队任务。它在 GPU 仍在执行上一个查询的任务时,就入队下一个查询。更多信息,请参考 trtexec 部分。

The following image shows a typical view of the normal inference workloads in the Nsight Systems timeline view, showing CPU and GPU activities on different rows.
下图展示了 Nsight Systems 时间线视图中典型的正常推理工作负载视图,显示了不同行上的 CPU 和 GPU 活动。

Normal Inference Workloads in Nsight Systems Timeline View

Use the NVTX Tracing in Nsight Systems
使用 Nsight Systems 中的 NVTX 跟踪

Tracing enables the NVIDIA Tools Extension SDK (NVTX), a C-based API for marking events and ranges in your applications. It allows Nsight Compute and Nsight Systems to collect data generated by TensorRT applications.
跟踪功能使 NVIDIA 工具扩展 SDK(NVTX)成为基于 C 的 API,用于在您的应用程序中标记事件和范围。它允许 Nsight Compute 和 Nsight Systems 收集由 TensorRT 应用程序生成的数据。

Decoding the kernel names back to layers in the original network can be complicated. Because of this, TensorRT uses NVTX to mark a range for each layer, allowing the CUDA profilers to correlate each layer with the kernels called to implement it. In TensorRT, NVTX helps to correlate the runtime engine layer execution with CUDA kernel calls. Nsight Systems supports collecting and visualizing these events and ranges on the timeline. Nsight Compute also supports collecting and displaying the state of all active NVTX domains and ranges in a given thread when the application is suspended.
将内核名称解码回原始网络中的层可能很复杂。由于这个原因,TensorRT 使用 NVTX 为每一层标记一个范围,允许 CUDA 分析器将每一层与实现它的调用的内核相关联。在 TensorRT 中,NVTX 帮助将运行时引擎层的执行与 CUDA 内核调用相关联。Nsight Systems 支持在时间轴上收集和可视化这些事件和范围。当应用程序挂起时,Nsight Compute 也支持在给定线程中收集和显示所有活动 NVTX 域和范围的状态。

In TensorRT, each layer may launch one or more kernels to perform its operations. The exact kernels launched depend on the optimized network and the hardware present. Depending on the builder’s choices, multiple additional operations that reorder data may be interspersed with layer computations; these reformat operations may be implemented as device-to-device memory copies or custom kernels.
在 TensorRT 中,每个层可能会启动一个或多个内核来执行其操作。具体启动哪些内核取决于优化后的网络和当前的硬件。根据构建器的选择,多个用于重新排序数据的额外操作可能会与层计算交错进行;这些重新格式化操作可以由设备到设备的内存复制或自定义内核实现。

For example, the following screenshots are from Nsight Systems.
例如,以下截图来自 Nsight Systems。

The Layer Execution and the Kernel Being Launched on the CPU Side

The kernels run on the GPU; in other words, the following image shows the correlation between the layer execution and kernel launch on the CPU side and their execution on the GPU side.
这些内核在 GPU 上运行;换句话说,以下图像显示了层执行与 CPU 端的内核启动之间的关联,以及它们在 GPU 端的执行情况。

The Kernels Run on the GPU

Control the Level of Details in NVTX Tracing
控制 NVTX 跟踪的详细程度

By default, TensorRT only shows layer names in the NVTX markers. At the same time, users can control the level of details by setting the ProfilingVerbosity in the IBuilderConfig when the engine is built. For example, to disable NVTX tracing, set the ProfilingVerbosity to kNONE:
默认情况下,TensorRT 仅在 NVTX 标记中显示层名称。同时,用户可以通过在构建引擎时设置 ProfilingVerbosity 来控制详细程度。例如,要禁用 NVTX 跟踪,请将 ProfilingVerbosity 设置为 kNONE

1builderConfig->setProfilingVerbosity(ProfilingVerbosity::kNONE);
1builder_config.profilling_verbosity = trt.ProfilingVerbosity.NONE

On the other hand, you can choose to allow TensorRT to print more detailed layer information in the NVTX markers, including input and output dimensions, operations, parameters, tactic numbers, and so on, by setting the ProfilingVerbosity to kDETAILED:
另一方面,您可以选择允许 TensorRT 在 NVTX 标记中打印更多详细的层信息,包括输入和输出维度、操作、参数、策略编号等,通过将 ProfilingVerbosity 设置为 kDETAILED

1builderConfig->setProfilingVerbosity(ProfilingVerbosity::kDETAILED);
1builder_config.profilling_verbosity = trt.ProfilingVerbosity.DETAILED

Note  注意

Enabling detailed NVTX markers increases the latency of enqueueV3() calls and could result in a performance drop if the performance depends on the latency of enqueueV3() calls.
启用详细的 NVTX 标记会增加 enqueueV3() 调用的延迟,并且如果性能取决于 enqueueV3() 调用的延迟,可能会导致性能下降。

Run Nsight Systems with trtexec
使用 trtexec 运行 Nsight Systems

Below is an example of the commands to gather Nsight Systems profiles using the trtexec tool:
以下是使用 trtexec 工具收集 Nsight Systems 配置文件的命令示例:

trtexec --onnx=foo.onnx --profilingVerbosity=detailed --saveEngine=foo.plan
nsys profile -o foo_profile --capture-range cudaProfilerApi trtexec --profilingVerbosity=detailed --loadEngine=foo.plan --warmUp=0 --duration=0 --iterations=50

The first command builds and serializes the engine to foo.plan, and the second command runs the inference using foo.plan and generates a foo_profile.nsys-rep file can then be opened in the Nsight Systems user interface for visualization.
第一个命令用于构建和序列化引擎到 foo.plan ,第二个命令使用 foo.plan 运行推理,并生成一个 foo_profile.nsys-rep 文件,该文件可以在 Nsight Systems 用户界面中打开以进行可视化。

The --profilingVerbosity=detailed flag allows TensorRT to show more detailed layer information in the NVTX marking, and the --warmUp=0, --duration=0, and --iterations=50 flags allow you to control how many inference iterations to run. By default, trtexec runs inference for three seconds, which may result in a large output of the nsys-rep file.
--profilingVerbosity=detailed 标志允许 TensorRT 在 NVTX 标记中显示更详细的层信息,而 --warmUp=0--duration=0--iterations=50 标志允许您控制运行多少次推理迭代。默认情况下, trtexec 会运行三秒钟的推理,这可能导致 nsys-rep 文件产生大量输出。

If the CUDA graph is enabled, add --cuda-graph-trace=node flag to the nsys command to see the per-kernel runtime information:
如果启用了 CUDA 图,请在 nsys 命令中添加 --cuda-graph-trace=node 标志以查看每个内核的运行时信息:

nsys profile -o foo_profile --capture-range cudaProfilerApi --cuda-graph-trace=node trtexec --profilingVerbosity=detailed --loadEngine=foo.plan --warmUp=0 --duration=0 --iterations=50 --useCudaGraph

(Optional) Enable GPU Metrics Sampling in Nsight Systems
(可选)在 Nsight Systems 中启用 GPU 指标采样

On discrete GPU systems, add the --gpu-metrics-device all flag to the nsys command to sample GPU metrics, including GPU clock frequencies, DRAM bandwidth, Tensor Core utilization, and so on. If the flag is added, these GPU metrics appear in the Nsight Systems web interface.
在独立 GPU 系统上,向 nsys 命令添加 --gpu-metrics-device 全局标志以采样 GPU 指标,包括 GPU 时钟频率、DRAM 带宽、张量核利用率等。如果添加了该标志,这些 GPU 指标将出现在 Nsight Systems 网页界面上。

Profiling for DLA#  DLA 的分析 #

To profile DLA, add the --accelerator-trace nvmedia flag when using the NVIDIA Nsight Systems CLI or enable Collect other accelerators trace when using the user interface. For example, the following command can be used with the NVIDIA Nsight Systems CLI:
要分析 DLA,在使用 NVIDIA Nsight Systems CLI 时添加 --accelerator-trace nvmedia 标志,或在使用用户界面时启用收集其他加速器跟踪。例如,可以使用以下命令与 NVIDIA Nsight Systems CLI 一起使用:

nsys profile -t cuda,nvtx,nvmedia,osrt --accelerator-trace=nvmedia  --show-output=true trtexec --loadEngine=alexnet_int8.plan --warmUp=0 --duration=0 --iterations=20

Here is an example report:
这是一个示例报告:

  • NvMediaDLASubmit submits a DLA task for each DLA subgraph. The task’s runtime can be found in the DLA timeline under Other accelerators trace.
    NvMediaDLASubmit 为每个 DLA 子图提交一个 DLA 任务。该任务的运行时可以在 DLA 时间线中的“其他加速器跟踪”下找到。

  • Because GPU fallback was allowed, TensorRT automatically added some CUDA kernels, like permutationKernelPLC3 and copyPackedKernel, which are used for data reformatting.
    由于允许 GPU 回退,TensorRT 自动添加了一些 CUDA 内核,如 permutationKernelPLC3copyPackedKernel ,这些内核用于数据重格式化。

  • EGLStream APIs were executed because TensorRT uses EGLStream for data transfer between GPU memory and DLA.
    执行了 EGLStream API,因为 TensorRT 使用 EGLStream 在 GPU 内存和 DLA 之间进行数据传输。

To maximize GPU utilization, trtexec enqueues the queries one batch beforehand.
为了最大化 GPU 的利用率, trtexec 会提前将查询批处理。

Sample DLA Profiling Report

The runtime of the DLA task can be found under Other Accelerator API. Some CUDA kernels and EGLStream API are called for interaction between GPU and DLA.
DLA 任务的运行时间可以在 Other Accelerator API 下找到。为了实现 GPU 和 DLA 之间的交互,会调用一些 CUDA 内核和 EGLStream API。

Sample DLA Profiling Report

Tracking Memory#  跟踪内存 #

Tracking memory usage can be as important as execution performance. Usually, the device’s memory is more constrained than the host’s. To keep track of device memory, the recommended mechanism is to create a simple custom GPU allocator that internally keeps some statistics and then uses the regular CUDA memory allocation functions cudaMalloc and cudaFree.
跟踪内存使用情况与执行性能同样重要。通常,设备的内存比主机的内存更加受限。为了跟踪设备内存,推荐的方法是创建一个简单的自定义 GPU 分配器,该分配器内部会保留一些统计信息,然后使用常规的 CUDA 内存分配函数 cudaMalloccudaFree

A custom GPU allocator can be set for the builder IBuilder for network optimizations and IRuntime when deserializing engines using the IGpuAllocator APIs. One idea for the custom allocator is to keep track of the current amount of memory allocated and push an allocation event with a timestamp and other information onto a global list of allocation events. Looking through the list of allocation events allows profiling memory usage over time.
自定义 GPU 分配器可以用于构建器 IBuilder 进行网络优化,以及在使用 IGpuAllocator API 反序列化引擎时。自定义分配器的一个思路是跟踪当前已分配的内存量,并将带有时间戳和其他信息的分配事件推送到全局分配事件列表中。通过查看分配事件列表,可以分析内存随时间的使用情况。

On mobile platforms, GPU memory and CPU memory share the system memory. On devices with very limited memory size, like Nano, system memory might run out with large networks; even the required GPU memory is smaller than system memory. In this case, increasing the system swap size could solve some problems. An example script is:
在移动平台上,GPU 内存和 CPU 内存共享系统内存。在内存非常有限的设备上,如 Nano,大型网络可能会导致系统内存耗尽;即使所需的 GPU 内存小于系统内存。在这种情况下,增加系统交换空间可以解决一些问题。示例脚本如下:

echo "######alloc swap######"
if [ ! -e /swapfile ];then
    sudo fallocate -l 4G /swapfile
    sudo chmod 600 /swapfile
    sudo mkswap /swapfile
    sudo /bin/sh -c 'echo  "/swapfile \t none \t swap \t defaults \t 0 \t 0" >> /etc/fstab'
    sudo swapon -a
fi

Hardware/Software Environment for Performance Measurements#
性能测量硬件/软件环境 #

Performance measurements are influenced by many factors, including hardware environment differences like the machine’s cooling capability and software environment differences like GPU clock settings. This section summarizes a few items that may affect performance measurements.
性能测量受多种因素影响,包括硬件环境差异(如机器的散热能力)和软件环境差异(如 GPU 时钟设置)。本节总结了可能影响性能测量的几个项目。

Note that the items involving nvidia-smi are only supported on dGPU systems, not mobile ones.
请注意,涉及 nvidia-smi 的项目仅在 dGPU 系统上受支持,而不在移动设备上。

GPU Information Query and GPU Monitoring#
GPU 信息查询和 GPU 监控 #

While measuring performance, it is recommended that you record and monitor the GPU status in parallel to the inference workload. Having the monitoring data allows you to identify possible root causes when you see unexpected performance measurement results.
在测量性能时,建议您在推理工作负载的同时记录和监控 GPU 状态。拥有监控数据使您能够在看到意外的性能测量结果时识别可能的根本原因。

Before the inference starts, call the nvidia-smi -q command to get detailed information on the GPU, including the product name, power cap, clock settings, etc. Then, while the inference workload is running, run the nvidia-smi dmon -s pcu -f <FILE> -c <COUNT> command in parallel to print out GPU clock frequencies, power consumption, temperature, and utilization to a file. Call nvidia-smi dmon --help for more options about the nvidia-smi device monitoring tool.
在推理开始之前,调用 nvidia-smi -q 命令以获取有关 GPU 的详细信息,包括产品名称、功耗限制、时钟设置等。然后,在推理工作负载运行时,并行运行 nvidia-smi dmon -s pcu -f <FILE> -c <COUNT> 命令以将 GPU 时钟频率、功耗、温度和利用率打印到文件中。调用 nvidia-smi dmon --help 获取有关 nvidia-smi 设备监控工具的更多选项。

GPU Clock Locking and Floating Clock#
GPU 时钟锁定和浮动时钟 #

By default, the GPU clock frequency is floating, meaning it sits idle when there is no active workload and boosts the boost clock frequency when the workload starts. This is usually the desired behavior since it allows the GPU to generate less heat at idle and to run at maximum speed when there is an active workload.
默认情况下,GPU 时钟频率是浮动的,即在无活动工作负载时处于空闲状态,并在工作负载开始时提升加速能力时钟频率。这通常是期望的行为,因为它允许 GPU 在空闲时产生较少的热量,并在有活动工作负载时以最高速度运行。

Alternatively, you can lock the clock at a specific frequency by calling the sudo nvidia-smi -lgc <freq> command (and conversely, you can let the clock float again with the sudo nvidia-smi -rgc command). The sudo nvidia-smi -q -d SUPPORTED_CLOCKS command can find the supported clock frequencies. After the clock frequency is locked, it should stay at that frequency unless power or thermal throttling occurs, which will be explained in the next sections. When the throttling kicks in, the device behaves like the clock floats.
或者,您可以通过调用 sudo nvidia-smi -lgc <freq> 命令将时钟锁定在特定频率(反之,您可以使用 sudo nvidia-smi -rgc 命令让时钟再次浮动)。 sudo nvidia-smi -q -d SUPPORTED_CLOCKS 命令可以查找支持的时钟频率。锁定时钟频率后,除非发生电源或热节流,否则应保持该频率,这将在下一节中解释。当节流启动时,设备的行为就像时钟浮动一样。

Running TensorRT workloads with floating clocks or with throttling taking place can lead to more non-determinism in tactic selections and unstable performance measurements across inferences because every CUDA kernel may run at slightly different clock frequencies, depending on which frequency the driver boosts or throttles the clock to at that moment. On the other hand, running TensorRT workloads with locked clocks allows more deterministic tactic selections and consistent performance measurements. Still, the average performance will not be as good as when the clock is floating or is locked at maximum frequency with throttling taking place.
运行具有浮点时钟或存在节流的 TensorRT 工作负载会导致策略选择中的非确定性增加,并在推理过程中出现不稳定的性能测量,因为每个 CUDA 内核可能会在驱动程序提升或节流时钟频率的时刻以略微不同的时钟频率运行。另一方面,运行具有锁定时钟的 TensorRT 工作负载可以实现更确定的策略选择和一致的性能测量。然而,平均性能可能不如时钟浮动或锁定在最大频率并存在节流时那么好。

There is no definite recommendation on whether the clock should be locked or which clock frequency to lock the GPU while running TensorRT workloads. It depends on whether the deterministic and stable performance or the best average performance is desired.
在运行 TensorRT 工作负载时,是否应锁定时钟或锁定 GPU 的时钟频率没有明确的建议。这取决于是希望获得确定性和稳定的性能,还是最佳的平均性能。

GPU Power Consumption and Power Throttling#
GPU 功耗和功耗节流 #

Power throttling occurs when the average GPU power consumption reaches the power limit, which can be set by the sudo nvidia-smi -pl <power_cap> command. When this happens, the driver has to throttle the clock to a lower frequency to keep the average power consumption below the limit. The constantly changing clock frequencies may lead to unstable performance measurements if the measurements are taken within a short time, such as within 20ms.
功率限制发生在平均 GPU 功耗达到功率限制时,该限制可通过 sudo nvidia-smi -pl <power_cap> 命令设置。发生这种情况时,驱动程序必须将时钟频率调低以保持平均功耗低于限制。如果测量时间较短(例如 20ms 内),不断变化的时钟频率可能导致性能测量结果不稳定。

Power throttling happens by design and is a natural phenomenon when the GPU clock is not locked or is locked at a higher frequency, especially for GPUs with lower power limits, such as NVIDIA T4 and NVIDIA A2 GPUs. To avoid performance variations caused by power throttling, you can lock the GPU clock at a lower frequency to stabilize the performance numbers. However, the average performance numbers will be lower than those with floating clocks or the clock locked at a higher frequency, even though power throttling would happen in this case.
功率限制是设计中的预期行为,当 GPU 时钟未锁定或锁定在较高频率时,这是一个自然现象,特别是对于功耗限制较低的 GPU,如 NVIDIA T4 和 NVIDIA A2 GPU。为了避免功率限制引起的性能变化,您可以将 GPU 时钟锁定在较低频率以稳定性能数据。然而,即使在这种情况下发生功率限制,平均性能数据也会低于浮动时钟或锁定在较高频率时的数据。

Another issue with power throttling is that it may skew the performance numbers if there are gaps between inferences in your performance benchmarking applications. For example, if the application synchronizes at each inference, there will be periods when the GPU is idle between the inferences. The gaps cause the GPU to consume less power on average, so the clock is throttled less, and the GPU can run at higher clock frequencies on average. However, the throughput numbers measured this way are inaccurate because when the GPU is fully loaded with no gaps between inferences, the actual clock frequency will be lower, and the actual throughput will not reach the throughput numbers measured using the benchmarking application.
另一个与功耗节流相关的问题是,如果在性能基准测试应用程序中的推理之间存在间隔,它可能会扭曲性能数据。例如,如果应用程序在每次推理时同步,那么在推理之间会有 GPU 空闲的时期。这些间隔导致 GPU 平均功耗降低,因此时钟节流减少,GPU 可以以更高的时钟频率运行。然而,以这种方式测量的吞吐量数据是不准确的,因为当 GPU 没有间隔地完全加载时,实际时钟频率会较低,实际吞吐量不会达到使用基准测试应用程序测量的吞吐量数据。

To avoid this, the trtexec tool is designed to maximize GPU execution by leaving nearly no gaps between GPU kernel executions so that it can measure the true throughput of a TensorRT workload. Therefore, if you see performance gaps between your benchmarking application and what trtexec reports, check if the power throttling and the gaps between inferences are the cause.
为了避免这种情况, trtexec 工具被设计为通过几乎不留下 GPU 内核执行之间的间隔来最大化 GPU 执行,以便它可以测量 TensorRT 工作负载的真实吞吐量。因此,如果你看到你的基准测试应用程序和 trtexec 报告之间存在性能差距,请检查功耗节流和推理之间的间隔是否是原因。

Lastly, power consumption can depend on the activation values, causing different input performance measurements. For example, if all the network input values are set to zeros or NaNs, the GPU consumes less power than the inputs are normal values because of fewer bit-flips in DRAM and the L2 cache. To avoid this discrepancy, always use the input values that best represent the actual value distribution when measuring the performance. The trtexec tool uses random input values by default, but you can specify the input using the --loadInputs flag. For more information, refer to the trtexec section.
最后,功耗可能取决于激活值,导致不同的输入性能测量结果。例如,如果将所有网络输入值设置为 0 或 NaN,由于 DRAM 和 L2 缓存中的位翻转次数减少,GPU 的功耗低于输入为正常值时的功耗。为了避免这种差异,在测量性能时,应始终使用最能代表实际值分布的输入值。 trtexec 工具默认使用随机输入值,但您可以使用 --loadInputs 标志指定输入。有关更多信息,请参阅 trtexec 部分。

GPU Temperature and Thermal Throttling#
GPU 温度和热节流 #

Thermal throttling happens when the GPU temperature reaches a predefined threshold of around 85 degrees Celsius for most GPUs, and the driver has to throttle the clock to a lower frequency to prevent the GPU from overheating. You can tell this by seeing the temperature logged by the nvidia-smi dmon command gradually increasing while the inference workload runs until it reaches ~85C and the clock frequency drops.
过热降频发生在 GPU 温度达到预设阈值(大多数 GPU 约为 85 摄氏度)时,驱动程序必须将时钟频率调低以防止 GPU 过热。你可以通过观察 nvidia-smi dmon 命令记录的温度在推理工作负载运行时逐渐升高,直到达到~85C 并时钟频率下降来判断这一点。

If thermal throttling happens on actively cooled GPUs like Quadro A8000, then it is possible that the fans on the GPU are broken or obstacles are blocking the airflow.
如果主动冷却的 GPU(如 Quadro A8000)发生过热降频,那么可能是 GPU 上的风扇损坏或障碍物阻挡了气流。

If thermal throttling happens on passively cooled GPUs like NVIDIA A10, then it is likely that the GPUs are not properly cooled. Passively cooled GPUs require external fans or air conditioning to cool down the GPUs, and the airflow must go through the GPUs for effective cooling. Common cooling problems include installing GPUs in a server that is not designed for the GPUs or installing the wrong numbers of GPUs into the server. In some cases, the air flows through the “easy path” (the path with the least friction) around the GPUs instead of going through them. Fixing this requires examination of the airflow in the server and installation of airflow guidance if necessary.
如果被动冷却的 GPU(如 NVIDIA A10)发生热节流,那么很可能是 GPU 没有得到适当的冷却。被动冷却的 GPU 需要外部风扇或空调来冷却 GPU,并且气流必须通过 GPU 才能实现有效冷却。常见的冷却问题包括将 GPU 安装在为 GPU 设计不当的服务器中,或向服务器中安装了错误数量的 GPU。在某些情况下,气流会通过 GPU 周围的“易路径”(阻力最小的路径)而不是通过 GPU。解决这个问题需要检查服务器的气流,并在必要时安装气流导向装置。

Note that higher GPU temperature also leads to more leakage current in the circuits, which increases the power consumed by the GPU at a specific clock frequency. Therefore, for GPUs more likely to be power throttled like NVIDIA T4, poor cooling can lead to lower stabilized clock frequency with power throttling and, thus, worse performance, even if the GPU clocks have not been thermally throttled yet.
请注意,更高的 GPU 温度会导致电路中更多的漏电流,从而增加 GPU 在特定时钟频率下的功耗。因此,对于更容易发生功耗节流的 GPU(如 NVIDIA T4),不良的冷却会导致在功耗节流的情况下频率不稳定,从而导致性能更差,即使 GPU 尚未发生热节流。

On the other hand, ambient temperature, the environment’s temperature around the server, does not usually affect GPU performance as long as the GPUs are properly cooled, except for GPUs with lower power limits whose performance may be slightly affected.
另一方面,环境温度(服务器周围的温度)只要 GPU 散热得当,通常不会影响 GPU 性能,除非是那些功耗较低的 GPU,其性能可能会受到轻微影响。

H2D/D2H Data Transfers and PCIe Bandwidth#
H2D/D2H 数据传输和 PCIe 带宽 #

On dGPU systems, the input data must often be copied from the host memory to the device memory (H2D) before an inference starts, and the output data must be copied back from the device memory to the host memory (D2H) after the inference. These H2D/D2H data transfers go through PCIe buses, and they can sometimes influence the inference performance or even become the performance bottleneck. The H2D/D2H copies can also be seen in the Nsight Systems profiles, appearing as cudaMemcpy() or cudaMemcpyAsync() CUDA API calls.
在 dGPU 系统上,推理开始前,输入数据通常需要从主机内存复制到设备内存(H2D),推理结束后,输出数据需要从设备内存复制回主机内存(D2H)。这些 H2D/D2H 数据传输通过 PCIe 总线进行,有时会影响推理性能,甚至成为性能瓶颈。H2D/D2H 复制也可以在 Nsight Systems 分析报告中看到,表现为 cudaMemcpy()cudaMemcpyAsync() CUDA API 调用。

To achieve maximum throughput, the H2D/D2H data transfers should run parallel to the GPU executions of other inferences so that the GPU does not sit idle when the H2D/D2H copies occur. This can be done by running multiple inferences in parallel streams or launching H2D/D2H copies in a different stream than the stream used for GPU executions and using CUDA events to synchronize between the streams. The trtexec tool shows an example of the latter implementation.
为了实现最大吞吐量,H2D/D2H 数据传输应与 GPU 的其他推理执行并行运行,以便在 H2D/D2H 复制过程中 GPU 不会处于空闲状态。这可以通过并行运行多个推理流或在不同于 GPU 执行流的流中启动 H2D/D2H 复制,并使用 CUDA 事件在不同流之间进行同步来实现。trtexec 工具展示了后一种实现的示例。

When the H2D/D2H copies run parallel to GPU executions, they can interfere with the GPU executions, especially if the host memory is pageable, which is the default case. Therefore, it is recommended that you allocate pinned host memory for the input and output data using cudaHostAlloc() or cudaMallocHost() CUDA APIs.
当 H2D/D2H 复制与 GPU 执行并行运行时,它们可能会干扰 GPU 执行,特别是如果主机内存是可分页的(默认情况)。因此,建议您使用 cudaHostAlloc()cudaMallocHost() CUDA API 为输入和输出数据分配固定主机内存。

To check whether the PCIe bandwidth becomes the performance bottleneck, you can check the Nsight Systems profiles and see if the H2D or D2H copies of an inference query have longer latencies than the GPU execution part. If PCIe bandwidth becomes the performance bottleneck, here are a few possible solutions.
要检查 PCIe 带宽是否成为性能瓶颈,您可以检查 Nsight Systems 的分析结果,查看推理查询的 H2D 或 D2H 复制是否比 GPU 执行部分具有更长的延迟。如果 PCIe 带宽成为性能瓶颈,以下是一些可能的解决方案。

First, check whether the PCIe bus configuration of the GPU is correct in terms of which generation (for example, Gen3 or Gen4) and how many lanes (for example, x8 or x16) are used. Next, reduce the amount of data that must be transferred using the PCIe bus. For example, suppose the input images have high resolutions, and the H2D copies become the bottleneck. In that case, you can transmit JPEG-compressed images over the PCIe bus and decode the image on the GPUs before the inference workflow instead of transmitting raw pixels. Finally, consider using NVIDIA GPUDirect technology to load data directly from/to the network or the filesystems without going through the host memory.
首先,检查 GPU 的 PCIe 总线配置是否正确,包括使用的代数(例如 Gen3 或 Gen4)和通道数(例如 x8 或 x16)。接下来,减少通过 PCIe 总线传输的数据量。例如,假设输入图像具有高分辨率,而 H2D 复制成为瓶颈。在这种情况下,您可以通过 PCIe 总线传输 JPEG 压缩图像,并在推理工作流之前在 GPU 上解码图像,而不是传输原始像素。最后,考虑使用 NVIDIA GPUDirect 技术直接从/到网络或文件系统加载数据,而无需通过主机内存。

In addition, if your system has AMD x86_64 CPUs, check the machine’s NUMA (Non-Uniform Memory Access) configurations with the numactl --hardware command. The PCIe bandwidth between a host memory and a device memory located on two different NUMA nodes is much more limited than the bandwidth between the host/device memory located on the same NUMA node. Allocate the host memory on the NUMA node on which the GPU on which the data will be copied resides. Also, pin the CPU threads that trigger the H2D/D2H copies on that specific NUMA node.
此外,如果您的系统具有 AMD x86_64 CPU,请使用 numactl --hardware 命令检查机器的 NUMA(非一致性内存访问)配置。主机内存和位于两个不同 NUMA 节点上的设备内存之间的 PCIe 带宽远低于位于同一 NUMA 节点上的主机/设备内存之间的带宽。将主机内存分配在数据将要复制到的 GPU 所在的 NUMA 节点上。此外,请将触发 H2D/D2H 复制的 CPU 线程固定在该特定的 NUMA 节点上。

Note that the host and the device share the same memory on mobile platforms, so the H2D/D2H data transfers are not required if the host memory is allocated using CUDA APIs and is pinned instead of pageable.
请注意,在移动平台上,主机和设备共享相同的内存,因此如果使用 CUDA API 分配主机内存并将其固定而不是可分页,则无需进行 H2D/D2H 数据传输。

By default, the trtexec tool measures the latencies of the H2D/D2H data transfers, which tells the user if the H2D/D2H copies may bottleneck the TensorRT workload. However, the H2D/D2H copies affect the stability of the GPU Compute Time. In that case, you can add the --noDataTransfers flag to disable H2D/D2H transfers and measure only the latencies of the GPU execution part.
默认情况下, trtexec 工具测量 H2D/D2H 数据传输的延迟,这会告诉用户 H2D/D2H 复制是否会成为 TensorRT 工作负载的瓶颈。然而,H2D/D2H 复制会影响 GPU 计算时间的稳定性。在这种情况下,您可以添加 --noDataTransfers 标志来禁用 H2D/D2H 传输,并仅测量 GPU 执行部分的延迟。

TCC Mode and WDDM Mode#
TCC 模式和 WDDM 模式#

On Windows machines, there are two driver modes: you can configure the GPU to be in the TCC mode and the WDDM mode. The mode can be specified by calling the sudo nvidia-smi -dm [0|1] command, but a GPU connected to a display shall not be configured into TCC mode. For more information, refer to the TCC mode documentation.
在 Windows 机器上,有两种驱动模式:可以将 GPU 配置为 TCC 模式和 WDDM 模式。可以通过调用 sudo nvidia-smi -dm [0|1] 命令指定模式,但连接到显示器的 GPU 不能配置为 TCC 模式。有关更多信息,请参阅 TCC 模式文档。

In TCC mode, the GPU is configured to focus on computation work, and graphics support like OpenGL or monitor display is disabled. This is the recommended mode for GPUs that run TensorRT inference workloads. On the other hand, the WDDM mode tends to cause GPUs to have worse and unstable performance results when running inference workloads using TensorRT.
在 TCC 模式下,GPU 被配置为专注于计算工作,而 OpenGL 或显示器显示等图形支持被禁用。这是运行 TensorRT 推理工作负载的 GPU 的推荐模式。另一方面,WDDM 模式在使用 TensorRT 运行推理工作负载时,往往会导致 GPU 出现更差和不稳定的性能结果。

This does not apply to Linux-based OS.
这适用于基于 Linux 的操作系统。

Enqueue-Bound Workloads and CUDA Graphs#
入队绑定工作负载和 CUDA 图 #

The enqueueV3() function of IExecutionContext is asynchronous. That is, it returns immediately after all the CUDA kernels are launched without waiting for the completion of CUDA kernel executions. However, in some cases, the enqueueV3() time can take longer than the actual GPU executions, causing the latency of enqueueV3() calls to become the performance bottleneck. We say that this type of workload is “enqueue-bound.” Two reasons may cause a workload to be enqueue-bound.
enqueueV3() 函数的 IExecutionContext 是异步的。也就是说,它在所有 CUDA 内核启动后立即返回,而不会等待 CUDA 内核执行完成。然而,在某些情况下, enqueueV3() 时间可能比实际的 GPU 执行时间更长,导致 enqueueV3() 调用的延迟成为性能瓶颈。我们称这种类型的工作负载为“入队受限”。两种原因可能导致工作负载成为入队受限。

First, if the workload is very tiny in terms of the number of computations, such as containing convolutions with small I/O sizes, matrix multiplications with small GEMM sizes, or mostly element-wise operations throughout the network, then the workload tends to be enqueue-bound. This is because most CUDA kernels take the CPU and the driver around 5-15 microseconds to launch per kernel, so if each CUDA kernel execution time is only several microseconds long on average, the kernel launching time becomes the main performance bottleneck.
首先,如果工作负载在计算数量方面非常小,例如包含小 I/O 尺寸的卷积、小 GEMM 尺寸的矩阵乘法,或网络中主要是逐元素操作,那么工作负载往往会成为入队受限。这是因为大多数 CUDA 内核每次启动需要大约 5-15 微秒的 CPU 和驱动时间,因此如果每个 CUDA 内核执行时间平均只有几微秒长,那么内核启动时间就会成为主要的性能瓶颈。

To solve this, try increasing the computation per CUDA kernel by increasing the batch size. You can also use CUDA Graphs to capture the kernel launches into a graph and launch the graph instead of calling enqueueV3().
为了解决这个问题,尝试通过增加批处理大小来增加每个 CUDA 内核的计算量。你也可以使用 CUDA Graphs 将内核启动捕获到图中,然后启动图而不是调用 enqueueV3()

Second, it is naturally queue-bound if the workload contains operations requiring device synchronizations, such as loops or if-else conditions. Increasing the batch size may help improve the throughput without increasing the latency.
其次,如果工作负载包含需要设备同步的操作(如循环或 if-else 条件),则自然会受队列限制。增加批处理大小可能有助于在不增加延迟的情况下提高吞吐量。

In trtexec, you can tell that a workload is enqueue-bound if the reported Enqueue Time is close to or longer than the reported GPU Compute Time. In this case, it is recommended that you add the --useCudaGrap``h flag to enable CUDA graphs in ``trtexec, which will reduce the Enqueue Time as long as the workload does not contain any synchronization operations.
trtexec 中,你可以看出如果报告的 Enqueue Time 接近或长于报告的 GPU Compute Time ,则工作负载是受队列限制的。在这种情况下,建议添加 --useCudaGrap``h flag to enable CUDA graphs in ``trtexec ,只要工作负载不包含任何同步操作, Enqueue Time 就会减少。

BlockingSync and SpinWait Synchronization Modes#
BlockingSyncSpinWait 同步模式 #

If performance is measured with cudaStreamSynchronize() or cudaEventSynchronize(), synchronization overhead variations may lead to performance measurement variations. This section describes the causes of the variations and how to avoid them.
如果使用 cudaStreamSynchronize()cudaEventSynchronize() 测量性能,同步开销的变化可能导致性能测量值的变化。本节描述了变化的原因以及如何避免它们。

When cudaStreamSynchronize() is called, there are two ways in which the driver waits until the stream is completed. If the cudaDeviceScheduleBlockingSync flag has been set with cudaSetDeviceFlags() calls, then the cudaStreamSynchornize() uses the blocking-sync mechanism. Otherwise, it uses the spin-wait mechanism.
当调用 cudaStreamSynchronize() 时,驱动程序等待流完成有两种方式。如果使用 cudaSetDeviceFlags() 调用设置了 cudaDeviceScheduleBlockingSync 标志,则 cudaStreamSynchornize() 会使用阻塞同步机制。否则,它使用自旋等待机制。

A similar idea applies to CUDA events. If a CUDA event is created with the cudaEventDefault flag, then the cudaEventSynchronize() call uses the spin-wait mechanism. If a CUDA event is created with the cudaEventBlockingSync flag, then the cudaEventSynchronize() call will use the blocking-sync mechanism.
类似的概念也适用于 CUDA 事件。如果使用 cudaEventDefault 标志创建 CUDA 事件,则 cudaEventSynchronize() 调用会使用自旋等待机制。如果使用 cudaEventBlockingSync 标志创建 CUDA 事件,则 cudaEventSynchronize() 调用会使用阻塞同步机制。

When the blocking-sync mode is used, the host thread yields to another thread until the device work is done. This allows the CPUs to sit idle to save power or to be used by other CPU workloads when the device is still executing. However, the blocking-sync mode tends to result in relatively unstable overheads in stream/event synchronizations in some OS, leading to variations in latency measurements.
当使用阻塞同步模式时,主机线程会出让给另一个线程,直到设备工作完成。这允许 CPU 处于空闲状态以节省功耗,或在设备仍在执行时被其他 CPU 工作负载使用。然而,阻塞同步模式在某些操作系统中会导致流/事件同步的相对不稳定开销,从而导致延迟测量值的变化。

On the other hand, when the spin-wait mode is used, the host thread is constantly polling until the device work is done. Using spin-wait makes the latency measurements more stable due to shorter and more stable overhead in stream/event synchronizations. Still, it consumes some CPU computation resources and leads to more power consumption by the CPUs.
另一方面,当使用自旋等待模式时,主机线程会持续轮询,直到设备工作完成。使用自旋等待由于流/事件同步的开销更短且更稳定,使得延迟测量更加稳定。然而,它会消耗一些 CPU 计算资源,并导致 CPU 功耗增加。

Therefore, if you want to reduce CPU power consumption or do not want the stream/event synchronizations to consume CPU resources (for example, you are running other heavy CPU workloads in parallel), use the blocking-sync mode. If you care more about stable performance measurements, use the spin-wait mode.
因此,如果您想减少 CPU 功耗,或者不希望流/事件同步消耗 CPU 资源(例如,您在并行运行其他密集型 CPU 工作负载),请使用阻塞同步模式。如果您更关心稳定的性能测量,请使用自旋等待模式。

In trtexec, the default synchronization mechanism is in blocking-sync mode. Add the --useSpinWait flag to enable synchronizations using the spin-wait mode for more stable latency measurements at the cost of more CPU utilizations and power consumptions.
trtexec 中,默认的同步机制是阻塞同步模式。添加 --useSpinWait 标志以启用自旋等待模式进行同步,以获得更稳定的延迟测量,但代价是更高的 CPU 利用率和功耗。

Optimizing TensorRT Performance#
优化 TensorRT 性能

The following sections focus on the general inference flow on GPUs and some general strategies to improve performance. These ideas apply to most CUDA programmers but may not be as obvious to developers from other backgrounds.
以下部分重点介绍了在 GPU 上进行推理的一般流程以及一些提高性能的一般策略。这些思想适用于大多数 CUDA 程序员,但对于其他背景的开发者可能不太明显。

Batching#  批处理 #

The most important optimization is to compute as many results in parallel as possible using batching. In TensorRT, a batch is a collection of inputs that can all be processed uniformly. Each instance in the batch has the same shape and flows through the network similarly. Therefore, each instance can be trivially computed in parallel.
最重要的优化是使用批处理并行计算尽可能多的结果。在 TensorRT 中,批处理是一组可以统一处理的输入集合。批处理中的每个实例具有相同的形状,并且以类似的方式流经网络。因此,每个实例都可以轻松地并行计算。

Each network layer will have some overhead and synchronization required to compute forward inference. By computing more results in parallel, this overhead is paid off more efficiently. In addition, many layers are performance-limited by the smallest dimension in the input. If the batch size is one or small, this size can often be the performance-limiting dimension. For example, the fully connected layer with V inputs and K outputs can be implemented for one batch instance as a matrix multiplied by a 1xV matrix with a VxK weight matrix. If N instances are batched, this becomes an NxV multiplied by the VxK matrix. The vector-matrix multiplier becomes a matrix-matrix multiplier, which is much more efficient.
每个网络层在计算前向推理时都需要一些开销和同步。通过并行计算更多结果,这种开销可以更有效地得到补偿。此外,许多层的性能受限于输入中最小的维度。如果批量大小为 1 或较小,这个尺寸通常会成为性能限制的维度。例如,具有 V 个输入和 K 个输出的全连接层,对于单个批量实例可以实现为一个矩阵乘以一个 1xV 矩阵,再乘以一个 VxK 权重矩阵。如果将 N 个实例批量处理,这就变成了一个 NxV 矩阵乘以一个 VxK 矩阵。向量-矩阵乘法器变成了矩阵-矩阵乘法器,效率要高得多。

Larger batch sizes are almost always more efficient on the GPU. Extremely large batches, such as N > 2^16, can sometimes require extended index computation and should be avoided if possible. But generally, increasing the batch size improves total throughput. In addition, when the network contains MatrixMultiply layers, batch sizes of multiples of 32 tend to have the best performance for FP16 and INT8 inference because of the utilization of Tensor Cores if the hardware supports them.
较大的批处理大小在 GPU 上几乎总是更高效。极大批次,例如 N > 2^16 ,有时可能需要扩展索引计算,如果可能的话应避免使用。但通常,增加批处理大小可以提高总吞吐量。此外,当网络包含 MatrixMultiply 层时,批处理大小为 32 的倍数通常在 FP16 和 INT8 推理时具有最佳性能,因为如果硬件支持,可以利用 Tensor Cores。

On NVIDIA Ada Lovelace or later GPUs, decreasing the batch size may improve the throughput significantly if the smaller batch sizes help the GPU cache the input/output values in the L2 cache. Therefore, various batch sizes should be tried to find the batch size that provides optimal performance.
在 NVIDIA Ada Lovelace 或更新的 GPU 上,如果较小的批处理大小有助于 GPU 将输入/输出值缓存到 L2 缓存中,则减小批处理大小可能会显著提高吞吐量。因此,应该尝试各种批处理大小,以找到提供最佳性能的批处理大小。

Sometimes, batching inference work is impossible due to the application’s organization. In some common applications, such as a server that makes inferences per request, it is possible to implement opportunistic batching. For each incoming request, wait for a time T. If other requests come in, batch them together. Otherwise, continue with a single-instance inference. This strategy adds fixed latency to each request but can greatly improve the system’s maximum throughput.
有时,由于应用程序的组织结构,批量推理工作可能无法实现。在一些常见应用程序中,例如每个请求进行推理的服务器,可以实现机会性批量。对于每个传入的请求,等待时间 T 。如果其他请求进来,将它们批量处理。否则,继续进行单实例推理。这种策略为每个请求增加了固定的延迟,但可以大大提高系统的最大吞吐量。

The NVIDIA Triton Inference Server provides a simple way to enable dynamic batching with TensorRT engines.
NVIDIA Triton Inference Server 提供了一种简单的方法,使用 TensorRT 引擎启用动态批量。

Using Batching  使用批量

The batch dimension is part of the tensor dimensions, and you can specify the range of the batch sizes and the batch size to optimize the engine by adding optimization profiles. For more information, refer to the Working With Dynamic Shapes section.
批量维度是张量维度的一部分,您可以通过添加优化配置文件来指定批量大小的范围和批量大小以优化引擎。有关更多信息,请参阅动态形状部分。

Within-Inference Multi-Streaming#
在推理中-Inference 多流式处理 #

In general, CUDA programming streams are a way of organizing asynchronous work. Asynchronous commands put into a stream are guaranteed to run in sequence but may execute out of order concerning other streams. In particular, asynchronous commands in two streams may be scheduled to run concurrently (subject to hardware limitations).
通常,CUDA 编程流是一种组织异步工作的方式。放入流中的异步命令保证按顺序执行,但可能相对于其他流执行顺序错开。特别是,两个流中的异步命令可能会被调度为同时运行(受硬件限制)。

In the context of TensorRT and inference, each layer of the optimized final network will require work on the GPU. However, not all layers can fully use the hardware’s computation capabilities. Scheduling requests in separate streams allows work to be scheduled immediately as the hardware becomes available without unnecessary synchronization. Even if only some layers can be overlapped, overall performance will improve.
在 TensorRT 和推理的上下文中,优化后的最终网络中的每一层都需要在 GPU 上进行工作。然而,并非所有层都能充分利用硬件的计算能力。在不同的流中调度请求允许工作在硬件可用时立即调度,而无需不必要的同步。即使只有部分层可以重叠,整体性能也会得到提升。

Use the IBuilderConfig::setMaxAuxStreams() API to set the maximum number of auxiliary streams TensorRT can use to run multiple layers in parallel. The auxiliary streams contrast the “mainstream” provided in the enqueueV3() call. If enabled, TensorRT will run some layers on the auxiliary streams parallel to those running on the mainstream.
使用 IBuilderConfig::setMaxAuxStreams() API 设置 TensorRT 可以使用的最大辅助流数量,以便并行运行多个层。辅助流与 enqueueV3() 调用中提供的“主流”形成对比。如果启用,TensorRT 将在辅助流上运行一些层,与在主流上运行的层并行。

For example, to run the inference on at most eight streams (that is, seven auxiliary streams and one mainstream) in total:
例如,要在总共最多八个流(即七个辅助流和一个主流)上运行推理:

1config->setMaxAuxStreams(7)
1config.max_aux_streams = 7

Note that this only sets the maximum number of auxiliary streams. However, TensorRT may use fewer auxiliary streams than this number if it determines that using more streams does not help.
请注意,这仅设置了辅助流的最大数量。但是,如果 TensorRT 确定使用更多流没有帮助,它可能会使用少于这个数量的辅助流。

To get the actual number of auxiliary streams that TensorRT uses for an engine, run the following:
要获取 TensorRT 为引擎使用的实际辅助流数量,请运行以下命令:

1int32_t nbAuxStreams = engine->getNbAuxStreams()
1num_aux_streams = engine.num_aux_streams

When an execution context is created from the engine, TensorRT automatically creates the auxiliary streams needed to run the inference. However, you can also specify the auxiliary streams you would like TensorRT to use:
当从引擎创建执行上下文时,TensorRT 会自动创建运行推理所需的辅助流。但是,您也可以指定希望 TensorRT 使用的辅助流:

1int32_t nbAuxStreams = engine->getNbAuxStreams();
2std::vector<cudaStream_t> streams(nbAuxStreams);
3for (int32_t i = 0; i < nbAuxStreams; ++i)
4{
5    cudaStreamCreate(&streams[i]);
6}
7context->setAuxStreams(streams.data(), nbAuxStreams);
1from cuda import cudart
2num_aux_streams = engine.num_aux_streams
3streams = []
4for i in range(num_aux_streams):
5    err, stream = cudart.cudaStreamCreate()
6    streams.append(stream)
7context.set_aux_streams(streams)

TensorRT will always insert event synchronizations between the mainstream provided using enqueueV3() call and the auxiliary streams:
TensorRT 总是在使用 enqueueV3() 调用提供的流和辅助流之间插入事件同步:

  • At the beginning of the enqueueV3() call, TensorRT will ensure that all the auxiliary streams wait on the activities on the mainstream.
    enqueueV3() 调用的开始处,TensorRT 将确保所有辅助流等待主流上的活动。

  • At the end of the enqueueV3() call, TensorRT will ensure that the mainstream waits for the activities on all the auxiliary streams.
    enqueueV3() 调用的结束处,TensorRT 将确保主流等待所有辅助流上的活动。

Enabling auxiliary streams may increase memory consumption because some activation buffers can no longer be reused.
启用辅助流可能会增加内存消耗,因为某些激活缓冲区将无法再被重用。

Cross-Inference Multi-Streaming#
跨推理多流 #

In addition to the within-inference streaming, you can enable streaming between multiple execution contexts. For example, you can build an engine with multiple optimization profiles and create an execution context per profile. Then, call the enqueueV3() function of the execution contexts on different streams to allow them to run in parallel.
除了内部推理流,您还可以启用多个执行上下文之间的流。例如,您可以构建具有多个优化配置的引擎,并为每个配置创建一个执行上下文。然后,在不同的流上调用执行上下文的 enqueueV3() 函数,以允许它们并行运行。

Running multiple concurrent streams often leads to several streams sharing compute resources simultaneously. This means the network may have fewer compute resources available during inference than when the TensorRT engine was optimized. This difference in resource availability can cause TensorRT to choose a suboptimal kernel for the actual runtime conditions. To mitigate this effect, you can limit the amount of available compute resources during engine creation to resemble actual runtime conditions more closely. This approach generally promotes throughput at the expense of latency. For more information, refer to the Limiting Compute Resources section.

It is also possible to use multiple host threads with streams. A common pattern is incoming requests dispatched to a pool of worker threads waiting for work. In this case, the pool of worker threads will each have one execution context and CUDA stream. Each thread will request work in its stream as the work becomes available. Each thread will synchronize with its stream to wait for results without blocking other worker threads.

CUDA Graphs#

CUDA Graphs represent a sequence (or, more generally, a graph) of kernels in a way that allows CUDA to optimize their scheduling. This can be particularly useful when your application performance is sensitive to the CPU time to queue the kernels.

Using CUDA Graphs with TensorRT Execution Context#

TensorRT’s enqueueV3() method supports CUDA graph capture for models requiring no mid-pipeline CPU interaction. For example:

 1// Call enqueueV3() once after an input shape change to update internal state.
 2context->enqueueV3(stream);
 3
 4// Capture a CUDA graph instance
 5cudaGraph_t graph;
 6cudaGraphExec_t instance;
 7cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
 8context->enqueueV3(stream);
 9cudaStreamEndCapture(stream, &graph);
10cudaGraphInstantiate(&instance, graph, 0);
11
12// To run inferences, launch the graph instead of calling enqueueV3().
13for (int i = 0; i < iterations; ++i) {
14    cudaGraphLaunch(instance, stream);
15    cudaStreamSynchronize(stream);
16}
 1from cuda import cudart
 2err, stream = cudart.cudaStreamCreate()
 3
 4# Call execute_async_v3() once after an input shape change to update internal state.
 5context.execute_async_v3(stream);
 6
 7# Capture a CUDA graph instance
 8cudaStreamBeginCapture(stream, cudart.cudaStreamCaptureModeGlobal)
 9context.execute_async_v3(stream)
10err, graph = cudart.cudaStreamEndCapture(stream)
11err, instance = cudart.cudaGraphInstantiate(graph, 0)
12
13# To run inferences, launch the graph instead of calling execute_async_v3().
14for i in range(iterations):
15    cudart.cudaGraphLaunch(instance, stream)
16    cudart.cudaStreamSynchronize(stream)

Limitations of CUDA Graphs#

CUDA graphs cannot handle some operations, so graph capturing may fail if the execution context contains such operations. Typical deep learning operators unsupported by CUDA graphs include loops, conditionals, and layers requiring data-dependent shapes. In these cases, cudaStreamEndCapture() will return cudaErrorStreamCapture* errors, indicating that the graph capturing has failed, but the context can continue to be used for normal inference without CUDA graphs. Refer to the CUDA Programming Guide to learn more about the limitations of CUDA graphs.

Also, when capturing a graph, it is important to account for the two-phase execution strategy used in the presence of dynamic shapes.

  1. Update the model’s internal state to account for any changes in input size.

  2. Stream work to the GPU.

The first phase requires no per-invocation work for models where input size is fixed at build time. Otherwise, if the input sizes have changed since the last invocation, some work may be required to update derived properties.

The first phase of work is not designed to be captured, and even if the capture is successful, it may increase model execution time. Therefore, after changing the shapes of inputs or the values of shape tensors, call enqueueV3() once to flush deferred updates before capturing the graph.

Graphs captured with TensorRT are specific to the input size and the state of the execution context. Modifying the context from which the graph was captured will result in undefined behavior when executing the graph—in particular, if the application is providing its memory for activations using createExecutionContextWithoutDeviceMemory(), the memory address is also captured as part of the graph. Locations of input and output buffers are also captured as part of the graph.

Therefore, the best practice is to use one execution context per captured graph and to share memory across the contexts with createExecutionContextWithoutDeviceMemory().

trtexec allows you to check whether the TensorRT engine you built is compatible with CUDA graph capture. For more information, refer to the trtexec section.

Concurrent CUDA Activities with CUDA Graph Capture#

Launching a CUDA kernel on the CUDA legacy default stream or calling synchronous CUDA APIs like cudaMemcpy() while capturing a CUDA graph fails because these CUDA activities implicitly synchronize the CUDA streams used by TensorRT execution contexts.

To avoid breaking the CUDA graph capture, ensure other CUDA kernels are launched on non-default CUDA streams and use the asynchronous version of CUDA APIs, like cudaMemcpyAsync().

Alternatively, a CUDA stream can be created with the cudaStreamNonBlocking flag to capture the CUDA graph for an execution context. If the execution context uses auxiliary streams, make sure you also call the setAuxStreams() API using streams created with the cudaStreamNonBlocking flag. Refer to the Within-Inference Multi-Streaming section about how to set auxiliary streams in TensorRT execution contexts.

Enabling Fusion#

Layer Fusion#

TensorRT attempts to perform many different types of optimizations in a network during the build phase. In the first phase, layers are fused whenever possible. Fusions transform the network into a simpler form but preserve the same overall behavior. Internally, many layer implementations have extra parameters and options that are not directly accessible when creating the network. Instead, the fusion optimization step detects supported patterns of operations and fuses multiple layers into one layer with an internal options set.

Consider the common case of a convolution followed by ReLU activation. Creating a network with these operations involves adding a Convolution layer with addConvolutionNd and following it with an Activation layer using addActivation with an ActivationType of kRELU. The unoptimized graph will contain separate layers for convolution and activation. The internal implementation of convolution supports computing the ReLU function on the output in one step directly from the convolution kernel without requiring a second kernel call. The fusion optimization step will detect the convolution followed by ReLU. Verify that the implementation supports the operations, then fuse them into one layer.

To investigate which fusions have occurred, the builder logs its operations to the logger object provided during construction. Optimization steps are at the kINFO log level. To see these messages, ensure you log them in the ILogger callback.

Fusions are normally handled by creating a new layer with a name containing the names of both of the layers that were fused. For example, a MatrixMultiply layer (InnerProduct) named ip1 is fused with a ReLU Activation layer named relu1 to create a new layer named ip1 + relu1.

Types of Fusions#

The following list describes the types of supported fusions.

Supported Layer Fusions

  • ReLU Activation: A single activation layer will replace an Activation layer performing ReLU followed by an activation performing ReLU.

  • Convolution and ReLU Activation: The Convolution layer can be of any type, and values are not restricted. The Activation layer must be of the ReLU type.

  • Convolution and GELU Activation: The input and output precision should be the same, with both of them FP16 or INT8. The Activation layer must be GELU type. TensorRT should run on an NVIDIA Turing or later with CUDA version 10.0.

  • Convolution and Clip Activation: The Convolution layer can be any type, and values are not restricted. The Activation layer must be Clip type.

  • Scale and Activation: The Scale layer, followed by an Activation layer, can be fused into a single Activation layer.

  • Convolution and ElementWise Operation: A Convolution layer followed by a simple sum, min, or max in an ElementWise layer can be fused into the Convolution layer. The sum must not use broadcasting unless the broadcasting is across the batch size.

  • Padding and Convolution/Deconvolution: If all the padding sizes are non-negative, padding followed by a Convolution or Deconvolution can be fused into a single Convolution/Deconvolution layer.

  • Shuffle and Reduce: A Shuffle layer without reshaping, followed by a Reduce layer, can be fused into a single Reduce layer. The Shuffle layer can perform permutations but cannot perform any reshape operation. The Reduce layer must have a keepDimensions set of dimensions.

  • Shuffle and Shuffle: Each Shuffle layer consists of a transpose, a reshape, and a second transpose. A Shuffle layer followed by another can be replaced by a single Shuffle (or nothing). If both Shuffle layers perform reshape operations, this fusion is only allowed if the second transpose of the first shuffle is the inverse of the first transpose of the second shuffle.

  • Scale: A Scale layer that adds 0, multiplied by 1, or computes powers to the 1 can be erased.

  • Convolution and Scale: Adjusting the convolution weights can fuse a convolution layer followed by a Scale layer that is kUNIFORM or kCHANNEL into a single convolution. This fusion is disabled if the scale has a non-constant power parameter.

  • Convolution and Generic Activation: This fusion happens after the pointwise fusion mentioned below. A pointwise with one input and output can be called a generic activation layer. A convolution layer followed by a generic activation layer can be fused into a single convolution layer.

  • Reduce: It performs average pooling, which a Pooling layer will replace. The Reduce layer must have a keepDimensions set and be reduced across H and W dimensions from the CHW input format before batching using the kAVG operation.

  • Convolution and Pooling: The Convolution and Pooling layers must have the same precision. The Convolution layer may already have a fused activation operation from a previous fusion.

  • Depthwise Separable Convolution: A depthwise convolution with activation followed by a convolution with activation may sometimes be fused into a single optimized DepSepConvolution layer. The precision of both convolutions must be INT8, and the device’s computation capability must be 7.2 or later.

  • Softmax and Log: If it has not already been fused with a previous log operation, it can be fused into a single Softmax layer.

  • Softmax and TopK: It can be fused into a single layer. The Softmax may or may not include a Log operation.

Supported Reduction Operation Fusions

  • GELU: A group of Unary and ElementWise layers representing the following equations can be fused into a single GELU reduction operation.

    \(0.5x\times \left( 1+\tanh\left( \frac{2}{π}\left( x+0.044715x^{3} \right) \right) \right)\)

    Or the alternative representation:

    \(0.5x \times \left( 1+erf\left( \frac{x}{\sqrt{2}} \right) \right)\)

  • L1Norm: A Unary layer kABS operation followed by a Reduce layer kSUM operation can be fused into a single L1Norm reduction operation.

  • Sum of Squares: A product ElementWise layer with the same input (square operation) followed by a kSUM reduction can be fused into a single square sum reduction operation.

  • L2Norm: A sum of squares operation followed by a kSQRT UnaryOperation can be fused into a single L2Norm reduction operation.

  • LogSum: A Reduce layer kSUM followed by a kLOG UnaryOperation can be fused into a single LogSum reduction operation.

  • LogSumExp: A Unary kEXP ElementWise operation followed by a LogSum fusion can be fused into a single LogSumExp reduction operation.

Pointwise Fusion#

Multiple adjacent Pointwise layers can be fused into a single Pointwise layer to improve performance.

The following types of Pointwise layers are supported, with some limitations:

  • Activation: Every ActivationType is supported.

  • Constant: Only constant with a single value (size == 1).

  • ElementWise: Every ElementWiseOperation is supported.

  • Pointwise: Pointwise itself is also a Pointwise layer.

  • Scale: Only support ScaleMode::kUNIFORM.

  • Unary: Every UnaryOperation is supported.

The size of the fused Pointwise layer is not unlimited, so some layers may not be fused.

Fusion creates a new layer with a name consisting of both fused layers. For example, an ElementWise layer named add1 is fused with a ReLU Activation layer named relu1, creating a new layer named fusedPointwiseNode(add1, relu1).

Q/DQ Fusion#

Refer to the Explicit Quantization section for suggestions on optimizing INT8 and FP8 networks containing QuantizeLinear and DequantizeLinear layers.

Multi-Head Attention Fusion#

We highly recommend tailoring your model to said restrictions so that Multi-Head Attention (MHA) fusion happens. It is important because it supports large sequence lengths by significantly reducing memory footprint from O(S^2) to O(S), where S is the sequence length. On top of that, it shares the common performance benefits of operator fusion, that is, reduced memory traffic, better hardware utilization, less kernel launch, and synchronization overhead.

Multi-head attention (MHA) computes softmax(Q * K^T / scale + mask) * V, where:

  • Q is query embedding

  • K is key embedding

  • V is value embeddings

The shape of Q is [B, N, S_q, H], and the shapes of K and V are [B, N, S_kv, H], where:

  • B is batch size

  • N is the number of attention heads

  • H is the head/hidden size

  • S_q and S_kv are the sequence lengths of the query and key/value, respectively.

MHA Fusion Pattern for FP16 and BF16

TensorRT chooses the accumulation precision by default based on the input types and performance considerations. However, you can also control accumulation precision (refer to Control of Computational Precision).

MHA Fusion Pattern for FP8 and INT8

The MHA fusion captures common pointwise operators in series in MHA as mentioned in the pointwise operation list. It also covers Q/DQ fusion following MHA for certain quantization and architecture (for example, FP16/BF16 to FP8/INT8 on NVIDIA Ampere GPU architecture).

Supported MHA Fusion Types#

Feature

FP16

BF16

INT8

FP8

SM Version (V)

SM75 V SM90

SM80 V SM90

SM75 V SM90

  • SM89

  • SM90

Head Size (H)

  • 16 H 256

  • H % 8 ==0

  • 16 H 256

  • H % 8 ==0

  • 16

  • 32

  • 64

  • 32 H 256

  • H % 16 == 0

Sequence Length (S_q, S_kv)

No restriction

No restriction

S_{q,kv} 512

S_{q} =1 or 32 S_{q,kv}

Quantization

Not required

Not required

Specify Q/DQ layers in the MHA pattern for FP8 and INT8.

Specify Q/DQ layers in the MHA pattern for FP8 and INT8.

Accumulation Precision (BMM1)

  • FP16

  • FP32

FP32

INT32

FP32

Accumulation Precision (BMM2)

  • FP16

  • FP32 (if SM90 and S_q = S_kv)

FP32

INT32

FP32

Supported Mask Type

Any masking (for example, Select operator in TensorRT)

Any masking (for example, Select operator in TensorRT)

Any masking (for example, Select operator in TensorRT)

Any masking (for example, Select operator in TensorRT)

Pointwise op

Activation, Constant, Elementwise (including SiLU), Pointwise (single input), Scale, and Unary

Activation, Constant, Elementwise (including SiLU), Pointwise (single input), Scale, and Unary

Activation, Constant, Elementwise (including SiLU), Pointwise (single input), Scale, and Unary

Activation, Constant, Elementwise (including SiLU), Pointwise (single input), Scale, and Unary

Supported MHA Fusions on SM100#

Feature

FP16

BF16

FP8

SM Version (V)

SM100

SM100

SM100

Head Size (H)

  • 8 H 128

  • H % 8 ==0

  • 8 H 128

  • H % 8 ==0

  • 16 H 128

  • H % 16 == 0

Sequence Length (S_q, S_kv)

No restriction

No restriction

No restriction

Quantization

Not required

Not required

Specify Q/DQ layers in the MHA pattern for FP8 and INT8.

Accumulation Precision (BMM1)

FP32

FP32

FP32

Accumulation Precision (BMM2)

FP32

FP32

FP32

Supported Mask Type

1d vector / scalar

1d vector / scalar

1d vector / scalar

Pointwise op

Activation, Constant, Elementwise (including SiLU), Pointwise (single input), Scale, and Unary

Activation, Constant, Elementwise (including SiLU), Pointwise (single input), Scale, and Unary

Activation, Constant, Elementwise (including SiLU), Pointwise (single input), Scale, and Unary

TensorRT may decide not to fuse an MHA graph into a single kernel based on performance evaluations or other constraints.

Example Workflow: FP8 MHA Fusion#

Assume you have an ONNX model, vit_base_patch8_224_Opset17.onnx, and calibration data, calib.npy, on your local machine.

  1. Install the TensorRT model optimizer.

    pip3 install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-modelop
    
  2. Quantize a model with TensorRT model optimizer. For more information, refer to these detailed instructions.

    python3 -m modelopt.onnx.quantization \
    --onnx_path=vit_base_patch8_224_Opset17.onnx \
    --quantize_mode=<fp8|int8> \
    --calibration_data=calib.npy \
    --calibration_method=<max|entropy> \
    --output_path=vit_base_patch8_224_Opset17.quant.onnx
    
  3. Compile the quantized model with TensorRT.

    trtexec --onnx=vit_base_patch8_224_Opset17.quant.onnx \
    --saveEngine=vit_base_patch8_224_Opset17.engine \
    --stronglyTyped --skipInference --profilingVerbosity=detailed
    
  4. Run the quantized model with TensorRT.

    trtexec --loadEngine=vit_base_patch8_224_Opset17.engine \
    --useCudaGraph --noDataTransfers --useSpinWait
    

    Add the following options if you want to check if MHA is fused. MHA should be fused if you find the mha op in the output.log file.

    trtexec --loadEngine=vit_base_patch8_224_Opset17.engine \
    --profilingVerbosity=detailed --dumpLayerInfo --skipInference &> output.log
    

    Tip

    • There are two ways to set the accumulation data type to FP32:

      1. Manually set computational precision. For more information, refer to these detailed instructions.

      2. Convert your ONNX model using TensorRT Model Optimizer, which adds the Cast ops automatically.

        • If the MHA has a head size (H) that is not a multiple of 16, do not add Q/DQ ops in the MHA to fall back to the FP16 MHA for better performance.

        • Given the restrictions, compare INT8 with FP8 for MHA fusion.

Limiting Compute Resources#

Limiting the number of compute resources available to TensorRT during engine creation is beneficial when the reduced amount better represents the expected conditions during runtime. For example, when the GPU is expected to be performing additional work in parallel to the TensorRT engine or when the engine is expected to be run on a different GPU with fewer resources (note that the recommended approach is to build the engine on the GPU that will be used for inference, but this may not always be feasible).

You can limit the number of available compute resources with the following steps:

  1. Start the CUDA MPS control daemon.

    nvidia-cuda-mps-control -d
    
  2. Set the number of computing resources to use with the CUDA_MPS_ACTIVE_THREAD_PERCENTAGE environment variable. For example, export CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=50.

  3. Build the network engine.

  4. Stop the CUDA MPS control daemon.

    echo quit | nvidia-cuda-mps-control
    

The resulting engine is optimized to the reduced number of compute cores (50% in this example) and provides better throughput when using similar conditions during inference. You are encouraged to experiment with different amounts of streams and different MPS values to determine the best performance for your network.

For more details about nvidia-cuda-mps-control, refer to the nvidia-cuda-mps-control documentation and the relevant GPU requirements here.

Deterministic Tactic Selection#

TensorRT runs through all the possible tactics in the engine-building phase and selects the fastest ones. Since the selection is based on the tactics’ latency measurements, TensorRT may select different tactics across different runs if some have similar latencies. Therefore, different engines built from the same INetworkDefinition may behave slightly differently regarding output values and performance. You can inspect the selected tactics of an engine by using the engine inspector APIs or by turning on verbose logging while building the engine.

If deterministic tactic selection is desired, the following lists a few suggestions that may help improve the determinism of tactic selection.

Locking GPU Clock Frequency

By default, the GPU’s clock frequency is not locked, meaning that the GPU normally sits at the idle clock frequency and only boosts to the max clock frequency when there are active GPU workloads. However, there is a latency for the clock to be boosted from the idle frequency, and that may cause performance variations while TensorRT is running through the tactics and selecting the best ones, resulting in non-deterministic tactic selections.

Therefore, locking the GPU clock frequency before building a TensorRT engine may improve the determinism of tactic selection. Refer to the Hardware/Software Environment for Performance Measurements section for more information about how to lock and monitor the GPU clock and the factors that may affect GPU clock frequencies.

Increasing Average Timing Iterations

By default, TensorRT runs each tactic for at least four iterations and takes the average latency. You can increase the number of iterations by calling the setAvgTimingIterations() API:

1builderConfig->setAvgTimingIterations(8);
1Builder_config.avg_timing_iterations = 8

Increasing the number of average timing iterations may improve the determinism of tactic selections, but the required engine-building time will become longer.

Using Timing Cache

Timing Cache records the latencies of each tactic for a specific layer configuration. The tactic latencies are reused if TensorRT encounters another layer with an identical configuration. Therefore, by reusing the same timing cache across multiple engine buildings runs with the same INetworkDefinition and builder config, you can make TensorRT select an identical set of tactics in the resulting engines.

Overhead of Shape Change and Optimization Profile Switching#

After the IExecutionContext switches to a new optimization profile or the shapes of the input bindings change, TensorRT must recompute the tensor shapes throughout the network and recompute the resources needed by some tactics for the new shapes before the next inference can start. That means the first enqueueV3() call after a shape/profile change may be longer than the subsequent enqueueV3() calls.

Optimizing Layer Performance#

The following descriptions detail how you can optimize the listed layers.

  • Gather: Use an axis of 0 to maximize the performance of a Gather layer. There are no fusions available for a Gather layer.

  • Reduce: To get the maximum performance out of a Reduce layer, perform the reduction across the last dimensions (tail reduce). This allows optimal memory to read/write patterns through sequential memory locations. If doing common reduction operations, express the reduction in a way that will be fused to a single operation.

  • RNN: Loop-based API provides a much more flexible mechanism for using general layers within recurrence. The ILoopLayer recurrence enables a rich set of automatic loop optimizations, including loop fusion, unrolling, and loop-invariant code motion, to name a few. For example, significant performance gains are often obtained when multiple instances of the same MatrixMultiply layer are properly combined to maximize machine utilization after loop unrolling along the sequence dimension. This works best if you can avoid a MatrixMultiply layer with a recurrent data dependence along the sequence dimension.

  • Shuffle: Shuffle operations equivalent to identity operations on the underlying data are omitted if the input tensor is only used in the shuffle layer and the input and output tensors of this layer are not input and output tensors of the network. TensorRT does not execute additional kernels or memory copies for such operations.

  • TopK: To get the maximum performance out of a TopK layer, use small values of K, reducing the last dimension of data to allow optimal sequential memory access. Reductions along multiple dimensions at once can be simulated using a Shuffle layer to reshape the data and then appropriately reinterpret the index values.

For more information about layers, refer to the TensorRT Operator documentation.

Optimizing for Tensor Cores#

Tensor Core is a key technology for delivering high-performance inference on NVIDIA GPUs. In TensorRT, Tensor Core operations are supported by all compute-intensive layers: MatrixMultiply, Convolution, and Deconvolution.

Tensor Core layers tend to achieve better performance if the I/O tensor dimensions are aligned to a certain minimum granularity:

  • The alignment requirement is on the I/O channel dimension in the Convolution and Deconvolution layers.

  • In the MatrixMultiply layer, the alignment requirement is on matrix dimensions K and N in a MatrixMultiply that is M x K times K x N.

The following table captures the suggested tensor dimension alignment for better Tensor Core performance.

Types of Tensor Cores#

Tensor Core Operation Type

Suggested Tensor Dimension Alignment in Elements

TF32

4

FP16

8 for dense math, 16 for sparse math

INT8

32

When using Tensor Core implementations in cases where these requirements are unmet, TensorRT implicitly pads the tensors to the nearest multiple of alignment, rounding up the dimensions in the model definition instead to allow for extra capacity in the model without increasing computation or memory traffic.

TensorRT always uses the fastest implementation for a layer, and thus, in some cases, it may not use a Tensor Core implementation even if it is available.

To check if Tensor Core is used for a layer, run Nsight Systems with the --gpu-metrics-device all flag while profiling the TensorRT application. The Tensor Core usage rate can be found in the profiling result in the Nsight Systems user interface under the SM instructions/Tensor Active row. Refer to the CUDA Profiling Tools for more information about using Nsight Systems to profile TensorRT applications.

It is impractical to expect a CUDA kernel to reach 100% Tensor Core usage since there are other overheads such as DRAM reads/writes, instruction stalls, other computation units, etc. The more computation-intensive an operation is, the higher the Tensor Core usage rate the CUDA kernel can achieve.

The following image is an example of Nsight Systems profiling.

Tensor Core Activities on an A100 GPU Running ResNet-50 with FP16 Enabled

Optimizing Plugins#

TensorRT provides a mechanism for registering custom plugins that perform layer operations. After a plugin creator is registered, you can search the registry to find the creator and add the corresponding plugin object to the network during serialization/deserialization.

Once the plugin library is loaded, all TensorRT plugins are automatically registered. For more information about custom plugins, refer to Extending TensorRT With Custom Layers.

Plugin performance depends on the CUDA code performing the plugin operation. Standard CUDA Best Practices apply. When developing plugins, starting with simple standalone CUDA applications that perform the plugin operation and verify correctness can be helpful. The plugin program can then be extended with performance measurements, more unit testing, and alternate implementations. After the code is working and optimized, it can be integrated as a plugin into TensorRT.

Supporting as many formats as possible in the plugin is important to get the best performance possible. This removes the need for internal reformat operations during the execution of the network. Refer to the Extending TensorRT With Custom Layers section for examples.

Optimizing Python Performance#

Most of the same performance considerations apply when using the Python API. When building engines, the builder optimization phase will normally be the performance bottleneck, not API calls to construct the network. Inference time should be nearly identical between the Python API and C++ API.

Setting up the input buffers in the Python API involves using pycuda or another CUDA Python library, like cupy, to transfer the data from the host to device memory. The details of how this works will depend on where the host data comes from. Internally, pycuda supports the Python Buffer Protocol, allowing efficient access to memory regions. This means that if the input data is available in a suitable format in numpy arrays or another type with support for the buffer protocol, it allows efficient access and transfer to the GPU. For even better performance, allocate a page-locked buffer using pycuda and write your final preprocessed input.

For more information about using the Python API, refer to the Python API documentation.

Improving Model Accuracy#

Depending on the builder configuration, TensorRT can execute a layer in FP32, FP16, BF16, FP8, or INT8 precision. By default, TensorRT chooses to run a layer in a precision that results in optimal performance. Sometimes, this can result in poor accuracy. Generally, running a higher-precision layer helps improve accuracy with some performance hits.

There are several steps that we can take to improve model accuracy:

  1. Validate layer outputs:

    1. Use Polygraphy to dump layer outputs and verify no NaNs or Infs. The --validate option can check for NaNs and Infs. Also, we can compare layer outputs with golden values from, for example, ONNX runtime.

    2. For FP16 and BF16, a model might require retraining to ensure that intermediate layer output can be represented in FP16/BF16 precision without overflow or underflow.

    3. For INT8, consider recalibrating with a more representative calibration data set. If your model comes from PyTorch, we also provide the TensorRT Model Optimizer for QAT in the framework besides PTQ in TensorRT. You can try both approaches and choose the one with more accuracy.

  2. Manipulate layer precision:

    1. Sometimes, running a layer with a certain precision results in incorrect output. This can be due to inherent layer constraints (for example, LayerNorm output should not be INT8) or model constraints (output gets diverged, resulting in poor accuracy).

    2. You can control layer execution precision and output precision.

    3. An experimental debug precision tool can help automatically find layers to run with high precision.

  3. Use the Editable Timing Cache to select a proper tactic.

    1. When accuracy changes between two built engines for the same model, it might be due to a bad tactic being selected for a layer.

    2. Use Editable Timing Cache to dump available tactics. Update the cache with a proper one.

Accuracy from run-to-run variation should not change; once the engine is built for a specific GPU, it should result in bit-accurate outputs in multiple runs. If not, file a TensorRT bug.

Optimizing Builder Performance#

The TensorRT builder profiles each layer’s available tactics to search for the fastest inference engine plan. The builder time can be long if the model has many layers or complicated topology. The following sections provide options to reduce builder time.

Timing Cache#

TensorRT creates a layer-timing cache to reduce builder time and keep the layer-profiling information. The information it contains is specific to the targeted device, CUDA, TensorRT versions, and BuilderConfig parameters that can change the layer implementation, such as BuilderFlag::kTF32 or BuilderFlag::kREFIT.

The TensorRT builder skips profiling and reuses the cached result for the repeated layers if other layers have the same IO tensor configuration and layer parameters. If a timing query misses in the cache, the builder times the layer and updates the cache.

The timing cache can be serialized and deserialized. You can load a serialized cache from a buffer using IBuilderConfig::createTimingCache:

ITimingCache* cache =
 config->createTimingCache(cacheFile.data(), cacheFile.size());

Setting the buffer size to 0 creates a new empty timing cache.

You then attach the cache to a builder configuration before building.

config->setTimingCache(*cache, false);

Due to cache misses, the timing cache can be augmented with more information during the build. After the build, it can be serialized for use with another builder.

IHostMemory* serializedCache = cache->serialize();

If a builder does not have a timing cache attached, it creates its temporary local cache and destroys it when it is done.

The compilation cache is part of the timing cache, which caches JIT-compiled code and will be serialized as part of the timing cache by default. It can be disabled by setting the BuildFlag.

config->setFlag(BuilderFlag::kDISABLE_COMPILATION_CACHE);

Note

The timing cache supports the most frequently used layer types: Convolution, Deconvolution, Pooling, SoftMax, MatrixMultiply, ElementWise, Shuffle, and tensor memory layout conversion. More layer types will be added in future releases.

Builder Optimization Level#

Set the optimization level in the builder config to adjust how long TensorRT should spend searching for tactics with potentially better performance. By default, the optimization level is 3. Setting it to a smaller value results in much faster engine building time, but the engine’s performance may be worse. On the other hand, setting it to a larger value will increase the engine building time, but the resulting engine may perform better if TensorRT can find better tactics.

For example, to set the optimization level to 0 (the fastest):

1config->setOptimizationLevel(0);
1config.optimization_level = 0