TensorRT-LLM × PyTorch: A New Development Paradigm for High-Performance LLM Inference

加速计算专家团队薛博阳

议程 (Agenda
引言 (Introduction
快速入门 (Quick Start
概览 (Overview
LLM API 对比
PyTorch 工作流架构详解
代码结构 (Code Structure
基于 PyTorch 的建模 (PyTorch based Modeling
模块化 Python 运行时 (Modularized Python Runtime
性能 (Performance
结论
行动号召 (Call for actions
社区与资源

议程 (Agenda)

引言 (Introduction)
基于 PyTorch 的建模 (PyTorch based Modeling)
模块化 Python 运行时 (Modularized Python Runtime)

引言 (Introduction)

TensorRT-LLM 简介：TensorRT-LLM 为用户提供了一个易于使用的 Python API，用于定义大型语言模型（LLMs），并支持最先进的优化，以便在 NVIDIA GPU 上高效执行推理。TensorRT-LLM 还包含 Python 和 C++ 组件，以高性能方式编排推理执行。
TensorRT-LLM v0.17 引入了一个新的 PyTorch 后端。
新后端旨在解决现有 TensorRT 后端的易用性问题，同时提供顶级的性能。其特性包括：
- 基于 PyTorch 的模型实现和执行
- 模块化的 Python 运行时
- 与 HuggingFace checkpoints 兼容的 LLM API
- OpenAI 服务 (trtllm-serve)

快速入门 (Quick Start)

LLM API

用户可以通过几行代码尝试 PyTorch 工作流。

# Import LLM API
from tensorrt_llm.torch import LLM

# Create a LLM object
llm = LLM(model="./Llama-3.1-8B-Instruct")

# Prepare prompts
prompts = [
    "Hi, pls tell me something about reasoning model",
    "Hi, pls tell me something about TensorRT-LLM"
]

# Generate output
output = llm.generate(prompts)

PyTorch workflow

* LLM API 的设计思路借鉴自 vLLM 团队。

更多示例和参数：
更多带有附加参数的示例可在 examples/pytorch/quickstart_advanced.py 中找到。

LLMArgs: 模型路径、tokenizer、张量并行度、量化等...
- 所有参数都可以作为 kwargs 直接传递给 LLM()。
PyTorchConfig: 用于 PyTorch 的附加配置，例如 CUDA graph 和后端选择等。

概览 (Overview)

下图展示了 TensorRT-LLM 中 TensorRT 工作流和新的 PyTorch 工作流的整体架构。

两个工作流共享相同的上层 Serving 和 API 接口，但在 Runtime 和 Modeling 层面有所不同。

LLM API 对比

版本1：传统 TensorRT 工作流

此工作流需要手动进行模型转换和引擎构建，并使用 Python 包装器调用 C++ 运行时来构建新模型。

版本2：LLM API TensorRT 工作流

这是一个基于 LLM API 的单步工作流，底层使用 TensorRT。它通过 Python 包装器简化了新模型的构建，并使用带有 Python 绑定的 C++ 运行时。

版本3：LLM API PyTorch 工作流

这是基于 LLM API 的单步工作流，底层使用 PyTorch。它采用基于 PyTorch 的模型 API 来构建新模型，并通过重用模块化的 C++ 运行时来执行。

PyTorch 工作流架构详解

PyTorch 工作流专注于易用性和灵活性，其路径如下图高亮部分所示：

该流程从 LLM (Torch) API 开始，通过 PyExecutor 和模块化的运行时接口，调度 PyTorch Engine 执行。模型层 (torch.nn.Module) 可以使用 PyTorch 原生算子、自定义算子以及复用底层的 TRT-LLM Kernels。

代码结构 (Code Structure)

TensorRT-LLM 的代码结构清晰地划分了不同功能模块。

API

llm.py 模块继承了 LLM API，是 PyTorch 工作流的用户入口。

Runtime

pyexecutor/ 目录包含了 Python 运行时的实现。

Modeling

模型定义相关代码位于多个模块中，包括：
* attention_backend/: 实现了多种 Attention 后端，如 Vanilla, flashinfer, TRT-LLM, StarAttention。
* models/: 使用 PyTorch 模块实现各种模型。
* modules/: 包含构成模型的基本 PyTorch 模块，如 Linear, Norm, Attention, MLP, MoE 等。

完整代码结构概览

tensorrt_llm/
- _torch/: 包含所有 PyTorch 工作流的代码。
  - __init__.py: Python 模块初始化。
  - attention_backend/: Vanilla, flashinfer, TRT-LLM, StarAttention。
  - compilation/: 与 torch.compile 相关。
  - custom_op/: 自定义算子注册。
  - distributed/: allreduce, allgather, reducescatter。
  - llm.py: 继承 LLM API。
  - metadata.py: 现在用于 KVCache 元数据。
  - model_config.py: pretrained_config, device mapping, quant_config, attn_backend, moe_backend。
  - models/: 使用 PyTorch 模块实现模型。
  - modules/: PyTorch 模块: Linear, Norm, Attention, MLP, MoE 等。
  - peft/: LoRA 支持。
  - pyexecutor/: Python runtime。
  - speculative/: eagle3, mpt, ...

基于 PyTorch 的建模 (PyTorch based Modeling)

下图展示了基于 PyTorch 的建模在整个系统架构中的位置。它位于底层，负责模型的定义和执行，并与上层的 Python 运行时、C++ 运行时以及服务层（如 Triton Inference Server）进行交互。

整个流程分为以下几个层次：

Serving（服务层）: Triton Inference Server 或 OpenAI Server (triton-serve)
API: Dynamo
Runtime（运行时）: LLM (Torch), GenerationExecutor
Python Runtime（Python 运行时）: PyExecutor, Modularized Runtime Interface, Runtime Impl / Binding Wrappers
C++ Runtime（C++ 运行时）: Scheduler, KV Cache Manager
Modeling（建模层）: 包含 PyTorch Engine, torch.nn.Module, PyTorch Custom Ops (与 PyTorch Native Ops 结合), 以及 TRT-LLM Kernels。这一部分构成了 PyTorch 工作流。

使用 PyTorch 开发模型

之前，TensorRT-LLM 提供了一套类似 PyTorch 的 API 来使用 TensorRT 开发模型。
- tensorrt_llm.Module 对应 torch.nn.Module
- tensorrt_llm.functional 对应 torch.nn.functional
- tensorrt_llm.Tensor 对应 torch.Tensor
- TRT Plugins 对应 torch.ops.trtllm
Page 18
这使得模型开发比使用原生的 TensorRT API 更简单。
然而，开发体验不如原生的 PyTorch：
- PyTorch 支持 Eager 模式执行。
- PyTorch Tensor 类更加灵活和强大。
- 扩展 PyTorch Ops 比扩展 TensorRT Plugins 要容易得多（特别是对于熟悉 PyTorch 的开发者）。

添加新模型

模型层次结构

模型定义与 HuggingFace Transformers 库非常相似。
每个层次结构的具体架构可以在 tensorrt_llm/torch/models/modeling_XXX 中进行定制。

下图展示了模型的层次结构：

从外到内依次是：
1. PyTorchModelEngine
2. DecoderModelForCausalLM
3. LMHead
4. DecoderModel
* Embedding
* RMSNorm
* DecoderLayer x N
* RMSNorm
* Attention
* MLP

文档链接: https://nvidia.github.io/TensorRT-LLM/torch/adding_new_model.html

张量 (Tensors)

数据在神经网络中的表示

TensorRT Tensors 仅仅是定义计算图的代理（proxies）。
- 我们不能直接操作 TensorRT tensors。
- 例如，打印值、原地修改值。
- 它们仅作为 TensorRT 层的输入和输出节点。

input_tensor: tensorrt_llm.Tensor
# Slicing (results in a new Tensor)
sliced_tensor = slice(input_tensor, starts=[1, 0], sizes=[2, 2])
# Indexing
indices = constant(np.array([0, 2], dtype=np.int32))
gathered_tensor = gather(input_tensor, dim=0, indices=indices)
# Boolean masking
mask = gt(input_tensor, 5)
masked_tensor = masked_select(input_tensor, mask)
# Unary Op
abs_tensor = input_tensor.abs() # Does not support abs()

* 每个操作都必须产生新的 Tensors。 * 这依赖于图优化来高效执行。

PyTorch Tensors 是被物化（materialized）并带有真实值的张量，在模型执行期间。
- 支持 "Pythonic" 风格的操作：

input_tensor: torch.Tensor
# Slicing (creates a view, materialized when needed)
sliced_tensor = input_tensor[1:3, 0:2]
# Indexing
indexed_tensor = input_tensor[[0, 2]]
# Boolean masking
masked_tensor = input_tensor[input_tensor > 5]
# Unary Op generating a new Tensor
abs_tensor = input_tensor.abs()
# Unary Op with in-place modification
abs_tensor = input_tensor.abs_()

* 在可能的情况下创建张量的“视图”（views），仅在需要时物化新的张量。 * 命令式编程更加自然。

函数 (Functionals)

TensorRT 和 PyTorch 中的内置操作

tensorrt_llm.functional 实现了 LLM 推理中最常见的功能。
- 使用开箱即用（OOTB）的 TensorRT 操作或 TensorRT 插件。
- 调用内置操作是繁琐的：
  - 获取网络（Get the network）
  - 添加层（Add a layer）
  - 添加参数（Add parameters）
  - 获取输出张量（Get the output tensor(s)）

def softmax(input: Tensor, dim: Optional[int] = None) -> Tensor:
    axes = dim_to_trt_axes(dim)
    layer = default_trtnet().add_softmax(input.trt_tensor)
    layer.axes = axes
    return_create_tensor(layer.get_output(0), layer)

torch.nn.functional 提供了多样的操作。
- 设计良好的 API；只需传入张量和其他参数。

def softmax(input: torch.Tensor, dim: Optional[int] = None) -> torch.Tensor:
    return F.softmax(input, dim=dim)

使用自定义核函数 (Custom Kernel)

实现 TensorRT 插件 / PyTorch 操作

TensorRT plugin: 创建一个插件类。
- class Fp4GemmPlugin : public BasePlugin
- 有大量的样板代码（10+个成员函数）需要填写。
  - Fp4GemmPlugin(...)
  - Fp4GemmPlugin(const void*, size_t)
  - ~Fp4GemmPlugin()
  - clone()
  - getOutputDimensions(...)
  - supportsFormatCombination(...)
  - configurePlugin(...)
  - getWorkspaceSize(...)
  - enqueue(...)
  - getOutputDataType(...)
- 核函数调用在 enqueue 中。
- 在处理指针时要小心。
- Page 22 Code Snippet 1
Torch op: 只是一个带有 Python 绑定的 C++ 函数。
- TORCH_LIBRARY_IMPL(trtllm, cuda, m)
- m.impl("fp4_gemm", &torch_ext::fp4_gemm);
- at::Tensor fp4_gemm(at::Tensor const& mat1, ...)
- Torch 张量作为参数传入，提供了充足的信息，如 shape, dtype, data_ptr 等。
- 分配器，例如输出张量。
- 类型检查。
- 核函数启动。
- Page 22 Code Snippet 2

调用 TensorRT 插件 / PyTorch 操作

TensorRT plugin
- 需要大量样板代码。
- Page 23 Code Snippet 1
PyTorch op
- 只需调用该操作。
- Page 23 Code Snippet 2
实现一个单元测试是很痛苦的。
- 创建输入张量，定义一个小网络，捕获输出，并将数据复制回 CPU。

模块 (Modules)

模型的构建块

TensorRT-LLM 旨在使开发体验达到与 PyTorch 相似的水平。
为一个 Module 定义两个主要方法：
- __init__
- forward
tensorrt_llm.layers

class RmsNorm(Module):
    def __init__(self, ...):
        # ...
        if self.elementwise_affine:
            self.weight = Parameter(shape=self.normalized_shape, dtype=dtype)
        else:
            self.register_parameter('weight', None)
        
        self.eps = eps
        self.dtype = dtype

    def forward(self, x, ...):
        weight = None if self.weight is None else self.weight.value
        if self.normalized_shape is None:
            normalized_shape = self.normalized_shape
        return rms_norm(x, normalized_shape, self.num_groups, weight, self.eps)

tensorrt_llm._torch.modules

class RmsNorm(nn.Module):
    def __init__(self, ...):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size, dtype=type))
        self.variance_epsilon = eps

    def forward(self, 
                hidden_states: torch.Tensor,
                residual: Optional[torch.Tensor] = None
               ) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
        if IS_FLASHINFER_AVAILABLE:
            from ...custom_op import flashinfer_fused_add_rmsnorm
            # ...
            if residual is not None:
                flashinfer_fused_add_rmsnorm(hidden_states, residual,
                                             self.weight, self.variance_epsilon)
            return hidden_states, residual
        # ...

调试体验

PyTorch 的 Eager 模式执行使调试变得更加容易。
调试一个张量的值，例如 MLP 的输出。
在 TensorRT 模块中:
- 将张量注册为输出。

output = self.proj(inter)

self.register_network_output('mlp_output', output)

* 使用标志构建 TensorRT 引擎。

trtllm-build ... --enable_debug_output

* 在 \1 中为调试模式开启。

runner_kwangs = dict(debug_mode=True, ) # ...
model_runner = ModelRunner.from_dir(**runner_kwags)

* 在 \1 中捕获输出张量。

if self.debug_mode:
    #...
    print(self.debug_buffer['transformer.layers.0.mlp_output'])

在 PyTorch 模块中:
- 张量可以直接打印。

output = self.down_proj(inter)
print(output.shape, output[0])

* 可以在 IDE 中使用断点。

内置模块 (Built-in Modules)

tensorrt_llm._torch.modules 包含用于 LLM 的常见构建块，包括：
- Linear (线性层)
  - TP (张量并行)
  - 量化的 GEMM (目前支持 FP8 和 NVF4)
  - 权重加载 (支持 GEMM fusion)
- RMSNorm
  - 融合的残差加法和 RMSNorm (来自 flashinfer)
- MLP
  - up_gate_proj
  - 融合的 silu_and_mul (来自 flashinfer)
- Fused_moe
- EP
  - MoE layers
- Attention
  - qkv_proj and o_proj
  - 具有可替换后端的 Attention (trtllm, flashinfer 等)
- DecoderLayer
  - Norm + Attention + MLP

Linear (线性层)

用于 GEMM 的统一线性模块

TensorRT 工作流: 继承 (Inheritance)
- Column/Row Linear 和量化在子类中实现。
- 模块替换必须在 init 之后执行。
- Page 27
PyTorch 工作流: 组合 (Composition)
- 只有一个类。
- load_weights 处理 TP, GEMM fusion, 量化等。
- Page 27 PyTorch Code

Linear - 权重加载 (Weight Loading)

Linear.load_weights 以统一的方式处理权重加载。
其参数是模块所需的权重张量。
- 例如 model.layers.0.self_attn.qkv_proj 需要：
  - model.layers.0.self_attn.q_proj.weight
  - model.layers.0.self_attn.k_proj.weight
  - model.layers.0.self_attn.v_proj.weight
仅当启用 TP 时才加载相关分片（shard）的权重。
- 例如 weight.shape=[4096, 4096], tp_size=4, tp_rank=1, tp_mode=COLUMN
- 加载的权重是 weight[1024:2048, :] (注意：第一个维度是 PyTorch 中的 out_features)
所需的 qkv 分片被融合并复制到 Parameter 中。

扩展模块：在线性层中支持 FP4

FP4: E2M1
- 15 个可能值: 0, ±0.5, ±1.0, ±1.5, ±2.0, ±3.0, ±4.0, ±6.0
双重量化 (Double quantization)
- 16 个 FP4 E2M1 值共享一个 FP8 E4M3 per-block scaling factor
- Per-block scaling factors 共享一个 FP32 per-tensor scaling factor
由 ModelOpt 支持的量化
- 在 FP4 ckpt 中，每个线性模块包含：
  - FP4 权重
  - Per-block SFs (Scaling Factors) 和一个 per-tensor SF
  - 输入激活的一个 per-tensor SF
激活量化是动态的
- FP4 quantize: 给定 FP16/BF16 输入和 per-tensor SF, 返回 FP4 量化输出和 per-block SFs
与 FP8s 相比
- 约 2 倍内存节省 (4.5 bits)
- 2 倍 Tensor Core 性能

在 _create_weights (由 load_weights 调用) 中声明模块的参数。

class Linear(nn.Module):
    def _create_weights(self):
        # Quantized weights
        self.weight = Parameter(torch.empty(
            [self.out_features, self.in_features // 2],
            dtype=fp4_utils.float4_e2m1x2,
            device=device),
                                requires_grad=False)

        # FP8 per-block scaling factors
        self.weight_scale = # ...
        
        # FP32 per-tensor global scaling factor = 448x6 / amax_input
        self.input_scale = # ...
        self.inv_input_scale = # ...

        # (amax_input*amax_weight) / (448*6*448*6)
        self.alpha = # ...

        self.profiler = torch.classes.trtllm.FP4GemmRunner.get_instance(
            self.dtype)
        self.needs_profiling = True

正确加载权重缩放因子
- 例如，对于FP4模型，model.layers.0.self_attn.qkv_proj也会接收到：
- model.layers.0.self_attn.q_proj.weight_scale

- `model.layers.0.self_attn.q_proj.weight_scale_2`
- `model.layers.0.self_attn.q_proj.input_scale`
- 这同样适用于 k_proj 和 v_proj。

块状缩放因子 (Block SFs) 会根据GEMM的要求进行拼接和重排（swizzled）。
全局缩放因子 (Global SFs)：在ModelOpt中，确保QKV的全局SFs是相同的。

代码示例：加载和处理FP4权重缩放因子
代码片段（Page 31）展示了load_weight_scales_nvfp4函数，它处理权重的加载，并在拼接后对权重缩放因子进行重排。

实现 apply_linear
- 动态激活量化
- 提供静态全局SF和BF16激活。
- 返回FP4量化的激活和FP8块状SFs。

选择最佳GEMM配置：基于分析（profiling）结果。
启动GEMM，使用以下输入：
- FP4权重
- FP8权重块状SFs
- FP32权重全局SF
- FP32 alpha（激活和权重全局SF的乘积）

代码示例：实现apply_linear以支持FP4
代码片段（Page 32）展示了apply_linear函数的实现。该函数首先进行性能分析，然后使用torch.ops.triton.fp4_quantize对激活进行量化，并最终调用run_gemm执行GEMM操作。

Attention 后端

用户可以通过实现AttentionBackend接口来替换Attention的实现。
- 输入：投影后的QKV张量和批处理元数据（batch metadata）。
- 操作：更新KV缓存，进行分页注意力（paged attention），并返回输出张量。

class AttentionBackend(Generic[Metadata]):
    def forward(self,
                q: torch.Tensor,
                k: Optional[torch.Tensor],
                v: Optional[torch.Tensor],
                metadata: TMetadata,
                *,
                attention_mask: AttentionMask = PredefinedAttentionMask.CAUSAL,
                **kwargs) -> torch.Tensor:

可用的后端:
- trtllm: 使用专有核，gptAttentionPlugin被封装为PyTorch操作，提供最佳性能。
- vanilla: 使用torch的SDPA（Scaled Dot-Product Attention），实现简单但速度较慢。
- FlashInfer: 使用FlashAttention、RoPE Fusion等进行优化。
- StarFlashInfer: 基于FlashInfer的定制化后端（由NV Research发明）。
可以通过PyTorchConfig.attn_backend进行切换。

添加新模型：以Qwen3为例

直接使用tensorrt_llm._torch.modules中的Embedding, FusedMoE和Linear模块。
继承标准的Attention模块。
使用来自Hugging Face transformers的Qwen3MoeConfig中的参数来初始化超类。

代码示例：初始化Qwen3模型和Attention模块
代码片段（Page 34）展示了Qwen3Moe和Qwen3Attention类的__init__方法，演示了如何使用模型配置来初始化模型组件。

Qwen3MoeDecoderLayer: 由归一化 (Norm) + 注意力 (Attention) + MLP 组成。
Qwen3MoeModel: 由词嵌入 (Embedding) + 解码器层 (Decoder Layers) 组成。

代码示例：Qwen3解码器层和模型的`forward`方法
代码片段（Page 35）展示了Qwen3MoeDecoderLayer和Qwen3MoeModel的forward方法，说明了数据在模型层级间的流动过程。

DecoderModelForCausalLM: 这是最终的模型类，由DecoderModel和LogitsProcessor（即LMHead，可能包含后处理）组成。

代码示例：Qwen3因果语言模型的`__init__`和`forward`方法
代码片段（Page 36）展示了Qwen3MoeForCausalLM的实现，它将Qwen3MoeModel和LogitsProcessor结合起来，用于因果语言建模任务。

模型权重加载

DecoderModelForCausalLM.load_weights是加载Hugging Face检查点权重到PyTorch模块的入口点。
所有权重名称或格式的差异都在此处理。
提供了一个覆盖常见情况的默认实现。
- 它将递归调用每个Linear模块的load_weights方法。

class DecoderModelForCausalLM(nn.Module,
    def load_weights(self,
                     params_map = {
                         'qkv_proj': ['q_proj', 'k_proj', 'v_proj'],
                         'gate_up_proj': ['gate_proj', 'up_proj']
                     }

然而，如果默认实现不能满足新模型架构的需求，可以重写该方法。
- 例如，Qwen3MoeForCausalLM中就重写了此方法。

建模代码极大简化：以Qwen为例

tensorrt_llm/models/qwen/http://model.py: ~500 行代码 (LOC)
- QwenDecoderLayer: ~100 LOC
- QwenModel: ~80 LOC
- QwenForCausalLM: ~250 LOC
- TensorRT-LLM到HF的key转换、量化（模块替换）和其他特性使得代码更长。
tensorrt_llm/models/qwen/http://convert.py: ~1200 LOC
- 遗留的权重共享代码，在统一转换器之前。
- SmoothQuant / int8 KV cache 量化。
- 处理特殊格式如GPTQ。
examples/models/core/qwen/http://convert_checkpoint.py: ~300 LOC
- 运行检查点转换的脚本。
- 使用convert.py中的函数。
- 大量命令行参数。

tensorrt_llm/_torch/models/http://modeling_qwen3_moe.py: ~350 LOC
- QwenAttention: ~30 LOC
- QwenDecoderLayer: ~60 LOC
- QwenModel: ~50 LOC
- QwenForCausalLM: ~15 LOC
无额外的检查点转换代码
- 模型定义匹配HF safetensors权重字典中的键，无需额外翻译。
- 权重分片由Linear处理。
- 训练后量化被委托给ModelOpt，只需加载量化后的HF检查点。
- 这将模型定义与加载过程隔离开来。

注意：这并非严格的苹果对苹果比较。例如，1200行的convert.py包含了一些TensorRT工作流不再使用的遗留代码。这只是为了展示PyTorch工作流的代码库更加清晰。

模块化 Python 运行时 (Modularized Python Runtime)

下图展示了运行时在整个系统架构中的位置，重点突出了Python运行时及其与C++组件的交互。

PyTorch 建模架构图
上图（Page 41）展示了整个系统的架构。请求从Triton推理服务器或OpenAI服务器进入，可能通过Dynamo，最终到达基于PyTorch的LLM。系统分为服务层、API层、运行时层和建模层。运行时层包含Python实现的GenerationExecutor和PyExecutor，并与C++实现的调度器和KV缓存管理器交互。建模层基于torch.nn.Module，并利用PyTorch原生及自定义操作，底层调用TRT-LLM核。

Python 运行时概览

Python 运行时概览图
上图（Page 42）详细展示了Python运行时的组件结构。PyExecutor是顶层组件。模块化的Python运行时接口定义了ModelEngine, RequestScheduler, BaseResourceManager和Decoder等核心抽象。Python运行时实现了这些接口，例如PyTorchModelEngine和SimpleScheduler。底层则调用C++实现的组件，如tensorrt_llm::batch_manager中的调度器和KV缓存管理器。

模块化的运行时模块

许多运行时模块是可定制的。
ResourceManager
- KVCache Manager
Scheduler
- Capacity Scheduler
  - requests -> fitting_requests, paused_requests
- MicroBatchScheduler
  - fitting requests -> context_requests, generation_requests
当前许多模块通过Python绑定使用C++运行时组件。

执行器循环 (Executor Loop)

_executor_loop
- 准备 (Prepare)
  - 获取新请求，调度请求并批处理上下文/生成请求以便下一步运行。
  - 为已调度的批次准备资源（KV缓存）。
- 启动 (Launch)
  - 运行ModelEngine一步。
- 采样器 (Sampler)
- 处理 (Process)
  - 处理响应。
_executor_loop_overlap
- 准备/启动/解码第 N 批次，然后处理第 N-1 批次。
- 通过重叠操作来隐藏准备和处理的主机开销。

CPU/GPU 重叠执行示意图
上图（Page 44）展示了CPU和GPU在执行器循环中的工作流。通过重叠CPU任务（准备、处理）和GPU任务（计算、采样），可以有效隐藏CPU开销，提升整体效率。

*重叠调度器的想法归功于SGLang团队: https://lmsys.org/blog/2024-12-04-sglang-v0-4/

CUDA Graph

CUDA Graph被用来减少CPU开销。
仅用于纯生成（generation-only）步骤
- 由于序列长度可变，张量形状在上下文生成（context-generation）或混合步骤中不是静态的。
- 在生成阶段，内核启动开销更为显著。
ModelEngine决定是否为当前调度的请求使用CUDA Graph。
- DecoderModelForCausalLM的forward()方法将被捕获和重放。
将为每个预定义的批处理大小（batch sizes）[1, 2, 3, 4, ..., 32, 64, 128]捕获一个CUDA Graph。
- 图在预热（warm-up）期间被捕获。
- 如果请求的批处理大小与预定义的大小不匹配，当cuda_graph_padding_enabled为True时，它将被填充到最接近的大小。

基于 PyTorch 工作流的端到端示例：DeepSeek R1 性能优化

加速调试过程
- 通过引入原生的 MLA（多层感知机）实现来加速 MTP（模型张量并行）开发的准确性验证，以隔离内核准确性问题。
- 通过用原生的 PyTorch 层替换自定义的通信融合模式，加速问题隔离。
加速新功能启用周期
- 性能工程师（即使不太熟悉 TensorRT-LLM 核心代码）也可以通过启用 Attention DP 的基础版本来快速贡献，从而提高吞吐量性能。
- 加速各种融合模式的集成/验证，如 top-K/GEMM 融合/MoE 融合等。
减少繁琐的检查点转换和引擎构建时间。
在 NVIDIA GPU 上实现 DeepSeek R1 SOTA 性能的基础
- https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md
- https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/
更多关于 DeepSeek R1 性能优化的细节将在未来的会议中分享。

性能 (Performance)

Qwen3-235B-A22B FP8

性能图：TRT-LLM 8xH20 1k-1k 性能
上图（Page 39）展示了在8个H20 GPU上，输入和输出序列长度均为1k时的性能表现。比较了DP8EP8（数据并行）和TP8EP8（张量并行）两种配置下的"每GPU输出吞吐量"与"每用户输出吞吐量"的关系。

Commit: a4c3359513dae5694a2a01955abffb7702b004ab
*仅用于技术讨论

结论

PyTorch 工作流为开箱即用（OOTB）用户带来了易用性
- 节省了检查点转换、引擎构建等工作。
- 只需几行代码即可尝试一个模型。
对开发者也更加友好
- 基于 PyTorch 的模型开发和基于 Python 的运行时。
- 模块化设计使其更易于扩展。
由于能够快速添加新功能的灵活性，其性能可以与现有的 TRT 工作流相媲美，甚至更好。
TensorRT 和 PyTorch 工作流现在将共存
- TensorRT 工作流功能更完整。
- PyTorch 工作流更适合未来开发新模型和新功能。
新模型/功能的支持将在 PyTorch 工作流中优先考虑。

行动号召 (Call for actions)

请遵循使用 TensorRT-LLM 的说明，并在 GitHub issues 上留下您的评论：
- https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/torch.md
- https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md
征集贡献（您将得到 TensorRT-LLM 团队的支持，将您的贡献合入 GitHub 仓库）：
- PyTorch 后端的特性对齐
  - https://github.com/NVIDIA/TensorRT-LLM/issues/3704
- TensorRT-LLM 中推理时计算支持的任务
  - https://github.com/NVIDIA/TensorRT-LLM/issues/3706

社区与资源

GitHub仓库:
https://github.com/NVIDIA/TensorRT-LLM

加入 NVIDIA 开发者 Discord 社区