A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well

文章标题：关于大语言模型中测试时扩展的综述：是什么、如何做、在哪里用以及效果如何
作者/机构：Qiyuan Zhang1*, Fuyuan Lyu2*, Zexu Sun3†, Lei Wang5†, Weixu Zhang2†, Wenyue Hua8†, Haolun Wu2,7†, Zhihan Guo4†, Yufei Wang6‡, Niklas Muennighoff7, Irwin King4, Xue Liu2, Chen Ma1 (1香港城市大学, 2麦吉尔大学 & MILA, 3中国人民大学高瓴人工智能学院, 4香港中文大学, 5Salesforce AI Research, 6麦考瑞大学, 7斯坦福大学, 8加州大学圣巴巴拉分校)
文章链接：https://testtimescaling.github.io/

A1 主要贡献

核心问题：随着预训练阶段通过扩展计算（数据和参数）带来的性能提升逐渐放缓，研究重心已转向如何在测试时充分激发大型语言模型（LLMs）中编码的智能，以最大化其在现实世界中的效能。尽管近期在该领域的研究激增，但仍迫切需要一篇全面的综述来提供系统的理解。

研究目标：本文旨在填补这一空白，通过一个统一的、分层的框架来系统性地梳理“测试时扩展”（Test-Time Scaling, TTS）这一研究领域。该框架围绕四个核心维度构建：扩展什么（What）、如何扩展（How）、在哪里扩展（Where）以及扩展得如何（How well）。

主要贡献：
1. 提出统一的多维分类法：本文提出了一个由“扩展什么”、“如何扩展”、“在哪里扩展”和“扩展得如何”四个轴组成的分类法。该分类法为TTS方法提供了结构化的分类、比较和可扩展性支持。
2. 系统的文献组织和实用分析：利用该分类法，本文对TTS领域进行了全面的文献调研，分析了代表性方法，并为研究应用和部署提供了指导方针。
3. 揭示挑战、见解和未来方向：基于结构化的视角，本文揭示了从推进扩展到阐明本质等关键挑战，并概述了可能塑造未来进展的有前景的研究方向。统一的框架有助于将这些开放问题映射到TTS的具体维度，从而实现更有针对性和影响力的进步。

A3 背景知识/关键观察/设计原则

“扩展什么”指的是在推理阶段为了提升LLM性能而扩展或调整的具体形式。研究人员通常会根据经验性假设选择一个特定的“扩展什么”的目标。例如，一些研究者假设更长的思维链（CoTs）能改善复杂推理，从而强制LLM生成更长的输出。另一些研究者则利用自洽性原则，假设对一个推理任务生成多个解决方案会增加获得正确答案的可能性。

图2：测试时扩展研究的分类法，包括扩展什么、如何扩展、在哪里扩展以及扩展得如何。

2.1 并行扩展 (Parallel Scaling)

基本思想：并行扩展通过并行生成多个输出，然后将它们聚合以形成最终答案来提升测试时性能。这种方法旨在增加生成至少一个正确答案的概率（覆盖率）并依赖高质量的聚合函数来识别出正确答案。该方法的形式化表示为，对于一个问题集P和模型集合m ∈ {1, . . . , M}，每个模型为问题p ∈ P生成km个候选响应，产生一个解集S:

其中A是聚合函数。这种方法的有效性既依赖于覆盖率，也依赖于聚合质量。认知科学研究【索引：Stanovich and West, 2000, Advancing the rationality debate, Behavioral and Brain Sciences】表明，复杂问题通常允许多条有效的解决路径，增加生成响应的数量能提高找到正确答案的机会【索引：Li et al., 2025d, S*: Test time scaling for code generation, arXiv】。

分类与技术：我们将并行扩展分为两种常见形式：（1）从单个模型重复采样；（2）跨多个模型采样。此外，还有一些技术用于增强解的多样性和可靠性，例如调整超参数（如采样温度【索引：Renze, 2024, The effect of sampling temperature on problem solving in large language models, Findings of EMNLP】）和修改输入（如提示重述【索引：Lambert et al., 2025, Tulu 3: Pushing frontiers in open language model post-training, arXiv】）。

2.2 序列扩展 (Sequential Scaling)

基本思想：序列扩展涉及根据中间步骤显式地指导后续计算，通过迭代更新中间状态来逐步构建解决方案。我们将部分解状态表示为$n_1, n_2, . . . , n_T$，每个新状态$n_{t+1} = R(n_t, p)$都结合了前一个状态和问题背景。由于许多复杂任务需要深思熟虑而非即时模式匹配，单遍的“系统1”式生成【索引：Yu et al., 2024c, Distilling system 2 into system 1, The First Workshop on System-2 Reasoning at Scale】在复杂推理任务上常常失败。迭代方法模仿“系统2”的方式，逐步分解和完善解决方案。

发展与应用：早期的工作如思维链提示【索引：Wei et al., 2022, Chain-of-thought prompting elicits reasoning in large language models, Advances in neural information processing systems】激励模型逐步解决问题，即$n_{t+1} = \text{AppendStep}(n_t, \text{new reasoning step})$。这引出了改进响应的方法【索引：Madaan et al., 2023, Self-refine: Iterative refinement with self-feedback, Conference on Neural Information Processing Systems】，即$n_{t+1} = \text{Refine}(n_t)$，或系统性地分解问题【索引：Zhou et al., 2023a, Least-to-most prompting enables complex reasoning in large language models, ICLR; Zelikman et al., 2022, Star: Bootstrapping reasoning with reasoning, Advances in Neural Information Processing Systems】，即$n_{t+1} = \text{IntegrateSub}(n_t, \text{solution to next subproblem})$。后续研究表明，迭代修正【索引：Chen et al., 2024h, Teaching large language models to self-debug, ICLR; Gou et al., 2024, CRITIC: Large language models can self-correct with tool-interactive critiquing, ICLR; Chen et al., 2025d, Iterative deepening sampling for large language models, arXiv; Snell et al., 2024, Scaling llm test-time compute optimally can be more effective than scaling model parameters, arXiv】能够触发自校正，提高在挑战性任务上的准确性。然而，纯粹的序列方法可能只是更广泛解决方案的一部分。

2.3 混合扩展 (Hybrid Scaling)

基本思想：混合扩展利用了并行和序列扩展的互补优势。并行扩展通过广泛撒网来降低模型错过正确思路的风险，而序列扩展则允许在某个思路看起来有前景时进行深度探索。形式上，设$F_t$为第t次迭代时的候选解集。每次迭代通过扩展函数E并行扩展这些候选解，并通过选择函数S进行序列过滤：

经过T次迭代后，聚合器A从$F_T$中选择最终解$\hat{s}$。这种组合类似于人类解决问题时先产生多个假设（发散思维），然后对其进行完善/评估（收敛思维）。

相关工作：经典的搜索算法，如迭代加深【索引：Chen et al., 2025d, Iterative deepening sampling for large language models, arXiv】和束搜索【索引：Snell et al., 2024, Scaling llm test-time compute optimally can be more effective than scaling model parameters, arXiv】体现了这种平衡探索与利用的策略。近期的工作如思维树（ToT）【索引：Yao et al., 2023b, Tree of thoughts: Deliberate problem solving with large language models, Conference on Neural Information Processing Systems】在决策点进行分支，探索多条推理路径。后续方法如思维图（Graph-of-Thoughts）【索引：Besta et al., 2024, Graph of thoughts: Solving elaborate problems with large language models, AAAI Conference on Artificial Intelligence】、思维算法（Algorithm-of-Thought）【索引：Sel et al., 2024, Algorithm of thoughts: Enhancing exploration of ideas in large language models, ICML】、思维森林（Forest-of-Thought）【索引：Bi et al., 2024, Forest-of-thought: Scaling test-time compute for enhancing llm reasoning, arXiv】、蒙特卡洛树搜索（MCTS）【索引：Lin et al., 2025, Leveraging constrained monte carlo tree search to generate reliable long chain-of-thought for mathematical reasoning, arXiv】和多智能体推理【索引：Wang et al., 2025a, Mixture-of-agents enhances large language model capabilities, ICLR; Chen et al., 2024f, RouterDC: Query-based router by dual contrastive learning for assembling large language models, arXiv】也利用了类似但更复杂的混合模式。

2.4 内部扩展 (Internal Scaling)

基本思想：内部扩展让模型在测试时自主决定分配多少计算资源用于推理，这依赖于模型的内部参数而非外部的人工指导策略。形式上，通过一个训练过程 $\Phi : (M_0, D) \rightarrow M_1$，在包含多步推理任务的数据D上更新初始模型$M_0$得到新模型$M_1$。令人惊讶的是，使用面向结果的奖励建模【索引：DeepSeek-AI, 2025, Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, arXiv; OpenAI, 2024b, Openai o1 system card, arXiv】进行强化学习（RL）能使模型自主扩展其推理过程。

工作机制：在测试时，$M_1$通过以下方式生成一系列内部状态$z_1, z_2, . . . , z_T$：

模型的学习策略$\pi_\theta$控制何时停止。这种内部反馈循环可以在没有任何外部提示或多轮调用协调的情况下，产生更详细的推理链或自评估步骤等涌现行为。实践中，内部扩展因其能将计算资源集中在单一、连贯的推理轨迹上，其性能常常能与甚至超越标准技术。

A2 方法细节

3.1 基于调优的方法

要激活模型在测试时投入成本的能力，直接调整其参数是一种有效的策略。这包括两种方法：1) 监督微调（SFT）：通过在合成或蒸馏的长思维链（CoT）上进行下一个词元预测来训练LLM，使其能够模仿并内化结构化的推理模式。2) 强化学习（RL）：通过利用奖励模型对推理任务的反馈，策略模型被自动更新。

3.1.1 监督微调 (Supervised Finetuning, SFT)

通过在合成或蒸馏的长CoT上进行下一个词元预测来训练LLM，使其能够内化结构化推理模式并有效地“思考”复杂问题。SFT通过模仿扩展的推理过程，减少了在推理时对显式提示的依赖。

模仿（Imitation）。通过SFT增强LLM推理能力的一个重要方法是，使用测试时“规划器”算法生成长CoT演示，然后微调模型以模仿这些演示。例如，STaR【索引：Zelikman et al., 2022, Star: Bootstrapping reasoning with reasoning, Advances in Neural Information Processing Systems】使用模型自身为给定问题生成逐步解决方案，并筛选出正确的结果，将验证过的解决方案作为新的演示来微调。更结构化的搜索已被用于生成更高质量的轨迹：ReST-MCTS【索引：Zhang et al., 2024a, ReST-MCTS*: LLM self-training via process reward guided tree search, NeurIPS】集成了一个MCTS规划器（由一个学习到的价值模型指导）来探索可能的推理步骤空间；随后，模型在这些搜索生成的轨迹上进行微调，即学习模仿规划器发现的成功推理轨迹。

蒸馏（Distillation）。与模仿方法使用模型自身的中间输出来改进不同，蒸馏技术旨在通过监督学习将更强模型（或模型集成）的能力转移到目标模型中。正如Muennighoff等人【索引：Muennighoff et al., 2025, `s1: Simple test-time scaling, arXiv】和Li等人【索引：Li et al., 2025e, Llms can easily learn to reason from demonstrations structure, not content, is what matters!, arXiv】所报道的，一个在由顶级推理器生成的精选样本集上训练的32B模型，解决竞赛级数学问题的能力几乎与教师模型一样好，这表明推理能力的成功蒸馏。

预热（Warmup）。SFT预热【索引：Luong et al., 2024, Reft: Reasoning with reinforced fine-tuning, arXiv】指在LLM无监督预训练之后、其他后训练步骤（如RL）之前应用的初始SFT阶段。这一阶段通过提供一个良好初始化的模型来稳定后续训练，该模型能更好地适应偏好优化，并避免因无根据的行为导致的不稳定性【索引：Zeng et al., 2025c, itool: Boosting tool use of large language models via iterative reinforced fine-tuning, arXiv】。有效的SFT预热具有几个关键要素：（i）使用高质量、任务相关的数据集【索引：Luong et al., 2024】；（ii）持续时间短；（iii）定制的学习率调度【索引：Pareja et al., 2024, Unveiling the secret recipe: A guide for supervised fine-tuning small llms, arXiv】。技术上，SFT预热常与拒绝采样等方法结合【索引：Pareja et al., 2024】，后者使用预热过的模型生成高质量数据用于进一步训练。

3.1.2 强化学习 (Reinforcement Learning, RL)

无奖励模型。RL和偏好优化的最新进展显著提升了大型语言模型的性能，尤其是在推理和问题解决任务中。该领域的一个关键创新是DeepSeek R1【索引：DeepSeek-AI, 2025, Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, arXiv】引入的带可验证奖励的RL，它利用基于规则的奖励机制来高效可靠地优化模型。这一方法通过为策略优化提供密集反馈，解决了奖励稀疏和训练不稳定等挑战。多种方法已被开发用于通过偏好优化来改善推理任务中的探索和准确性，例如，cDPO【索引：Lin et al., 2024, Critical tokens matter: Token-level contrastive estimation enhence llm’s reasoning capability, arXiv】、CPL【索引：Wang et al., 2024f, Cpl: Critical plan step learning boosts llm generalization in reasoning tasks, arXiv】、Focused-DPO【索引：Zhang et al., 2025b, Focused-dpo: Enhancing code generation through focused preference optimization on error-prone points, arXiv】、DAPO【索引：Liu et al., 2024b, Improving multi-step reasoning abilities of large language models with direct advantage policy optimization, arXiv】和RFTT【索引：Zhang et al., 2025c, Reasoning with reinforced functional token tuning, arXiv】优先处理关键或易错区域，增强了内部扩展和推理准确性。此外，Selective DPO【索引：Gao et al., 2025b, Principled data selection for alignment: The hidden risks of difficult examples, arXiv】强调了将数据难度与模型能力对齐的重要性。VC-PPO【索引：Yuan et al., 2025, What’s behind ppo’s collapse in long-cot? value optimization holds the secret, arXiv】研究了PPO在长CoT任务上的失败原因，并使用预训练的价值模型取得了更好结果。Light-R1【索引：Wen et al., 2025, Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond, arXiv】提出了一个结合多阶段后训练的课程学习框架。SimPO【索引：Meng et al., 2024, Simpo: Simple preference optimization with a reference-free reward, Advances in Neural Information Processing Systems】使用序列的平均对数概率作为隐式奖励，并在DPO中移除了参考模型。

开源框架。在数学问题解决领域，DQO【索引：Ji et al., 2024, Enhancing multi-step reasoning abilities of language models through direct q-function optimization, arXiv】和OREO【索引：Wang et al., 2024b, Offline reinforcement learning for llm multi-step reasoning, arXiv】提出了新颖的价值函数优化技术。DAPO【索引：Yu et al., 2025, Dapo: An open-source llm reinforcement learning system at scale, arXiv】利用动态采样进行大规模RL系统。一系列开源训练框架为这些进展提供了支持。早期的框架如SimpleRL【索引：Zeng et al., 2025b, 7b model and 8k examples: Emerging reasoning with reinforcement learning is both effective and efficient, Notion Blog】和DeepScaler【索引：Luo et al., 2025b, Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, Notion Blog】迅速复现了DeepSeek R1的技术栈。SimpleRL-Zoo【索引：Zeng et al., 2025a, Simplerlzoo: Investigating and taming zero reinforcement learning for open base models in the wild, arXiv】提供了更多实验细节。其他如X-R1【索引：X-R1Team, 2025, X-r1, Github】和TinyZero【索引：Pan et al., 2025b, Tinyzero, Github】专注于提供直观且经济高效的用户体验。值得注意的是，Open-Reasoner-Zero【索引：Hu et al., 2025b, Openreasoner-zero: An open source approach to scaling reinforcement learning on the base model, Github】使用32B模型复现了DeepSeek R1-zero训练方案。其他框架如OpenR【索引：Wang et al., 2024c, Openr: An open source framework for advanced reasoning with large language models, arXiv】、OpenRLHF【索引：Hu et al., 2024, Openrlhf: An easy-to-use, scalable and high-performance rlhf framework, arXiv】、OpenR1【索引：HuggingFace, 2025, Open r1: A fully open reproduction of deepseek-r1】、Logic-RL【索引：Xie et al., 2025, Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning, arXiv】和AReaL【索引：AntResearch-RL-Lab, 2025, Areal: Ant reasoning rl】进一步促进了内部扩展的复现和学术研究。

有奖励模型。使用通过人类偏好优化的Bradley-Terry模型【索引：Zheng et al., 2023b, Secrets of rlhf in large language models part i: Ppo, arXiv】作为奖励模型，PPO【索引：Schulman et al., 2017, Proximal policy optimization algorithms】因其效率和稳定性成为内部扩展中最具影响力的算法之一。在PPO的基础上，ReMax【索引：Li et al., 2023b, Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models, arXiv】引入了方差缩减技术，并结合了REINFORCE【索引：Sutton et al., 1999, Policy gradient methods for reinforcement learning with function approximation, Advances in neural information processing systems】和RLOO【索引：Ahmadian et al., 2024, Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms, ACL】方法，从而无需额外的价值模型，降低了GPU内存使用并加速了训练。GRPO【索引：Shao et al., 2024, Deepseekmath: Pushing the limits of mathematical reasoning in open language models, arXiv】用改进的采样策略取代了传统价值模型。REINFORCE++【索引：Hu et al., 2025a, Reinforce++: A simple and efficient approach for aligning large language models, arXiv】进一步简化并增强了GRPO的训练。DVPO【索引：Huang et al., 2025a, Lean and mean: Decoupled value policy optimization with global value guidance, arXiv】提出了一个精简框架，用预训练的全局价值模型替代奖励模型。PRIME【索引：Cui et al., 2025, Process reinforcement through implicit rewards, arXiv】将SFT模型集成为一个统一RL框架内的PRM。SPPD【索引：Yi et al., 2025, Sppd: Self-training with process preference learning using dynamic value margin, arXiv】利用过程偏好学习进行自训练。近期的工作还关注其他挑战：UGDA【索引：Sun et al., 2025, Uncertainty and influence aware reward model refinement for reinforcement learning from human feedback, ICLR】利用样本的不确定性和影响来迭代优化奖励模型。VinePPO【索引：Kazemnejad et al., 2024, Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment, arXiv】利用语言环境的灵活性来计算无偏的蒙特卡洛估计。LCPO【索引：Aggarwal and Welleck, 2025, L1: Controlling how long a reasoning model thinks with reinforcement learning, arXiv】专注于优化准确性和对用户指定长度约束的遵守。Rest-MCTS【索引：Zhang et al., 2024a, ReST-MCTS: LLM self-training via process reward guided tree search, NeurIPS】使用基于树搜索的RL来绕过训练过程奖励所需的逐步手动标注。

3.2 基于推理的方法

与离线调整模型参数的训练方法不同，基于推理的方法在部署期间动态调整计算。该范式包括四个基本组成部分：（i）激励（Stimulation），鼓励模型生成更长或多个候选输出；（ii）验证（Verification），根据正确性或其他标准筛选或评分输出；（iii）搜索（Search），系统地探索样本空间；（iv）聚合（Aggregation），将多个输出整合成最终输出。这四个组件通常组合使用，以更有效地分配测试时计算并提升复杂推理任务的性能。

3.2.1 激励 (Stimulation)

核心作用：激励技术是鼓励模型为思考分配更多计算的第一步。它主要通过激励LLM生成（i）更长的样本和（ii）更多的样本，而不是通过简单提示生成单一且简短的样本。

提示策略 (Prompt Strategy)。通过提示来引导LLM在测试时进行扩展，而非直接生成答案。例如，添加“请逐步思考”等明确指令【索引：Lightman et al., 2023, Let’s verify step by step, ICLR】可以改善模型分解复杂问题的能力。其他技术如【索引：Wei et al., 2022, Chain-of-thought prompting elicits reasoning in large language models, Advances in neural information processing systems; Ranaldi et al., 2025, Improving chain-of-thought reasoning via quasi-symbolic abstractions, arXiv】也依赖在提示中明确说明要求来激励样本生成。

解码策略 (Decode Strategy)。通过修改解码过程来鼓励LLM自适应地生成更长、更详细的样本。技术包括注入填充词元【索引：Pfau et al., 2024, Let’s think dot by dot: Hidden computation in transformer language models, CoLM】、自适应注入预定义短语【索引：Jin et al., 2020, What disease does this patient have? a large-scale open domain question answering dataset from medical exams, arXiv】、强制扩展预算【索引：Muennighoff et al., 2025, `s1: Simple test-time scaling, arXiv】、强制中间生成【索引：Li et al., 2025f, From drafts to answers: Unlocking llm potential via aggregation fine-tuning, arXiv】或预测性解码【索引：Ma et al., 2025a, Non-myopic generation of language models for reasoning and planning, ICLR】。

潜在策略 (Latent Strategy)。通过在隐藏表示中鼓励更深或循环的思考来扩展测试时计算。例如，Hao等人【索引：Hao et al., 2024, Training large language models to reason in a continuous latent space, arXiv】提出模型在隐藏空间中完成推理步骤。Kong等人【索引：Kong et al., 2025, Scalable language models with posterior inference of latent thought vectors, arXiv】引入一个潜在思想框架，以推断的潜在变量为条件生成文本。其他方法【索引：Saunshi et al., 2025, Reasoning with latent thoughts: On the power of looped transformers, arXiv】利用循环推理来重复优化隐藏状态。

自重复策略 (Self-Repetition Strategy)。除了生成更长的样本外，另一种激励LLM的方式是生成多个样本。一种常用策略是在解码阶段重复提示LLM，即自重复【索引：Wang et al., 2023b, Self-consistency improves chain of thought reasoning in language models, ICLR】。另一种策略是顺序提示LLM，以模仿精炼过程【索引：Madaan et al., 2023, Self-refine: Iterative refinement with self-feedback, Conference on Neural Information Processing Systems】或约束下的关联【索引：Ferraz et al., 2024, Llm self-correction with decrim: Decompose, critique, and refine for enhanced following of instructions with multiple constraints, Findings of EMNLP】。

模型混合策略 (Mixture-of-Model Strategy)。通过跨多个模型进行协调采样来汇集“群体智慧”。这些LLM可以扮演同质角色【索引：Wang et al., 2025a, Mixture-of-agents enhances large language model capabilities, ICLR】或异质角色【索引：Chen et al., 2024i, Brain-inspired two-stage approach: Enhancing mathematical reasoning by imitating human thought processes, arXiv; He et al., 2025, Enhancing llm reasoning with multi-path collaborative reactive and reflection agents, arXiv】。

表1：部分激励技术总结

3.2.2 验证 (Verification)

核心作用：验证LLM在测试时扩展过程中的正确性和一致性至关重要。一个稳健的验证过程可以用于：直接在并行扩展范式下选择输出样本；在序列扩展范式下指导激励过程并决定何时停止；作为搜索过程中的标准；决定聚合哪些样本以及如何聚合。

结果验证 (Outcome Verification)。常用方法包括使用一个单独的验证器模型对多个候选答案进行评分【索引：Cobbe et al., 2021, Training verifiers to solve math word problems, arXiv】、采用自洽性、投票机制【索引：Wang et al., 2023b, Self-consistency improves chain of thought reasoning in language models, ICLR】和判别器LM【索引：Chen et al., 2024j, When is tree search useful for LLM planning? it depends on the discriminator, ACL】，以及利用工具辅助【索引：Gou et al., 2024, CRITIC: Large language models can self-correct with tool-interactive critiquing, ICLR】或启发式检查【索引：DeepSeek-AI, 2025, Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, arXiv】。对于特定任务，如旅行规划，也采用功能性评分【索引：Lee et al., 2025, Evolving deeper llm thinking, arXiv】。Zhang等人【索引：Zhang et al., 2025d, Generative verifiers: Reward modeling as next-token prediction, arXiv】将结果验证重新表述为下一个词元预测任务。Li等人【索引：Li et al., 2025g, Learning to reason from feedback at test-time, arXiv】将反馈利用表述为优化问题。

多角度验证。一些结果验证方法从多个角度验证样本质量。Liu等人【索引：Liu et al., 2023b, Plan, verify and switch: Integrated reasoning with diverse x-of-thoughts, EMNLP】同时进行外部工具的被动验证和通过反思机制的主动验证。Zhang等人【索引：Zhang et al., 2024c, Wrong-of-thought: An integrated reasoning framework with multi-perspective verification and wrong information, Findings of EMNLP】从断言、过程和结果三个方面验证每个样本。Lifshitz等人【索引：Lifshitz et al., 2025, Multi-agent verification: Scaling test-time compute with goal verifiers, Workshop on Reasoning and Planning for Large Language Models】将验证代理的数量扩展到任意数量。

过程验证 (Process Verification)。过程验证方法验证样本结果及其获取过程，通常用于推理、编码或数学等具有形式化、演绎过程的任务。它们也被称为过程奖励模型（PRM）或状态验证。Lightman等人【索引：Lightman et al., 2023, Let’s verify step by step, ICLR】训练PRM作为数学任务的步骤级验证。Yao等人【索引：Yao et al., 2023b, Tree of thoughts: Deliberate problem solving with large language models, Conference on Neural Information Processing Systems】使用基于LM的状态验证器作为树结构搜索的指导。Xie等人【索引：Xie et al., 2023, Self-evaluation guided beam search for reasoning, NeurIPS】提示同一个LM来评估当前步骤。Hosseini等人【索引：Hosseini et al., 2024, V-star: Training verifiers for self-taught reasoners, CoLM】建议用准确和不准确的生成数据来训练验证器。Ling等人【索引：Ling et al., 2023, Deductive verification of chain-of-thought reasoning, Advances in Neural Information Processing Systems】以演绎方式分解验证过程。Li等人【索引：Li et al., 2025b, START: Self-taught reasoner with tools, arXiv】依赖外部工具箱（如代码解释器）来验证过程。

表2：部分验证技术总结

3.2.3 搜索 (Search)

核心作用：搜索是测试时扩展中常用的组件，通过结构化方式探索LLM的潜在选项，以充分利用其能力。基于搜索技术的现有测试时扩展方法在复杂数学等任务上表现出显著的性能提升。

树搜索。Yao等人【索引：Yao et al., 2023b, Tree of thoughts: Deliberate problem solving with large language models, Conference on Neural Information Processing Systems】将输出样本分解为多个思想并组织成树结构，仅使用朴素的树搜索算法（如深度优先和广度优先）就在推理任务上取得了优越性能。蒙特卡洛树搜索（MCTS）【索引：Coulom, 2006, Efficient selectivity and backup operators in monte-carlo tree search, International conference on computers and games】也被广泛应用。Chaffin等人【索引：Chaffin et al., 2022, Ppl-mcts: Constrained textual generation through discriminator-guided mcts decoding, NAACL】在解码阶段采用由判别器引导的MCTS进行受限文本生成。Zhang等人【索引：Zhang et al., 2023b, Planning with large language models for code generation, ICLR】将MCTS扩展到代码生成中，以增强规划能力。Wu等人【索引：Wu et al., 2024d, Scaling inference computation: Compute-optimal inference for problem-solving with language models, Workshop on Mathematical Reasoning and AI at NeurIPS’24】实证分析了各种搜索算法，并设计了一种奖励平衡的搜索算法，以实现帕累托最优的测试时扩展。Edward Beeching【索引：Edward Beeching, 2024, Scaling test-time compute with open models】通过引入多样性考量扩展了束搜索。

其他搜索结构。除了树结构搜索，Besta等人【索引：Besta et al., 2024, Graph of thoughts: Solving elaborate problems with large language models, AAAI Conference on Artificial Intelligence】将输出建模为图搜索问题。Xie等人【索引：Xie et al., 2023, Self-evaluation guided beam search for reasoning, NeurIPS】提出了一种基于自评估的随机束搜索解决方案。Pan等人【索引：Pan et al., 2025a, Coat: Chain-of-associated-thoughts framework for enhancing large language models reasoning, arXiv】通过提出的联想记忆增强MCTS，以动态更新其知识库。Li等人【索引：Li et al., 2025c, Reasoning-as-logic-units: Scaling test-time reasoning in large language models through logic unit alignment, arXiv】提出将推理过程解决为构建一个控制流图。

3.2.4 聚合 (Aggregation)

核心作用：聚合技术将多个解决方案整合成一个最终决策，以增强模型预测在测试时的可靠性和鲁棒性。根据最终输出的生成方式，我们将其分为两类：（i）选择（Selection），从所有候选中选择表现最好的样本；（ii）融合（Fusion），通过加权或生成等技巧将多个样本融合成一个。

选择 (Selection)。该类聚合过程可视为一个选择问题。一个著名的例子是选择最一致的答案，即自洽性（self-consistency）【索引：Wang et al., 2023b, Self-consistency improves chain of thought reasoning in language models, ICLR】。由于不准确的样本会影响投票质量，因此提出了多种方法在投票前过滤候选者，如使用LM作为过滤器【索引：Chen et al., 2024e, Are more LLM calls all you need? towards the scaling properties of compound AI systems, Conference on Neural Information Processing Systems】或长度过滤投票【索引：Wu et al., 2025b, When more is less: Understanding chain-of-thought length in llms, arXiv】。Best-of-N【索引：Irvine et al., 2023, Rewarding chatbots for real-world engagement with millions of users, arXiv】使用外部验证器生成的标量分数代替自洽性标准。Song等人【索引：Song et al., 2024, The good, the bad, and the greedy: Evaluation of llms should not ignore non-determinism, arXiv】证明在小型LLM上使用Best-of-N可以获得与SOTA专有模型相媲美的性能。Munkhbat等人【索引：Munkhbat et al., 2025, Self-training elicits concise reasoning in large language models, arXiv】在Best-of-N选择前附加了少样本条件过滤。Sessa等人【索引：Sessa et al., 2024, Bond: Aligning llms with best-of-n distillation, arXiv】通过RLHF将Best-of-N的结果调优到LM中。

融合 (Fusion)。当候选样本质量较低时，直接选择可能效果不佳，融合方法旨在将多个样本合并为一个。Brown等人【索引：Brown et al., 2024, Large language monkeys: Scaling inference compute with repeated sampling, arXiv】和Li等人【索引：Li et al., 2023a, Making language models better reasoners with step-aware verifier, ACL】扩展了Best-of-N的思想，通过外部验证器的分数为每个样本加权。Jiang等人【索引：Jiang et al., 2023, LLM-blender: Ensembling large language models with pairwise ranking and generative fusion, ACL】直接提示另一个LLM作为摘要器来合并多个选定的样本。Li等人【索引：Li et al., 2025j, Llms can generate a better answer by aggregating their own responses, arXiv】用生成式自聚合替换了自洽性中的多数投票。

表3：部分聚合技术总结。BoN代表Best-of-N。

A4 实验环境与结果

实验环境：TTS 的应用场景 (Where to Scale)

TTS能显著提升LLM在多种真实场景中的性能。我们将这些场景系统地分类，并列出代表性基准。

4.1 推理密集型任务

这类任务需要结构化、明确的多步推理、精确性和严格的正确性验证。

数学推理：挑战在于生成准确的逐步解法并验证中间步骤。代表性基准包括MiniF2F【索引：Zheng et al., 2021, Minif2f: a cross-system benchmark for formal olympiad-level mathematics, arXiv】、AIME 2024【索引：Google, 2025, Aime problems and solutions】、MATH-500【索引：Zhang et al., 2024a, ReST-MCTS*: LLM self-training via process reward guided tree search, NeurIPS】、AMC 2023【索引：Guan et al., 2025, rstar-math: Small llms can master math reasoning with self-evolved deep thinking, arXiv】、PutnamBench【索引：Tsoukalas et al., 2024, Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition】、MUSTARD【索引：Huang et al., 2024a, MUSTARD: mastering uniform synthesis of theorem and proof data, ICLR】和OlympiadBench【索引：He et al., 2024a, OlympiadBench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems, ACL】。
编程与代码生成：挑战在于生成正确的实现、迭代调试代码。代表性数据集包括Codeforces【索引：codeforce, 2025, Codeforces】、SWE-bench【索引：Jimenez et al., 2024, SWE-bench: Can language models resolve real-world github issues?, ICLR】和LiveCodeBench【索引：Jain et al., 2025, Livecodebench: Holistic and contamination free evaluation of large language models for code, ICLR】。
游戏与策略推理：涉及自适应规划和交互式决策。代表性基准是SysBench【索引：Google, 2025】。
科学推理：需要跨学科知识整合。代表性基准包括GPQA Diamond【索引：Rein et al., 2024, GPQA: A graduate-level google-proof q&a benchmark, CoLM】和MR-Ben【索引：Zeng et al., 2024, MR-ben: A meta-reasoning benchmark for evaluating system-2 thinking in LLMs, NeurIPS】。
医学推理：确保可靠、准确的推理，模仿医学专家的决策逻辑。代表性数据集包括JAMA Clinical Challenge【索引：Chen et al., 2025a, Benchmarking large language models on answering and explaining challenging medical questions, arXiv】、Medbullets【索引：Chen et al., 2025a】和MedQA【索引：Jin et al., 2020, What disease does this patient have? a large-scale open domain question answering dataset from medical exams, arXiv】。

4.2 智能体任务 (Agentic Tasks)

智能体扩展可分为三类：通过设计选择进行扩展、为分析涌现行为而扩展、以及通过环境交互进行扩展。

作为设计选择的智能体扩展：研究协作智能体的扩展定律，即增加协作智能体数量如何影响系统性能。研究发现，增加集成规模可以提升多种任务性能【索引：Li et al., 2024a, More agents is all you need, arXiv】，但增加LLM调用次数与性能之间存在非单调关系【索引：Chen et al., 2024d, Are more llm calls all you need? towards scaling laws of compound inference systems, arXiv】。
为涌现社会能力而扩展智能体：在大规模模拟中研究涌现行为，特别是在社会科学应用中。研究包括信息茧房的出现【索引：Zhang et al., 2025h, Understanding dynamic diffusion process of llm-based agents under information asymmetry, arXiv】和大规模社会生活模拟【索引：Piao et al., 2025, Agentsociety: Large-scale simulation of llm-driven generative agents advances understanding of human behaviors and society, arXiv】。
扩展环境反馈：扩展智能体与环境的交互以获得更丰富的反馈。研究发现智能体的内在性能与模型大小和环境交互呈幂律关系【索引：Hilton et al., 2023, Scaling laws for single-agent reinforcement learning, arXiv】。
智能体任务的模拟环境：代表性基准包括WebShop【索引：Yao et al., 2023a, Webshop: Towards scalable real-world web interaction with grounded language agents, arXiv】、WebArena【索引：Zhou et al., 2023c, Webarena: A realistic web environment for building autonomous agents, arXiv】、SciWorld【索引：Wang et al., 2022, Scienceworld: Is your agent smarter than a 5th grader?, EMNLP】和TextCraft【索引：Prasad et al., 2024, Adapt: As-needed decomposition and planning with language models, Findings of NAACL】。

4.3 其他任务

通用任务：评估模型的通用性能。代表性基准包括AGIEval【索引：Zhong et al., 2024, AGIEval: A human-centric benchmark for evaluating foundation models, Findings of NAACL】、MMLU-Pro【索引：Wang et al., 2024d, Measuring multimodal mathematical reasoning with MATH-vision dataset, NeurIPS Datasets and Benchmarks Track】和Gaokao【索引：Guan et al., 2025, rstar-math: Small llms can master math reasoning with self-evolved deep thinking, arXiv】。
开放式任务：评估主观、开放式和通用推理。代表性基准包括AlpacaEval2.0【索引：Dubois et al., 2024, Length-controlled alpacaeval: A simple way to debias automatic evaluators, arXiv】、ArenaHard【索引：Li et al., 2024c, From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline, arXiv】、IF-Eval【索引：Zhou et al., 2023b, Instruction-following evaluation for large language models, arXiv】和C-Eval【索引：Huang et al., 2023, C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models, NeurIPS Datasets and Benchmarks Track】。
知识密集型任务：需要从外部来源检索和综合事实知识。代表性基准包括SimpleQA【索引：Wei et al., 2024a, Measuring short-form factuality in large language models】、C-SimpleQA【索引：He et al., 2024c, Chinese simpleqa: A chinese factuality evaluation for large language models】和FRAMES【索引：Krishna et al., 2025, Fact, fetch, and reason: A unified evaluation of retrieval-augmented generation】。
评估任务：LLM作为裁判进行质量评估。代表性基准包括RewardBench【索引：Lambert et al., 2024, Rewardbench: Evaluating reward models for language modeling, arXiv】、JudgeBench【索引：Tan et al., 2025, Judgebench: A benchmark for evaluating llm-based judges, arXiv】、RMBench【索引：Liu et al., 2024c, Rm-bench: Benchmarking reward models of language models with subtlety and style, arXiv】等。
多模态任务：需要在视觉和文本输入间进行有效的跨模态整合与推理。代表性基准包括MMMU【索引：Yue et al., 2024, Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, arXiv】、MathVista【索引：Lu et al., 2024, Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, arXiv】、MathVision【索引：Wang et al., 2024d, Measuring multimodal mathematical reasoning with MATH-vision dataset, NeurIPS Datasets and Benchmarks Track】等。

表4：基准总结

实验结果：TTS 的评估指标 (How Well to Scale)

本节将评估测试时扩展方法的指标分为四个维度：性能、可控性、可扩展性和效率。

5.1 性能 (Performance)

Pass@1：评估模型首次输出正确性的最常用指标之一。它衡量模型首次生成的解决方案是正确的问题的比例。
Pass@k (覆盖率)：衡量模型在k个采样输出中至少有一个是正确的。其无偏估计量为：

公式4

其中n是问题数量，N是每个问题的总样本数，$C_i$是第i个问题的正确样本数。
Cons@k (共识@k)：衡量k个独立采样输出的多数投票正确性。
基于竞技场的评估 (成对胜率)：通过人类或基于LLM的裁判对模型输出进行成对比较。
特定任务指标：例如，Codeforces百分位和Elo等级用于衡量在竞争性编程环境中的编码能力。

5.2 效率 (Efficiency)

效率涵盖计算成本和推理过程质量。核心挑战是推理长度与解决方案质量之间的权衡。

效率的关键关注点：推理效率$\eta(M)$定义为解决方案质量Q与计算成本C的期望比率：

公式5

推理中的低效模式包括：冗余（重复推理步骤）、思考不足（过早改变推理方向）和过度思考（对简单问题过度验证）。
通用计算成本指标：
- 词元成本：推理过程中生成的总词元数。
- 基于FLOPs的效率分析：通过绘制准确性与总推理FLOPs的关系图来衡量。
- KV缓存大小：存储键值缓存所需的总内存。
推理效率指标：
- 思考不足得分 (Underthinking Score)：量化模型早期产生正确中间思想但未能得出正确最终答案的情况。
  
  公式6
- 结果效率 (Outcome Efficiency)：量化模型在多轮推理中经济地得出正确答案的程度。
  
  公式7
- 过程效率 (Process Efficiency)：评估模型在多轮解决方案中探索不同推理策略的效率。
  
  公式8

5.3 可控性 (Controllability)

评估测试时方法是否能遵守预定义的资源约束。

控制指标 (Control Metric)：衡量测试时计算值保持在给定上下限内的比例。

公式9
长度偏差指标：
- 与目标长度的平均偏差：
  
  公式10
- 长度偏差的均方根误差 (RMSE)：
  
  公式11
k–ϵ 可控性：量化模型是否可以在有界的提示长度和允许的偏差内被引导产生目标输出。

公式12

5.4 可扩展性 (Scalability)

衡量测试时扩展方法利用增加的计算来提高性能的有效性。

扩展指标 (Scaling Metric)：捕捉性能随计算增加而增益的平均斜率。

公式13
扩展曲线 (准确性 vs. 计算)：可视化准确性等指标随计算预算增加的变化情况，有助于揭示收益递减和性能饱和点。

测试时扩展的组织和趋势

表5：现有文献中进行推理扩展时常用的组合

发展路径：从2022年到2023年，研究人员强调结构化推理。2024年，PRM和MCTS等方法实现了对复杂推理轨迹的自动监督，为微调提供了丰富的标注数据。随后的方法，如o1和R1，证明了纯RL也能引出全面、逻辑合理的推理。

技术互补性：这些技术是互补的而非互斥的。例如，R1需要基于SFT的拒绝采样预热。实现更强大的扩展需要系统地整合这些方法。
最优扩展方案：不存在适用于所有问题的单一简单扩展方案。研究人员越来越关注最优扩展解决方案【索引：Wu et al., 2024d, Scaling inference computation: Compute-optimal inference for problem-solving with language models, Workshop on Mathematical Reasoning and AI at NeurIPS’24; Snell et al., 2024, Scaling llm test-time compute optimally can be more effective than scaling model parameters, arXiv】。
界限模糊：基于推理和基于调优的方法之间的界限正在模糊。扩展的目标在不同阶段会发生变化。一些研究将基于推理的能力通过合成高质量数据调优到LLM中【索引：Li et al., 2025f; Munkhbat et al., 2025】，而另一些研究则提出在训练和推理阶段都能更好利用LLM能力的技术【索引：Wan et al., 2024, Alphazero-like tree-search can guide large language model decoding and training, ICML】。

A5 结论与展望

测试时扩展实践指南

TTS适用的任务：几乎所有任务。虽然传统推理任务（如奥数、复杂编码）提升显著，但开放式任务（如评论生成）和现实复杂场景（如医疗、法律）也显示出潜力。
快速实现TTS的路径：有三条主要技术路径：i) 推理时的审慎推理过程，ii) 模仿复杂推理轨迹，iii) 基于RL的激励。要快速了解TTS的潜力上限，可直接使用经过(iii)训练的模型。要以最低成本开发基线，可从(i)开始，再用(ii)验证和泛化。
路径的结合：这些路径并非互斥，可以无缝集成。例如，R1就需要SFT预热。
训练时策略的影响：SFT可以提供强大的推理先验，提高扩展策略的稳定性和质量。RL微调可以激励简洁正确的推理链。两者可以结合，如DeepSeek-R1。
提高多轮TTS效率：可通过模型层面（微调模型生成简洁推理）、输出层面（应用早停策略）和提示层面（使用词元预算、步数限制等）来提高效率。
代表性基线方法：
- 并行：自洽性、Best-of-N
- 序列：STaR、Self-Refine、PRM
- 混合：MCTS、ToT
- 内部：Distilled-R1、R1
评估方法：除了准确性，效率（性能与成本的权衡）是关键。鲁棒性、安全性、偏见和可解释性等也逐渐受到关注。

挑战与机遇

更多扩展是前沿：
- 并行扩展：挑战在于如何从暴力扩展覆盖率转向更具指导性、更高效的过程，如智能覆盖率扩展和验证器增强的并行扩展。
- 序列扩展：挑战在于保持连贯性、防止错误累积。未来方向包括结构化自完善和验证增强的迭代扩展。
- 混合扩展：挑战在于提高泛化能力。未来方向包括通用混合扩展架构和多智能体交互式扩展。
- 内部扩展：挑战在于有效的计算分配、稳定性与一致性、以及可解释性与可控性。
阐明扩展技术的本质是基础：
- 迫切需要深入理解核心技术（SFT、RL、奖励建模）如何对测试时扩展做出贡献。
- 需要重新评估奖励建模，例如PRM是否真的能改善多步推理。
- 探索测试时扩展的数学性质，如性能与推理步骤的扩展关系。
- 研究自适应测试时扩展，使模型能根据问题自动调整推理过程。
优化扩展是关键：
- 需要系统地评估和优化新TTS方法的各个方面，包括任务准确性、效率、鲁棒性、偏见、安全性、可解释性等。
跨领域泛化是主流：
- 预计测试时扩展将扩展到医疗、金融、法律等需要复杂决策和结构化推理的领域。
- 挑战包括平衡成本与准确性、确保领域特定的可解释性、以及整合外部知识和现实世界约束。

结论

本文是第一篇通过分层分类法分解TTS的综述，提供了一个有助于概念理解和识别个体贡献的结构化视角。本文强调实用性，提出了一个与分类法各维度对齐的实践指南。基于此框架，我们概述了塑造TTS研究未来的关键趋势、挑战和机遇。

A6 附录

A 详细的结果验证方法

本附录扩展了在LLM测试时使用的结果验证技术。这些技术在推理期间即时操作，通常通过生成多个解决方案并使用“提议者-验证者”框架。

A.1 基于验证器模型的评分。验证器通常使用人类反馈或监督数据进行训练，根据预期的正确性或质量对每个候选者进行评分。变体包括：i) 成对比较验证器，ii) 加权投票系统，iii) 基于LLM的验证器，如LLM-as-a-Judge和Critic-based Model。

A.2 自洽性和投票机制。自洽性技术生成多个独立的推理链，并根据多数票选择最终答案。其基本假设是，如果多个链收敛到同一个答案，那么该答案更可能是正确的。也可以使用多个模型进行投票，形成一个集成投票。

A.3 工具辅助和启发式验证。在代码生成或数学问题解决等领域，可以通过直接执行或基于规则的检查来实现结果验证。
* 基于执行的验证：在编程任务中，通过运行代码来测试其正确性。
* 通过检索进行事实核查：在开放域问答中，搜索引擎或知识库可作为强大的验证器。
* 基于规则的过滤器：应用简单的启发式过滤器自动拒绝不良输出，例如对话系统中禁止某些不安全或无意义的回复。

B 代表性方法

B.1 Best-of-N。这是一种TTS方法，模型为给定输入生成N个候选输出，然后根据选定的评估指标选择最佳的一个。给定输入x和模型f，通过不同随机种子或采样策略抽取N个独立输出$y_1, . . . , y_N \sim f(x)$，并选择结果$\hat{y} = \arg\max_{i=1}^N M(y_i)$，其中M是质量评分函数。增加N会提高获得高质量结果的概率。

B.2 多数投票。这是一种基本的集成策略，用于聚合多个独立预测以做出最终决策。形式上，给定M个模型的集成$h_1, h_2, . . . , h_M$，多数投票的结果定义为：

其中$1\{\cdot\}$是指示函数，c遍历所有可能的类别或输出。

B.3 过程奖励模型 (PRM)。PRM是一种旨在逐步评估整个推理轨迹的奖励模型。给定输入问题x和一系列推理步骤$z_1, z_2, . . . , z_T$，完整的推理轨迹可以表示为：

PRM定义为一个分配实值分数的函数：

PRM通常在每一步的人类或算法注释上进行训练，内化了“部分给分”的概念。

B.4 蒙特卡洛树搜索 (MCTS)。MCTS是一种基于模拟的决策算法，通过采样许多可能的未来轨迹（playouts）来增量构建搜索树。MCTS的每次迭代包括四个阶段：选择、扩展、模拟（Rollout）和反向传播。
1. 选择：递归地选择使启发式值最大化的子动作，直到达到叶节点。常用策略是树的上置信界（UCT）：

其中$w_a$是总模拟奖励，$n_a$是动作a的访问次数，N是父状态的总模拟次数，c > 0是探索常数。
2. 扩展：达到叶状态后，通过模拟未探索的动作创建新的子节点。
3. 模拟 (Rollout)：执行蒙特卡洛模拟，模拟一个完整的片段直到结束。
4. 反向传播：将模拟结果沿路径向上传播，更新每个节点的统计数据。

B.5 Self-Refine。这是一种使LLM能够通过自生成的反馈迭代改进其自身输出的TTS技术。模型首先生成一个初始答案，然后对该答案进行批判或评估，最后利用该批判来完善答案。这个反馈-完善循环可以重复多次。形式上，该过程如下：
1. 初始输出生成：模型首先生成一个初始响应：

2. 反馈生成：在每个完善步骤t，模型评估前一个输出并生成反馈：

3. 完善步骤：利用生成的反馈，模型更新其输出：

B.6 思维树 (Tree-of-Thought, ToT)。ToT将CoT推广为分支搜索。在每个推理步骤，模型可以生成多个候选思想，形成一个可能性树。它评估这些候选者并选择最有希望的分支继续扩展。
* 思想生成 (状态转换)：在每个步骤，语言模型作为思想生成器函数G。给定当前状态s，模型生成一组下一步思想：

每个思想产生一个新状态：

* 状态评估 (启发式函数)：ToT使用评估函数$f(s)$来估计部分状态s的质量：

* 搜索算法 (树扩展)：ToT可以采用不同的搜索策略，如广度优先搜索（BFS）或深度优先搜索（DFS）。

B.7 强化学习 (Reinforcement Learning)。推理过程本身可以被表述为一个序贯决策问题。通过RL训练模型，我们可以明确地奖励导致正确或高质量答案的结果，从而鼓励模型更好地利用额外的推理步骤。与仅依赖模仿学习不同，RL使模型能够进行自我探索：模型可以尝试不同的推理路径，并通过试错学习哪种策略能产生最高奖励。这使得RL训练的语言模型能够学习动态的推理策略，例如何时复核中间结果或如何在推理偏离时回溯和修正自己。