发表时间: 2025-09 · arXiv:2509.19128 (ServiceNow / Mila)
Alexandre Piché (ServiceNow AI Research), Ehsan Kamalloo (ServiceNow AI Research), Rafael Pardinas (ServiceNow AI Research), Xiaoyin Chen (Mila, Université de Montréal), Dzmitry Bahdanau (ServiceNow AI Research, Mila, McGill University)
# 传统RL的 Actor-Trainer 逻辑
# Actor 进程
def Actor():
Sprog = [] # 在途序列
while True:
Sfin, Sprog = pop_finished_sequences(Sprog)
Qtrain.put(Sfin)
if len(Sprog) < H:
add_prompts_to_Sprog(H - len(Sprog))
if Trainer_requests_weight_update:
μ = receive_weight_update()
Sprog = generate_next_tokens_with(μ)
# Trainer 进程
def Trainer(π, opt_state):
batch = []
while True:
request_actor_weight_update(π)
batch = get_B_sequences_from(Qtrain)
π, opt_state = optimizer_step(π, opt_state, batch)
/v1/chat/completions(用于生成)、/init_process_group(用于创建权重传输进程组)和 /request_weight_update(用于启动在途权重更新)。优化手段包括在线序列打包(online sequence packing)和环形缓冲区(ring buffers)。【Ahmadian et al., 2024, Back to basics: Revisiting REINFORCE style optimization for learning from human feedback in LLMs, arXiv】
【Roux et al., 2025, Tapered off-policy REINFORCE: Stable and efficient reinforcement learning for LLMs, arXiv】
【Munos et al., 2016, Safe and efficient off-policy reinforcement learning, NeurIPS】
【Espeholt et al., 2018, IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures, ICML】
【Kong, 1992, A note on importance sampling using standardized weights, University of Chicago】
【Schlegel et al., 2019, Importance resampling for off-policy prediction, NeurIPS】
【Fakoor et al., 2020, P3O: Policy-on policy-off policy optimization, UAI】
【Kwon et al., 2023b, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP】
【Noukhovitch et al., 2024, Asynchronous RLHF: Faster and more efficient off-policy RL for language models, arXiv】
【Hu et al., 2025, OpenReasoner-Zero: An open source approach to scaling up reinforcement learning on the base model, arXiv】
【Zeng et al., 2025, SimpleRL-Zoo: Investigating and taming zero reinforcement learning for open base models in the wild, arXiv】