Robust and Efficient CUDA C++ Concurrency with Stream-Ordered Allocation

Mark Harris, NVIDIA
作者信息未提供
日期信息未提供

目录 (Table of Contents)

议程 (Agenda)
通过流水线实现并发 (Concurrency Through Pipelining)
CUDA 流回顾 (CUDA Streams Review)
资源 (Resources)
流安全性：它安全吗？ (Stream Safety: Is it Safe?)
流安全性：使其安全 (Stream Safety: Make it Safe)
CUDA 缺失的部分 (The missing piece of CUDA)
cudaMalloc / cudaFree：非流序 (cudaMalloc / cudaFree: not stream-ordered)
使用 cudaMalloc/cudaFree 实现并发 (Concurrency with cudaMalloc/cudaFree)
流序内存分配 (Stream-Ordered Memory Allocation)
使用流序分配实现并发 (Concurrency with Stream-Ordered Allocation)
流序分配的语义 (Semantics of Stream-Ordered Allocation)
流序释放的语义 (Semantics of Stream-Ordered Deallocation)
流序内存分配中的安全性问题
- 扩展示例 1：流序内存分配
- 流序分配的流安全性
- 可能发生什么？未定义行为
- 确保安全
- 小结与挑战
RAII 与设备缓冲区
- RAII：资源获取即初始化 (Resource Acquisition Is Initialization)
- RAII：流序设备缓冲区
- 安全性考量：流序设备缓冲区示例
- 安全性分析：流序设备缓冲区示例
- 确保安全：RAII 析构函数在对象超出作用域时调用
- 确保安全：对依赖操作使用相同的流
Thrust Vector 与流序分配
- 回顾：Thrust Vectors
- 流序 Thrust device_vector：如何提供自定义流序分配器
- 安全性考量：使用流序 thrust::device_vector
- 安全性分析：使用流序 thrust::device_vector
- 确保安全性：使用相同流的分配器
- thrust::device_vector 的问题
- 安全性、异步性和速度问题：主机容器与设备容器混合
  - 代码示例
  - 问题
- rmm::device_uvector：RAPIDS 库中重要的容器
- rmm::device_uvector 基准测试设置
- device_vector 构造会阻止并发
- device_vector 构造阻止并发，即使使用流有序分配
- rmm::device_uvector 构造允许完全并发
Stream Safety and API Design
- 主机容器的安全性问题
- 确保主机容器的安全性
- 作为参数的主机容器的安全性
- 异步性与速度问题
- 异步实现与安全性考量
- 安全但同步的实现
- 让调用者决定：API 设计准则
- Libcudf 的方法：异步，仅处理设备数据
- API 设计最佳实践：RAPIDS libcudf 的方法
流安全性优先！总结

议程 (Agenda)

本演示文稿的议程包括：
* CUDA 流 (Streams) 和流安全性 (Stream Safety)
* 流序内存分配 (Stream-Ordered Memory Allocation)
* C++ 对象生命周期语义 (C++ Object Lifetime Semantics)
* 数据容器和并发 (Data Containers and Concurrency)
* 流安全 API 设计 (Stream-Safe API Design)

通过流水线实现并发 (Concurrency Through Pipelining)

串行 (Serial)：操作按顺序执行，例如 cudaMemcpyAsync (H2D) -> Kernel -> cudaMemcpyAsync (D2H)。
并发 (Concurrent)：通过重叠内核和内存复制操作来提高性能。如图所示，不同的流 (Stream 1, Stream 2, Stream 3, Stream 4) 可以并行执行 H2D、内核 (K) 和 D2H 操作，从而显著缩短总执行时间，实现性能提升。

CUDA 流回顾 (CUDA Streams Review)

流是 CUDA 中实现并发的主要机制。
流是设备工作队列：
- 内核启动和异步 CUDA API 调用会将工作放入队列并立即返回。
- 设备会从流中调度工作到可用资源。
流内的操作是有序的 (FIFO) 且不能重叠。
不同流中的操作是无序的且可以重叠。

资源 (Resources)

本演示文稿中所有示例均可在 GitHub 上获取：https://github.com/harrism/stream-safety-first
- 代码片段右侧的数字对应示例编号，例如 [example_1]。
其他资源：
- RAPIDS 内存管理器：https://github.com/rapidsai/rmm
- CUDA 编程指南中关于流序内存管理的章节：https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-ordered-memory-allocator
- 流序内存管理博客文章：https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/

流安全性：它安全吗？ (Stream Safety: Is it Safe?)

目标：确保行为明确并避免数据竞争。

上图展示了一个数据竞争的例子：

stream_a 中启动了 kernel_a 和 kernel_b。
stream_b 中启动了 cudaMemcpyAsync 从设备复制数据到主机 (D2H)。
问题：stream_b 尝试在 kernel_b 写入 foo 之前或同时读取 foo，这会导致数据竞争。

流安全性：使其安全 (Stream Safety: Make it Safe)

目标：确保行为明确并避免数据竞争。

解决数据竞争的方法是使用 cudaEventRecord 和 cudaStreamWaitEvent 进行显式同步：
* 在 stream_a 中 kernel_a 之后记录一个事件 (event_a)。
* 在 stream_b 中，在执行 cudaMemcpyAsync 之前，等待 event_a 完成。
* 这样可以确保 stream_b 中的 D2H 操作在 stream_a 中的 kernel_a 完成之后才开始，从而避免数据竞争。

CUDA 缺失的部分 (The missing piece of CUDA)

CUDA 的核心组件包括：
* 内核执行 (Kernel Execution)：通过流进行管理。
* 内存复制 (Memory Copies)：通过流进行管理。
* 内存分配 (Memory Allocation)：传统上它不直接通过流进行管理，是并发工作流中缺失的一环。

`cudaMalloc` / `cudaFree`：非流序 (cudaMalloc / cudaFree: not stream-ordered)

cudaFree 会隐式地同步整个 CUDA 上下文。
频繁的分配和释放内存会严重影响性能。
图中，cudaMalloc 和 cudaFree 被描绘为在流之外，并且有一个“STOP”标志，表示它们会阻碍并发。
*然而，你不应该依赖于这种行为。

使用 `cudaMalloc`/`cudaFree` 实现并发 (Concurrency with cudaMalloc/cudaFree)

串行 (Serial)：所有操作包括内存分配和释放都顺序执行。
半并发 (Semi-concurrent)：cudaFree 会暂停整个系统。图中 F1（流 1 的 cudaFree）导致后续流的操作（例如 DH2、F2）被阻塞，直到 F1 完成，从而降低了整体并发性能，损失了大部分潜在的性能提升。

流序内存分配 (Stream-Ordered Memory Allocation)

解决方案：cudaMallocAsync 和 cudaFreeAsync。
这些功能在 CUDA 11.2 中引入。
还增加了内存池分配器 (pool allocators)，以实现更高的性能。
图中，内存分配现在也被纳入了流中，与内核执行和内存复制无缝集成，实现了全面的流序操作。

使用流序分配实现并发 (Concurrency with Stream-Ordered Allocation)

串行 (Serial)：作为基准，所有操作顺序执行。
并发 (Concurrent)：所有操作都在流上。图中展示了 cudaMallocAsync (MA) 和 cudaFreeAsync (FA) 操作与数据传输 (DH) 和内核 (K) 操作在多个流之间完全并行执行，显著提高了整体并发性和性能。

流序分配的语义 (Semantics of Stream-Ordered Allocation)

注意流的语义。
* 将分配/释放操作视为内核 (kernels)。
* 在流序的分配/释放之间，可以自由使用内存。
* cudaMallocAsync(&ptr, size, stream);
* ptr 在该流上立即有效，否则需要同步。

流序释放的语义 (Semantics of Stream-Ordered Deallocation)

注意重用语义。

cudaFreeAsync(ptr, stream);
释放的内存可以在同一流上的其他分配中重用。
驱动程序可以在其他情况下重用内存。
参阅前面链接的博客文章了解更多细节。

流序内存分配中的安全性问题

扩展示例 1：流序内存分配

为了探讨流序内存分配的安全性，我们扩展了示例 1。
以下代码展示了一个使用 cudaMallocAsync 和 cudaFreeAsync 进行流序内存分配的 CUDA 程序片段：

__global__ void kernel(int *in, int *out) { ... }                 [example_2]

cudaMallocAsync(&foo, bytes, stream_a);
cudaMallocAsync(&bar, bytes, stream_a);

kernel_foo<<<..., stream_a>>>(input, foo);
cudaEventRecord(event_a, stream_a);
kernel_bar<<<..., stream_a>>>(input, bar);
cudaStreamWaitEvent(stream_b, event_a);
cudaMemcpyAsync(h_output, foo, bytes, D2H, stream_b);

cudaFreeAsync(bar, stream_a);
cudaFreeAsync(foo, stream_a);

流序分配的流安全性

本节旨在避免使用已释放内存 (use-after-free) 和在分配前使用内存 (use-before-alloc) 的竞态条件。
考虑 example_2 的以下部分：

...
cudaStreamWaitEvent(stream_b, event_a);
cudaMemcpyAsync(h_output, foo, bytes, D2H, stream_b);

cudaFreeAsync(bar, stream_a);
cudaFreeAsync(foo, stream_a);

下图展示了 stream_a 和 stream_b 上的操作时间线，揭示了潜在的“使用已释放内存”问题。

问题描述: stream_b 在 stream_a 释放 foo 之后尝试对 foo 进行数据传输 (D2H)，导致 foo 在 stream_b 上被释放后使用。

可能发生什么？未定义行为

当发生上述竞态条件时，可能导致未定义行为：
* 驱动程序可能尝试重用先前通过 cudaFreeAsync() 释放的内存。
* 在某个流中释放的内存可以立即被同一流中的后续分配请求重用。

考虑以下代码片段：

cudaMemcpyAsync(h_output, foo, bytes, D2H, stream_b);
cudaFreeAsync(foo, stream_a);
...
cudaMallocAsync(baz, size, stream_a);
cudaMemsetAsync(baz, 0, size, stream_a);

在这种情况下，foo 和 baz 可能指向同一块内存。这意味着 memcpyAsync 在读取 foo 的同时，memsetAsync 可能会写入 foo，从而导致数据损坏或程序崩溃。

确保安全

为避免“使用已释放内存”和“在分配前使用内存”的竞态条件，需要修正内存释放操作。
将 example_2 中的 cudaFreeAsync(foo, stream_a); 修改为 cudaFreeAsync(foo, stream_b);：

...
cudaStreamWaitEvent(stream_b, event_a);
cudaMemcpyAsync(h_output, foo, bytes, D2H, stream_b);

cudaFreeAsync(bar, stream_a);
cudaFreeAsync(foo, stream_b); // 修正：在 stream_b 上释放 foo

关键原则是：在内存最后使用的流上释放它，或者同步该流。下图展示了修正后的时间线，消除了竞态条件。

小结与挑战

相同的流安全性原则适用于：
- 流序分配 (Stream-ordered allocation)
- 核函数执行 (Kernel execution)
- 数据拷贝 (Data copies)
但是，当引入 C++ 对象生命周期语义时，情况会变得复杂。

RAII 与设备缓冲区

RAII：资源获取即初始化 (Resource Acquisition Is Initialization)

RAII 是一种重要的 C++ 技术，其核心思想是：
* 将资源的生命周期（如内存、线程、文件句柄等）绑定到对象的生命周期。
* 在类的构造函数 (ctor) 中获取资源。
* 在类的析构函数 (dtor) 中释放资源。
* 请记住：当对象超出作用域时，析构函数 (dtor) 会被调用。

RAII：流序设备缓冲区

以下是一个 device_buffer 类的实现，它利用 RAII 来管理流序内存：

class device_buffer {
public:
  device_buffer(std::size_t size, cudaStream_t stream) : _size(size), _stream(stream) {
    cudaMallocAsync(&_data, _size, _stream); // 构造函数拷贝流并按流序分配
  }

  ~device_buffer() { cudaFreeAsync(_data, _stream); } // 析构函数使用保存的流进行流序释放

  void* data() { return _data; }

private:
  void* _data{};
  std::size_t _size;
  cudaStream_t _stream;
};

关键思想：析构函数使用保存的流进行流序释放。

安全性考量：流序设备缓冲区示例

考虑一个使用 device_buffer 的示例：

{
  device_buffer input(100, stream_a);                               [example_3]
  cudaMemcpyAsync(input.data(), h_input.data(),
                  bytes, H2D, stream_a);
  cudaStreamSynchronize(stream_a);
  kernel<<<..., stream_b>>>(input.data(), ...); // 启动核函数在 stream_b
}

安全性分析：流序设备缓冲区示例

继续对 example_3 进行安全性分析。

问题描述：input 对象在 stream_a 上进行构造和数据拷贝，但在 stream_a 同步后，kernel 在 stream_b 上启动。当 input 对象超出作用域时，其析构函数会尝试在 stream_a 上释放内存，但 stream_b 上的 kernel 可能尚未完成对 input.data() 的使用，导致“使用已释放内存”错误。

确保安全：RAII 析构函数在对象超出作用域时调用

为了解决 example_3 中的“使用已释放内存”问题，我们需要确保在析构函数释放内存之前，所有依赖该内存的操作都已完成。一种方法是在对象超出作用域之前同步相关的流。

{
  device_buffer input(100, stream_a);                               [example_3]
  cudaMemcpyAsync(input.data(), h_input.data(),
                  bytes, H2D, stream_a);
  cudaStreamSynchronize(stream_a);
  kernel<<<..., stream_b>>>(input.data(), ...);
  cudaStreamSynchronize(stream_b); // 修正：在对象超出作用域前同步 stream_b
}

下图展示了修正后的时间线，显示了通过同步 stream_b 来确保安全。

确保安全：对依赖操作使用相同的流

另一种更简洁且通常更有效的方法是确保所有依赖操作都在同一个流上执行，从而自然地维护操作顺序。

{
  device_buffer input(100, stream_a);                               [example_3]
  cudaMemcpyAsync(input.data(), h_input.data(),
                  bytes, H2D, stream_a);
  kernel<<<..., stream_a>>>(input.data(), ...); // 修正：在 stream_a 上启动核函数
}

下图展示了这种方法的时间线，所有操作都在 stream_a 上按序执行，stream_b 未使用。

Thrust Vector 与流序分配

回顾：Thrust Vectors

Thrust 是一个 C++ 模板库，提供高性能并行算法和数据结构。
thrust::host_vector 和 device_vector 是其核心数据结构。

以下示例展示了 thrust::host_vector 和 thrust::device_vector 的基本用法：

// Generate 32M random numbers serially.
thrust::host_vector<int> h_vec(32 << 20);
thrust::generate(h_vec.begin(), h_vec.end(), rand);

// Transfer data to the device.
thrust::device_vector<int> d_vec = h_vec;

// Sort data on the device.
thrust::sort(d_vec.begin(), d_vec.end());

// Transfer data back to host.
thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());

更多信息请访问：https://nvidia.github.io/thrust/

流序 Thrust `device_vector`：如何提供自定义流序分配器

为了将 Thrust 与流序内存分配结合，可以提供一个自定义的流序分配器。以下代码展示了 async_thrust_allocator 的实现：

template <typename T>
class async_thrust_allocator : public thrust::device_malloc_allocator<T> {
public:
  async_thrust_allocator(cudaStream_t stream = cudaStream_t(0)) : _stream(stream) {}
  ~async_thrust_allocator() { cudaStreamSynchronize(_stream); } // 析构函数同步流

  pointer allocate(size_type num) {
    T* ptr;
    cudaMallocAsync(&ptr, num * sizeof(T), _stream); // 默认使用 "null stream" 进行分配
    return thrust::device_pointer_cast<T>(ptr);
  }

  void deallocate(pointer ptr, size_type num) { cudaFreeAsync(ptr, _stream); } // 实现 allocate() 和 deallocate() 使用流序分配

private:
  cudaStream_t _stream;
};

template <typename T>
using stream_device_vector = thrust::device_vector<T, async_thrust_allocator<T>>;

关键点：
* 默认使用“null stream”进行分配。
* 实现了 allocate() 和 deallocate() 方法，使用流序分配。

安全性考量：使用流序 `thrust::device_vector`

考虑一个使用 stream_device_vector 的示例：

{
  stream_device_vector<int> v(100);                                 [example_4]
  kernel<<<..., stream_b>>>(v.data().get(), ...); // 在 stream_b 上启动核函数
}

安全性分析：使用流序 `thrust::device_vector`

在使用 Stream-Ordered thrust::device_vector 时，可能会出现 use-after-free 的问题。

Stream-Ordered thrust::device_vector 的安全性问题

问题描述：stream_device_vector 默认在默认流上进行分配，其析构函数也在默认流上释放内存。如果核函数 kernel 在 stream_b 上启动，并且 v 对象在其析构函数执行时 kernel 尚未完成，则可能导致“使用已释放内存”错误。这是因为析构函数会同步其关联的流（默认流），而不是 stream_b。

实际执行过程可能如下：
- Default stream (默认流): constructor: cudaMallocAsync -> uninitialized_fill -> sync -> destructor: cudaMallocAsync
- stream_b (流 b): kernel1(v.data().get(), ...)

由于 cudaMallocAsync 和 cudaFreeAsync (析构函数隐式调用) 发生在默认流上，而 kernel1 在 stream_b 上执行。如果 kernel1 在 default stream 的析构函数完成之前仍在使用 v.data().get() 指向的内存，就会导致 use-after-free。

确保安全性：使用相同流的分配器

为了解决上述安全性问题，需要实例化一个使用相同流的分配器。
修正后的代码示例：

{
    async_thrust_allocator<T> alloc(stream_b);
    stream_device_vector<int> v(100, alloc);
    some_kernel<<<..., stream_b>>>(v.data().get(), ...);
}

此时，所有内存操作都在 stream_b 上按流顺序执行：
- stream_b (流 b): constructor: cudaMallocAsync -> kernel1(v.data().get(), ...) -> destructor: cudaMallocAsync

这样可以确保 v 的内存不会在 kernel1 完成之前被释放，从而避免了 use-after-free 问题。

`thrust::device_vector` 的问题

thrust::device_vector 存在一些问题，它会“停止世界”：
- device_vector 是同步的，而非流有序的。
- device_vector 的构造函数会默认初始化其内容。
- 设备内存的初始化会调用 uninitialized_fill。
- uninitialized_fill 必须在默认流上运行。
- 构造函数随后必须同步默认流。

尽管有这些缺点，它们在某些情况下也可能是优点。使用 device_vector 的场景包括：

当您需要支持非平凡可复制（non-trivially copyable）类型的内容时。
当您需要默认初始化内容时。

安全性、异步性和速度问题：主机容器与设备容器混合

代码示例

{
    thrust::host_vector<int> h_data(...);
    stream_device_vector<int> d_data = h_data;
}

或

{
    std::vector<int> h_data(...);
    stream_device_vector<int> d_data = h_data;
}

问题

避免混合使用流有序和非流有序的容器。
很难推断其安全性。
很难实现快速/异步。

`rmm::device_uvector`：RAPIDS 库中重要的容器

rmm::device_uvector 是 RAPIDS 库中一个重要的容器，具有以下特点：
- 它是一个流有序的、未初始化（uninitialized）的向量容器。
- 所有分配或复制数据的操作都接受一个显式流参数。
- 更容易推断流安全性及其性能。
- 仅支持平凡可复制（trivially copyable）类型（例如 int、float、简单结构体）。

代码示例：

auto input = rmm::device_uvector<int32_t>(n, stream_a);
cudaMemsetAsync(input.data(), 0, n * sizeof(int32_t), stream_a);

// output target need not be default-initialized
auto output = rmm::device_uvector<int32_t>(num_elements, stream_a);
kernel<<<..., stream_a>>>(input.data(), output.data(), num_elements);

更多信息请访问: https://github.com/rapidsai/rmm

`rmm::device_uvector` 基准测试设置

接下来的幻灯片将使用 RMM 的 device_uvector 基准测试。
基准测试流程：
- 在设备上创建输入数据向量。
- 在 4 个并发流上运行内核。
- 内核读取输入并计算输出向量。

将比较三个版本：
- device_vector
- stream_device_vector
- rmm::device_uvector

使用 nsight-systems 检查并发性。

`device_vector` 构造会阻止并发

即使在流中使用内核，device_vector 的构造也会阻止并发。
代码示例：

auto in = thrust::device_vector<int32_t>(n, 0);

for (cudaStream_t stream : streams) {
    auto vec = thrust::device_vector<int32_t>(n); // 每次迭代都在内部构造
    kernel<<<n_blocks, block_sz, 0, stream>>>(
        input.data().get(), vec.data().get(), n);
}

图中显示：
- 无并发：多个内核任务没有并行执行。
- 存在分配瓶颈。
- 总耗时：1930 us。

`device_vector` 构造阻止并发，即使使用流有序分配

即使使用流有序分配，stream_device_vector 的构造也会因隐式同步而阻止并发。
代码示例：

auto in = stream_device_vector<int32_t>(n, 0, async_allocator(input_stream));
cudaStreamSynchronize(input_stream);

for (cudaStream_t stream : streams) {
    auto out = stream_device_vector<int32_t>(n, async_allocator(stream));
    kernel<<<n_blocks, block_sz, 0, stream>>>(
        in.data().get(), out.data().get(), n);
}

图中显示：
- 无并发：多个内核任务没有并行执行。
- 存在同步瓶颈。
- 总耗时：639 us。虽然比 device_vector 有改进，但仍未实现并发。

`rmm::device_uvector` 构造允许完全并发

rmm::device_uvector 的构造允许完全并发。
代码示例：

rmm::device_uvector<int32_t> input(num_elements, input_stream);
cudaMemsetAsync(input.data(), 0, num_elements * sizeof(int32_t), input_stream);
cudaStreamSynchronize(input_stream);

for (rmm::cuda_stream_view stream : streams) {
    auto output = rmm::device_uvector<int32_t>(num_elements, stream);
    kernel<<<num_blocks, 0, stream.value()>>>(
        input.data(), output.data(), num_elements);
}

图中显示：
- 完全并发：多个内核任务并行执行。
- 实现异步分配和内核。
- 总耗时：208 us。

Stream Safety and API Design

主机容器的安全性问题

当涉及到主机容器时，可能会出现安全性问题。
代码示例：

void foo(cudaStream_t stream) {
    std::vector<int> v = some_host_function(...);
    rmm::device_uvector<int> d_v(v.size(), stream);
    cudaMemcpyAsync(d_v.data(), v.data(), v.size() * sizeof(int), cudaMemcpyDefault, stream);
    kernel<<<..., stream>>>(d_v.data());
}

不安全！主机向量 v 可能在数据复制到设备之前就被销毁了。因为 v 是一个局部变量，其生命周期可能在 cudaMemcpyAsync 或 kernel 完成之前结束。

确保主机容器的安全性

为了确保主机容器的安全性，需要等待流上的所有工作完成。
修正后的代码示例：

void foo(cudaStream_t stream) {
    std::vector<int> v = some_host_function(...);
    rmm::device_uvector<int> d_v(v.size(), stream);
    cudaMemcpyAsync(d_v.data(), v.data(), v.size() * sizeof(int), cudaMemcpyDefault, stream);
    kernel<<<..., stream>>>(d_v.data());
    cudaStreamSynchronize(stream); // 添加这一行确保安全
}

向量 v 在流上的所有工作完成之前不会被销毁。

作为参数的主机容器的安全性

当主机容器作为参数传入时，安全性问题更为复杂。
代码示例：

void bar(std::vector<int> const& v, cudaStream_t stream) {
    device_uvector<int> d_v(v.size(), stream);
    cudaMemcpyAsync(d_v.data(), v.data(), v.size() * sizeof(int), cudaMemcpyDefault, stream);
    kernel<<<..., stream>>>(d_v.data());
    cudaMemcpyAsync(v.data(), d_v.data(), v.size() * sizeof(int), cudaMemcpyDefault, stream);
}

潜在不安全！
- 调用者在读取、写入或销毁向量 v 之前必须同步流。
- API 语义必须清晰并有明确的文档。

异步性与速度问题

某些操作可能看起来简单，但实际上是不安全、同步且缓慢的。
代码示例：

// Create widget from host vector
std::unique_ptr<widget> make_widget(std::vector<int> const& input, cudaStream_t stream) {
    stream_device_vector<int> d_data(input); // 隐式同步，从主机复制到设备
    return std::make_unique<widget>(d_data, stream);
}

结果是不安全、同步且缓慢的！ 因为 stream_device_vector<int> d_data(input); 的构造函数可能会隐式地执行同步的主机到设备数据复制操作。

异步实现与安全性考量

幻灯片提出了一个问题：虽然可以将操作实现为异步且快速，但这是否安全？

下面是一个使用 cudaMemcpyAsync 创建 widget 的示例代码：

// Create widget from host vector
std::unique_ptr<widget> make_widget(std::vector<int> const& input,
                                     rmm::cuda_stream_view stream)
{
  device_uvector<int> d_data(input.size(), stream);
  cudaMemcpyAsync(d_data.data(), input.data(),
                  input.size() * sizeof(int), cudaMemcpyDefault, stream);
  return std::make_unique<widget>(d_data, stream);
}

潜在的不安全性：
- 调用者在使用返回的 widget 之前必须同步流。
- 调用者在销毁向量输入之前必须同步流。

安全但同步的实现

幻灯片探讨了实现安全但同步操作的可能性，并提出是否应该由API来决定同步。

下面是一个在异步拷贝后显式进行流同步的示例代码：

// Create widget from host vector
std::unique_ptr<widget> make_widget(std::vector<int> const& input,
                                     cudaStream_t stream)
{
  device_uvector<int> d_data(input.size(), stream);
  cudaMemcpyAsync(d_data.data(), input.data(),
                  input.size() * sizeof(int), cudaMemcpyDefault, stream);
  auto w = std::make_unique<widget>(d_data, stream);
  cudaStreamSynchronize(stream);
  return w;
}

这样做是安全的，但是否应该由API来决定同步行为？

让调用者决定：API 设计准则

幻灯片提出了一项API设计准则：将同步决策权交给调用者。

同步操作可能显著影响性能。
API的调用者可能拥有更多信息，使其能够优化并安全地使用API。
RAPIDS libcudf 遵循以下经验法则：
- libcudf 函数应该是流有序的，并且不进行同步（除非算法上必要）。
YMMV? (您的具体情况可能不同？)

Libcudf 的方法：异步，仅处理设备数据

Libcudf 采用异步且仅处理设备数据的方法。

下面是一个从设备内存范围创建 widget 的示例代码：

// Create widget from device memory span
std::unique_ptr<widget> make_widget_async(
    device_span<int> const input, cudaStream_t stream)
{
  return std::make_unique<widget>(input, stream);
}

函数明确命名为 “async”（意味着它不同步传入的流）。
device_span 是 std::span 的设备内存版本：
- 非拥有（non-owning）、轻量级的内存/对象范围的“视图”。
- 使API在不同设备容器之间保持灵活性，例如 device_vector、device_uvector 等。

API 设计最佳实践：RAPIDS libcudf 的方法

幻灯片总结了RAPIDS libcudf 的API设计最佳实践，明确了应该采纳和避免的做法。

API Design Best Practices RAPIDS libcudf's approach

流安全性优先！总结

幻灯片总结了流安全性的重要方面。

并发具有性能优势：
- 使用CUDA流来重叠（解）分配、计算和数据拷贝。
- 使用 cudaMallocAsync 实现更快的分配和更好的GPU利用率。
- 采用流有序的设备数据容器。
理解流安全性和排序语义：
- 消除未定义行为。
在您的API中采纳流安全最佳实践。