RDMA Glance Notes

What is RDMA?

RDMA:
a technology that allows direct memory access between computers or devices without involving the CPU

RDMA allows data to be directly transferred from the memory of one computer to another computer’s memory, bypassing traditional networking protocols and reducing CPU involvement. This approach reduces the processing overhead, minimizes latency, and improves network efficiency.

UCX

https://github.com/openucx/ucx

UCX exposes a set of abstract communication primitives that utilize the best of available hardware resources and offloads. These include RDMA (InfiniBand and RoCE), TCP, GPUs, shared memory, and network atomic operations.

UCC

https://github.com/openucx/ucc

Unified Collective Communication Library

Design Goals

  • Highly scalable and performant collectives for HPC, AI/ML and I/O workloads
  • Nonblocking collective operations that cover a variety of programming models
  • Flexible resource allocation model
  • Support for relaxed ordering model
  • Flexible synchronous model
  • Repetitive collective operations (init once and invoke multiple times)
  • Hardware collectives are a first-class citizen

Why UCX/UCC?

  1. Save time by debugging the RDMA communication API.
  2. No need to modify the code since it can plug into NCCL, MPICH, OpenMPI, or PyTorch.

How to use UCC + UCX?

Examples:

Personal view on UCC + UCX?

Pros & Cons: It appears easy to use, but lacks documentation.

(Another potential cons: those library can only compiled on Intel processes.)

Maybe a better way is use NCCL directly with plugins:

RDMA 2023 Hackthon

Task

Implement Reduce-scatter on sharp-based UCC

Implementation

Reduce-scatter = All-reduce + Data-discard

What is SHARP?

(14) In-Network Computing with NVIDIA SHARP
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol

Finish some simple tasks on a switch or other network-level computing resources.

Comments

Through the hackthon, other shortages have been exposed to us.

Misleading Performance

First, the RDMA + Sharp doesn’t seem as good as NVIDIA announced on their website. After conducting my tests, the performance results are listed here:

Size Latency us
131072 124.188
262144 121.844
524288 120.188
1048576 120.25
2097152 122.375
4194304 119.188
8388608 118.562

Here is use MPI only with TCP:

But RDMA can provide benefits as data scales to a larger size.

Size Latency us
134217728 118.562
134217728000 132.688
1342177280000 258.375
2684354560000 393.031
5368709120000 698.75

Limited API

APIs provided by SHARP:

  • ucc_tl_sharp_allreduce_init
  • ucc_tl_sharp_barrier_init
  • ucc_tl_sharp_bcast_init

But develop other modules are not as hard as I expected. The correct developing path is:

  1. develop sharp-based api
  2. develop ucc-based api
  3. plug them into torch/openmpi/NCCL…

[!note]

  • sharp_coll.c: a module which ucc calls sharp
  • ucc_tl_sharp_reduce_scatter_start: api entrypoint that ucc calls sharp
  • sharp_coll_do_xxx(v)_nb: api that provided by NVIDIA sharp, v represents the vector API

Further work

After the hackthon, there are 2 things that I wanna know in the future:

  1. Can I move sharp_coll_do_xx(v) to GPU? When the data collective task becomes more challeneging, we cannot demand too much for the computation resources on the switch.
  2. NCCL plugin is not tested, I am not sure about the performance.

RDMA Glance Notes
http://blog.chivier.site/2023-08-17/b2f300e8f8f7/
Author
Chivier Humber
Posted on
August 17, 2023
Licensed under