RDMA Glance Notes
What is RDMA?
RDMA:
a technology that allows direct memory access between computers or devices without involving the CPU
RDMA allows data to be directly transferred from the memory of one computer to another computer’s memory, bypassing traditional networking protocols and reducing CPU involvement. This approach reduces the processing overhead, minimizes latency, and improves network efficiency.
UCX
https://github.com/openucx/ucx
UCX exposes a set of abstract communication primitives that utilize the best of available hardware resources and offloads. These include RDMA (InfiniBand and RoCE), TCP, GPUs, shared memory, and network atomic operations.
UCC
https://github.com/openucx/ucc
Unified Collective Communication Library
Design Goals
- Highly scalable and performant collectives for HPC, AI/ML and I/O workloads
- Nonblocking collective operations that cover a variety of programming models
- Flexible resource allocation model
- Support for relaxed ordering model
- Flexible synchronous model
- Repetitive collective operations (init once and invoke multiple times)
- Hardware collectives are a first-class citizen
Why UCX/UCC?
- Save time by debugging the RDMA communication API.
- No need to modify the code since it can plug into NCCL, MPICH, OpenMPI, or PyTorch.
How to use UCC + UCX?
Examples:
Personal view on UCC + UCX?
Pros & Cons: It appears easy to use, but lacks documentation.
(Another potential cons: those library can only compiled on Intel processes.)
Maybe a better way is use NCCL directly with plugins:
RDMA 2023 Hackthon
Task
Implement Reduce-scatter on sharp-based UCC
Implementation
Reduce-scatter = All-reduce + Data-discard
What is SHARP?
(14) In-Network Computing with NVIDIA SHARP
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol
Finish some simple tasks on a switch or other network-level computing resources.
Comments
Through the hackthon, other shortages have been exposed to us.
Misleading Performance
First, the RDMA + Sharp doesn’t seem as good as NVIDIA announced on their website. After conducting my tests, the performance results are listed here:
Size | Latency us |
---|---|
131072 | 124.188 |
262144 | 121.844 |
524288 | 120.188 |
1048576 | 120.25 |
2097152 | 122.375 |
4194304 | 119.188 |
8388608 | 118.562 |
Here is use MPI only with TCP:
But RDMA can provide benefits as data scales to a larger size.
Size | Latency us |
---|---|
134217728 | 118.562 |
134217728000 | 132.688 |
1342177280000 | 258.375 |
2684354560000 | 393.031 |
5368709120000 | 698.75 |
Limited API
APIs provided by SHARP:
- ucc_tl_sharp_allreduce_init
- ucc_tl_sharp_barrier_init
- ucc_tl_sharp_bcast_init
But develop other modules are not as hard as I expected. The correct developing path is:
- develop sharp-based api
- develop ucc-based api
- plug them into torch/openmpi/NCCL…
[!note]
- sharp_coll.c: a module which ucc calls sharp
- ucc_tl_sharp_reduce_scatter_start: api entrypoint that ucc calls sharp
- sharp_coll_do_xxx(v)_nb: api that provided by NVIDIA sharp, v represents the vector API
Further work
After the hackthon, there are 2 things that I wanna know in the future:
- Can I move sharp_coll_do_xx(v) to GPU? When the data collective task becomes more challeneging, we cannot demand too much for the computation resources on the switch.
- NCCL plugin is not tested, I am not sure about the performance.