RDMA Glance Notes

What is RDMA?

RDMA:
a technology that allows direct memory access between computers or devices without involving the CPU

RDMA allows data to be directly transferred from the memory of one computer to another computer’s memory, bypassing traditional networking protocols and reducing CPU involvement. This approach reduces the processing overhead, minimizes latency, and improves network efficiency.

UCX

https://github.com/openucx/ucx

UCX exposes a set of abstract communication primitives that utilize the best of available hardware resources and offloads. These include RDMA (InfiniBand and RoCE), TCP, GPUs, shared memory, and network atomic operations.

UCC

https://github.com/openucx/ucc

Unified Collective Communication Library

Design Goals

Highly scalable and performant collectives for HPC, AI/ML and I/O workloads
Nonblocking collective operations that cover a variety of programming models
Flexible resource allocation model
Support for relaxed ordering model
Flexible synchronous model
Repetitive collective operations (init once and invoke multiple times)
Hardware collectives are a first-class citizen

Why UCX/UCC?

Save time by debugging the RDMA communication API.
No need to modify the code since it can plug into NCCL, MPICH, OpenMPI, or PyTorch.

How to use UCC + UCX?

Examples:

Personal view on UCC + UCX?

Pros & Cons: It appears easy to use, but lacks documentation.

(Another potential cons: those library can only compiled on Intel processes.)

Maybe a better way is use NCCL directly with plugins:

https://github.com/Mellanox/nccl-rdma-sharp-plugins

RDMA 2023 Hackthon

Task

Implement Reduce-scatter on sharp-based UCC

Implementation

Reduce-scatter = All-reduce + Data-discard

What is SHARP?

(14) In-Network Computing with NVIDIA SHARP
SHARP: In-Network Scalable Hierarchical Aggregation and Reduction Protocol

Finish some simple tasks on a switch or other network-level computing resources.

Comments

Through the hackthon, other shortages have been exposed to us.

Misleading Performance

First, the RDMA + Sharp doesn’t seem as good as NVIDIA announced on their website. After conducting my tests, the performance results are listed here:

Size	Latency us
131072	124.188
262144	121.844
524288	120.188
1048576	120.25
2097152	122.375
4194304	119.188
8388608	118.562

Here is use MPI only with TCP:

But RDMA can provide benefits as data scales to a larger size.

Size	Latency us
134217728	118.562
134217728000	132.688
1342177280000	258.375
2684354560000	393.031
5368709120000	698.75

Limited API

APIs provided by SHARP:

ucc_tl_sharp_allreduce_init
ucc_tl_sharp_barrier_init
ucc_tl_sharp_bcast_init

But develop other modules are not as hard as I expected. The correct developing path is:

develop sharp-based api
develop ucc-based api
plug them into torch/openmpi/NCCL…

[!note]

sharp_coll.c: a module which ucc calls sharp

ucc_tl_sharp_reduce_scatter_start: api entrypoint that ucc calls sharp

sharp_coll_do_xxx(v)_nb: api that provided by NVIDIA sharp, v represents the vector API

Further work

After the hackthon, there are 2 things that I wanna know in the future:

Can I move sharp_coll_do_xx(v) to GPU? When the data collective task becomes more challeneging, we cannot demand too much for the computation resources on the switch.
NCCL plugin is not tested, I am not sure about the performance.

Skill

#rdma #system

RDMA Glance Notes

http://blog.chivier.site/2023-08-17/b2f300e8f8f7/

Author

Chivier Humber

Posted on

August 17, 2023

Licensed under

Ray deployment draft Previous

eBPF - a new Swiss army knife in the system Next