Ray deployment draft

created: 2023-08-04T03:36

updated: 2023-09-01T05:59

All codes in this doc are copied to gala: /opt/shared/ray_test

Install ray

From [[Ray multimodel discussion]]:

# ray
pip install -U "ray[air]"
pip install -U "ray[serve]"
# torch and jupyter
pip install jupyter
pip install torch torchvision requests xmltodict
pip install transformers
# make ray serve work !!!
pip install pydantic==1.10.9
pip install fastapi==0.89.0

A better requirements.txt for reproduce the results:

aiohttp==3.8.5
aiohttp-cors==0.7.0
aiorwlock==1.3.0
aiosignal==1.3.1
annotated-types==0.5.0
anyio==3.7.1
async-timeout==4.0.2
attrs==23.1.0
beautifulsoup4==4.12.2
blessed==1.20.0
cachetools==5.3.1
certifi==2023.7.22
charset-normalizer==2.1.1
click==8.1.6
colorful==0.5.5
distlib==0.3.7
exceptiongroup==1.1.2
fastapi==0.88.0
filelock==3.12.2
frozenlist==1.4.0
fsspec==2023.6.0
google-api-core==2.11.1
google-auth==2.22.0
googleapis-common-protos==1.59.1
gpustat==1.1
grpcio==1.56.2
h11==0.14.0
huggingface-hub==0.16.4
idna==3.4
jsonschema==4.18.4
jsonschema-specifications==2023.7.1
msgpack==1.0.5
multidict==6.0.4
numpy==1.25.1
nvidia-ml-py==12.535.77
opencensus==0.11.2
opencensus-context==0.1.3
packaging==23.1
platformdirs==3.9.1
prometheus-client==0.13.1
protobuf==4.23.4
psutil==5.9.5
py-spy==0.3.14
pyasn1==0.5.0
pyasn1-modules==0.3.0
pydantic==1.10.12
pydantic_core==2.4.0
PyYAML==6.0.1
ray==2.2.0
referencing==0.30.0
regex==2023.6.3
requests==2.28.1
rpds-py==0.9.2
rsa==4.9
six==1.16.0
smart-open==6.3.0
sniffio==1.3.0
soupsieve==2.4.1
starlette==0.22.0
tokenizers==0.12.1
tqdm==4.65.0
transformers==4.21.2
typing_extensions==4.7.1
urllib3==1.26.16
uvicorn==0.23.1
virtualenv==20.24.2
wcwidth==0.2.6
wikipedia==1.4.0
yarl==1.9.2

[!Warning]
fastapi <= 0.89.0
ray <= 2.2
pydantic <= 1.10.12
are essential dependencies, new versions cannot work

Docker Prerequisites (Docker network)

Before starting ray head and ray worker in docker containers, we need a docker network environment for those containers:

1	`docker network create ray-test`

Check the network type, make sure this is a BRIDGE:

Dockerfile for Ray Serve

Create a Dockerfile:

1
2
3

mkdir ray-test
cd ray-test
touch Dockerfile

Dockerfile:

FROM pytorch/pytorch:latest

# Set the working directory
WORKDIR /app

# Copy your application code to the container
COPY . /app

# Set the GPU environment variables
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility

# set USTC mirror for apt
RUN sed -i 's@//.*archive.ubuntu.com@//mirrors.ustc.edu.cn@g' /etc/apt/sources.list
# update sources
RUN apt-get update
# install git for model download, and iputils-ping for ping test
RUN apt install -y iputils-ping git
# update pip
RUN python -m pip install --upgrade pip
# set up pypi mirror
RUN pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
# requirements file are listed above
RUN pip install -r requirements.txt
RUN pip install "ray[serve]" requests torch diffusers
# install huggingface transformers
RUN pip install git+https://github.com/huggingface/transformers

# Set the entry point command (modify as per your needs)
CMD ["bash"]

Then create the image based on the dockerfile:

1	`docker build -t ray_test_image .`

Start head and workers:

# Start head node
docker run --name head -d -t -p 6379:6379 -p 8265:8265 --network ray-network ray_test_image

# Start worker node
docker run --name worker0 --gpus device=0 -d -t --network ray-network ray_test_image
docker run --name worker1 --gpus device=1 -d -t --network ray-network ray_test_image
docker run --name worker2 --gpus device=2 -d -t --network ray-network ray_test_image
docker run --name worker3 --gpus device=3 -d -t --network ray-network ray_test_image

# Fetch head node IP
HEAD_IP=` docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' head`

# Start rat serve
docker exec head sh -c "ray start --head --num-gpus=0 --num-cpus=12"
docker exec worker0 sh -c "ray start --address=\"$HEAD_IP:6379\""
docker exec worker1 sh -c "ray start --address=\"$HEAD_IP:6379\""
docker exec worker2 sh -c "ray start --address=\"$HEAD_IP:6379\""
docker exec worker3 sh -c "ray start --address=\"$HEAD_IP:6379\""

Port 6379 is for Ray Serve connections, and port 8265 is for the dashboard.

After head node and worker node are started in the containers, we can read the ray cluster status on the host (because we expose port 6379 before, so we can monitor the ray status directly):

A refresh script might be helpful when the head node is down:

docker exec head sh -c "ray stop"
docker exec worker0 sh -c "ray stop"
docker exec worker1 sh -c "ray stop"
docker exec worker2 sh -c "ray stop"
docker exec worker3 sh -c "ray stop"

# Fetch head node IP
HEAD_IP=` docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' head`

# Start rat serve
docker exec head sh -c "ray start --head --num-gpus=0 --num-cpus=12"
docker exec worker0 sh -c "ray start --address=\"$HEAD_IP:6379\""
docker exec worker1 sh -c "ray start --address=\"$HEAD_IP:6379\""
docker exec worker2 sh -c "ray start --address=\"$HEAD_IP:6379\""
docker exec worker3 sh -c "ray start --address=\"$HEAD_IP:6379\""

Serve Models and Send Requests

Here is an example from ray official docs, we use a simple stable diffution as an example to show how ray serve work.

First we copy serve script and request script to the head container:

# stable.py

from io import BytesIO
from fastapi import FastAPI
from fastapi.responses import Response
import torch

from ray import serve


app = FastAPI()

@serve.deployment(num_replicas=1, route_prefix="/")
@serve.ingress(app)
class APIIngress:
    def __init__(self, diffusion_model_handle) -> None:
        self.handle = diffusion_model_handle

    @app.get(
        "/imagine",
        responses={200: {"content": {"image/png": {}}}},
        response_class=Response,
    )
    async def generate(self, prompt: str, img_size: int = 512):
        assert len(prompt), "prompt parameter cannot be empty"

        image_ref = await self.handle.generate.remote(prompt, img_size=img_size)
        image = await image_ref
        file_stream = BytesIO()
        image.save(file_stream, "PNG")
        return Response(content=file_stream.getvalue(), media_type="image/png")


@serve.deployment(
    ray_actor_options={"num_gpus": 0.5},
    autoscaling_config={
	    "min_replicas": 0,
	    "max_replicas": 8,
        "initial_replicas": 2,
        "target_num_ongoing_requests_per_replica": 1,
        "downscale_delay_s": 20,
        "upscale_delay_s": 0},
)
class StableDiffusionV2:
    def __init__(self):
        from diffusers import EulerDiscreteScheduler, StableDiffusionPipeline

        model_id = "stabilityai/stable-diffusion-2"

        scheduler = EulerDiscreteScheduler.from_pretrained(
            model_id, subfolder="scheduler"
        )
        self.pipe = StableDiffusionPipeline.from_pretrained(
            model_id, scheduler=scheduler, revision="fp16", torch_dtype=torch.float16
        )
        self.pipe = self.pipe.to("cuda")

    def generate(self, prompt: str, img_size: int = 512):
        assert len(prompt), "prompt parameter cannot be empty"

        image = self.pipe(prompt, height=img_size, width=img_size).images[0]
        return image


entrypoint = APIIngress.bind(StableDiffusionV2.bind())

The deployment configuration is the most important part:

ray_actor_options lists the required resources for each replica. In addition to the number of GPUs, you can also modify the number of CPUs, memory, accelerator type, and more. For more information, refer to the Ray Actor Options documentation.
autoscaling_config specifies the autoscale rules for ray serve. For more information, refer to Autoscaling config

# request.py
import requests
import asyncio

async def get_image(id):
    prompt = "a cute cat is dancing on the grass."
    input = "%20".join(prompt.split(" "))
    resp = requests.get(f"http://127.0.0.1:8000/imagine?prompt={input}")
    with open(f"output{id}.png", 'wb') as f:
        f.write(resp.content)

async def main():
    tasks = []
    for i in range(50):
        task = asyncio.create_task(get_image(i))
        tasks.append(task)
    responses = await asyncio.gather(*tasks)
    print(responses)

asyncio.run(main())

# copy files to head node
docker cp stable.py head:/app
docker cp request.py head:/app
docker exec -it head bash
# following lines are work in head container
serve run stable:entrypoint