目录

Ray-on-Kubernetes-

Ray on Kubernetes

1. 安装 KubeRay Operator

KubeRay Operator 负责管理 Ray 集群的生命周期。

# 添加 helm 仓库
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update

# 安装 KubeRay Operator(同时安装 CRDs)
helm install kuberay-operator kuberay/kuberay-operator \
  --namespace kube-system \
  --version 1.4.2

验证安装:

kubectl get pods -n kube-system | grep kuberay-operator

2. 部署 RayCluster

RayCluster 是 Ray 在 Kubernetes 上的自定义资源(CRD)。

  1. 拉取 Helm chart:
helm pull kuberay/ray-cluster --untar --version 1.4.2
  1. 修改 values.yaml,配置 Head 节点Worker 节点,例如:
head:
  resources:
    requests:
      cpu: "4"
      memory: "8Gi"
    limits:
      cpu: "4"
      memory: "8Gi"

worker:
  replicas: 2
  resources:
    requests:
      cpu: "12"
      memory: "40Gi"
      huawei.com/Ascend910B: "1"
      huawei.com/Ascend910B-memory: "16384"
    limits:
      cpu: "12"
      memory: "40Gi"
      huawei.com/Ascend910B: "1"
      huawei.com/Ascend910B-memory: "16384"
  1. 部署集群:
helm -n kube-system install ray-cluster ./ray-cluster -f ./values.yaml
  1. 验证:
kubectl get pods -n kube-system -l ray.io/cluster=ray-cluster

3. Worker 节点配置(参考 Deployment)

在某些情况下,Helm values 无法覆盖特殊的 Worker 配置(如 Ascend NPU 驱动挂载、探针配置等),需要单独用 Deployment 管理 Worker Pod。

示例 Deployment YAML:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ray-cluster-npu-workergroup
  namespace: kube-system
  labels:
    ray.io/cluster: ray-cluster
    ray.io/group: workergroup
spec:
  replicas: 2
  selector:
    matchLabels:
      ray.io/identifier: ray-cluster-npu-worker
  template:
    metadata:
      labels:
        ray.io/identifier: ray-cluster-npu-worker
        ray.io/cluster: ray-cluster
        ray.io/node-type: worker
    spec:
      containers:
      - name: ray-worker
        image: harbor.xx.net/library/rayproject/ray:2.48.0.py311-npu-vjepa2-aarch64-20250911
        command: ["/bin/bash", "-c", "--"]
        args:
          - >-
            ulimit -n 65536;
            ray start --address=ray-cluster-head-svc.kube-system.svc.cluster.local:6379
            --block --dashboard-agent-listen-port=52365
            --memory=20000000000 --metrics-export-port=8080
            --num-cpus=12 --resources='{"ascend":20}'
        resources:
          requests:
            cpu: "12"
            memory: 40Gi
            huawei.com/Ascend910B: "1"
            huawei.com/Ascend910B-memory: "16384"
          limits:
            cpu: "12"
            memory: 40Gi
            huawei.com/Ascend910B: "1"
            huawei.com/Ascend910B-memory: "16384"
        livenessProbe:
          exec:
            command:
              - bash
              - -c
              - wget --tries 1 -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
          initialDelaySeconds: 30
          periodSeconds: 5
          failureThreshold: 120

4. 常见运维操作

查看 Ray Dashboard

Ray Head Pod 会暴露 dashboard(默认 8265 端口):

kubectl port-forward svc/ray-cluster-head-svc -n kube-system 8265:8265

访问:http://localhost:8265

调整 Worker 数量

如果使用 Deployment:

kubectl scale deployment ray-cluster-npu-workergroup -n kube-system --replicas=4

如果使用 RayCluster CR:

kubectl edit raycluster ray-cluster -n kube-system
# 修改 spec.worker.replicas

5. 参考文档

  • 官方文档:
  • KubeRay Helm Charts: