Ray-on-Kubernetes-

2025-09-11 约 527 字预计阅读 2 分钟

https://bing.ee123.net/img/rand?artid=151582291

Ray on Kubernetes

1. 安装 KubeRay Operator

KubeRay Operator 负责管理 Ray 集群的生命周期。

# 添加 helm 仓库
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update

# 安装 KubeRay Operator（同时安装 CRDs）
helm install kuberay-operator kuberay/kuberay-operator \
  --namespace kube-system \
  --version 1.4.2

验证安装：

kubectl get pods -n kube-system | grep kuberay-operator

2. 部署 RayCluster

RayCluster 是 Ray 在 Kubernetes 上的自定义资源（CRD）。

拉取 Helm chart：

helm pull kuberay/ray-cluster --untar --version 1.4.2

修改 values.yaml，配置 Head 节点和 Worker 节点，例如：

head:
  resources:
    requests:
      cpu: "4"
      memory: "8Gi"
    limits:
      cpu: "4"
      memory: "8Gi"

worker:
  replicas: 2
  resources:
    requests:
      cpu: "12"
      memory: "40Gi"
      huawei.com/Ascend910B: "1"
      huawei.com/Ascend910B-memory: "16384"
    limits:
      cpu: "12"
      memory: "40Gi"
      huawei.com/Ascend910B: "1"
      huawei.com/Ascend910B-memory: "16384"

部署集群：

helm -n kube-system install ray-cluster ./ray-cluster -f ./values.yaml

验证：

kubectl get pods -n kube-system -l ray.io/cluster=ray-cluster

3. Worker 节点配置（参考 Deployment）

在某些情况下，Helm values 无法覆盖特殊的 Worker 配置（如 Ascend NPU 驱动挂载、探针配置等），需要单独用 Deployment 管理 Worker Pod。

示例 Deployment YAML：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ray-cluster-npu-workergroup
  namespace: kube-system
  labels:
    ray.io/cluster: ray-cluster
    ray.io/group: workergroup
spec:
  replicas: 2
  selector:
    matchLabels:
      ray.io/identifier: ray-cluster-npu-worker
  template:
    metadata:
      labels:
        ray.io/identifier: ray-cluster-npu-worker
        ray.io/cluster: ray-cluster
        ray.io/node-type: worker
    spec:
      containers:
      - name: ray-worker
        image: harbor.xx.net/library/rayproject/ray:2.48.0.py311-npu-vjepa2-aarch64-20250911
        command: ["/bin/bash", "-c", "--"]
        args:
          - >-
            ulimit -n 65536;
            ray start --address=ray-cluster-head-svc.kube-system.svc.cluster.local:6379
            --block --dashboard-agent-listen-port=52365
            --memory=20000000000 --metrics-export-port=8080
            --num-cpus=12 --resources='{"ascend":20}'
        resources:
          requests:
            cpu: "12"
            memory: 40Gi
            huawei.com/Ascend910B: "1"
            huawei.com/Ascend910B-memory: "16384"
          limits:
            cpu: "12"
            memory: 40Gi
            huawei.com/Ascend910B: "1"
            huawei.com/Ascend910B-memory: "16384"
        livenessProbe:
          exec:
            command:
              - bash
              - -c
              - wget --tries 1 -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
          initialDelaySeconds: 30
          periodSeconds: 5
          failureThreshold: 120

4. 常见运维操作

查看 Ray Dashboard

Ray Head Pod 会暴露 dashboard（默认 8265 端口）：

kubectl port-forward svc/ray-cluster-head-svc -n kube-system 8265:8265

访问：http://localhost:8265

调整 Worker 数量

如果使用 Deployment：

kubectl scale deployment ray-cluster-npu-workergroup -n kube-system --replicas=4

如果使用 RayCluster CR：

kubectl edit raycluster ray-cluster -n kube-system
# 修改 spec.worker.replicas

5. 参考文档

官方文档:
KubeRay Helm Charts: