Ray-on-Kubernetes-
目录
Ray on Kubernetes
1. 安装 KubeRay Operator
KubeRay Operator 负责管理 Ray 集群的生命周期。
# 添加 helm 仓库
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo update
# 安装 KubeRay Operator(同时安装 CRDs)
helm install kuberay-operator kuberay/kuberay-operator \
--namespace kube-system \
--version 1.4.2
验证安装:
kubectl get pods -n kube-system | grep kuberay-operator
2. 部署 RayCluster
RayCluster 是 Ray 在 Kubernetes 上的自定义资源(CRD)。
- 拉取 Helm chart:
helm pull kuberay/ray-cluster --untar --version 1.4.2
- 修改
values.yaml
,配置 Head 节点和 Worker 节点,例如:
head:
resources:
requests:
cpu: "4"
memory: "8Gi"
limits:
cpu: "4"
memory: "8Gi"
worker:
replicas: 2
resources:
requests:
cpu: "12"
memory: "40Gi"
huawei.com/Ascend910B: "1"
huawei.com/Ascend910B-memory: "16384"
limits:
cpu: "12"
memory: "40Gi"
huawei.com/Ascend910B: "1"
huawei.com/Ascend910B-memory: "16384"
- 部署集群:
helm -n kube-system install ray-cluster ./ray-cluster -f ./values.yaml
- 验证:
kubectl get pods -n kube-system -l ray.io/cluster=ray-cluster
3. Worker 节点配置(参考 Deployment)
在某些情况下,Helm values 无法覆盖特殊的 Worker 配置(如 Ascend NPU 驱动挂载、探针配置等),需要单独用 Deployment 管理 Worker Pod。
示例 Deployment YAML:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ray-cluster-npu-workergroup
namespace: kube-system
labels:
ray.io/cluster: ray-cluster
ray.io/group: workergroup
spec:
replicas: 2
selector:
matchLabels:
ray.io/identifier: ray-cluster-npu-worker
template:
metadata:
labels:
ray.io/identifier: ray-cluster-npu-worker
ray.io/cluster: ray-cluster
ray.io/node-type: worker
spec:
containers:
- name: ray-worker
image: harbor.xx.net/library/rayproject/ray:2.48.0.py311-npu-vjepa2-aarch64-20250911
command: ["/bin/bash", "-c", "--"]
args:
- >-
ulimit -n 65536;
ray start --address=ray-cluster-head-svc.kube-system.svc.cluster.local:6379
--block --dashboard-agent-listen-port=52365
--memory=20000000000 --metrics-export-port=8080
--num-cpus=12 --resources='{"ascend":20}'
resources:
requests:
cpu: "12"
memory: 40Gi
huawei.com/Ascend910B: "1"
huawei.com/Ascend910B-memory: "16384"
limits:
cpu: "12"
memory: 40Gi
huawei.com/Ascend910B: "1"
huawei.com/Ascend910B-memory: "16384"
livenessProbe:
exec:
command:
- bash
- -c
- wget --tries 1 -T 2 -q -O- http://localhost:52365/api/local_raylet_healthz | grep success
initialDelaySeconds: 30
periodSeconds: 5
failureThreshold: 120
4. 常见运维操作
查看 Ray Dashboard
Ray Head Pod 会暴露 dashboard(默认 8265 端口):
kubectl port-forward svc/ray-cluster-head-svc -n kube-system 8265:8265
访问:http://localhost:8265
调整 Worker 数量
如果使用 Deployment:
kubectl scale deployment ray-cluster-npu-workergroup -n kube-system --replicas=4
如果使用 RayCluster CR:
kubectl edit raycluster ray-cluster -n kube-system
# 修改 spec.worker.replicas
5. 参考文档
- 官方文档:
- KubeRay Helm Charts: