#
Prerequisites
#
Kubernetes tools
#
kubectl
This section describes how to install kubectl. See Kubernetes / Install and Set Up kubectl on Linux for more details.
You can install kubectl binary with curl on Linux as follows. Please replace <kubernetesVersion> and <kubeconfigPath> with your desired Kubernetes version and the path to your kubeconfig file, respectively.
KUBECTL_VERSION=<kubernetesVersion>
curl -LO https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
export KUBECONFIG=<kubeconfigPath>
You can verify the installation by running the following command. Note that the printed version may vary depending on your cluster version.
kubectl version
Client Version: v1.32.9
Kustomize Version: v5.5.0
Server Version: v1.32.8
#
Helm
You can install Helm by running the following command. See Helm / Installing Helm for more details.
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
You can verify the installation by running the following command. Note that the printed version may vary depending on the Helm version installed.
helm version
version.BuildInfo{Version:"v3.19.0", GitCommit:"3d8990f0836691f0229297773f3524598f46bda6", GitTreeState:"clean", GoVersion:"go1.24.7"}
#
Monitoring components
For the monitoring features of the MoAI Inference Framework, you need to install the Prometheus, Prometheus Operator, Node Exporter, and Grafana using the kube-prometheus-stack Helm chart. First, add the Prometheus Community Helm chart repository.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update prometheus-community
Since the Prometheus stack installs many components by default, we recommend disabling unnecessary ones to achieve a minimal installation. Create a prometheus-stack-values.yaml file as follows. Please replace <storageClassName> with your own StorageClass name.
defaultRules:
create: false
windowsMonitoring:
enabled: false
alertmanager:
enabled: false
grafana:
enabled: true
kubernetesServiceMonitors:
enabled: false
kubeApiServer:
enabled: false
kubelet:
enabled: false
kubeControllerManager:
enabled: false
coreDns:
enabled: false
kubeDns:
enabled: false
kubeEtcd:
enabled: false
kubeScheduler:
enabled: false
kubeProxy:
enabled: false
kubeStateMetrics:
enabled: false
nodeExporter:
enabled: true
prometheusOperator:
enabled: true
tls:
enabled: false
prometheus:
enabled: true
prometheusSpec:
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: "<storageClassName>"
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
Install the Prometheus stack.
helm upgrade -i prometheus-stack prometheus-community/kube-prometheus-stack \
--version 77.11.1 \
-n prometheus-stack \
--create-namespace \
-f prometheus-stack-values.yaml
You can verify that the Prometheus stack pods are running using the following command.
kubectl get pods -n prometheus-stack
NAME READY STATUS RESTARTS AGE
prometheus-prometheus-stack-kube-prom-prometheus-0 2/2 Running 0 100s
prometheus-stack-grafana-7c655db89f-9ltch 3/3 Running 0 116s
prometheus-stack-kube-prom-operator-56d44cb7db-8w5v5 1/1 Running 0 116s
prometheus-stack-prometheus-node-exporter-ppsgg 1/1 Running 0 116s
#
AMD GPU operator
This section describes how to set up the AMD GPU Operator on a Kubernetes cluster. See AMD GPU Operator / Kubernetes (Helm) for more details.
#
Certification
The AMD GPU Operator requires cert-manager to be installed in the cluster. First, add the Jetstack Helm chart repository.
helm repo add jetstack https://charts.jetstack.io
helm repo update jetstack
Create a cert-manager-values.yaml file as shown below, then install cert-manager using this file.
crds:
enabled: true
helm upgrade -i cert-manager jetstack/cert-manager \
--version v1.18.3 \
-n cert-manager \
--create-namespace \
-f cert-manager-values.yaml
You can verify that the cert-manager pods are running using the following command.
kubectl get pods -n cert-manager
NAME READY STATUS RESTARTS AGE
cert-manager-74b7f6cbbc-hc587 1/1 Running 0 5m
cert-manager-cainjector-58c9d76cb8-cgx5t 1/1 Running 0 5m
cert-manager-webhook-5875b545cf-7x8tc 1/1 Running 0 5m
#
GPU operator installation
Add the ROCm's GPU Operator Helm chart repository.
helm repo add rocm https://rocm.github.io/gpu-operator
helm repo update rocm
Create a namespace for the AMD GPU Operator.
kubectl create namespace amd-gpu
Create a Docker registry secret in the amd-gpu namespace. Please replace <registry>, <username>, and <password> with your own values.
kubectl create secret -n amd-gpu \
docker-registry private-registry \
--docker-server=<registry> \
--docker-username=<username> \
--docker-password=<password>
Create a gpu-operator-values.yaml file with the following content. Please replace <registry> and <repository> with your own values.
Warning
When deviceConfig.spec.driver.enable is set to true, the gpu-operator checks whether the image exists at the specified <registry>/<repository>:<tag>. If the image does not exist, the gpu-operator builds and pushes the image to that location. For more details, see AMD GPU Operator / Preparing Pre-compiled Driver Images.
deviceConfig:
spec:
driver:
enable: true
version: "6.4.3"
blacklist: true
image: "<registry>/<repository>"
imageRegistrySecret:
name: private-registry
imageRegistryTLS:
insecure: false
insecureSkipTLSVerify: false
tolerations: &tolerations
- key: amd.com/gpu
operator: Exists
effect: NoSchedule
devicePlugin:
devicePluginTolerations: *tolerations
metricsExporter:
prometheus:
serviceMonitor:
enabled: true
interval: 10s
labels:
release: prometheus-stack
tolerations: *tolerations
node-feature-discovery:
worker:
tolerations: *tolerations
You can install the AMD GPU Operator as follows.
helm upgrade -i gpu-operator rocm/gpu-operator-charts \
--version v1.4.0 \
-n amd-gpu \
-f gpu-operator-values.yaml
You can verify that the gpu-operator pods are running using the following command.
Tip
The installation of the operator and GPU drivers may take some time. You can monitor the progress using the following command instead.
kubectl get pods -n amd-gpu -w
kubectl get pods -n amd-gpu
NAME READY STATUS RESTARTS AGE
default-device-plugin-fxj66 1/1 Running 0 108s
default-metrics-exporter-r2l6h 1/1 Running 0 108s
default-node-labeller-qhqdl 1/1 Running 0 2m35s
gpu-operator-gpu-operator-charts-controller-manager-69856dhd67k 1/1 Running 0 4m20s
gpu-operator-kmm-controller-7b5dd7b48b-fpcv6 1/1 Running 0 4m20s
gpu-operator-kmm-webhook-server-c7bfc864-tfqdb 1/1 Running 0 4m20s
gpu-operator-node-feature-discovery-gc-7649c47d5d-55rcn 1/1 Running 0 4m20s
gpu-operator-node-feature-discovery-master-fc889959c-sx7wv 1/1 Running 0 4m20s
gpu-operator-node-feature-discovery-worker-4tnns 1/1 Running 0 4m20s
#
RDMA device plugin
#
Host driver installation
You need to install the device drivers and OFED software for InfiniBand or RoCE NICs on the host OS. This must be completed before joining the node to the Kubernetes cluster. It is recommended to follow the instructions provided by your hardware vendor.
Check the IB NIC has been properly detected.
lspci -vvv | grep -i Mellanox | grep -i ConnectX
Set the environment variables.
OS_VER=$(. /etc/os-release;echo $ID$VERSION_ID)
KERNEL_VERSION=$(uname -r)
BASE_URL=https://content.mellanox.com/ofed
OFED_VER=23.10-3.2.2.0
OFED_TGZ_FILE=MLNX_OFED_LINUX-$OFED_VER-$OS_VER-x86_64.tgz
OFED_PREFIX_DIR=MLNX_OFED-$OFED_VER
OFED_DIR=MLNX_OFED_LINUX-$OFED_VER-$OS_VER-x86_64
Install the required dependency packages.
sudo apt-get update -y
sudo apt-get install -y lm-sensors xfsprogs net-tools libnuma-dev \
ocl-icd-opencl-dev sqlite3 libsqlite3-dev libboost-all-dev libbz2-dev \
openmpi-bin libtinfo-dev universal-ctags cscope nmon sox google-perftools \
libssl-dev pstack libomp-dev libmsgpack-dev clang llvm llvm-12-dev \
clang-format-12 libclang-12-dev libstdc++-12-dev
Download and install the OFED driver.
wget $BASE_URL/$OFED_PREFIX_DIR/$OFED_TGZ_FILE
tar -xzvf $OFED_TGZ_FILE
cd ./$OFED_DIR
./mlnxofedinstall --without-fw-update --all --ovs-dpdk --upstream-libs --with-nfsrdma --without-ucx --without-openmpi --force --kernel $KERNEL_VERSION
Check the RoCE NIC has been properly detected.
lspci -vvv | grep -i 'Broadcom' | grep -i 'Ethernet controller'
Install the required dependency packages.
sudo apt-get update -y
sudo apt-get install -y ca-certificates htop net-tools vim zip wget curl \
iputils-ping pciutils python3 infiniband-diags iproute2 binutils perftest \
git make autoconf sudo libtool g++ bc
Download and install the RoCE NIC driver.
wget https://docs.broadcom.com/docs-and-downloads/ethernet-network-adapters/NXE/Thor2/GCA2/bcm5760x_231.2.63.0a.zip
unzip ./bcm5760x_231.2.63.0a.zip
cd ./bcm5760x_231.2.63.0a/utils/linux_installer
sudo bash ./install.sh -i \
"$(lspci -vvv 2> /dev/null | grep 'Broadcom' | grep 'Ethernet controller' | cut -d ' ' -f 1)"
#
RDMA device plugin installation
This section describes how to install the rdma-shared-device-plugin. See k8s-rdma-shared-dev-plugin / README for more details.
First, create a rdma-shared-device-plugin.yaml file as follows. Please replace <device> with your RDMA NIC's network interface name.
---
apiVersion: v1
kind: ConfigMap
metadata:
name: rdma-devices
namespace: kube-system
labels:
app.kubernetes.io/name: rdma-shared-device-plugin
app.kubernetes.io/version: v1.5.2
app.kubernetes.io/instance: rdma-shared-device-plugin
data:
config.json: |
{
"periodicUpdateInterval": 300,
"configList": [
{
"resourcePrefix": "mellanox",
"resourceName": "hca",
"rdmaHcaMax": 1000,
"devices": [
"<device>"
]
}
]
}
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: rdma-shared-device-plugin
namespace: kube-system
labels:
app.kubernetes.io/name: rdma-shared-device-plugin
app.kubernetes.io/version: v1.5.2
app.kubernetes.io/instance: rdma-shared-device-plugin
spec:
selector:
matchLabels:
app.kubernetes.io/name: rdma-shared-device-plugin
app.kubernetes.io/instance: rdma-shared-device-plugin
updateStrategy:
rollingUpdate:
maxUnavailable: "30%"
template:
metadata:
labels:
app.kubernetes.io/name: rdma-shared-device-plugin
app.kubernetes.io/version: v1.5.2
app.kubernetes.io/instance: rdma-shared-device-plugin
spec:
hostNetwork: true
priorityClassName: system-node-critical
tolerations:
- key: amd.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: device-plugin
image: ghcr.io/mellanox/k8s-rdma-shared-dev-plugin:v1.5.2
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: plugins-registry
mountPath: /var/lib/kubelet/plugins_registry
- name: config
mountPath: /k8s-rdma-shared-dev-plugin
- name: devs
mountPath: /dev/
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: plugins-registry
hostPath:
path: /var/lib/kubelet/plugins_registry
- name: config
configMap:
name: rdma-devices
items:
- key: config.json
path: config.json
- name: devs
hostPath:
path: /dev/
Then, create an rdma-shared-device-plugin DaemonSet using the following command.
kubectl apply -f rdma-shared-device-plugin.yaml
You can verify that the rdma-shared-device-plugin pods are running using the following command.
kubectl get pods -n kube-system -l app.kubernetes.io/instance=rdma-shared-device-plugin
#
Gateway
Add the Gateway API and Gateway API Inference Extension CRDs.
kubectl apply --server-side -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.1.0/manifests.yaml
You can use any gateway controller compatible with the Gateway API Inference Extension. We recommend using either Istio or Kgateway, and installation instructions for both are provided below.
Add the Istio Helm chart repository.
helm repo add istio https://istio-release.storage.googleapis.com/charts
helm repo update istio
Install the Istio base chart.
helm upgrade -i istio-base istio/base \
--version 1.28.0 \
-n istio-system \
--create-namespace
Create a istiod-values.yaml file and install the Istio control plane.
pilot:
env:
PILOT_ENABLE_ALPHA_GATEWAY_API: "true"
ENABLE_GATEWAY_API_INFERENCE_EXTENSION: "true"
helm upgrade -i istiod istio/istiod \
--version 1.28.0 \
-n istio-system \
-f istiod-values.yaml
Install the Kgateway CRDs.
helm upgrade -i kgateway-crds oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds \
--version v2.1.1 \
-n kgateway-system \
--create-namespace
Create a kgateway-values.yaml file and install the Kgateway controller.
inferenceExtension:
enabled: true
helm upgrade -i kgateway oci://cr.kgateway.dev/kgateway-dev/charts/kgateway \
--version v2.1.1 \
-n kgateway-system \
-f kgateway-values.yaml
#
Amazon ECR token for Moreh's container image repository
The container images of the MoAI Inference Framework are distributed through a private repository on Amazon ECR. To download them, you need to obtain an authorization token. First, store your AWS credentials as Kubernetes secrets as follows.
The AWS credentials should have been provided to you along with your purchase or trial issuance of the MoAI Inference Framework. If you did not receive this information, please contact your point of purchase separately.
kubectl create namespace mif
kubectl create secret -n mif generic aws-credentials \
--from-literal=AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
--from-literal=AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
Then, create the following aws-ecr-token-refresher.yaml file and apply it.
apiVersion: v1
kind: ServiceAccount
metadata:
name: aws-ecr-token-refresher
namespace: mif
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: aws-ecr-token-refresher
namespace: mif
rules:
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "delete", "create", "update", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: aws-ecr-token-refresher
namespace: mif
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: aws-ecr-token-refresher
subjects:
- kind: ServiceAccount
name: aws-ecr-token-refresher
namespace: mif
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: aws-ecr-token-refresher
namespace: mif
spec:
schedule: "0 */6 * * *"
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
serviceAccountName: aws-ecr-token-refresher
containers:
- name: aws-ecr-token-refresher
image: heyvaldemar/aws-kubectl:58dad7caa5986ceacd1bc818010a5e132d80452b
command:
- bash
- -c
- |
kubectl create secret -n ${NAMESPACE} docker-registry moreh-registry \
--docker-server=255250787067.dkr.ecr.ap-northeast-2.amazonaws.com \
--docker-username=AWS \
--docker-password=$(aws ecr get-login-password --region ${AWS_REGION}) \
--dry-run=client -o yaml | \
kubectl apply -f -
echo "ECR token refreshed at $(date)"
env:
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: AWS_REGION
value: ap-northeast-2
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: AWS_SECRET_ACCESS_KEY
kubectl apply -f aws-ecr-token-refresher.yaml
This CronJob runs every 6 hours to refresh the ECR token. To create the initial moreh-registry secret, you can run the following command.
kubectl create job -n mif initial-aws-ecr-token-refresh \
--from=cronjob/aws-ecr-token-refresher
You can check whether the moreh-registry secret has been created using the following command.
kubectl get secret -n mif moreh-registry
NAME TYPE DATA AGE
moreh-registry kubernetes.io/dockerconfigjson 1 101s