#
Quickstart
In this quickstart, we will launch two vLLM instances (pods) of the Llama 3.2 1B Instruct model and serve them through a single endpoint as an example. Please make sure to install all prerequisites before starting this quickstart guide.
#
Gateway
First, create a gateway.yaml file to add the Gateway resource to the mif namespace. The contents of the gateway.yaml file are provided for both Kgateway and Istio.
apiVersion: v1
kind: ConfigMap
metadata:
name: mif-gateway-infrastructure
namespace: mif
data:
service: |
spec:
type: ClusterIP
deployment: |
spec:
template:
metadata:
annotations:
proxy.istio.io/config: |
accessLogFile: /dev/stdout
accessLogEncoding: JSON
spec:
containers:
- name: istio-proxy
resources:
limits: null
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: mif
namespace: mif
spec:
gatewayClassName: istio
infrastructure:
parametersRef:
group: ""
kind: ConfigMap
name: mif-gateway-infrastructure
listeners:
- name: http
protocol: HTTP
port: 80
allowedRoutes:
namespaces:
from: All
apiVersion: gateway.kgateway.dev/v1alpha1
kind: GatewayParameters
metadata:
name: mif-gateway-infrastructure
namespace: mif
spec:
kube:
service:
type: ClusterIP
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: mif
namespace: mif
spec:
gatewayClassName: kgateway
infrastructure:
parametersRef:
group: gateway.kgateway.dev
kind: GatewayParameters
name: mif-gateway-infrastructure
listeners:
- name: http
protocol: HTTP
port: 80
allowedRoutes:
namespaces:
from: All
kubectl apply -f gateway.yaml
You can verify that the Gateway resource is created using the following command.
kubectl get pod -n mif -l gateway.networking.k8s.io/gateway-name=mif
NAME READY STATUS RESTARTS AGE
mif-istio-78789865b7-747nz 1/1 Running 0 27s
#
Deployment
Add the Moreh Helm chart repository.
helm repo add moreh https://moreh-dev.github.io/helm-charts
helm repo update moreh
First, create a heimdall-values.yaml file as shown below and install the Heimdall scheduler using this file. If you are using Kgateway instead of Istio, set the gateway.gatewayClassName to kgateway.
global:
imagePullSecrets:
- name: moreh-registry
config:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: single-profile-handler
- type: queue-scorer
- type: max-score-picker
schedulingProfiles:
- name: default
plugins:
- pluginRef: queue-scorer
- pluginRef: max-score-picker
gateway:
name: mif
gatewayClassName: istio
serviceMonitor:
labels:
release: prometheus-stack
helm upgrade -i heimdall moreh/heimdall \
--version v0.5.0 \
-n mif \
-f heimdall-values.yaml
You can verify that the Heimdall pods are running as follows.
kubectl get all -n mif -l app.kubernetes.io/instance=heimdall
NAME READY STATUS RESTARTS AGE
pod/heimdall-7d54fcbfff-chw94 1/1 Running 0 70s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/heimdall ClusterIP 10.110.35.57 <none> 9002/TCP,9090/TCP,5557/TCP 70s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/heimdall 1/1 1 1 70s
NAME DESIRED CURRENT READY AGE
replicaset.apps/heimdall-7d54fcbfff 1 1 1 70s
Before deploying an inference service, create your own Hugging Face token from Hugging Face / Access Tokens. And then, to download the meta-llama/Llama-3.2-1B-Instruct model from Hugging Face, you need to accept the model license at meta-llama/Llama-3.2-1B-Instruct. You may need to go through a similar approval process for other open-source models as well.
Create a inference-service-values.yaml file with the following contents. Please replace <huggingfaceToken> with your own value.
global:
imagePullSecrets:
- name: moreh-registry
extraArgs:
- meta-llama/Llama-3.2-1B-Instruct
- --quantization
- "None"
- --tensor-parallel-size
- "2"
- --max-num-batched-tokens
- "8192"
- --no-enable-prefix-caching
- --no-enable-log-requests
- --disable-uvicorn-access-log
extraEnvVars:
- name: HF_TOKEN
value: "<huggingfaceToken>"
_common: &common
image:
repository: 255250787067.dkr.ecr.ap-northeast-2.amazonaws.com/quickstart/moreh-vllm
tag: "20250915.1"
resources:
requests:
amd.com/gpu: "2"
limits:
amd.com/gpu: "2"
podMonitor:
labels:
release: prometheus-stack
decode:
replicas: 2
<<: *common
prefill:
enabled: false
<<: *common
After that, you can install the Odin inference service by running the following command.
helm upgrade -i inference-service moreh/inference-service \
--version v0.3.1 \
-n mif \
-f inference-service-values.yaml
You can verify that the inference service pods are running as follows.
kubectl get all -n mif -l app.kubernetes.io/instance=inference-service
NAME READY STATUS RESTARTS AGE
pod/inference-service-decode-fd954dc5d-6xmjj 1/1 Running 0 7m40s
pod/inference-service-decode-fd954dc5d-7wjhh 1/1 Running 0 7m40s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/inference-service-decode 2/2 2 2 7m40s
NAME DESIRED CURRENT READY AGE
replicaset.apps/inference-service-decode-fd954dc5d 2 2 2 7m40s
#
Usage
You can set up port forwarding as follows to send API requests to the inference endpoint from your local machine.
SERVICE=$(kubectl -n mif get service -l gateway.networking.k8s.io/gateway-name=mif -o name)
kubectl -n mif port-forward $SERVICE 8000:80
You can send a request to the inference endpoint as follows. Note that jq is used only to format the JSON response for better readability and is not required for the request to function.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"messages": [
{
"role": "developer",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}' | jq '.'
{
"id": "chatcmpl-5613ccb4-d168-40df-a5b7-842ab4a00d6a",
"object": "chat.completion",
"created": 1761484035,
"model": "meta-llama/Llama-3.2-1B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today? Do you have a specific question or problem you'd like to talk about, or are you just looking for some information on a particular topic?",
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": [],
"reasoning_content": null
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null,
"token_ids": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 48,
"total_tokens": 86,
"completion_tokens": 38,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"prompt_token_ids": null,
"kv_transfer_params": null
}
#
Cleanup
To delete all the resources created in this quickstart, run the following commands.
helm uninstall -n mif inference-service
helm uninstall -n mif heimdall
kubectl delete gateway -n mif mif