#
Quickstart
Please make sure to install all prerequisites before starting this quickstart guide.
#
Deployment
Add the Moreh Helm chart repository.
helm repo add moreh https://moreh-dev.github.io/helm-charts
helm repo update moreh
First, create a heimdall-values.yaml file as shown below and install the Heimdall scheduler using this file. If you are using Istio instead of Kgateway, set the gateway.gatewayClassName to istio.
global:
imagePullSecrets:
- name: moreh-registry
config:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
- type: single-profile-handler
- type: queue-scorer
- type: max-score-picker
parameters:
maxNumOfEndpoints: 2
schedulingProfiles:
- name: default
plugins:
- pluginRef: queue-scorer
weight: 1
- pluginRef: max-score-picker
gateway:
name: mif
gatewayClassName: kgateway
helm install heimdall moreh/heimdall \
--version TODO \
-n mif \
-f heimdall-values.yaml
You can verify that the Heimdall pods are running as follows.
kubectl get all -n mif -l app.kubernetes.io/instance=heimdall
Before deploying an inference service, create your own Hugging Face token from Hugging Face / Access Tokens. And then, to download the meta-llama/Llama-3.2-1B-Instruct model from Hugging Face, you need to accept the model license at meta-llama/Llama-3.2-1B-Instruct. You may need to go through a similar approval process for other open-source models as well.
Create a inference-service-values.yaml file with the following contents. Please replace <huggingfaceToken>, <repository>, and <tag> with your own values.
global:
imagePullSecrets:
- name: moreh-registry
inferenceModel:
modelName: meta-llama/Llama-3.2-1B-Instruct
poolRef:
name: heimdall
extraArgs:
- "{{ .Values.inferenceModel.modelName }}"
- --quantization
- "None"
- --tensor-parallel-size
- "2"
- --max-num-batched-tokens
- "8192"
- --no-enable-prefix-caching
- --no-enable-log-requests
- --disable-uvicorn-access-log
extraEnvVars:
- name: HF_TOKEN
value: "<huggingfaceToken>"
_tolerations: &tolerations
- key: amd.com/gpu
operator: Exists
effect: NoSchedule
_resources: &resources
limits:
amd.com/gpu: "2"
requests:
amd.com/gpu: "2"
_monitorLabels: &monitorLabels
prometheus: mif
decode:
replicas: 2
image:
repository: "<repository>"
tag: "<tag>"
containerPorts:
http: 8000
resources: *resources
tolerations: *tolerations
podMonitor:
labels: *monitorLabels
prefill:
enabled: false
After that, you can install the Odin inference service by running the following command.
helm install inference-service moreh/inference-service \
--version v0.1.0 \
-n mif \
-f inference-service-values.yaml
You can verify that the inference service pods are running as follows.
kubectl get all -n mif -l app.kubernetes.io/instance=inference-service
NAME READY STATUS RESTARTS AGE
pod/inference-service-decode-fd954dc5d-6xmjj 1/1 Running 0 7m40s
pod/inference-service-decode-fd954dc5d-7wjhh 1/1 Running 0 7m40s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/inference-service-decode 2/2 2 2 7m40s
NAME DESIRED CURRENT READY AGE
replicaset.apps/inference-service-decode-fd954dc5d 2 2 2 7m40s
#
Port forwarding
You can set up port forwarding as follows to send API requests to the inference endpoint from your local machine.
SERVICE=$(kubectl -n mif get service -l gateway.networking.k8s.io/gateway-name=mif -o name)
kubectl -n mif port-forward $SERVICE 8000:80
#
Usage
You can send a request to the inference endpoint as follows. Note that jq is used only to format the JSON response for better readability and is not required for the request to function.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"messages": [
{
"role": "developer",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}' | jq '.'
{
"id": "chatcmpl-5613ccb4-d168-40df-a5b7-842ab4a00d6a",
"object": "chat.completion",
"created": 1761484035,
"model": "meta-llama/Llama-3.2-1B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today? Do you have a specific question or problem you'd like to talk about, or are you just looking for some information on a particular topic?",
"refusal": null,
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": [],
"reasoning_content": null
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null,
"token_ids": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 48,
"total_tokens": 86,
"completion_tokens": 38,
"prompt_tokens_details": null
},
"prompt_logprobs": null,
"prompt_token_ids": null,
"kv_transfer_params": null
}