# Quickstart

Please make sure to install all prerequisites before starting this quickstart guide.

# Deployment

Add the Moreh Helm chart repository.

helm repo add moreh https://moreh-dev.github.io/helm-charts
helm repo update moreh

First, create a heimdall-values.yaml file as shown below and install the Heimdall scheduler using this file. If you are using Istio instead of Kgateway, set the gateway.gatewayClassName to istio.

heimdall-values.yaml
global:
  imagePullSecrets:
    - name: moreh-registry

config:
  apiVersion: inference.networking.x-k8s.io/v1alpha1
  kind: EndpointPickerConfig
  plugins:
    - type: single-profile-handler
    - type: queue-scorer
    - type: max-score-picker
      parameters:
        maxNumOfEndpoints: 2
  schedulingProfiles:
    - name: default
      plugins:
        - pluginRef: queue-scorer
          weight: 1
        - pluginRef: max-score-picker

gateway:
  name: mif
  gatewayClassName: kgateway
helm install heimdall moreh/heimdall \
    --version TODO \
    -n mif \
    -f heimdall-values.yaml

You can verify that the Heimdall pods are running as follows.

kubectl get all -n mif -l app.kubernetes.io/instance=heimdall

Before deploying an inference service, create your own Hugging Face token from Hugging Face / Access Tokens. And then, to download the meta-llama/Llama-3.2-1B-Instruct model from Hugging Face, you need to accept the model license at meta-llama/Llama-3.2-1B-Instruct. You may need to go through a similar approval process for other open-source models as well.

Create a inference-service-values.yaml file with the following contents. Please replace <huggingfaceToken>, <repository>, and <tag> with your own values.

inference-service-values.yaml
global:
  imagePullSecrets:
    - name: moreh-registry

inferenceModel:
  modelName: meta-llama/Llama-3.2-1B-Instruct
  poolRef:
    name: heimdall

extraArgs:
  - "{{ .Values.inferenceModel.modelName }}"
  - --quantization
  - "None"
  - --tensor-parallel-size
  - "2"
  - --max-num-batched-tokens
  - "8192"
  - --no-enable-prefix-caching
  - --no-enable-log-requests
  - --disable-uvicorn-access-log

extraEnvVars:
  - name: HF_TOKEN
    value: "<huggingfaceToken>"

_tolerations: &tolerations
  - key: amd.com/gpu
    operator: Exists
    effect: NoSchedule

_resources: &resources
  limits:
    amd.com/gpu: "2"
  requests:
    amd.com/gpu: "2"

_monitorLabels: &monitorLabels
  prometheus: mif

decode:
  replicas: 2

  image:
    repository: "<repository>"
    tag: "<tag>"

  containerPorts:
    http: 8000

  resources: *resources
  tolerations: *tolerations
  podMonitor:
    labels: *monitorLabels

prefill:
  enabled: false

After that, you can install the Odin inference service by running the following command.

helm install inference-service moreh/inference-service \
    --version v0.1.0 \
    -n mif \
    -f inference-service-values.yaml

You can verify that the inference service pods are running as follows.

kubectl get all -n mif -l app.kubernetes.io/instance=inference-service
Expected output
NAME                                           READY   STATUS    RESTARTS   AGE
pod/inference-service-decode-fd954dc5d-6xmjj   1/1     Running   0          7m40s
pod/inference-service-decode-fd954dc5d-7wjhh   1/1     Running   0          7m40s

NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/inference-service-decode   2/2     2            2           7m40s

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/inference-service-decode-fd954dc5d   2         2         2       7m40s

# Port forwarding

You can set up port forwarding as follows to send API requests to the inference endpoint from your local machine.

SERVICE=$(kubectl -n mif get service -l gateway.networking.k8s.io/gateway-name=mif -o name)
kubectl -n mif port-forward $SERVICE 8000:80

# Usage

You can send a request to the inference endpoint as follows. Note that jq is used only to format the JSON response for better readability and is not required for the request to function.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-1B-Instruct",
    "messages": [
      {
        "role": "developer",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }' | jq '.'
Response
{
  "id": "chatcmpl-5613ccb4-d168-40df-a5b7-842ab4a00d6a",
  "object": "chat.completion",
  "created": 1761484035,
  "model": "meta-llama/Llama-3.2-1B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I assist you today? Do you have a specific question or problem you'd like to talk about, or are you just looking for some information on a particular topic?",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 48,
    "total_tokens": 86,
    "completion_tokens": 38,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}