# Quickstart

In this quickstart, we will launch two vLLM instances (pods) of the Llama 3.2 1B Instruct model and serve them through a single endpoint as an example. Please make sure to install all prerequisites before starting this quickstart guide.

# Gateway

First, create a gateway.yaml file to add the Gateway resource to the mif namespace. The contents of the gateway.yaml file are provided for both Kgateway and Istio.

gateway.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: mif-gateway-infrastructure
  namespace: mif
data:
  service: |
    spec:
      type: ClusterIP
  deployment: |
    spec:
      template:
        metadata:
          annotations:
            proxy.istio.io/config: |
              accessLogFile: /dev/stdout
              accessLogEncoding: JSON
        spec:
          containers:
            - name: istio-proxy
              resources:
                limits: null

---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: mif
  namespace: mif
spec:
  gatewayClassName: istio
  infrastructure:
    parametersRef:
      group: ""
      kind: ConfigMap
      name: mif-gateway-infrastructure
  listeners:
    - name: http
      protocol: HTTP
      port: 80
      allowedRoutes:
        namespaces:
          from: All
gateway.yaml
apiVersion: gateway.kgateway.dev/v1alpha1
kind: GatewayParameters
metadata:
  name: mif-gateway-infrastructure
  namespace: mif
spec:
  kube:
    service:
      type: ClusterIP

---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: mif
  namespace: mif
spec:
  gatewayClassName: kgateway
  infrastructure:
    parametersRef:
      group: gateway.kgateway.dev
      kind: GatewayParameters
      name: mif-gateway-infrastructure
  listeners:
    - name: http
      protocol: HTTP
      port: 80
      allowedRoutes:
        namespaces:
          from: All
kubectl apply -f gateway.yaml

You can verify that the Gateway resource is created using the following command.

kubectl get pod -n mif -l gateway.networking.k8s.io/gateway-name=mif
Expected output
NAME                         READY   STATUS    RESTARTS   AGE
mif-istio-78789865b7-747nz   1/1     Running   0          27s

# Deployment

Add the Moreh Helm chart repository.

helm repo add moreh https://moreh-dev.github.io/helm-charts
helm repo update moreh

First, create a heimdall-values.yaml file as shown below and install the Heimdall scheduler using this file. If you are using Kgateway instead of Istio, set the gateway.gatewayClassName to kgateway.

heimdall-values.yaml
global:
  imagePullSecrets:
    - name: moreh-registry

config:
  apiVersion: inference.networking.x-k8s.io/v1alpha1
  kind: EndpointPickerConfig
  plugins:
    - type: single-profile-handler
    - type: queue-scorer
    - type: max-score-picker
  schedulingProfiles:
    - name: default
      plugins:
        - pluginRef: queue-scorer
        - pluginRef: max-score-picker

gateway:
  name: mif
  gatewayClassName: istio

serviceMonitor:
  labels:
    release: prometheus-stack
helm upgrade -i heimdall moreh/heimdall \
    --version v0.5.0 \
    -n mif \
    -f heimdall-values.yaml

You can verify that the Heimdall pods are running as follows.

kubectl get all -n mif -l app.kubernetes.io/instance=heimdall
Expected output
NAME                            READY   STATUS    RESTARTS   AGE
pod/heimdall-7d54fcbfff-chw94   1/1     Running   0          70s

NAME               TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
service/heimdall   ClusterIP   10.110.35.57   <none>        9002/TCP,9090/TCP,5557/TCP   70s

NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/heimdall   1/1     1            1           70s

NAME                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/heimdall-7d54fcbfff   1         1         1       70s

Before deploying an inference service, create your own Hugging Face token from Hugging Face / Access Tokens. And then, to download the meta-llama/Llama-3.2-1B-Instruct model from Hugging Face, you need to accept the model license at meta-llama/Llama-3.2-1B-Instruct. You may need to go through a similar approval process for other open-source models as well.

Create a inference-service-values.yaml file with the following contents. Please replace <huggingfaceToken> with your own value.

inference-service-values.yaml
global:
  imagePullSecrets:
    - name: moreh-registry

extraArgs:
  - meta-llama/Llama-3.2-1B-Instruct
  - --quantization
  - "None"
  - --tensor-parallel-size
  - "2"
  - --max-num-batched-tokens
  - "8192"
  - --no-enable-prefix-caching
  - --no-enable-log-requests
  - --disable-uvicorn-access-log

extraEnvVars:
  - name: HF_TOKEN
    value: "<huggingfaceToken>"

_common: &common
  image:
    repository: 255250787067.dkr.ecr.ap-northeast-2.amazonaws.com/quickstart/moreh-vllm
    tag: "20250915.1"

  resources:
    requests:
      amd.com/gpu: "2"
    limits:
      amd.com/gpu: "2"

  podMonitor:
    labels:
      release: prometheus-stack

decode:
  replicas: 2

  <<: *common

prefill:
  enabled: false

  <<: *common

After that, you can install the Odin inference service by running the following command.

helm upgrade -i inference-service moreh/inference-service \
    --version v0.3.1 \
    -n mif \
    -f inference-service-values.yaml

You can verify that the inference service pods are running as follows.

kubectl get all -n mif -l app.kubernetes.io/instance=inference-service
Expected output
NAME                                           READY   STATUS    RESTARTS   AGE
pod/inference-service-decode-fd954dc5d-6xmjj   1/1     Running   0          7m40s
pod/inference-service-decode-fd954dc5d-7wjhh   1/1     Running   0          7m40s

NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/inference-service-decode   2/2     2            2           7m40s

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/inference-service-decode-fd954dc5d   2         2         2       7m40s

# Usage

You can set up port forwarding as follows to send API requests to the inference endpoint from your local machine.

SERVICE=$(kubectl -n mif get service -l gateway.networking.k8s.io/gateway-name=mif -o name)
kubectl -n mif port-forward $SERVICE 8000:80

You can send a request to the inference endpoint as follows. Note that jq is used only to format the JSON response for better readability and is not required for the request to function.

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.2-1B-Instruct",
    "messages": [
      {
        "role": "developer",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }' | jq '.'
Response
{
  "id": "chatcmpl-5613ccb4-d168-40df-a5b7-842ab4a00d6a",
  "object": "chat.completion",
  "created": 1761484035,
  "model": "meta-llama/Llama-3.2-1B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I assist you today? Do you have a specific question or problem you'd like to talk about, or are you just looking for some information on a particular topic?",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 48,
    "total_tokens": 86,
    "completion_tokens": 38,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

# Cleanup

To delete all the resources created in this quickstart, run the following commands.

helm uninstall -n mif inference-service
helm uninstall -n mif heimdall
kubectl delete gateway -n mif mif