Resource allocation

This document describes how to allocate resources (accelerators and NICs) to your inference containers, select specific nodes using node selectors and node affinity, and handle taints. When an InferenceService is created, it generates a Deployment or a LeaderWorkerSet, which ultimately results in the creation of Pods. Therefore, placing a Pod is synonymous with placing an inference container in this context.

Before starting this guide, make sure to install all prerequisites.

Allocation examples

Before guiding you through individual settings, let's look at a complete example of allocating the desired resources. This example allocates 8x AMD MI300X GPUs and all RDMA NIC devices to the InferenceService:

apiVersion: odin.moreh.io/v1alpha1
kind: InferenceService
spec:
  template:
    spec:
      containers:
        - name: main
          resources:
            limits:
              amd.com/gpu: "8"
              mellanox/hca: "1"
            requests:
              amd.com/gpu: "8"
              mellanox/hca: "1"
      nodeSelector:
        moai.moreh.io/accelerator.vendor: amd
        moai.moreh.io/accelerator.model: mi300x
      tolerations:
        - key: "amd.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"

If you use a LeaderWorkerSet, you can allocate resources to the worker containers by specifying them in the workerTemplate as well like this:

apiVersion: odin.moreh.io/v1alpha1
kind: InferenceService
spec:
  workerTemplate:
    spec:
      containers:
        - name: main
          resources:
            limits:
              amd.com/gpu: "8"
              mellanox/hca: "1"
            requests:
              amd.com/gpu: "8"
              mellanox/hca: "1"
      nodeSelector:
        moai.moreh.io/accelerator.vendor: amd
        moai.moreh.io/accelerator.model: mi300x
      tolerations:
        - key: "amd.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"

Resource requests and limits

You can request the resources to use by setting requests and limits in the resources section. Each resource uses the following resource names.

AMD GPU: amd.com/gpu
NVIDIA GPU: nvidia.com/gpu
RDMA NIC: mellanox/hca

For example, you can specify resources as follows. Note that, for RDMA NICs, you do not specify the number to use; instead, once requested, the Pod is granted access to all RDMA NICs on the node.

resources:
  limits:
    amd.com/gpu: "4"
  requests:
    amd.com/gpu: "4"

resources:
  limits:
    nvidia.com/gpu: "4"
  requests:
    nvidia.com/gpu: "4"

resources:
  limits:
    mellanox/hca: "1"
  requests:
    mellanox/hca: "1"

Node selection

Resource names alone cannot distinguish GPU types. If you require a specific GPU type for inference containers, you must additionally use either nodeSelector or affinity. You only need to use one of the two options. Node selectors are suitable if you simply want to select GPU types, while node affinity allows you to define more complex conditions.

Accelerator labels

MoAI Inference Framework automatically detects the vendors and models of accelerators in the cluster and assigns the following labels accordingly, so they can be used with node selectors or node affinity.

moai.moreh.io/accelerator.vendor: The vendor of the accelerator (e.g., amd).
moai.moreh.io/accelerator.model: The specific model of the accelerator.

Vendor	Models
`amd`	`mi355x`, `mi350x`, `mi325x`, `mi300x`, `mi308x`, `mi250x`, `mi250`, `mi210`, and `mi100`
`nvidia`	`h200-sxm`, `h100-sxm`, `a100-80gb-sxm`, `a100-80gb-pcie`, `a100-40gb-sxm`, and `a100-40gb-pcie`

For more detailed information, please refer to the supported devices.

Example: Selecting MI300X nodes

To schedule your workload specifically on nodes with AMD MI300X GPUs, use the nodeSelector or affinity fields as follows.

Using nodeSelector:

nodeSelector:
  moai.moreh.io/accelerator.vendor: amd
  moai.moreh.io/accelerator.model: mi300x

Using affinity:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: moai.moreh.io/accelerator.vendor
              operator: In
              values:
                - amd
            - key: moai.moreh.io/accelerator.model
              operator: In
              values:
                - mi300x
            - key: kubernetes.io/hostname
              operator: In
              values:
                - <nodeName>
                - <nodeName>

Taints and tolerations

Nodes equipped with GPUs are tainted to prevent accidental scheduling of non-GPU workloads. To schedule a pod on these nodes, you must include a matching toleration in your pod specification. Please note that while a toleration is required to schedule a pod on a tainted node, the presence of a toleration does not imply that the node must be tainted.

Node taint:

taints:
  - key: "amd.com/gpu"
    effect: "NoSchedule"

Pod toleration:

tolerations:
  - key: "amd.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

Node taint:

taints:
  - key: "nvidia.com/gpu"
    effect: "NoSchedule"

Pod toleration:

tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"