# 5. Changing the Number of GPUs

Let's rerun the fine-tuning task with a different number of GPUs. MoAI Platform abstracts GPU resources into a single accelerator and automatically performs parallel processing. Therefore, there is no need to modify the PyTorch script even when changing the number of GPUs.

# Changing the Accelerator type

Switch the accelerator type using the moreh-switch-model tool. For instructions on changing the accelerator, please refer again to the 3. Model fine-tuning.

$ moreh-switch-model

Please contact your infrastructure provider and choose one of the following options before proceeding.

  • AMD MI250 GPU with 32 units
    • When using Moreh's trial container: select 8xlarge
    • When using KT Cloud's Hyperscale AI Computing: select 8xLarge.4096GB
  • AMD MI210 GPU with 64 units
  • AMD MI300X GPU with 16 units

# Training Parameters

Again, run the train_llama3.py script.

python train_llm.py \
    --lr 0.000001 \
    --model meta-llama/Meta-Llama-3-8B \
    --dataset bitext/Bitext-customer-support-llm-chatbot-training-dataset \
    --train-batch-size 1024 \
    --eval-batch-size 1024 \
    --sequence-length 1024 \
    --log-interval 10 \
    --num-epochs 5 \
    --output-dir llama3-finetuned

Since the available GPU memory has doubled, let's increase the batch size from the previous 256 to 1024 and run the code again.

...
[info] Got DBs from backend for auto config.
[info] Requesting resources for MoAI Accelerator from the server...
[info] Initializing the worker daemon for MoAI Accelerator
[info] [1/8] Connecting to resources on the server (192.168.xxx.x:xxxxx)...
[info] [2/8] Connecting to resources on the server (192.168.xxx.x:xxxxx)...
[info] [3/8] Connecting to resources on the server (192.168.xxx.x:xxxxx)...
[info] [4/8] Connecting to resources on the server (192.168.xxx.xx:xxxxx)...
[info] [5/8] Connecting to resources on the server (192.168.xxx.xx:xxxxx)...
[info] [6/8] Connecting to resources on the server (192.168.xxx.xx:xxxxx)...
[info] [7/8] Connecting to resources on the server (192.168.xxx.xx:xxxxx)...
[info] [8/8] Connecting to resources on the server (192.168.xxx.xx:xxxxx)...
[info] Establishing links to the resources...
[info] MoAI Accelerator is ready to use.
[info] Moreh Version: 24.11.0
[info] Moreh Job ID: 991551
[info] The number of candidates is 102.
[info] Parallel Graph Compile start...
[info] Elapsed Time to compile all candidates = 164346 [ms]
[info] Parallel Graph Compile finished.
[info] The number of possible candidates is 35.
[info] SelectBestGraphFromCandidates start...
[info] Elapsed Time to compute cost for survived candidates = 27038 [ms]
[info] SelectBestGraphFromCandidates finished.
[info] Configuration for parallelism is selected.
[info] No PP, No TP, micro_batching_enabled : true, num_micro_batches : 2, batch_per_device : 8, recomputation : default(1), distribute_param : true, distribute_low_prec_param : true
[info] train: true
{'warmup_duration': 501.28}                                                        
{'loss': 4.4444, 'grad_norm': 1044.89, 'learning_rate': 0.0, 'epoch': 0.4, 'step': 10, 'tps': 63017.95, 'duration': 149.75, 'max_seq_len': 1024}
{'loss': 3.1531, 'grad_norm': 847.93, 'learning_rate': 0.0, 'epoch': 0.8, 'step': 20, 'tps': 63390.05, 'duration': 165.42, 'max_seq_len': 1024}
...
Saving model checkpoint to $SAVE_DIR
Configuration saved in $SAVE_DIR/config.json
Configuration saved in $SAVE_DIR/generation_config.json

If the training proceeds normally, you will see similar logs to the previous run but with improved throughput due to the doubled number of GPUs.

  • When using AMD MI250 GPU 16 → 32 : From approximately 40,000 tokens/sec to 63,000 tokens/sec.