# 5. Changing the Number of GPUs

Let's rerun the fine-tuning task with a different number of GPUs. MoAI Platform abstracts GPU resources into a single accelerator and automatically performs parallel processing. Therefore, there is no need to modify the PyTorch script even when changing the number of GPUs.

# Changing the Accelerator type

Switch the accelerator type using the moreh-switch-model tool. For instructions on changing the accelerator, please refer again to the 3. Model fine-tuning.

$ moreh-switch-model

Please contact your infrastructure provider and choose one of the following options before proceeding.

AMD MI250 GPU with 32 units
- When using Moreh's trial container: select 8xlarge
- When using KT Cloud's Hyperscale AI Computing: select 8xLarge.4096GB
AMD MI210 GPU with 64 units
AMD MI300X GPU with 16 units

# Training Parameters

Again, run the train_llama3.py script.

python train_llm.py \
    --lr 0.000001 \
    --model meta-llama/Meta-Llama-3-8B \
    --dataset bitext/Bitext-customer-support-llm-chatbot-training-dataset \
    --train-batch-size 1024 \
    --eval-batch-size 1024 \
    --sequence-length 1024 \
    --log-interval 10 \
    --num-epochs 5 \
    --output-dir llama3-finetuned

Since the available GPU memory has doubled, let's increase the batch size from the previous 256 to 1024 and run the code again.

...
[info] Got DBs from backend for auto config.
[info] Requesting resources for MoAI Accelerator from the server...
[info] Initializing the worker daemon for MoAI Accelerator
[info] [1/8] Connecting to resources on the server (192.168.xxx.x:xxxxx)...
[info] [2/8] Connecting to resources on the server (192.168.xxx.x:xxxxx)...
[info] [3/8] Connecting to resources on the server (192.168.xxx.x:xxxxx)...
[info] [4/8] Connecting to resources on the server (192.168.xxx.xx:xxxxx)...
[info] [5/8] Connecting to resources on the server (192.168.xxx.xx:xxxxx)...
[info] [6/8] Connecting to resources on the server (192.168.xxx.xx:xxxxx)...
[info] [7/8] Connecting to resources on the server (192.168.xxx.xx:xxxxx)...
[info] [8/8] Connecting to resources on the server (192.168.xxx.xx:xxxxx)...
[info] Establishing links to the resources...
[info] MoAI Accelerator is ready to use.
[info] Moreh Version: 24.11.0
[info] Moreh Job ID: 991551
[info] The number of candidates is 102.
[info] Parallel Graph Compile start...
[info] Elapsed Time to compile all candidates = 164346 [ms]
[info] Parallel Graph Compile finished.
[info] The number of possible candidates is 35.
[info] SelectBestGraphFromCandidates start...
[info] Elapsed Time to compute cost for survived candidates = 27038 [ms]
[info] SelectBestGraphFromCandidates finished.
[info] Configuration for parallelism is selected.
[info] No PP, No TP, micro_batching_enabled : true, num_micro_batches : 2, batch_per_device : 8, recomputation : default(1), distribute_param : true, distribute_low_prec_param : true
[info] train: true
{'warmup_duration': 501.28}                                                        
{'loss': 4.4444, 'grad_norm': 1044.89, 'learning_rate': 0.0, 'epoch': 0.4, 'step': 10, 'tps': 63017.95, 'duration': 149.75, 'max_seq_len': 1024}
{'loss': 3.1531, 'grad_norm': 847.93, 'learning_rate': 0.0, 'epoch': 0.8, 'step': 20, 'tps': 63390.05, 'duration': 165.42, 'max_seq_len': 1024}
...
Saving model checkpoint to $SAVE_DIR
Configuration saved in $SAVE_DIR/config.json
Configuration saved in $SAVE_DIR/generation_config.json

If the training proceeds normally, you will see similar logs to the previous run but with improved throughput due to the doubled number of GPUs.

When using AMD MI250 GPU 16 → 32 : From approximately 40,000 tokens/sec to 63,000 tokens/sec.

tutorial llama3_8b