# 3. Model Fine-tuning

Now, we will actually execute the fine-tuning process.

# Setting Accelerator Flavor

In MoAI Platform, physical GPUs are not directly exposed to users. Instead, virtual MoAI Accelerators are provided, which are available for use in PyTorch. By setting the accelerator's flavor, you can determine how much of the physical GPU will be utilized by PyTorch. Since the total training time and GPU usage cost vary depending on the selected accelerator flavor, users should make decisions based on their training scenarios.

Please refer to the document above or reach out to your infrastructure provider to inquire about the GPU types and quantities corresponding to each flavor.

AMD MI250 GPU with 16 units:
- Select 4xlarge when using Moreh's trial container.
- Select 4xLarge.2048GB when using KT Cloud's Hyperscale AI Computing.
AMD MI210 GPU with 32 units.
AMD MI300X GPU with 8 units.

Remember when we checked the MoAI Accelerator in the Llama3 8B Fine-tuning - Getting Started? Now let's set up the accelerator needed for learning.

First, we'll use the moreh-smi command to check the currently used MoAI Accelerator.

$ moreh-smi
+----------------------------------------------------------------------------------------------------+
|                                                  Current Version: 24.11.0  Latest Version: 24.11.0 |
+----------------------------------------------------------------------------------------------------+
|  Device  |        Name         |      Model     |  Memory Usage  |  Total Memory  |  Utilization   |
+====================================================================================================+
|  * 0     |   MoAI Accelerator  |  xLarge.512GB  |  -             |  -             |  -             |
+----------------------------------------------------------------------------------------------------+

The current MoAI Accelerator in use has a memory size of 512GB.

You can utilize the moreh-switch-model command to review the available accelerator flavors on the current system. For seamless model training, consider using the moreh-switch-modelcommand to switch to a MoAI Accelerator with larger memory capacity.

$ moreh-switch-model
Current MoAI Accelerator: xLarge.512GB

Small.64GB
Medium.128GB
Large.256GB
xLarge.512GB  *
1.5xLarge.768GB
2xLarge.1024GB
3xLarge.1536GB
4xLarge.2048GB
6xLarge.3072GB
8xLarge.4096GB
12xLarge.6144GB
24xLarge.12288GB
48xLarge.24576GB

You can enter the number to switch to a different flavor.

In this tutorial, we will use a 2048GB-sized MoAI Accelerator.

Therefore, after switching from the initially set Large.256GB flavor to 4xLarge.2048GB , we will use the moreh-smi command to confirm that the change has been successfully applied.

Enter 8 to use 4xLarge.2048GB

Selection (1-13, q, Q): 8
The MoAI Accelerator model is successfully switched to  "4xLarge.2048GB".

Small.64GB
Medium.128GB
Large.256GB
xLarge.512GB
1.5xLarge.768GB
2xLarge.1024GB
3xLarge.1536GB
4xLarge.2048GB  *
6xLarge.3072GB
8xLarge.4096GB
12xLarge.6144GB
24xLarge.12288GB
48xLarge.24576GB

Selection (1-13, q, Q): q

Enter q to complete the change.

To confirm that the changes have been successfully applied, use the moreh-smi command again to check the currently used MoAI Accelerator.

$ moreh-smi
+------------------------------------------------------------------------------------------------------+
|                                                    Current Version: 24.11.0  Latest Version: 24.11.0 |
+------------------------------------------------------------------------------------------------------+
|  Device  |        Name         |       Model      |  Memory Usage  |  Total Memory  |  Utilization   |
+======================================================================================================+
|  * 0     |   MoAI Accelerator  |  4xLarge.2048GB  |  -             |  -             |  -             |
+------------------------------------------------------------------------------------------------------+

Now you can see that it has been successfully changed to 4xLarge.2048GB .

# Training Execution

Execute the train_llama3.py script below.

cd ~/quickstart
python train_llm.py \
    --lr 0.000001 \
    --model meta-llama/Meta-Llama-3-8B \
    --dataset bitext/Bitext-customer-support-llm-chatbot-training-dataset \
    --train-batch-size 256 \
    --eval-batch-size 256 \
    --sequence-length 1024 \
    --log-interval 10 \
    --num-epochs 5 \
    --output-dir $SAVE_DIR

If the training proceeds smoothly, you should see the following logs. By going through this logs, you can verify that the Advanced Parallelism feature, which determines the optimal parallelization settings, is functioning properly. It's worth noting that, apart from the single line of AP code we looked at earlier in the PyTorch script, there is no handling for using multiple GPUs simultaneously in other parts of the script.

...
[info] Got DBs from backend for auto config.
[info] Requesting resources for MoAI AI Accelerator from the server...
[info] Initializing the worker daemon for MoAI Accelerator
[info] [1/4] Connecting to resources on the server (192.168.xxx.x:xxxxx)...
[info] [2/4] Connecting to resources on the server (192.168.xxx.x:xxxxx)...
[info] [3/4] Connecting to resources on the server (192.168.xxx.xx:xxxxx)...
[info] [4/4] Connecting to resources on the server (192.168.xxx.xx:xxxxx)...
[info] Establishing links to the resources...
[info] MoAI Accelerator is ready to use.
[info] Moreh Version: 24.11.0
[info] Moreh Job ID: 991528
[info] The number of candidates is 72.
[info] Parallel Graph Compile start...
[info] Elapsed Time to compile all candidates = 36509 [ms]
[info] Parallel Graph Compile finished.
[info] The number of possible candidates is 26.
[info] SelectBestGraphFromCandidates start...
[info] Elapsed Time to compute cost for survived candidates = 4833 [ms]
[info] SelectBestGraphFromCandidates finished.
[info] Configuration for parallelism is selected.
[info] No PP, No TP,  recomputation : default(1), distribute_param : true, distribute_low_prec_param : false
[info] train: true
{'warmup_duration':304.07}                                                                    
{'loss': 4.4514, 'grad_norm': 520.17, 'learning_rate': 0.0, 'epoch': 0.1, 'step': 10, 'tps': 41385.87, 'duration': 57.01, 'max_seq_len': 1024}                                                                                       
{'loss': 3.1578, 'grad_norm': 423.98, 'learning_rate': 0.0, 'epoch': 0.2, 'step': 20, 'tps': 41340.6, 'duration': 63.41, 'max_seq_len': 1024}                                                                                        
{'loss': 0.3286, 'grad_norm': 3.22, 'learning_rate': 0.0, 'epoch': 0.3, 'step': 30, 'tps': 41023.32, 'duration': 63.9, 'max_seq_len': 1024}                                                                                          
{'loss': 0.2046, 'grad_norm': 1.49, 'learning_rate': 0.0, 'epoch': 0.4, 'step': 40, 'tps': 40651.5, 'duration': 64.49, 'max_seq_len': 1024}                                                                                          
{'loss': 0.154, 'grad_norm': 0.86, 'learning_rate': 0.0, 'epoch': 0.5, 'step': 50, 'tps': 40511.73, 'duration': 64.71, 'max_seq_len': 1024}                                                                                          
{'loss': 0.1325, 'grad_norm': 0.75, 'learning_rate': 0.0, 'epoch': 0.6, 'step': 60, 'tps': 41035.12, 'duration': 63.88, 'max_seq_len': 1024} 
...
Saving model checkpoint to $SAVE_DIR
Configuration saved in $SAVE_DIR/config.json
Configuration saved in $SAVE_DIR/generation_config.json

Upon checking the training logs, you can confirm that the training is progressing smoothly.

The throughput displayed during training indicates how many tokens per second the script is training through this PyTorch script.

When using AMD MI250 GPU with 16 GPUs: approximately 33,000 tokens/sec

The approximate training time depending on the GPU type and quantity is as follows:

When using AMD MI250 GPU with 16 GPUs: approximately 1,480 minutes for 1 epoch (total steps: 1,121, batch size: 256)

# Checking Accelerator Status During Training

During training, open another terminal and connect to the container. Then, execute the moreh-smi command to observe the MoAI Accelerator occupying memory and the training script running. Make sure to check this while the initialization process is completed and the training loss appears in the execution logs.

$ moreh-smi
+------------------------------------------------------------------------------------------------------+
|                                                    Current Version: 24.11.0  Latest Version: 24.11.0 |
+------------------------------------------------------------------------------------------------------+
|  Device  |        Name         |       Flavor     |  Memory Usage  |  Total Memory  |  Utilization   |
+======================================================================================================+
|  * 0     |  MoAI Accelerator   |  4xLarge.2048GB  |  1472925 MiB   |  2096640 MiB   |    100%        |
+------------------------------------------------------------------------------------------------------+

tutorial llama3_8b