# 3. Model Fine-tuning

Now, we will actually execute the fine-tuning process.

# Setting Accelerator Flavor

In MoAI Platform, physical GPUs are not directly exposed to users. Instead, virtual MoAI Accelerators are provided, which are available for use in PyTorch. By setting the accelerator's flavor, you can determine how much of the physical GPU will be utilized by PyTorch. Since the total training time and GPU usage cost vary depending on the selected accelerator flavor, users should make decisions based on their training scenarios.

Please refer to the document above or reach out to your infrastructure provider for the types and numbers of GPUs corresponding to each flavor.

Select one of the following flavors to continue:

Using 32 AMD MI250 GPUs
- Select 4xlarge when using Moreh's trial container.
- Select 4xLarge.2048GB when using KT Cloud's Hyperscale AI Computing.
Using 64 AMD MI210 GPUs
Using 16 AMD MI300X GPUs

Remember when we checked the MoAI Accelerator in the Llama3 70B Fine-tuning - Getting Started? Now let's set up the accelerator needed for learning.

First, use the moreh-smi command to check the current MoAI Accelerator in use.

$ moreh-smi
+----------------------------------------------------------------------------------------------------+
|                                                  Current Version: 24.11.0  Latest Version: 24.11.0 |
+----------------------------------------------------------------------------------------------------+
|  Device  |        Name         |      Model     |  Memory Usage  |  Total Memory  |  Utilization   |
+====================================================================================================+
|  * 0     |   MoAI Accelerator  |  xLarge.512GB  |  -             |  -             |  -             |
+----------------------------------------------------------------------------------------------------+

The current MoAI Accelerator in use has a memory size of 512GB.

You can utilize the moreh-switch-model command to review the available accelerator flavors on the current system. For seamless model training, consider using the moreh-switch-modelcommand to switch to a MoAI Accelerator with larger memory capacity.

$ moreh-switch-model
Current MoAI Accelerator: xLarge.512GB

Small.64GB
Medium.128GB
Large.256GB
xLarge.512GB  *
1.5xLarge.768GB
2xLarge.1024GB
3xLarge.1536GB
4xLarge.2048GB
6xLarge.3072GB
8xLarge.4096GB
12xLarge.6144GB
24xLarge.12288GB
48xLarge.24576GB

You can enter a number here to switch to a different flavor.

For this tutorial, we will use the 4096GB MoAI Accelerator.

Therefore, we will switch the initially set xLarge.512GB flavor to 8xLarge.4096GB and then use the moreh-smi command to verify that the change has been applied correctly.

Enter 8 to use 4xLarge.4096GB .

Selection (1-13, q, Q): 10
The MoAI Accelerator model is successfully switched to  "8xLarge.4096GB".

Small.64GB
Medium.128GB
Large.256GB
xLarge.512GB
1.5xLarge.768GB
2xLarge.1024GB
3xLarge.1536GB
4xLarge.2048GB  *
6xLarge.3072GB
8xLarge.4096GB
12xLarge.6144GB
24xLarge.12288GB
48xLarge.24576GB

Selection (1-13, q, Q): q

Enter q to complete the change.

To confirm that the changes have been successfully applied, use the moreh-smi command again to check the currently used MoAI Accelerator.

$ moreh-smi
+------------------------------------------------------------------------------------------------------+
|                                                    Current Version: 24.11.0  Latest Version: 24.11.0 |
+------------------------------------------------------------------------------------------------------+
|  Device  |        Name         |       Model      |  Memory Usage  |  Total Memory  |  Utilization   |
+======================================================================================================+
|  * 0     |   MoAI Accelerator  |  4xLarge.2048GB  |  -             |  -             |  -             |
+------------------------------------------------------------------------------------------------------+

You can see that it has been successfully switched to 8xLarge.4096GB .

# Training Execution

Run the provided train_llama3_70b.py script.

$ cd ~/quickstart
~/quickstart$ python tutorial/train_llama3_70b.py

If the training is running correctly, you will see logs similar to the following. These logs indicate that the Advanced Parallelism feature, which finds the optimal parallelization settings, is functioning correctly. Note that in the PyTorch script we reviewed earlier, no additional handling for using multiple GPUs simultaneously was necessary apart from the single AP code line.

...
[info] Got DBs from backend for auto config.
[info] Requesting resources for MoAI Accelerator from the server...
[info] Initializing the worker daemon for MoAI Accelerator
[info] [1/4] Connecting to resources on the server (192.168.xxx.xx:xxxxx)...
[info] [2/4] Connecting to resources on the server (192.168.xxx.xx:xxxxx)...
[info] [3/4] Connecting to resources on the server (192.168.xxx.xx:xxxxx)...
[info] [4/4] Connecting to resources on the server (192.168.xxx.xx:xxxxx)...
[info] Establishing links to the resources...
[info] MoAI Accelerator is ready to use.
[info] Moreh Version: 24.11.0
[info] Moreh Job ID: 991802
[info] The number of candidates is 78.
[info] Parallel Graph Compile start...
[info] Elapsed Time to compile all candidates = 521039 [ms]
[info] Parallel Graph Compile finished.
[info] The number of possible candidates is 4.
[info] SelectBestGraphFromCandidates start...
[info] Elapsed Time to compute cost for survived candidates = 29258 [ms]
[info] SelectBestGraphFromCandidates finished.
[info] Configuration for parallelism is selected.
[info] num_stages : 4, num_micro_batches : 64, batch_per_device : 1, No TP,  recomputation : default(1), distribute_param : true, distribute_low_prec_param : true
[info] train: true

| INFO     | __main__:main:259 - Model load and warmup done. Duration: 1926.33
| INFO     | __main__:main:269 - [Step 10/560] | Loss: 1.7265 | Duration: 1877.21 | 2.45 | Throughput: 2513.62 tokens/sec
| INFO     | __main__:main:269 - [Step 20/560] | Loss: 1.5937 | Duration: 2099.69 | 2.44 | Throughput: 2496.98 tokens/sec
| INFO     | __main__:main:269 - [Step 30/560] | Loss: 1.5078 | Duration: 2137.83 | 2.39 | Throughput: 2452.44 tokens/sec
| INFO     | __main__:main:269 - [Step 40/560] | Loss: 1.8515 | Duration: 2162.07 | 2.37 | Throughput: 2424.93 tokens/sec
...
Training Done
Saving Model...
Model saved in ./llama3_70b_summarization

From the training logs, you can confirm that the training is progressing smoothly.

The throughput displayed during training indicates the number of tokens being trained per second by the PyTorch script.

Using 16 AMD MI250 GPUs (32 devices): approximately 2400 tokens/sec

Estimated training time based on GPU type and count is as follows:

Using 16 AMD MI250 GPUs (32 devices): approximately 32 hours

# Checking Accelerator Status During Training

During training, you can open another terminal and connect to the container. Then, run the moreh-smi command to see the MoAI Accelerator’s memory being utilized by the training script, as shown below.

$ moreh-smi
+-----------------------------------------------------------------------------------------------------+
|                                                 Current Version: 24.11.0  Latest Version: 24.11.0   |
+-----------------------------------------------------------------------------------------------------+
|  Device  |        Name         |       Model      |  Memory Usage  |  Total Memory  |  Utilization  |
+=====================================================================================================+
|  * 0     |   MoAI Accelerator  |  4xLarge.2048GB  |  1469731 MiB   |  2096640 MiB   |  100 %        |
+-----------------------------------------------------------------------------------------------------+

tutorial llama3_70b