# Automatic Parallelization

The Automatic Parallelization feature finds the optimal parallelization configuration automatically. It is recommended to use multi-node accelerators when using AP.

# How to Apply the AP Feature

The AP feature can be applied by adding a single line of code after import torch:

import torch

torch.moreh.option.enable_advanced_parallelization()
...

Check out the example below and see how a single line of code can bring about a significant transformation!

Checking the Flavor
First, check if you are ready to use MoAI Platform by verifying the current MoAI Platform version and flavor using moreh-smi. In this tutorial, we will use the xLarge.512GB flavor. If your flavor is not xLarge.512GB , please change the flavor using moreh-switch-model. For detailed instructions, please visit here.

$ moreh-smi
+-----------------------------------------------------------------------------------------------------+
|                                                  Current Version: 24.11.0  Latest Version: 24.11.0  |
+-----------------------------------------------------------------------------------------------------+
|  Device  |        Name         |       Model      |  Memory Usage  |  Total Memory  |  Utilization  |
+=====================================================================================================+
|    0     |   MoAI Accelerator  |   xLarge.512GB   |  -             |  -             |  -            |
+-----------------------------------------------------------------------------------------------------+

Cloning the Repo
Now, we'll check how Automatic Parallelization is applied in MoAI Platform. First, let's get the code ready.

git clone https://github.com/moreh-dev/quickstart.git
cd quickstart/moai-example

For more detailed examples, please refer to the sample code in our quickstart repository.

Run the AP Example Code
Let’s run the command with AP enabled. By passing the --with-ap argument, enable_advanced_parallelization() is applied within the example code.

python AP_example.py --with-ap

Voila! We’ve successfully executed the code with AP enabled! 🎉

Let's Check the Result!

The execution logs below shows the actual results with Automatic Parallelization.

[info] Got DBs from backend for auto config.
[info] Requesting resources for MoAI Accelerator from the server...
[info] Initializing the worker daemon for MoAI Accelerator
[info] [1/1] Connecting to resources on the server (192.168.xxx.x:xxxxx)...
[info] Establishing links to the resources...
[info] MoAI Accelerator is ready to use.
[info] Moreh Version: 24.11.0
[info] Moreh Job ID: xxxxxx
[info] The number of candidates is 24.
[info] Parallel Graph Compile start...
[info] Elapsed Time to compile all candidates = 20107 [ms]
[info] Parallel Graph Compile finished.
[info] The number of possible candidates is 5.
[info] SelectBestGraphFromCandidates start...
[info] Elapsed Time to compute cost for survived candidates = 1399 [ms]
[info] SelectBestGraphFromCandidates finished.
[info] Configuration for parallelism is selected.
[info] No PP, No TP, micro_batching_enabled : true, num_micro_batches : 4, batch_per_device : 2, recomputation : none(0), distribute_param : true, distribute_low_prec_param : true
[info] train: true
| INFO     | __main__:main:159 - [Step 2/30] Loss: 1.6328125
| INFO     | __main__:main:159 - [Step 4/30] Loss: 1.625
| INFO     | __main__:main:159 - [Step 6/30] Loss: 1.6171875
| INFO     | __main__:main:159 - [Step 8/30] Loss: 1.5625
...

Let’s do a simple analysis of the AP logs.

Up to line 8, it follows the same process as a standard MoAI execution. But from line 9, our AP begins. AP analyzes your model’s training pattern, generates graph candidates based on those patterns, and then selects the best graph from the candidates. As a result, in line 18, the optimal parallelization config for meta-llama/Llama-2-13b-hf model and xLarge.512GB is selected.

As you can see, by adding just a single line of code applying Automatic Parallelization (AP), the optimal parallelization for your GPU resources is implemented, resulting in a successful execution log. Train your models with the powerful capabilities of AP and experience the difference! 🚀