#
Automatic Parallelization
The Automatic Parallelization feature finds the optimal parallelization configuration automatically. It is recommended to use multi-node accelerators when using AP.
#
How to Apply the AP Feature
The AP feature can be applied by adding a single line of code after import torch
:
import torch
torch.moreh.option.enable_advanced_parallelization()
...
Check out the example below and see how a single line of code can bring about a significant transformation!
Checking the Flavor
First, check if you are ready to use MoAI Platform by verifying the current MoAI Platform version and flavor using moreh-smi
.
In this tutorial, we will use the
xLarge.512GB
flavor. If your flavor is not
xLarge.512GB
, please change the flavor using moreh-switch-model
. For detailed instructions, please visit here.
$ moreh-smi
+-----------------------------------------------------------------------------------------------------+
| Current Version: 24.11.0 Latest Version: 24.11.0 |
+-----------------------------------------------------------------------------------------------------+
| Device | Name | Model | Memory Usage | Total Memory | Utilization |
+=====================================================================================================+
| 0 | MoAI Accelerator | xLarge.512GB | - | - | - |
+-----------------------------------------------------------------------------------------------------+
Cloning the Repo
Now, we'll check how Automatic Parallelization is applied in MoAI Platform. First, let's get the code ready.
git clone https://github.com/moreh-dev/quickstart.git
cd quickstart/moai-example
For more detailed examples, please refer to the sample code in our quickstart repository.
Run the AP Example Code
Let’s run the command with AP enabled. By passing the --with-ap
argument, enable_advanced_parallelization()
is applied within the example code.
python AP_example.py --with-ap
Voila! We’ve successfully executed the code with AP enabled! 🎉
Let's Check the Result!
The execution logs below shows the actual results with Automatic Parallelization.
[info] Got DBs from backend for auto config.
[info] Requesting resources for MoAI Accelerator from the server...
[info] Initializing the worker daemon for MoAI Accelerator
[info] [1/1] Connecting to resources on the server (192.168.xxx.x:xxxxx)...
[info] Establishing links to the resources...
[info] MoAI Accelerator is ready to use.
[info] Moreh Version: 24.11.0
[info] Moreh Job ID: xxxxxx
[info] The number of candidates is 24.
[info] Parallel Graph Compile start...
[info] Elapsed Time to compile all candidates = 20107 [ms]
[info] Parallel Graph Compile finished.
[info] The number of possible candidates is 5.
[info] SelectBestGraphFromCandidates start...
[info] Elapsed Time to compute cost for survived candidates = 1399 [ms]
[info] SelectBestGraphFromCandidates finished.
[info] Configuration for parallelism is selected.
[info] No PP, No TP, micro_batching_enabled : true, num_micro_batches : 4, batch_per_device : 2, recomputation : none(0), distribute_param : true, distribute_low_prec_param : true
[info] train: true
| INFO | __main__:main:159 - [Step 2/30] Loss: 1.6328125
| INFO | __main__:main:159 - [Step 4/30] Loss: 1.625
| INFO | __main__:main:159 - [Step 6/30] Loss: 1.6171875
| INFO | __main__:main:159 - [Step 8/30] Loss: 1.5625
...
Let’s do a simple analysis of the AP logs.
Up to line 8
, it follows the same process as a standard MoAI execution. But from line 9
, our AP begins. AP analyzes your model’s training pattern, generates graph candidates based on those patterns, and then selects the best graph from the candidates. As a result, in line 18
, the optimal parallelization config for meta-llama/Llama-2-13b-hf model
and
xLarge.512GB
is selected.
As you can see, by adding just a single line of code applying Automatic Parallelization (AP), the optimal parallelization for your GPU resources is implemented, resulting in a successful execution log. Train your models with the powerful capabilities of AP and experience the difference! 🚀