# moreh-smi

We provide the Moreh System Management Interface to help you use the MoAI Accelerator efficiently. The Moreh System Management Interface is a command line tool designed to efficiently manage and monitor on the MoAI Platform. With just three commands (moreh-smi, moreh-switch-model), users can effectively manage MoAI Accelerators and easily update MoAI Platform.

# moreh-smi

moreh-smi is a command-line tool that allows users to manage and monitor the MoAI Accelerator. You can run it in a conda environment where MoAI Platform PyTorch is installed.

$ moreh-smi
+------------------------------------------------------------------------------------------------------+
|                                                    Current Version: 24.11.0  Latest Version: 24.11.0 |
+------------------------------------------------------------------------------------------------------+
|  Device  |        Name         |       Model      |  Memory Usage  |  Total Memory  |  Utilization   |
+======================================================================================================+
|  * 0     |  MoAI Accelerator   |  xlarge.512GB  |  -             |  -             |  -               |
+------------------------------------------------------------------------------------------------------+

If you are currently running a training session using the MoAI Accelerator, running moreh-smi in another terminal session will display the running process information as follows. You can also use moreh-smi to quickly identify your Job ID, allowing for faster support response from MoAI Platform in case of training or inference issues. In the example below, the Job ID is 976356.

$ moreh-smi
+------------------------------------------------------------------------------------------------------+
|                                                    Current Version: 24.11.0  Latest Version: 24.11.0 |
+------------------------------------------------------------------------------------------------------+
|  Device  |        Name         |       Model      |  Memory Usage  |  Total Memory  |  Utilization   |
+======================================================================================================+
|  * 0     |  MoAI Accelerator   |  xLarge.512GB    |    397 MiB     |  524160 MiB    |      0 %       |
+------------------------------------------------------------------------------------------------------+

Processes:
+-----------------------------------------------------------------------------------+
|  Device  |  Job ID  |    PID    |             Process            |  Memory Usage  |
+===================================================================================+
|       0  |  976356  |  1548305  |  python tutorial/train_gpt.py  |    397 MiB     |
+-----------------------------------------------------------------------------------+

# moreh-smi -p

You can check the information about the nodes assigned to the currently selected MoAI Accelerator using the moreh-smi -p command. The information that can be retrieved is as follows:

  • Dev Temp: GPU Temperature
  • Mem Temp: GPU Memory Temperature
  • Dev Util: GPU Usage
  • Mem Util: GPU Memory Utilization
  • Mem Usage: GPU Memory Usage

The moreh-smi -p command reads the information of the currently running MoAI Accelerator, so the process must be actively running as shown below:

moreh-smi 
+-----------------------------------------------------------------------------------------------------+
|                                                         Current Version:   Latest Version: 24.11.0  |
+-----------------------------------------------------------------------------------------------------+
|  Device  |        Name         |       Model      |  Memory Usage  |  Total Memory  |  Utilization  |
+=====================================================================================================+
|  * 0     |   MoAI Accelerator  |  4xLarge.2048GB  |  113217 MiB    |  2096640 MiB   |    0 %        |
+-----------------------------------------------------------------------------------------------------+

Processes:
+----------------------------------------------------------------------------------------+
|  Device  |  Job ID  |    PID    |              Process                |  Memory Usage  |
+========================================================================================+
|       0  |  ******  |  *******  |       python train_llama3.py        |  113217 MiB    |
+----------------------------------------------------------------------------------------+

Now, let’s enter the moreh-smi -p command while the process is running, as follows:

$ moreh-smi -p
+--------------------------------------------------------------------------------------+
|  Dev  |  Location  |  Dev Temp  |  Mem Temp  |  Dev Util  |  Mem Util  |  Mem Usage  |
+======================================================================================+
|    0  |  back01:0  |    31 C    |    40 C    |     0 %    |     0 %    |     1 %     |
|    1  |  back01:1  |    33 C    |    41 C    |     0 %    |     0 %    |     1 %     |
|    2  |  back01:2  |    34 C    |    45 C    |     0 %    |     0 %    |     1 %     |
|    3  |  back01:3  |    32 C    |    46 C    |     0 %    |     0 %    |     1 %     |
                                               .
                                               .
                                               .
|   27  |  back51:3  |    37 C    |    46 C    |     0 %    |     0 %    |     0 %     |
|   28  |  back51:4  |    34 C    |    45 C    |     0 %    |     0 %    |     0 %     |
|   29  |  back51:5  |    36 C    |    47 C    |     0 %    |     0 %    |     0 %     |
|   30  |  back51:6  |    42 C    |    55 C    |     0 %    |     0 %    |     0 %     |
|   31  |  back51:7  |    41 C    |    51 C    |     0 %    |     0 %    |     0 %     |
+--------------------------------------------------------------------------------------+

# moreh-smi -r

You can terminate the process using the moreh-smi --reset or moreh-smi -r command. Here is an example where the process is terminated during training.

$ moreh-smi -r
Device release success.

# enter the command once more
$ moreh-smi -r 
Device release failed. (Not running job.)

You can confirm the termination by seeing the message Device release success. If no process is running, it will fail with the message Device release failed. (Not running job.)

# moreh-smi -t

The Token value is a hash used to identify the user, and it is unique for each user. The Token is typically located within the user’s workspace, and the server connected to the MoAI Platform identifies the user and runs the training based on the Token value. Without the Token, GPU operations and Python applications cannot be executed.

You can check the Token configuration on the VM using the moreh-smi --token or moreh-smi -t command.

$ moreh-smi -t
+--------------------------------------------------------------------------------+
|  Device  |        Name         |            Token           |       Model      |
+================================================================================+
|  * 0     |   MoAI Accelerator  |  ************************  |  4xLarge.2048GB  |
+--------------------------------------------------------------------------------+

# moreh-smi device

moreh-smi device is a toolkit within the Moreh System Management Interface specifically designed for managing MoAI Accelerators. It allows you to add, remove, and modify MoAI Accelerators directly. Let’s explore the details through the following documentation.

# moreh-smi device --add

By default, if users do not configure anything, there will only be one MoAI Accelerator in a VM or container environment. However, there may be cases where you want to run multiple processes concurrently in the same environment. In such cases, you can create multiple MoAI Accelerators within a single token using moreh-smi. When you enter the moreh-smi device --add command to use two or more MoAI Accelerators, you will see the following interface.

$ moreh-smi device --add
1. Small.64GB
2. Medium.128GB
3. Large.256GB
4. xLarge.512GB
5. 1.5xLarge.768GB
6. 2xLarge.1024GB
7. 3xLarge.1536GB
8. 4xLarge.2048GB
9. 6xLarge.3072GB
10. 8xLarge.4096GB
11. 12xLarge.6144GB
12. 24xLarge.12288GB
13. 48xLarge.24576GB

Selection (1-13, q, Q):

When you input the integer corresponding to the model you want to use from 1 to 13, a MoAI Accelerator corresponding to the entered device number will be created with the message "Create device success." Within one environment, you can create a maximum of 5 AI accelerators. If you need to create more MoAI Accelerators, please contact your infrastructure administrator.

In the example below, let's add the 10th 8xLarge.4096GB MoAI Accelerator:

$ moreh-smi device --add 10
+-----------------------------------------------------------------------------------------------------+
|                                                  Current Version: 24.11.0  Latest Version: 24.11.0  |
+-----------------------------------------------------------------------------------------------------+
|  Device  |        Name         |       Model      |  Memory Usage  |  Total Memory  |  Utilization  |
+=====================================================================================================+
|  * 0     |   MoAI Accelerator  |  Large.256GB     |  -             |  -             |  -            |
|    1     |   MoAI Accelerator  |  8xLarge.4096GB  |  -             |  -             |  -            |
+-----------------------------------------------------------------------------------------------------+

You can confirm that the 8xLarge accelerator has been successfully added. However, note that the accelerator has been added but not selected yet. The method to select and use the added accelerator will be covered in the next section with the moreh-smi device --switch command.


# moreh-smi device --switch

moreh-smi device --switch {Device_ID} is a command that allows you to change the default MoAI Accelerator.

First, let’s check the currently selected MoAI Accelerator using the moreh-smi command.

$ moreh-smi
+-----------------------------------------------------------------------------------------------------+
|                                                  Current Version: 24.11.0  Latest Version: 24.11.0  |
+-----------------------------------------------------------------------------------------------------+
|  Device  |        Name         |       Model      |  Memory Usage  |  Total Memory  |  Utilization  |
+=====================================================================================================+
|  * 0     |  MoAI Accelerator  |  Large.256GB     |  -             |  -             |  -             |
|    1     |  MoAI Accelerator  |  8xLarge.4096GB  |  -             |  -             |  -             |
+-----------------------------------------------------------------------------------------------------+

Now, let's switch from device 0 to device 1. This can be done by using the --switch argument:

$ moreh-smi device --switch 1

+---------------------------------------------------+
|  Device  |        Name         |       Model      |
+===================================================+
|    0     |   MoAI Accelerator  |  Large.256GB     |
|  * 1     |   MoAI Accelerator  |  8xLarge.4096GB  |
+---------------------------------------------------+
Switch Current Device success.

This means that the current default MoAI Accelerator has been changed to Accelerator 1.

Selection (0-4, q, Q): q

$ moreh-smi
+-----------------------------------------------------------------------------------------------------+
|                                                  Current Version: 24.11.0  Latest Version: 24.11.0  |
+-----------------------------------------------------------------------------------------------------+
|  Device  |        Name         |       Model      |  Memory Usage  |  Total Memory  |  Utilization  |
+=====================================================================================================+
|    0     |   MoAI Accelerator  |  Large.256GB     |  -             |  -             |  -            |
|  * 1     |   MoAI Accelerator  |  8xLarge.4096GB  |  -             |  -             |  -            |
+-----------------------------------------------------------------------------------------------------+

# moreh-smi device --rm

This time, let's try to remove a specific accelerator corresponding to the specified device ID with the command moreh-smi device --rm {Device_ID}.

$ moreh-smi device --rm 1
+---------------------------------------------------+
|  Device  |        Name         |       Model      |
+===================================================+
|    0     |  MoAI Accelerator   |  Large.256GB     |
|  * 1     |  MoAI Accelerator   |  8xLarge.4096GB  |
+---------------------------------------------------+
Remove device failed. (Cannot remove current Device.)

You probably encountered an error because the command could not apply to the currently selected device. Therefore, let’s select device 0 again and run the command again.

$ moreh-smi device --switch 0
$ moreh-smi device --rm 1
+------------------------------------------------+
|  Device  |        Name         |     Model     |
+================================================+
|  * 0     |   MoAI Accelerator  |  Large.256GB  |
+------------------------------------------------+
Remove device success.

The MoAI Accelerator with Device ID 1, 8xLarge.4096GB , has been removed using the above command. To confirm, when you run moreh-smi again, you will notice that the device has been removed.

# Detail Usages

You can use the --help option to see what options are available.

$ moreh-smi --help

Usage: moreh-smi [-h | --help] [-r | --reset] [-s | --server-version] [-v | --version] [-t | --token] [-i | --idx]
                 [device {--add [model_id] | --rm [device_id] | --switch [device_id]}]

Basic Options:
  -h, --help             provide information about available command switches and their options
  -r, --reset            stop the running process
  -s, --server-version   print Moreh Framework version information
  -v, --version          print current software version information
  -t, --token            print Moreh Solution token information
  -i, --idx              select a device to print

Device Options:
  device --list                 list available models for adding device
  device --add [model_id]       add a device corresponding to model_id
  device --rm  [device_id]      remove a device corresponding to device_id
  device --switch [device_id]   switch to the device corresponding to [device_id]

Device Options operate interactively if there are no optional arguments([model_id], [device_id]).

Device Example:
  moreh-smi device --list
  moreh-smi device --add
  moreh-smi device --add 2
  moreh-smi device --switch 1
  moreh-smi -i 2