From 83aba383972b2918bf7fd42eb498ddac2880d7f3 Mon Sep 17 00:00:00 2001 From: senzhen Date: Fri, 31 Oct 2025 10:48:38 +0800 Subject: [PATCH] =?UTF-8?q?=E4=BF=AE=E6=94=B9=E6=9D=83=E9=87=8D=E8=BD=AC?= =?UTF-8?q?=E6=8D=A2=E6=96=87=E6=A1=A3=E6=8F=8F=E8=BF=B0?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../docs/source_en/function/transform_weight.md | 10 +++++----- .../docs/source_zh_cn/function/transform_weight.md | 10 +++++----- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/mindformers/docs/source_en/function/transform_weight.md b/docs/mindformers/docs/source_en/function/transform_weight.md index 28f257f3f9..f3b8d37385 100644 --- a/docs/mindformers/docs/source_en/function/transform_weight.md +++ b/docs/mindformers/docs/source_en/function/transform_weight.md @@ -144,11 +144,11 @@ bash transform_checkpoint.sh \ ### Multi-Node Multi-Device Training on Physical Machines -Training a large-scale model usually needs a cluster of servers. In the multi-node multi-device scenario, if there is a shared disk between servers, the automatic conversion function can be used. Otherwise, only offline conversion can be used. The following example is a training that uses two servers and 16 GPUs. +Training a large-scale model usually needs a cluster of servers. In the multi-node multi-device scenario, if a unified shared storage path (such as the NFS-mounted /worker directory) is configured between servers, the automatic conversion function can be used. Otherwise, only offline conversion can be used. The following example is a training that uses two servers and 16 GPUs. -#### Scenario 1: A shared disk exists between servers. +#### Scenario 1: A shared storage path is configured between servers. -If there is a shared disk between servers, you can use MindSpore Transformers to automatically convert a weight before multi-node multi-device training. Assume that `/data` is the shared disk between the servers and the MindSpore Transformers project code is stored in the `/data/mindformers` directory. +If a unified shared unified shared storage path (such as the NFS-mounted /worker directory) is configured between servers, you can use MindSpore Transformers to automatically convert a weight before multi-node multi-device training. - **Single-process conversion** @@ -209,9 +209,9 @@ If there is a shared disk between servers, you can use MindSpore Transformers to 16 8 ${ip} ${port} 1 output/msrun_log False 300 ``` -#### Scenario 2: No shared disk exists between servers. +#### Scenario 2: No shared path exists between servers. -If there is no shared disk between servers, you need to use the offline weight conversion tool to convert the weight. The following steps describe how to perform offline weight conversion and start a multi-node multi-device training task. +If there is no shared path between servers, you need to use the offline weight conversion tool to convert the weight. The following steps describe how to perform offline weight conversion and start a multi-node multi-device training task. - **Obtain the distributed policy file.** diff --git a/docs/mindformers/docs/source_zh_cn/function/transform_weight.md b/docs/mindformers/docs/source_zh_cn/function/transform_weight.md index 93adfaecd4..ddf5b6610f 100644 --- a/docs/mindformers/docs/source_zh_cn/function/transform_weight.md +++ b/docs/mindformers/docs/source_zh_cn/function/transform_weight.md @@ -144,11 +144,11 @@ bash transform_checkpoint.sh \ ### 物理机多机多卡训练 -大规模模型通常需要通过多台服务器组成的集群进行训练。在这种多机多卡的场景下,如果服务器之间存在共享盘,则可以使用自动转换功能,否则只能使用离线转换。下面以两台服务器、16卡训练为例进行说明。 +大规模模型通常需要通过多台服务器组成的集群进行训练。在这种多机多卡的场景下,如果服务器之间配置了统一的共享存储路径(如NFS挂载的/worker目录),则可以使用自动转换功能,否则只能使用离线转换。下面以两台服务器、16卡训练为例进行说明。 -#### 场景一:服务器之间有共享盘 +#### 场景一:服务器之间配置有共享存储路径 -在服务器之间有共享盘的场景下,可以使用 MindSpore Transformers 的自动权重转换功能在多机多卡训练之前自动进行权重转换。假设 `/data` 为服务器的共享盘,且 MindSpore Transformers 的工程代码位于 `/data/mindformers` 路径下。 +在服务器之间配置了统一的共享存储路径(如NFS挂载的/worker目录),可以使用 MindSpore Transformers 的自动权重转换功能在多机多卡训练之前自动进行权重转换。 - **单进程转换** @@ -209,9 +209,9 @@ bash transform_checkpoint.sh \ 16 8 ${ip} ${port} 1 output/msrun_log False 300 ``` -#### 场景二:服务器之间无共享盘 +#### 场景二:服务器之间无共享路径 -在服务器之间无共享盘的情况下,需要使用离线权重转换工具进行权重转换。以下步骤描述了如何进行离线权重转换,并启动多机多卡训练任务。 +在服务器之间无共享路径的情况下,需要使用离线权重转换工具进行权重转换。以下步骤描述了如何进行离线权重转换,并启动多机多卡训练任务。 - **获取分布式策略文件** -- Gitee