From 0e525eeb1c3b413db88655b533c466bd87223a7e Mon Sep 17 00:00:00 2001 From: zhangyihuiben Date: Mon, 15 Dec 2025 16:26:11 +0800 Subject: [PATCH] =?UTF-8?q?=E6=95=B4=E6=94=B9=E7=B2=BE=E5=BA=A6=E6=96=87?= =?UTF-8?q?=E6=A1=A3?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../accuracy_comparison.md | 138 +++++++++--------- .../accuracy_comparison.md | 2 +- 2 files changed, 70 insertions(+), 70 deletions(-) diff --git a/docs/mindformers/docs/source_en/advanced_development/accuracy_comparison.md b/docs/mindformers/docs/source_en/advanced_development/accuracy_comparison.md index 4c688eb4e8..892a5f0d66 100644 --- a/docs/mindformers/docs/source_en/advanced_development/accuracy_comparison.md +++ b/docs/mindformers/docs/source_en/advanced_development/accuracy_comparison.md @@ -55,75 +55,75 @@ The following tables describe the configuration comparison with Megatron-LM. This document supports only the precision comparison of the mcore model. Therefore, `--use-mcore-model` must be configured for Megatron-LM, and `use_legacy: False` must be configured for MindSpore Transformers. - | Megatron-LM | Description | MindSpore Transformers | Description | - |--------------------------------------------|---------------------------------------------|--------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------| - | `use-legacy-model` and `use-mcore-model` | Specifies whether to use the mcore model. | `use_legacy` | Specifies whether to use the mcore model. `use_legacy: False` is equivalent to `--use-mcore-model`. | - | `num-layers` | Number of network layers, that is, number of transformer layers. | `num_layers` | Number of network layers, that is, number of transformer layers. | - | `encoder-num-layers` | Number of encoder layers. | Not supported. | | - | `decoder-num-layers` | Number of decoder layers. | Not supported. | | - | `hidden-size` | Size of the hidden layer, which is the dimension in the hidden state. | `hidden_size` | Size of the hidden layer, which is the dimension in the hidden state. | - | `ffn-hidden-size` | Size of the hidden layer in the feedforward network. | `intermediate_size` | Size of the hidden layer in the feedforward network. | - | `num-attention-heads` | Number of attention heads. | `num_heads` | Number of attention heads. | - | `kv-channels` | Number of key/value tensor channels. | `head_dim` | Number of key/value tensor channels. | - | `group-query-attention` | Specifies whether to enable group query attention. | `use_gqa` | Specifies whether to enable group query attention. | - | `num-query-groups` | Number of query groups. | `n_kv_heads` | Number of query groups. | - | `max-position-embeddings` | Maximum position encoding length. | `max_position_embeddings` | Maximum position encoding length. | - | `position-embedding-type` | Position encoding type, such as learned_absolute and rope. | `position_embedding_type` | Position encoding type, such as learned_absolute and rope. | - | `use-rotary-position-embeddings` | Specifies whether to use rotary position embedding (RoPE). | Specified by `position_embedding_type`==`rope` | Specifies whether to use RoPE. | - | `rotary-base` | Rotary base used for RoPE. | `rotary_base` | Rotary base used for RoPE. | - | `rotary-percent` | RoPE usage ratio. | `rotary_percent` | RoPE usage ratio. | - | `rotary-interleaved` | Specifies whether to use interleaved RoPE. | `rotary_interleaved` | Specifies whether to use interleaved RoPE. | - | `rotary-seq-len-interpolation-factor` | Rotary sequence length interpolation factor. | `rotary_seq_len_interpolation_factor` | Rotary sequence length interpolation factor. | - | `use-rope-scaling` | Specifies whether to enable RoPE scaling. | `use_rope_scaling` | Specifies whether to enable RoPE scaling. | - | `rope-scaling-factor` | RoPE scaling factor. | `scaling_factor` | RoPE scaling factor. | - | `no-position-embedding` | Specifies whether to disable location encoding. | `no-position-embedding` | Specifies whether to disable location encoding. | - | `disable-bias-linear` | Disables bias in linear layers. | `add_bias_linear` | Enables bias in linear layers. | - | `mrope-section` | Information of multiple RoPE sections. | Not supported. | | - | `make-vocab-size-divisible-by` | Divides the size of the word table by a specified number. | Not supported. | By default, the dictionary size is not changed. | - | `init-method-std` | Standard deviation of the normal distribution used during model parameter initialization. | `init_method_std` | Standard deviation of the normal distribution used during model parameter initialization. | - | `attention-dropout` | Dropout probability applied in the multi-head self-attention mechanism. | `attention_dropout` | Dropout probability applied in the multi-head self-attention mechanism. | - | `hidden-dropout` | Dropout probability in the hidden layer. | `hidden_dropout` | Dropout probability in the hidden layer. | - | `normalization` | Normalization method, which can be LayerNorm or RMSNorm. | `normalization` | Normalization method, which can be LayerNorm or RMSNorm. | - | `norm-epsilon` | Normalized stability factor (epsilon). | `rms_norm_eps` | RMSNorm stability factor. | - | `apply-layernorm-1p` | Specifies whether to add 1 after LayerNorm. | Not supported. | | - | `apply-residual-connection-post-layernorm` | Specifies whether the residual connection is applied after LayerNorm. | `apply_residual_connection_post_layernorm` | Specifies whether the residual connection is applied after LayerNorm. | - | `openai-gelu` | Specifies whether to use the GELU activation function of the OpenAI version. | Not supported. | | - | `squared-relu` | Specifies whether to use the square ReLU activation function. | Not supported. | | - | Specified by `swiglu`, `openai-gelu`, and `squared-relu` | The default value is **torch.nn.functional.gelu**. | `hidden_act` | Activation function type. | - | `gated_linear_unit` | Specifies whether to use gate linear unit in multi-layer perceptron (MLP). | `gated_linear_unit` | Specifies whether to use gate linear unit in MLP. | - | `swiglu` | Specifies whether to use the SwiGLU activation function. | `hidden_act`==`silu` and `gated_linear_unit`| Specifies whether to use the SwiGLU activation function. | - | `no-persist-layer-norm` | Disables persistence layer normalization. | Not supported. | | - | `untie-embeddings-and-output-weights` | Specifies whether to decouple the weights of the input embedding layer and output layer. | `untie_embeddings_and_output_weights` | Specifies whether to decouple the weights of the input embedding layer and output layer. | - | Specified by `fp16` and `bf16` | Tensor compute precision during training. | `compute_dtype` | Tensor compute precision during training. | - | `grad-reduce-in-bf16` | Gradient reduction using BFloat16. | Not supported. | | - | Not supported. | By default, the initialization tensor is generated in BFloat16 format. | `param_init_type` | Initial precision of the weight tensor. The default value is **Float32**, which ensures that the backward gradient is updated in Float32. | - | Not supported. | By default, layer normalization is calculated in Float32. | `layernorm_compute_type` | Layer normalization tensor calculation precision. | - | `attention-softmax-in-fp32` | Executes **attention softmax** in Float32. | `softmax_compute_type` | Softmax tensor calculation precision. | - | Not supported. | | `rotary_dtype` | Position encoding tensor calculation precision. | - | `loss-scale` | Overall loss scaling factor. | `loss_scale_value` | Overall loss scaling factor, which is configured in **runner_wrapper**. If `compute_dtype` is set to **BFloat16**, the value is usually set to **1.0**. | - | `initial-loss-scale` | Initial loss scaling factor. | Not supported. | | - | `min-loss-scale` | Minimum loss scaling factor. | Not supported. | | - | `loss-scale-window` | Dynamic window size scaling. | `loss_scale_window` | Dynamic window size scaling. | - | `hysteresis` | Loss scale hysteresis parameter. | Not supported. | | - | `fp32-residual-connection` | Uses Float32 for residual connection. | Not supported. | | - | `accumulate-allreduce-grads-in-fp32` | Accumulates and reduces gradients using Float32. | Not supported. | Accumulates and reduces gradients using Float32 by default. | - | `fp16-lm-cross-entropy` | Uses Float16 to execute the cross entropy of the LLM. | Not supported. | Uses Float32 to execute the cross entropy of the LLM by default. | - | `q-lora-rank` | LoRA rank of the query projection layer, which is used when Q-LoRA is enabled. | `q_lora_rank` | LoRA rank of the query projection layer, which is used when Q-LoRA is enabled. | - | `kv-lora-rank` | LoRA rank of the key/value projection layer, which is used when KV-LoRA is enabled. | `kv_lora_rank` | LoRA rank of the key/value projection layer, which is used when KV-LoRA is enabled. | - | `qk-head-dim` | Number of dimensions per Q/K head. | `qk_nope_head_dim` | Number of dimensions per Q/K head. | - | `qk-pos-emb-head-dim` | Number of relative position embedding dimensions per Q/K head. | `qk_rope_head_dim` | Number of relative position embedding dimensions per Q/K head. | - | `v-head-dim` | Number of dimensions per value projection (V head). | `v_head_dim` | Number of dimensions per value projection (V head). | - | `rotary-scaling-factor` | RoPE scaling coefficient.| `scaling_factor` | RoPE scaling coefficient. | - | `use-precision-aware-optimizer` | Enables the optimizer with precision awareness to automatically manage parameter updates of different data types. | Not supported. | | - | `main-grads-dtype` | Data type of the main gradient. | Not supported. | By default, Float32 is used as the data type of the main gradient. | - | `main-params-dtype` | Data type of the main parameter. | Not supported. | By default, Float32 is used as the data type of the main parameter. | - | `exp-avg-dtype` | Data type of the exponential moving average (EMA). | Not supported. | | - | `exp-avg-sq-dtype` | Data type of the EMA square item. | Not supported. | | - | `first-last-layers-bf16` | Specifies whether to forcibly use BFloat16 at the first and last layers. | Not supported. | | - | `num-layers-at-start-in-bf16` | Number of layers that start with BFloat16. | Not supported. | | - | `num-layers-at-end-in-bf16` | Number of layers that end with BFloat16. | Not supported. | | - | `multi-latent-attention` | Specifies whether to enable the multi-hidden variable attention mechanism. | `multi_latent_attention` | Specifies whether to enable the multi-hidden variable attention mechanism. | - | `qk-layernorm` | Enables query/key layer normalization. | `qk-layernorm` | Enables query/key layer normalization. | + | Megatron-LM | Description | MindSpore Transformers | Description | + |--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------| + | `use-legacy-model` and `use-mcore-model` | Specifies whether to use the mcore model. | `use_legacy` | Specifies whether to use the mcore model. `use_legacy: False` is equivalent to `--use-mcore-model`. | + | `num-layers` | Number of network layers, that is, number of transformer layers. | `num_layers` | Number of network layers, that is, number of transformer layers. | + | `encoder-num-layers` | Number of encoder layers. | Not supported. | | + | `decoder-num-layers` | Number of decoder layers. | Not supported. | | + | `hidden-size` | Size of the hidden layer, which is the dimension in the hidden state. | `hidden_size` | Size of the hidden layer, which is the dimension in the hidden state. | + | `ffn-hidden-size` | Size of the hidden layer in the feedforward network. | `intermediate_size` | Size of the hidden layer in the feedforward network. | + | `num-attention-heads` | Number of attention heads. | `num_heads` | Number of attention heads. | + | `kv-channels` | Number of key/value tensor channels. | `head_dim` | Number of key/value tensor channels. | + | `group-query-attention` | Specifies whether to enable group query attention. | `use_gqa` | Specifies whether to enable group query attention. | + | `num-query-groups` | Number of query groups. | `n_kv_heads` | Number of query groups. | + | `max-position-embeddings` | Maximum position encoding length. | `max_position_embeddings` | Maximum position encoding length. | + | `position-embedding-type` | Position encoding type, such as learned_absolute and rope. | `position_embedding_type` | Position encoding type, such as learned_absolute and rope. | + | `use-rotary-position-embeddings` | Specifies whether to use rotary position embedding (RoPE). | Specified by `position_embedding_type`==`rope` | Specifies whether to use RoPE. | + | `rotary-base` | Rotary base used for RoPE. | `rotary_base` | Rotary base used for RoPE. | + | `rotary-percent` | RoPE usage ratio. | `rotary_percent` | RoPE usage ratio. | + | `rotary-interleaved` | Specifies whether to use interleaved RoPE. | `rotary_interleaved` | Specifies whether to use interleaved RoPE. | + | `rotary-seq-len-interpolation-factor` | Rotary sequence length interpolation factor. | `rotary_seq_len_interpolation_factor` | Rotary sequence length interpolation factor. | + | `use-rope-scaling` | Specifies whether to enable RoPE scaling. | `use_rope_scaling` | Specifies whether to enable RoPE scaling. | + | `rope-scaling-factor` | RoPE scaling factor. | `scaling_factor` | RoPE scaling factor. | + | `no-position-embedding` | Specifies whether to disable location encoding. | `no-position-embedding` | Specifies whether to disable location encoding. | + | `disable-bias-linear` | Disables bias in linear layers. | `add_bias_linear` | Enables bias in linear layers. | + | `mrope-section` | Information of multiple RoPE sections. | Not supported. | | + | `make-vocab-size-divisible-by` | Divides the size of the word table by a specified number. | Not supported. | By default, the dictionary size is not changed. | + | `init-method-std` | Standard deviation of the normal distribution used during model parameter initialization. | `init_method_std` | Standard deviation of the normal distribution used during model parameter initialization. | + | `attention-dropout` | Dropout probability applied in the multi-head self-attention mechanism. | `attention_dropout` | Dropout probability applied in the multi-head self-attention mechanism. | + | `hidden-dropout` | Dropout probability in the hidden layer. | `hidden_dropout` | Dropout probability in the hidden layer. | + | `normalization` | Normalization method, which can be LayerNorm or RMSNorm. | `normalization` | Normalization method, which can be LayerNorm or RMSNorm. | + | `norm-epsilon` | Normalized stability factor (epsilon). | `rms_norm_eps` | RMSNorm stability factor. | + | `apply-layernorm-1p` | Specifies whether to add 1 after LayerNorm. | Not supported. | | + | `apply-residual-connection-post-layernorm` | Specifies whether the residual connection is applied after LayerNorm. | `apply_residual_connection_post_layernorm` | Specifies whether the residual connection is applied after LayerNorm. | + | `openai-gelu` | Specifies whether to use the GELU activation function of the OpenAI version. | Not supported. | | + | `squared-relu` | Specifies whether to use the square ReLU activation function. | Not supported. | | + | Specified by `swiglu`, `openai-gelu`, and `squared-relu` | The default value is **torch.nn.functional.gelu**. | `hidden_act` | Activation function type. | + | `gated_linear_unit` | Specifies whether to use gate linear unit in multi-layer perceptron (MLP). | `gated_linear_unit` | Specifies whether to use gate linear unit in MLP. | + | `swiglu` | Specifies whether to use the SwiGLU activation function. | `hidden_act` == `silu` and `gated_linear_unit` | Specifies whether to use the SwiGLU activation function. | + | `no-persist-layer-norm` | Disables persistence layer normalization. | Not supported. | | + | `untie-embeddings-and-output-weights` | Specifies whether to decouple the weights of the input embedding layer and output layer. | `untie_embeddings_and_output_weights` | Specifies whether to decouple the weights of the input embedding layer and output layer. | + | Specified by `fp16` and `bf16` | Tensor compute precision during training. | `compute_dtype` | Tensor compute precision during training. | + | `grad-reduce-in-bf16` | Gradient reduction using BFloat16. | Not supported. | | + | Not supported. | By default, the initialization tensor is generated in BFloat16 format. | `param_init_type` | Initial precision of the weight tensor. The default value is **Float32**, which ensures that the backward gradient is updated in Float32. | + | Not supported. | By default, layer normalization is calculated in Float32. | `layernorm_compute_type` | Layer normalization tensor calculation precision. | + | `attention-softmax-in-fp32` | Executes **attention softmax** in Float32. | `softmax_compute_type` | Softmax tensor calculation precision. | + | Not supported. | | `rotary_dtype` | Position encoding tensor calculation precision. | + | `loss-scale` | Overall loss scaling factor. | `loss_scale_value` | Overall loss scaling factor, which is configured in **runner_wrapper**. If `compute_dtype` is set to **BFloat16**, the value is usually set to **1.0**. | + | `initial-loss-scale` | Initial loss scaling factor. | Not supported. | | + | `min-loss-scale` | Minimum loss scaling factor. | Not supported. | | + | `loss-scale-window` | Dynamic window size scaling. | `loss_scale_window` | Dynamic window size scaling. | + | `hysteresis` | Loss scale hysteresis parameter. | Not supported. | | + | `fp32-residual-connection` | Uses Float32 for residual connection. | `fp32_residual_connection` | Uses Float32 for residual connection. | + | `accumulate-allreduce-grads-in-fp32` | Accumulates and reduces gradients using Float32. | Not supported. | Accumulates and reduces gradients using Float32 by default. | + | `fp16-lm-cross-entropy` | Uses Float16 to execute the cross entropy of the LLM. | Not supported. | Uses Float32 to execute the cross entropy of the LLM by default. | + | `q-lora-rank` | LoRA rank of the query projection layer, which is used when Q-LoRA is enabled. | `q_lora_rank` | LoRA rank of the query projection layer, which is used when Q-LoRA is enabled. | + | `kv-lora-rank` | LoRA rank of the key/value projection layer, which is used when KV-LoRA is enabled. | `kv_lora_rank` | LoRA rank of the key/value projection layer, which is used when KV-LoRA is enabled. | + | `qk-head-dim` | Number of dimensions per Q/K head. | `qk_nope_head_dim` | Number of dimensions per Q/K head. | + | `qk-pos-emb-head-dim` | Number of relative position embedding dimensions per Q/K head. | `qk_rope_head_dim` | Number of relative position embedding dimensions per Q/K head. | + | `v-head-dim` | Number of dimensions per value projection (V head). | `v_head_dim` | Number of dimensions per value projection (V head). | + | `rotary-scaling-factor` | RoPE scaling coefficient. | `scaling_factor` | RoPE scaling coefficient. | + | `use-precision-aware-optimizer` | Enables the optimizer with precision awareness to automatically manage parameter updates of different data types. | Not supported. | | + | `main-grads-dtype` | Data type of the main gradient. | Not supported. | By default, Float32 is used as the data type of the main gradient. | + | `main-params-dtype` | Data type of the main parameter. | Not supported. | By default, Float32 is used as the data type of the main parameter. | + | `exp-avg-dtype` | Data type of the exponential moving average (EMA). | Not supported. | | + | `exp-avg-sq-dtype` | Data type of the EMA square item. | Not supported. | | + | `first-last-layers-bf16` | Specifies whether to forcibly use BFloat16 at the first and last layers. | Not supported. | | + | `num-layers-at-start-in-bf16` | Number of layers that start with BFloat16. | Not supported. | | + | `num-layers-at-end-in-bf16` | Number of layers that end with BFloat16. | Not supported. | | + | `multi-latent-attention` | Specifies whether to enable the multi-hidden variable attention mechanism. | `multi_latent_attention` | Specifies whether to enable the multi-hidden variable attention mechanism. | + | `qk-layernorm` | Enables query/key layer normalization. | `qk-layernorm` | Enables query/key layer normalization. | - Optimizer and learning rate scheduling configurations diff --git a/docs/mindformers/docs/source_zh_cn/advanced_development/accuracy_comparison.md b/docs/mindformers/docs/source_zh_cn/advanced_development/accuracy_comparison.md index a6ab5b0aff..3f9cd7b6c0 100644 --- a/docs/mindformers/docs/source_zh_cn/advanced_development/accuracy_comparison.md +++ b/docs/mindformers/docs/source_zh_cn/advanced_development/accuracy_comparison.md @@ -105,7 +105,7 @@ Megatron-LM 是一个面向大规模训练任务的成熟框架,具备高度 | `min-loss-scale` | 最小损失缩放因子 | 不支持配置 | | | `loss-scale-window` | 动态缩放窗口大小 | `loss_scale_window` | 动态缩放窗口大小 | | `hysteresis` | 损失缩放迟滞参数 | 不支持配置 | | - | `fp32-residual-connection` | 使用 Float32 残差连接 | 不支持配置 | | + | `fp32-residual-connection` | 使用 Float32 残差连接 | `fp32_residual_connection` | 使用 Float32 残差连接 | | `accumulate-allreduce-grads-in-fp32` | 使用 Float32 累加并规约梯度 | 不支持配置 | 默认使用 Float32 累加并规约梯度 | | `fp16-lm-cross-entropy` | 使用 Float16 执行语言模型交叉熵 | 不支持配置 | 默认使用 Float32 执行语言模型交叉熵 | | `q-lora-rank` | Query 投影层的 LoRA rank,启用 Q-LoRA 时使用 | `q_lora_rank` | Query 投影层的 LoRA rank,启用 Q-LoRA 时使用 | -- Gitee