From 0e525eeb1c3b413db88655b533c466bd87223a7e Mon Sep 17 00:00:00 2001
From: zhangyihuiben <zhangyihuiben@sina.com>
Date: Mon, 15 Dec 2025 16:26:11 +0800
Subject: [PATCH] =?UTF-8?q?=E6=95=B4=E6=94=B9=E7=B2=BE=E5=BA=A6=E6=96=87?=
 =?UTF-8?q?=E6=A1=A3?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .../accuracy_comparison.md                    | 138 +++++++++---------
 .../accuracy_comparison.md                    |   2 +-
 2 files changed, 70 insertions(+), 70 deletions(-)

diff --git a/docs/mindformers/docs/source_en/advanced_development/accuracy_comparison.md b/docs/mindformers/docs/source_en/advanced_development/accuracy_comparison.md
index 4c688eb4e8..892a5f0d66 100644
--- a/docs/mindformers/docs/source_en/advanced_development/accuracy_comparison.md
+++ b/docs/mindformers/docs/source_en/advanced_development/accuracy_comparison.md
@@ -55,75 +55,75 @@ The following tables describe the configuration comparison with Megatron-LM.
 
     This document supports only the precision comparison of the mcore model. Therefore, `--use-mcore-model` must be configured for Megatron-LM, and `use_legacy: False` must be configured for MindSpore Transformers.
 
-    | Megatron-LM                                | Description                                         | MindSpore Transformers                     | Description                                                                                                                                             |
-    |--------------------------------------------|---------------------------------------------|--------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
-    | `use-legacy-model` and `use-mcore-model`    | Specifies whether to use the mcore model.                              | `use_legacy`                               | Specifies whether to use the mcore model. `use_legacy: False` is equivalent to `--use-mcore-model`.                                                     |
-    | `num-layers`                               | Number of network layers, that is, number of transformer layers.                       | `num_layers`                               | Number of network layers, that is, number of transformer layers.                                                                                        |
-    | `encoder-num-layers`                       | Number of encoder layers.                             | Not supported.                                     |                                                                                                                                                         |
-    | `decoder-num-layers`                       | Number of decoder layers.                             | Not supported.                                     |                                                                                                                                                         |
-    | `hidden-size`                              | Size of the hidden layer, which is the dimension in the hidden state.                              | `hidden_size`                              | Size of the hidden layer, which is the dimension in the hidden state.                                                                                   |
-    | `ffn-hidden-size`                          | Size of the hidden layer in the feedforward network.                                  | `intermediate_size`                        | Size of the hidden layer in the feedforward network.                                                                                                    |
-    | `num-attention-heads`                      | Number of attention heads.                                      | `num_heads`                                | Number of attention heads.                                                                                                                              |
-    | `kv-channels`                              | Number of key/value tensor channels.                            | `head_dim`                                 | Number of key/value tensor channels.                                                                                                                    |
-    | `group-query-attention`                    | Specifies whether to enable group query attention.                                | `use_gqa`                                  | Specifies whether to enable group query attention.                                                                                                      |
-    | `num-query-groups`                         | Number of query groups.                                     | `n_kv_heads`                               | Number of query groups.                                                                                                                                 |
-    | `max-position-embeddings`                  | Maximum position encoding length.                                   | `max_position_embeddings`                  | Maximum position encoding length.                                                                                                                       |
-    | `position-embedding-type`                  | Position encoding type, such as learned_absolute and rope.           | `position_embedding_type`                  | Position encoding type, such as learned_absolute and rope.                                                                                              |
-    | `use-rotary-position-embeddings`           | Specifies whether to use rotary position embedding (RoPE).                           | Specified by `position_embedding_type`==`rope`      | Specifies whether to use RoPE.                                                                                                                          |
-    | `rotary-base`                              | Rotary base used for RoPE.                               | `rotary_base`                              | Rotary base used for RoPE.                                                                                                                              |
-    | `rotary-percent`                           | RoPE usage ratio.                                 | `rotary_percent`                           | RoPE usage ratio.                                                                                                                                       |
-    | `rotary-interleaved`                       | Specifies whether to use interleaved RoPE.                                | `rotary_interleaved`                       | Specifies whether to use interleaved RoPE.                                                                                                              |
-    | `rotary-seq-len-interpolation-factor`      | Rotary sequence length interpolation factor.                                 | `rotary_seq_len_interpolation_factor`      | Rotary sequence length interpolation factor.                                                                                                            |
-    | `use-rope-scaling`                         | Specifies whether to enable RoPE scaling.                               | `use_rope_scaling`                         | Specifies whether to enable RoPE scaling.                                                                                                               |
-    | `rope-scaling-factor`                      | RoPE scaling factor.                                  | `scaling_factor`                    | RoPE scaling factor.                                                                                                                                    |
-    | `no-position-embedding`                    | Specifies whether to disable location encoding.                                   | `no-position-embedding`                                     | Specifies whether to disable location encoding.                                                                                                         |
-    | `disable-bias-linear`                      | Disables bias in linear layers.                                | `add_bias_linear`                          | Enables bias in linear layers.                                                                                                                          |
-    | `mrope-section`                            | Information of multiple RoPE sections.                           | Not supported.                                     |                                                                                                                                                         |
-    | `make-vocab-size-divisible-by`             | Divides the size of the word table by a specified number.                               | Not supported.                                     | By default, the dictionary size is not changed.                                                                                                         |
-    | `init-method-std`                          | Standard deviation of the normal distribution used during model parameter initialization.                        | `init_method_std`                          | Standard deviation of the normal distribution used during model parameter initialization.                                                               |
-    | `attention-dropout`                        | Dropout probability applied in the multi-head self-attention mechanism.                    | `attention_dropout`                        | Dropout probability applied in the multi-head self-attention mechanism.                                                                                 |
-    | `hidden-dropout`                           | Dropout probability in the hidden layer.                            | `hidden_dropout`                           | Dropout probability in the hidden layer.                                                                                                                |
-    | `normalization`                            | Normalization method, which can be LayerNorm or RMSNorm.                  | `normalization`                            | Normalization method, which can be LayerNorm or RMSNorm.                                                                                                |
-    | `norm-epsilon`                             | Normalized stability factor (epsilon).                           | `rms_norm_eps`                             | RMSNorm stability factor.                                                                                                                               |
-    | `apply-layernorm-1p`                       | Specifies whether to add 1 after LayerNorm.                     | Not supported.                                     |                                                                                                                                                         |
-    | `apply-residual-connection-post-layernorm` | Specifies whether the residual connection is applied after LayerNorm.                     | `apply_residual_connection_post_layernorm` | Specifies whether the residual connection is applied after LayerNorm.                                                                                   |
-    | `openai-gelu`                              | Specifies whether to use the GELU activation function of the OpenAI version.                  | Not supported.                                     |                                                                                                                                                         |
-    | `squared-relu`                             | Specifies whether to use the square ReLU activation function.                           | Not supported.                                     |                                                                                                                                                         |
-    | Specified by `swiglu`, `openai-gelu`, and `squared-relu`  | The default value is **torch.nn.functional.gelu**.               | `hidden_act`                               | Activation function type.                                                                                                                               |
-    | `gated_linear_unit`                        | Specifies whether to use gate linear unit in multi-layer perceptron (MLP).                      | `gated_linear_unit`                        | Specifies whether to use gate linear unit in MLP.                                                                                                       |
-    | `swiglu`                                   | Specifies whether to use the SwiGLU activation function.                           | `hidden_act`==`silu` and `gated_linear_unit`| Specifies whether to use the SwiGLU activation function.                                                                                                |
-    | `no-persist-layer-norm`                    | Disables persistence layer normalization.                                  | Not supported.                                     |                                                                                                                                                         |
-    | `untie-embeddings-and-output-weights`      | Specifies whether to decouple the weights of the input embedding layer and output layer.                            | `untie_embeddings_and_output_weights`      | Specifies whether to decouple the weights of the input embedding layer and output layer.                                                                |
-    | Specified by `fp16` and `bf16`                       | Tensor compute precision during training.                                  | `compute_dtype`                            | Tensor compute precision during training.                                                                                                               |
-    | `grad-reduce-in-bf16`                      | Gradient reduction using BFloat16.                          | Not supported.                                     |                                                                                                                                                         |
-    | Not supported.                                     | By default, the initialization tensor is generated in BFloat16 format.                       | `param_init_type`                          | Initial precision of the weight tensor. The default value is **Float32**, which ensures that the backward gradient is updated in Float32.               |
-    | Not supported.                                     | By default, layer normalization is calculated in Float32.                       | `layernorm_compute_type`                   | Layer normalization tensor calculation precision.                                                                                                       |
-    | `attention-softmax-in-fp32`                | Executes **attention softmax** in Float32.            | `softmax_compute_type`                     | Softmax tensor calculation precision.                                                                                                                   |
-    | Not supported.                                     |                                             | `rotary_dtype`                             | Position encoding tensor calculation precision.                                                                                                         |
-    | `loss-scale`                               | Overall loss scaling factor.                                   | `loss_scale_value`                         | Overall loss scaling factor, which is configured in **runner_wrapper**. If `compute_dtype` is set to **BFloat16**, the value is usually set to **1.0**. |
-    | `initial-loss-scale`                       | Initial loss scaling factor.                                   | Not supported.                                     |                                                                                                                                                         |
-    | `min-loss-scale`                           | Minimum loss scaling factor.                                   | Not supported.                                     |                                                                                                                                                         |
-    | `loss-scale-window`                        | Dynamic window size scaling.                                   | `loss_scale_window`                        | Dynamic window size scaling.                                                                                                                            |
-    | `hysteresis`                               | Loss scale hysteresis parameter.                                   | Not supported.                                     |                                                                                                                                                         |
-    | `fp32-residual-connection`                 | Uses Float32 for residual connection.                            | Not supported.                                     |                                                                                                                                                         |
-    | `accumulate-allreduce-grads-in-fp32`       | Accumulates and reduces gradients using Float32.                         | Not supported.                                     | Accumulates and reduces gradients using Float32 by default.                                                                                             |
-    | `fp16-lm-cross-entropy`                    | Uses Float16 to execute the cross entropy of the LLM.                       | Not supported.                                     | Uses Float32 to execute the cross entropy of the LLM by default.                                                                                        |
-    | `q-lora-rank`                              | LoRA rank of the query projection layer, which is used when Q-LoRA is enabled.         | `q_lora_rank`                              | LoRA rank of the query projection layer, which is used when Q-LoRA is enabled.                                                                          |
-    | `kv-lora-rank`                             | LoRA rank of the key/value projection layer, which is used when KV-LoRA is enabled.    | `kv_lora_rank`                             | LoRA rank of the key/value projection layer, which is used when KV-LoRA is enabled.                                                                     |
-    | `qk-head-dim`                              | Number of dimensions per Q/K head.                         | `qk_nope_head_dim`                         | Number of dimensions per Q/K head.                                                                                                                      |
-    | `qk-pos-emb-head-dim`                      | Number of relative position embedding dimensions per Q/K head.                             | `qk_rope_head_dim`                         | Number of relative position embedding dimensions per Q/K head.                                                                                          |
-    | `v-head-dim`                               | Number of dimensions per value projection (V head).                       | `v_head_dim`                               | Number of dimensions per value projection (V head).                                                                                                     |
-    | `rotary-scaling-factor`                    | RoPE scaling coefficient.| `scaling_factor`                           | RoPE scaling coefficient.                                                                                                                               |
-    | `use-precision-aware-optimizer`            | Enables the optimizer with precision awareness to automatically manage parameter updates of different data types.            | Not supported.                                     |                                                                                                                                                         |
-    | `main-grads-dtype`                         | Data type of the main gradient.                                   | Not supported.                                     | By default, Float32 is used as the data type of the main gradient.                                                                                      |
-    | `main-params-dtype`                        | Data type of the main parameter.                                   | Not supported.                                     | By default, Float32 is used as the data type of the main parameter.                                                                                     |
-    | `exp-avg-dtype`                            | Data type of the exponential moving average (EMA).                           | Not supported.                                     |                                                                                                                                                         |
-    | `exp-avg-sq-dtype`                         | Data type of the EMA square item.                                | Not supported.                                     |                                                                                                                                                         |
-    | `first-last-layers-bf16`                   | Specifies whether to forcibly use BFloat16 at the first and last layers.                        | Not supported.                                     |                                                                                                                                                         |
-    | `num-layers-at-start-in-bf16`              | Number of layers that start with BFloat16.                        | Not supported.                                     |                                                                                                                                                         |
-    | `num-layers-at-end-in-bf16`                | Number of layers that end with BFloat16.                        | Not supported.                                     |                                                                                                                                                         |
-    | `multi-latent-attention`                   | Specifies whether to enable the multi-hidden variable attention mechanism.                              | `multi_latent_attention`                   | Specifies whether to enable the multi-hidden variable attention mechanism.                                                                              |
-    | `qk-layernorm`                             | Enables query/key layer normalization.                           | `qk-layernorm`                             | Enables query/key layer normalization.                                                                                                                  |
+    | Megatron-LM                                | Description                                                                                                       | MindSpore Transformers                         | Description                                                                                                                                             |
+    |--------------------------------------------|-------------------------------------------------------------------------------------------------------------------|------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
+    | `use-legacy-model` and `use-mcore-model`    | Specifies whether to use the mcore model.                                                                         | `use_legacy`                                   | Specifies whether to use the mcore model. `use_legacy: False` is equivalent to `--use-mcore-model`.                                                     |
+    | `num-layers`                               | Number of network layers, that is, number of transformer layers.                                                  | `num_layers`                                   | Number of network layers, that is, number of transformer layers.                                                                                        |
+    | `encoder-num-layers`                       | Number of encoder layers.                                                                                         | Not supported.                                 |                                                                                                                                                         |
+    | `decoder-num-layers`                       | Number of decoder layers.                                                                                         | Not supported.                                 |                                                                                                                                                         |
+    | `hidden-size`                              | Size of the hidden layer, which is the dimension in the hidden state.                                             | `hidden_size`                                  | Size of the hidden layer, which is the dimension in the hidden state.                                                                                   |
+    | `ffn-hidden-size`                          | Size of the hidden layer in the feedforward network.                                                              | `intermediate_size`                            | Size of the hidden layer in the feedforward network.                                                                                                    |
+    | `num-attention-heads`                      | Number of attention heads.                                                                                        | `num_heads`                                    | Number of attention heads.                                                                                                                              |
+    | `kv-channels`                              | Number of key/value tensor channels.                                                                              | `head_dim`                                     | Number of key/value tensor channels.                                                                                                                    |
+    | `group-query-attention`                    | Specifies whether to enable group query attention.                                                                | `use_gqa`                                      | Specifies whether to enable group query attention.                                                                                                      |
+    | `num-query-groups`                         | Number of query groups.                                                                                           | `n_kv_heads`                                   | Number of query groups.                                                                                                                                 |
+    | `max-position-embeddings`                  | Maximum position encoding length.                                                                                 | `max_position_embeddings`                      | Maximum position encoding length.                                                                                                                       |
+    | `position-embedding-type`                  | Position encoding type, such as learned_absolute and rope.                                                        | `position_embedding_type`                      | Position encoding type, such as learned_absolute and rope.                                                                                              |
+    | `use-rotary-position-embeddings`           | Specifies whether to use rotary position embedding (RoPE).                                                        | Specified by `position_embedding_type`==`rope` | Specifies whether to use RoPE.                                                                                                                          |
+    | `rotary-base`                              | Rotary base used for RoPE.                                                                                        | `rotary_base`                                  | Rotary base used for RoPE.                                                                                                                              |
+    | `rotary-percent`                           | RoPE usage ratio.                                                                                                 | `rotary_percent`                               | RoPE usage ratio.                                                                                                                                       |
+    | `rotary-interleaved`                       | Specifies whether to use interleaved RoPE.                                                                        | `rotary_interleaved`                           | Specifies whether to use interleaved RoPE.                                                                                                              |
+    | `rotary-seq-len-interpolation-factor`      | Rotary sequence length interpolation factor.                                                                      | `rotary_seq_len_interpolation_factor`          | Rotary sequence length interpolation factor.                                                                                                            |
+    | `use-rope-scaling`                         | Specifies whether to enable RoPE scaling.                                                                         | `use_rope_scaling`                             | Specifies whether to enable RoPE scaling.                                                                                                               |
+    | `rope-scaling-factor`                      | RoPE scaling factor.                                                                                              | `scaling_factor`                               | RoPE scaling factor.                                                                                                                                    |
+    | `no-position-embedding`                    | Specifies whether to disable location encoding.                                                                   | `no-position-embedding`                        | Specifies whether to disable location encoding.                                                                                                         |
+    | `disable-bias-linear`                      | Disables bias in linear layers.                                                                                   | `add_bias_linear`                              | Enables bias in linear layers.                                                                                                                          |
+    | `mrope-section`                            | Information of multiple RoPE sections.                                                                            | Not supported.                                 |                                                                                                                                                         |
+    | `make-vocab-size-divisible-by`             | Divides the size of the word table by a specified number.                                                         | Not supported.                                 | By default, the dictionary size is not changed.                                                                                                         |
+    | `init-method-std`                          | Standard deviation of the normal distribution used during model parameter initialization.                         | `init_method_std`                              | Standard deviation of the normal distribution used during model parameter initialization.                                                               |
+    | `attention-dropout`                        | Dropout probability applied in the multi-head self-attention mechanism.                                           | `attention_dropout`                            | Dropout probability applied in the multi-head self-attention mechanism.                                                                                 |
+    | `hidden-dropout`                           | Dropout probability in the hidden layer.                                                                          | `hidden_dropout`                               | Dropout probability in the hidden layer.                                                                                                                |
+    | `normalization`                            | Normalization method, which can be LayerNorm or RMSNorm.                                                          | `normalization`                                | Normalization method, which can be LayerNorm or RMSNorm.                                                                                                |
+    | `norm-epsilon`                             | Normalized stability factor (epsilon).                                                                            | `rms_norm_eps`                                 | RMSNorm stability factor.                                                                                                                               |
+    | `apply-layernorm-1p`                       | Specifies whether to add 1 after LayerNorm.                                                                       | Not supported.                                 |                                                                                                                                                         |
+    | `apply-residual-connection-post-layernorm` | Specifies whether the residual connection is applied after LayerNorm.                                             | `apply_residual_connection_post_layernorm`     | Specifies whether the residual connection is applied after LayerNorm.                                                                                   |
+    | `openai-gelu`                              | Specifies whether to use the GELU activation function of the OpenAI version.                                      | Not supported.                                 |                                                                                                                                                         |
+    | `squared-relu`                             | Specifies whether to use the square ReLU activation function.                                                     | Not supported.                                 |                                                                                                                                                         |
+    | Specified by `swiglu`, `openai-gelu`, and `squared-relu`  | The default value is **torch.nn.functional.gelu**.                                                                | `hidden_act`                                   | Activation function type.                                                                                                                               |
+    | `gated_linear_unit`                        | Specifies whether to use gate linear unit in multi-layer perceptron (MLP).                                        | `gated_linear_unit`                            | Specifies whether to use gate linear unit in MLP.                                                                                                       |
+    | `swiglu`                                   | Specifies whether to use the SwiGLU activation function.                                                          | `hidden_act` == `silu` and `gated_linear_unit` | Specifies whether to use the SwiGLU activation function.                                                                                                |
+    | `no-persist-layer-norm`                    | Disables persistence layer normalization.                                                                         | Not supported.                                 |                                                                                                                                                         |
+    | `untie-embeddings-and-output-weights`      | Specifies whether to decouple the weights of the input embedding layer and output layer.                          | `untie_embeddings_and_output_weights`          | Specifies whether to decouple the weights of the input embedding layer and output layer.                                                                |
+    | Specified by `fp16` and `bf16`                       | Tensor compute precision during training.                                                                         | `compute_dtype`                                | Tensor compute precision during training.                                                                                                               |
+    | `grad-reduce-in-bf16`                      | Gradient reduction using BFloat16.                                                                                | Not supported.                                 |                                                                                                                                                         |
+    | Not supported.                                     | By default, the initialization tensor is generated in BFloat16 format.                                            | `param_init_type`                              | Initial precision of the weight tensor. The default value is **Float32**, which ensures that the backward gradient is updated in Float32.               |
+    | Not supported.                                     | By default, layer normalization is calculated in Float32.                                                         | `layernorm_compute_type`                       | Layer normalization tensor calculation precision.                                                                                                       |
+    | `attention-softmax-in-fp32`                | Executes **attention softmax** in Float32.                                                                        | `softmax_compute_type`                         | Softmax tensor calculation precision.                                                                                                                   |
+    | Not supported.                                     |                                                                                                                   | `rotary_dtype`                                 | Position encoding tensor calculation precision.                                                                                                         |
+    | `loss-scale`                               | Overall loss scaling factor.                                                                                      | `loss_scale_value`                             | Overall loss scaling factor, which is configured in **runner_wrapper**. If `compute_dtype` is set to **BFloat16**, the value is usually set to **1.0**. |
+    | `initial-loss-scale`                       | Initial loss scaling factor.                                                                                      | Not supported.                                 |                                                                                                                                                         |
+    | `min-loss-scale`                           | Minimum loss scaling factor.                                                                                      | Not supported.                                 |                                                                                                                                                         |
+    | `loss-scale-window`                        | Dynamic window size scaling.                                                                                      | `loss_scale_window`                            | Dynamic window size scaling.                                                                                                                            |
+    | `hysteresis`                               | Loss scale hysteresis parameter.                                                                                  | Not supported.                                 |                                                                                                                                                         |
+    | `fp32-residual-connection`                 | Uses Float32 for residual connection.                                                                             | `fp32_residual_connection`                     | Uses Float32 for residual connection.                                                                                                                                                        |
+    | `accumulate-allreduce-grads-in-fp32`       | Accumulates and reduces gradients using Float32.                                                                  | Not supported.                                 | Accumulates and reduces gradients using Float32 by default.                                                                                             |
+    | `fp16-lm-cross-entropy`                    | Uses Float16 to execute the cross entropy of the LLM.                                                             | Not supported.                                 | Uses Float32 to execute the cross entropy of the LLM by default.                                                                                        |
+    | `q-lora-rank`                              | LoRA rank of the query projection layer, which is used when Q-LoRA is enabled.                                    | `q_lora_rank`                                  | LoRA rank of the query projection layer, which is used when Q-LoRA is enabled.                                                                          |
+    | `kv-lora-rank`                             | LoRA rank of the key/value projection layer, which is used when KV-LoRA is enabled.                               | `kv_lora_rank`                                 | LoRA rank of the key/value projection layer, which is used when KV-LoRA is enabled.                                                                     |
+    | `qk-head-dim`                              | Number of dimensions per Q/K head.                                                                                | `qk_nope_head_dim`                             | Number of dimensions per Q/K head.                                                                                                                      |
+    | `qk-pos-emb-head-dim`                      | Number of relative position embedding dimensions per Q/K head.                                                    | `qk_rope_head_dim`                             | Number of relative position embedding dimensions per Q/K head.                                                                                          |
+    | `v-head-dim`                               | Number of dimensions per value projection (V head).                                                               | `v_head_dim`                                   | Number of dimensions per value projection (V head).                                                                                                     |
+    | `rotary-scaling-factor`                    | RoPE scaling coefficient.                                                                                         | `scaling_factor`                               | RoPE scaling coefficient.                                                                                                                               |
+    | `use-precision-aware-optimizer`            | Enables the optimizer with precision awareness to automatically manage parameter updates of different data types. | Not supported.                                 |                                                                                                                                                         |
+    | `main-grads-dtype`                         | Data type of the main gradient.                                                                                   | Not supported.                                 | By default, Float32 is used as the data type of the main gradient.                                                                                      |
+    | `main-params-dtype`                        | Data type of the main parameter.                                                                                  | Not supported.                                 | By default, Float32 is used as the data type of the main parameter.                                                                                     |
+    | `exp-avg-dtype`                            | Data type of the exponential moving average (EMA).                                                                | Not supported.                                 |                                                                                                                                                         |
+    | `exp-avg-sq-dtype`                         | Data type of the EMA square item.                                                                                 | Not supported.                                 |                                                                                                                                                         |
+    | `first-last-layers-bf16`                   | Specifies whether to forcibly use BFloat16 at the first and last layers.                                          | Not supported.                                 |                                                                                                                                                         |
+    | `num-layers-at-start-in-bf16`              | Number of layers that start with BFloat16.                                                                        | Not supported.                                 |                                                                                                                                                         |
+    | `num-layers-at-end-in-bf16`                | Number of layers that end with BFloat16.                                                                          | Not supported.                                 |                                                                                                                                                         |
+    | `multi-latent-attention`                   | Specifies whether to enable the multi-hidden variable attention mechanism.                                        | `multi_latent_attention`                       | Specifies whether to enable the multi-hidden variable attention mechanism.                                                                              |
+    | `qk-layernorm`                             | Enables query/key layer normalization.                                                                            | `qk-layernorm`                                 | Enables query/key layer normalization.                                                                                                                  |
 
 - Optimizer and learning rate scheduling configurations
 
diff --git a/docs/mindformers/docs/source_zh_cn/advanced_development/accuracy_comparison.md b/docs/mindformers/docs/source_zh_cn/advanced_development/accuracy_comparison.md
index a6ab5b0aff..3f9cd7b6c0 100644
--- a/docs/mindformers/docs/source_zh_cn/advanced_development/accuracy_comparison.md
+++ b/docs/mindformers/docs/source_zh_cn/advanced_development/accuracy_comparison.md
@@ -105,7 +105,7 @@ Megatron-LM 是一个面向大规模训练任务的成熟框架，具备高度
     | `min-loss-scale`                           | 最小损失缩放因子                                    | 不支持配置                                      |                                                                     |
     | `loss-scale-window`                        | 动态缩放窗口大小                                    | `loss_scale_window`                        | 动态缩放窗口大小                                                            |
     | `hysteresis`                               | 损失缩放迟滞参数                                    | 不支持配置                                      |                                                                     |
-    | `fp32-residual-connection`                 | 使用 Float32 残差连接                             | 不支持配置                                      |                                                                     |
+    | `fp32-residual-connection`                 | 使用 Float32 残差连接                             | `fp32_residual_connection`                 | 使用 Float32 残差连接                                                     |
     | `accumulate-allreduce-grads-in-fp32`       | 使用 Float32 累加并规约梯度                          | 不支持配置                                      | 默认使用 Float32 累加并规约梯度                                                |
     | `fp16-lm-cross-entropy`                    | 使用 Float16 执行语言模型交叉熵                        | 不支持配置                                      | 默认使用 Float32 执行语言模型交叉熵                                              |
     | `q-lora-rank`                              | Query 投影层的 LoRA rank，启用 Q-LoRA 时使用          | `q_lora_rank`                              | Query 投影层的 LoRA rank，启用 Q-LoRA 时使用                                  |
-- 
Gitee