diff --git a/docs/mindformers/docs/source_en/advanced_development/images/mcore_infer_migration_process.png b/docs/mindformers/docs/source_en/advanced_development/images/mcore_infer_migration_process.png new file mode 100644 index 0000000000000000000000000000000000000000..1463717fbead3cb5761b0b980b6710430dfdf497 Binary files /dev/null and b/docs/mindformers/docs/source_en/advanced_development/images/mcore_infer_migration_process.png differ diff --git a/docs/mindformers/docs/source_en/advanced_development/inference_llm_migration.md b/docs/mindformers/docs/source_en/advanced_development/inference_llm_migration.md new file mode 100644 index 0000000000000000000000000000000000000000..30638626c643b2c8ceb2a001ad8293d6752ddac4 --- /dev/null +++ b/docs/mindformers/docs/source_en/advanced_development/inference_llm_migration.md @@ -0,0 +1,564 @@ +# LLM Inference Model Construction Tutorial + +[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_en/advanced_development/infer_llm_migration.md) + +## Overview + +This tutorial aims to guide developers to complete the adaptation of the inference model based on Mcore architecture. + +The core of model adaptation lies in understanding the composition and construction of model modules. Specifically, it involves the use of the [`GPTModel`](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/base_models/gpt/gpt_model.py#L37) class—the unified entry point for LLM models—and the `ModuleSpec` mechanism for building Megatron-style models. + +## Adaptation Workflow + +The core workflow for adapting an MCore inference model is as follows: + +1. Call the `get_spec` method to construct the `transformer_layer_spec` + +2. Initialize a `GPTModel` instance using this specification and other key parameters + +3. Create an instance of the `InferModelForCausalLM` class + +4. Assign the `GPTModel` object to the `model` member variable of `InferForCausalLM` + +The corresponding workflow diagram is shown below: + +![MCore_infer_building_process](./images/mcore_infer_migration_process.png) + +## Model Spec Construction Guide + +Transformer-based large language models share similar architectural patterns. MCore leverages two core functions to generalize the construction of various Transformer-style models (e.g., Qwen, DeepSeek): + + ● [get_gpt_layer_local_spec](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/base_models/gpt/gpt_layer_specs.py#L50): Constructs the module specification (`ModuleSpec`) for a single Transformer layer. Suitable for models with uniform layer structures (e.g., Qwen3). + + ● [get_gpt_decoder_block_spec](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/base_models/gpt/gpt_layer_specs.py#L141): Constructs the specification for the entire decoder block (i.e., multiple Transformer layers), supporting mixed MoE and Dense layers. Ideal for models with heterogeneous layer structures (e.g., DeepSeekV3). + +By flexibly configuring parameters, MCore enables efficient construction of diverse Transformer-like architectures. + +### get_gpt_layer_local_spec + +#### Functionality + +This function defines a `ModuleSpec` for a single Transformer layer, supporting various architectural features: + +● MoE(Mixture of Experts) + +● Multi Latent Attention + +● Q/K LayerNorm + +● Flash Attention + +● Sandwich Norm + +#### Parameters + +| PARAMETER NAME | TYPE | DESCRIPTION | DEFAULT VALUE | +| ---------------------- | ------------- | ------------------------------------------------------------ | ------------- | +| num_experts | Optional[int] | Number of experts in MoE; `None`indicates a Dense layer | `None` | +| moe_grouped_gemm | bool | Whether to use Grouped GEMM to accelerate MoE computation (enabled only) | `True` | +| qk_layernorm | bool | Whether to apply normalization to Query and Key | `False` | +| gated_linear_unit | bool | Whether to enable gated linear unit | `True` | +| multi_latent_attention | bool | Whether to use Multi Latent Attention (MLA) | `False` | +| normalization | Optional[str] | Type of normalization (e.g.,`"RMSNorm"`,`"LayerNorm"`) | `None` | +| qk_l2_norm | bool | Whether to apply L2 normalization (currently unsupported) | `False` | +| use_flash_attention | bool | Whether to enable Flash Attention (enabled by default) | `True` | +| sandwich_norm | bool | Whether to enable Sandwich Norm (specific to GLM models) | `False` | +| use_alltoall | bool | Whether to use AllToAll communication | `False` | +| use_fused_mla | bool | Whether to use fused MLA operator (for quantization) | `False` | + +#### Usage Example + +```python +from mindformers.parallel_core.inference.base_models.gpt.gpt_layer_specs import get_gpt_layer_local_spec + +# Define a spec for a single Transformer layer with: +# SelfAttention (with Flash Attention), Dense MLP, RMSNorm, no QK layernorm and no Sandwich norm +dense_layer_spec = get_gpt_layer_local_spec( + multi_latent_attention=False, # not enable MLA + normalization="RMSNorm", # use RMSNorm + qk_layernorm=False, # not enable QK layernorm + use_flash_attention=True, # use Flash Attention + sandwich_norm=False # not enable Sandwich Norm +) +``` + +### get_gpt_decoder_block_spec + +#### Functionality + +This function constructs the specification for the entire Transformer decoder block. It supports: + +● Mixed construction of MoE and Dense layers + +● Reading configuration to define layer arrangements + +Based on the `first_k_dense_replace` parameter in the config, it generates a distribution pattern for MoE and Dense layers. It then constructs individual `ModuleSpec`s for each layer (using `get_gpt_layer_local_spec`) and aggregates them into a `layer_specs` list, forming the full network specification. + +#### Parameters + +| PARAMETER NAME | TYPE | DESCRIPTION | DEFAULT VALUE | +| -------------- | ----------------- | ------------------------------------------------------------ | ------------- | +| config | TransformerConfig | Model configuration object | - | +| normalization | Optional[str] | Type of normalization to apply | `None` | +| qk_l2_norm | Optional[bool] | Whether to enable L2 normalization for Q/K (currently unsupported) | `None` | + +#### Usage Example + +```python +from mindformers.parallel_core.inference.base_models.gpt.gpt_layer_specs import get_gpt_decoder_block_spec + +# Define a Transformer block spec +transformer_block_spec = get_gpt_decoder_block_spec( + config=config, # Contains structural info; used to build layer specs + normalization="RMSNorm", # Use RMSNorm +) +``` + +## **GPTModel Introduction and Usage + +[`GPTModel`](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/base_models/gpt/gpt_model.py#L37) serves as the unified entry point for MCore LLMs, supporting construction of any Transformer-based architecture. Below is a brief overview of its initialization parameters and usage. + +### Initialization Parameters + +| PARAMETER NAME | TYPE | DESCRIPTION | DEFAULT VALUE | +| ----------------------------------- | ---------------------- | ------------------------------------------------------------ | ------------------------ | +| config | TransformerConfig | Global model config (hidden size, num heads, etc.) | - | +| transformer_layer_spec | ModuleSpec | Module spec defining Transformer layer components (e.g., Attention, MLP) | - | +| vocab_size | int | Vocabulary size, determines embedding and output layer dimensions | - | +| max_sequence_length | int | Maximum sequence length, used for position encoding initialization | - | +| pre_process | bool | Whether to compute the embedding layer | `True` | +| post_process | bool | Whether to compute output logits | `True` | +| fp16_lm_cross_entropy | bool | Whether to use FP16 for cross-entropy (not supported in inference) | `False` | +| parallel_output | bool | Whether to keep output in sharded tensor parallel format | `True` | +| share_embeddings_and_output_weights | bool | Whether to tie embedding and output weights | `False` | +| position_embedding_type | str | Type of position embedding (e.g.,`'rope'`) | `'none'` | +| rotary_percent | float | Proportion of dimensions to apply rotary encoding | `1.0` | +| rotary_base | int | Base value for RoPE | `10000` | +| rope_scaling | bool | Whether to enable RoPE extrapolation (controlled via`position_embedding_type='llama3'`) | `False` | +| seq_len_interpolation_factor | float | Sequence length interpolation factor | `None` | +| mtp_block_spec | ModuleSpec | MTP block specification | `None` | +| model_comm_pgs | ModelCommProcessGroups | Communication group information | `default_model_comm_pgs` | +| quant_config | QuantizationConfig | Quantization configuration | `None` | + +### Usage Example + +```python +# InferModelForCausalLM init() instantiates GPTModel when declaring self.model +self.model = GPTModel( + config=config, # Converted from model config + transformer_layer_spec=get_gpt_layer_local_spec(), # Get transformer_layer spec + vocab_size=152064, # Vocabulary size of model + max_sequence_length=32768, # Maximum sequence length + position_embedding_type='rope', # Use RoPE + rotary_base=10000, # RoPE base + share_embeddings_and_output_weights=False, # not tie embedding and output weights + post_process=True # compute output logits + ) +``` + +## LLM adaptation practice + +[Qwen2](https://gitee.com/mindspore/mindformers/blob/master/mindformers/models/qwen2/modeling_qwen2_infer.py#L36)、[Qwen3](https://gitee.com/mindspore/mindformers/blob/master/mindformers/models/qwen3/modeling_qwen3_infer.py#L36)、[Qwen3Moe](https://gitee.com/mindspore/mindformers/blob/master/mindformers/models/qwen3_moe/modeling_qwen3_moe_infer.py#L37) have been adapted to MindSpore Transformers. Please refer to the corresponding modeling_infer.py code for implementation. + +### Qwen2 Adaptation + +In `modeling_qwen2_infer.py`, the `InferenceQwen2ForCausalLM` class is defined as: + +```python +@MindFormerRegister.register(MindFormerModuleType.MODELS) +# The inference Mcore model needs to inherit InferModelMixin and the corresponding PreTrainedModel +class InferenceQwen2ForCausalLM(Qwen3PreTrainedModel, InferModelMixin): + def __init__(self, config): + super().__init__(config, auto_prefix=False) + self.config = config + # Call the conversion method to convert Model Config to TransformerConfig + config: TransformerConfig = convert_to_transformer_config(self.config) + + # Omit some steps + + # Create a GPTModel of the Qwen2 model structure + self.model = GPTModel( + config=config, + # Focus: Building the Qwen2 Transformer Spec + transformer_layer_spec=get_gpt_layer_local_spec(), + vocab_size=self.vocab_size, + max_sequence_length=self.max_position_embeddings, + position_embedding_type=config.position_embedding_type, + rotary_base=self.config.rope_theta, + share_embeddings_and_output_weights=self.config.tie_word_embeddings, + pre_process=self.config.pre_process, + post_process=self.config.post_process, + model_comm_pgs=self.model_comm_pgs + ) +``` + +The results of the Qwen2 model structure analysis are: self-attention module (`multi_latent_attention`=False), feedforward network MLP module (`num_experts`=None), RMSNorm module (`normalization`="RMSNorm"), no qk norm and sandwich_norm (`qk_layernorm`=False && `sandwich_norm`=False), and all Transformer layer structures are consistent. Therefore, the `get_gpt_layer_local_spec()` method is used to implement the Qwen2 Spec construction. The implementation code is as follows: + +```python +transformer_layer_spec = get_gpt_layer_local_spec( + # multi_latent_attention defaults to False + # num_experts defaults to None + # sandwich_norm defaults to False + normalization=config.normalization, # Normalization is configured as "RMSNorm" + qk_layernorm=False +) +``` + +### Qwen3 Adaptation + +Compared to Qwen2, Qwen3 introduces **QK LayerNorm**. The spec is adapted as: + +```python +transformer_layer_spec = get_gpt_layer_local_spec( + # multi_latent_attention defaults to False + # num_experts defaults to None + # sandwich_norm defaults to False + normalization=config.normalization, # Normalization is configured as "RMSNorm" + qk_layernorm=True # Different from Qwen2, enabled in Qwen3 +) +``` + +### Qwen3MoE Adaptation + +Qwen3MoE replaces the dense MLP with an **MoE module**. The spec must include the number of experts: + +```python +transformer_layer_spec = get_gpt_layer_local_spec( + # multi_latent_attention defaults to False + # sandwich_norm defaults to False + num_experts=config.num_moe_experts, # Different from Qwen2, Qwen3Moe needs to set the number of experts + normalization=config.normalization, # Normalization is configured as "RMSNorm" + qk_layernorm=True # enabled in Qwen3MoE +) +``` + +## Transformer Modules + +After adapting the overall architecture, understanding and correctly using the core modules in the MindFormers MCore inference architecture is essential for ensuring full model functionality. These modules form the building blocks of Transformer models. MCore provides standardized, high-performance interfaces that align with mainstream inference frameworks like vLLM. + +Below is a detailed introduction to the key modules, their functions, correspondence with vLLM, and usage examples. Developers can reuse these modules directly, avoiding reinvention and significantly accelerating development. + +### Module Overview + +The following table lists the main core modules and their core functions in MindSpore Transformers Mcore architecture: + +| MODULE TYPE | MCORE MODULE NAME | CORE FUNCTIONALITY | +| ------------------ | ------------------------------ | ------------------------------------------------------------ | +| Normalization | LayerNorm,RMSNorm | Normalize input features to stabilize inference | +| Activation | SiLU | Introduce non-linearity to enhance model expressiveness | +| RoPE | RotaryEmbedding, etc. | Generate rotary position encodings for sequence order awareness | +| Attention | SelfAttention,MLASelfAttention | Compute internal sequence correlations | +| Feed-Forward | MLP | Non-linear transformation after attention | +| Mixture of Experts | MoELayer | Dynamic expert routing to scale model capacity efficiently | + +### Norm Module + +#### Overview + +Normalization layers stabilize training and inference by normalizing feature distributions. + +MCore supports two norm interfaces: [LayerNorm](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/transformer/norm.py#L26) and [RMSNorm](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/transformer/norm.py#L73). The norm type is controlled via the `normalization` field in config. + +| CONFIG NAME | DEFAULT | MEANING | +| ----------------- | ----------- | ------------------------------------ | +| normalization | "LayerNorm" | Norm type (`LayerNorm` or `RMSNorm`) | +| layernorm_epsilon | 1e-5 | Small value for numerical stability | + +#### vLLM Comparison + +The correspondence between the module interface and vLLM is shown in the following table (functional accuracy is aligned with the vLLM interface) + +| | Mcore | vLLM([v0.8.4](https://github.com/vllm-project/vllm/tree/v0.8.3)) | +| ------------------- | --------- | ------------------------------------------------------------ | +| Layer Normalization | LayerNorm | [LayerNorm](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/models/commandr.py#L70) | +| RMS Normalization | RMSNorm | [RMSNorm](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/layers/layernorm.py#L82) | + +#### Usage Example + +```python +from mindformers.parallel_core.inference.transformer.norm import get_norm_cls +from mindformers.parallel_core.utils.spec_utils import build_module + +# LayerNorm instantiation +layernorm_cls = get_norm_cls("LayerNorm") +layernorm = build_module( + layernorm_cls, + config=config, # TransformerConfig configuration class object + hidden_size=config.hidden_size, + eps=config.layernorm_epsilon +) +# RMSNorm instantiation +rmsnorm_cls = get_norm_cls("RMSNorm") +rmsnorm = build_module( + rmsnorm_cls, + config=config, # TransformerConfig configuration class object + hidden_size=config.hidden_size, + eps=config.layernorm_epsilon +) +``` + +### Activation Module + +#### Overview + +Activation functions introduce non-linearity, crucial for model expressiveness. + +MCore currently supports only [SiLU](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/transformer/activation.py#L23), but extensible via the `hidden_act` config - Support future type expansion. + +| CONFIG NAME | DEFAULT | MEANING | +| ----------- | ------- | ------------------- | +| hidden_act | "gelu" | Activation function | + +#### vLLM Comparison + +The correspondence between the module interface and vLLM is shown in the following table (after superimposing Matmul, the functional accuracy is aligned with the vLLM combination interface) + +| | Mcore | vLLM([v0.8.4](https://github.com/vllm-project/vllm/tree/v0.8.3)) | +| ------------- | ----- | ------------------------------------------------------------ | +| σ Linear Unit | SiLU | [SiLUAndMul](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/layers/activation.py#L55) | + +#### Usage Example + +```python +from mindformers.parallel_core.inference.transformer.activation import get_act_func + +# SiLU instantiation +silu = get_act_func("silu") +``` + +### RoPE Module + +#### Overview + +Rotary Position Embedding (RoPE) encodes positional information via rotation matrices, enabling relative position awareness. + +Supported interfaces: + +● [RotaryEmbedding](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/base_models/common/embeddings/rotary_pos_embedding.py#L36) + +● [Llama3RotaryEmbedding](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/base_models/common/embeddings/rotary_pos_embedding.py#L173) + +● [YaRNScalingRotaryEmbedding](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/base_models/common/embeddings/yarn_rotary_pos_embedding.py#L30) + +Controlled by `position_embedding_type`. + +| CONFIG NAME | DEFAULT | MEANING | +| ----------------------- | ------- | --------------------------------- | +| kv_channels | None | Key/Value channel dimension | +| rotary_percent | 1.0 | Proportion of dimensions for RoPE | +| rotary_base | 10000 | Base frequency for rotation | +| rotary_cos_format | 0 | ApplyRotaryPosEmb大算子Format | +| rotary_dtype | fp16 | Data type of rotation matrix | +| max_position_embeddings | 4096 | Maximum sequence length | + +#### vLLM Comparison + +The correspondence between the module interface and vLLM is shown in the following table (functional accuracy is aligned with the vLLM interface) + +| | Mcore | vLLM([v0.8.4](https://github.com/vllm-project/vllm/tree/v0.8.3)) | +| -------------------- | -------------------------- | ------------------------------------------------------------ | +| Basic RoPE | RotaryEmbedding | [RotaryEmbedding](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/layers/rotary_embedding.py#L79) | +| Llama3 Extrapolation | Llama3RotaryEmbedding | [Llama3RotaryEmbedding](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/layers/rotary_embedding.py#L808) | +| Yarn Extrapolation | YaRNScalingRotaryEmbedding | [DeepseekScalingRotaryEmbedding](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/layers/rotary_embedding.py#L698) | + +#### Usage Example + +```python +from mindformers.parallel_core.inference.base_models.common.embeddings.rope_utils import get_rope + +rotary_pos_emb = get_rope( + transformer_config, + hidden_dim=hidden_dim, + rotary_percent=rotary_percent, + rotary_base=rotary_base, + rotary_dtype=rotary_dtype, + position_embedding_type="rope", + original_max_position_embeddings=8192, +) +``` + +### Attention Module + +#### Overview + +Attention is the core mechanism for capturing long-range dependencies. + +MCore supports two interfaces: [SelfAttention](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/transformer/attention.py#L283)和[MLASelfAttention](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/transformer/multi_latent_attention.py#L249), `SelfAttention` supports MHA and GQA variants and can be used as the Attention module of Llama and Qwen models; `MLASelfAttention` supports low-rank compression and can be used as the Attention module of DeepSeek models. + +Both support bias via `add_qkv_bias` and `add_bias_linear`. + +The main configuration parameters of `SelfAttention` are as follows: + +| CONFIG NAME | DEFAULT | MEANING | +| ------------------- | ------- | -------------------------- | +| hidden_size | 0 | Hidden dimension | +| num_attention_heads | 0 | Number of attention heads | +| num_query_groups | None | GQA group count | +| add_qkv_bias | False | Add bias in QKV projection | +| add_bias_linear | True | Add bias in Linear layers | + +The main configuration parameters of `MLASelfAttention` are as follows: + +| CONFIG NAME | DEFAULT | MEANING | +| ------------------- | ------- | ----------------------------- | +| hidden_size | 0 | Hidden dimension | +| num_attention_heads | 0 | Number of attention heads | +| add_qkv_bias | False | Add bias in QKV projection | +| add_bias_linear | True | Add bias in Linear layers | +| q_lora_rank | 512 | LoRA rank for Query | +| kv_lora_rank | 512 | LoRA rank for Key/Value | +| qk_head_dim | 128 | Head dimension for Q/K | +| qk_pos_emb_head_dim | 64 | Position-aware head dimension | +| v_head_dim | 128 | Value head dimension | + +#### vLLM Comparison + +The correspondence between the module interface and vLLM is shown in the following table (functional accuracy is aligned with the vLLM interface) + +| | Mcore | vLLM([v0.8.4](https://github.com/vllm-project/vllm/tree/v0.8.3)) | +|-----------------------------| ---------------- | ------------------------------------------------------------ | +| Self Attention | SelfAttention | [Qwen3Attention](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/models/qwen3.py#L54)[LlamaAttention](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/models/llama.py#L98) | +| Multi-head Latent Attention | MLASelfAttention | [DeepseekV2MLAAttention](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/models/deepseek_v2.py#L189) | + +#### Usage Example + +Build the Mcore Attention module based on `ModuleSpec`. Taking Qwen3 as an example, specify module (using `SeflAttention`) and submodules (the basic interface used by each submodule of SelfAttention), and instantiate the Attention module through the `build_module` method. + +```python +from mindformers.parallel_core.utils.spec_utils import ModuleSpec, build_module +from mindformers.parallel_core.inference.transformer.attention import SelfAttention, SelfAttentionSubmodules + +# Mcore Qwen3Attention ModuleSpec Construction +self_attn_spec = ModuleSpec( + module=SelfAttention, # Qwen3 use SelfAttention + submodules=SelfAttentionSubmodules( + # Interfaces used by each submodule of SelfAttention + core_attention=FlashAttention, + linear_proj=RowParallelLinear, + linear_qkv=QKVParallelLinear, + q_layernorm=get_norm_cls(normalization), + k_layernorm=get_norm_cls(normalization), + ) + ) + +# Mcore Qwen3Attention instantiation +self_attention = build_module( + self_attn_spec, + config=transformer_config, + layer_number=layer_number, + model_comm_pgs=model_comm_pgs, + ) +``` + +### MLP Module + +#### Overview + +The MLP performs non-linear transformation after attention, typically with two linear layers and an activation. + +MCore uses [MLP](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/transformer/mlp.py#L50) for dense models. Bias is controlled via `add_bias_linear`. + +| CONFIG NAME | DEFAULT | MEANING | +| ----------------- | --------------- | -------------------------- | +| hidden_size | 0 | Hidden dimension | +| ffn_hidden_size | 4 * hidden_size | fc1 linear layer dimension | +| gated_linear_unit | False | Whether to use gated unit | +| add_bias_linear | True | Add bias in Linear layers | +| hidden_act | "gelu" | Activation function | + +#### vLLM Comparison + +The correspondence between the module interface and vLLM is shown in the following table (functional accuracy is aligned with the vLLM interface) + +| | Mcore | vLLM([v0.8.4](https://github.com/vllm-project/vllm/tree/v0.8.3)) | +| -------------------- | ----- | ------------------------------------------------------------ | +| Feed-Forward Network | MLP | [Qwen2MLP](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/models/qwen2.py#L64) | + +#### Usage Example + +```python +from mindformers.parallel_core.utils.spec_utils import ModuleSpec, build_module +from mindformers.parallel_core.inference.transformer.attention import MLP, MLPSubmodules + +# Mcore Qwen3MLP ModuleSpec Construction +mlp_spec = ModuleSpec( + module=MLP, # Qwen3 use MLP + submodules=MLPSubmodules( + # Interfaces used by each submodule of MLP + linear_fc1=ColumnParallelLinear, + linear_fc2=RowParallelLinear, ) + ) + +# Mcore Qwen3MLP instantiation +mlp = build_module( + mlp_spec, + config=transformer_config, + ) +``` + +### MoE Module + +#### Overview + +MoE (Mixture of Experts) dynamically routes inputs to different expert networks, increasing model capacity without proportional compute cost. + +MCore implements [MoELayer](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/transformer/moe/moe_layer.py#L100) with: + +- Dynamic routing (Top-K selection) +- Flexible configuration via `TransformerConfig` + +| 配置名称 | 默认值 | 含义 | +| ------------------------- | --------- | -------------------------------------------------- | +| hidden_size | 0 | Hidden dimension | +| moe_ffn_hidden_size | None | Expert feed-forward dimension | +| num_moe_experts | None | Number of experts | +| moe_router_topk | 2 | Number of experts per token | +| hidden_act | "gelu" | Activation function | +| moe_router_score_function | "softmax" | Routing score function type | +| moe_router_fusion | False | Whether the router module uses the fusion operator | + +#### vLLM Comparison + +The correspondence between the module interface and vLLM is shown in the following table (functional accuracy is aligned with the vLLM interface) + +| | Mcore | vLLM([v0.8.4](https://github.com/vllm-project/vllm/tree/v0.8.3)) | +| ------------------ | -------- | ------------------------------------------------------------ | +| Mixture of Experts | MoELayer | [DeepseekV2MoE](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/models/deepseek_v2.py#L96) | + +#### Usage Example + +```python +from mindformers.parallel_core.utils.spec_utils import ModuleSpec, build_module +from mindformers.parallel_core.inference.transformer.moe.moe_layer import MoELayer, MoESubmodules +from mindformers.parallel_core.inference.transformer.mlp import MLPSubmodules +from mindformers.parallel_core.inference.transformer.moe.experts import GroupedMLP +from mindformers.parallel_core.inference.transformer.moe.shared_experts import SharedExpertMLP + +# Mcore DeepseekV3MoE ModuleSpec Construction +moe_spec = ModuleSpec( + module=MoELayer, # DeepseekV3 starts using MoELayer from the 4th layer + submodules=MoESubmodules( + # experts building + experts=ModuleSpec( + module=GroupedMLP, + submodules=None, + ), + # shared experts building + shared_experts=ModuleSpec( + module=SharedExpertMLP, + params={"gate": False}, + submodules=MLPSubmodules( + linear_fc1=ColumnParallelLinear, + linear_fc2=RowParallelLinear, + ) + ) + ) + ) + +# Mcore DeepseekV3MoE instantiation +moe = build_module( + moe_spec, + config=transformer_config, + ) +``` + diff --git a/docs/mindformers/docs/source_zh_cn/advanced_development/images/mcore_infer_migration_process.png b/docs/mindformers/docs/source_zh_cn/advanced_development/images/mcore_infer_migration_process.png new file mode 100644 index 0000000000000000000000000000000000000000..56effff344a9c6989094ad44a680c5e1c570a504 Binary files /dev/null and b/docs/mindformers/docs/source_zh_cn/advanced_development/images/mcore_infer_migration_process.png differ diff --git a/docs/mindformers/docs/source_zh_cn/advanced_development/infer_llm_migration.md b/docs/mindformers/docs/source_zh_cn/advanced_development/infer_llm_migration.md new file mode 100644 index 0000000000000000000000000000000000000000..f1111c6a0790e3d70ed77a05dd2587dcb2449588 --- /dev/null +++ b/docs/mindformers/docs/source_zh_cn/advanced_development/infer_llm_migration.md @@ -0,0 +1,565 @@ +# LLM推理模型搭建教程 + +[![View Source On Gitee](https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/website-images/master/resource/_static/logo_source_en.svg)](https://gitee.com/mindspore/docs/blob/master/docs/mindformers/docs/source_zh_cn/advanced_development/infer_llm_migration.md) + +## 概述 + +本教程旨在指导开发者完成基于Mcore架构的推理模型的适配工作。 + +适配的核心在于掌握模型的模块组成及其搭建方式,具体为Transformer结构模型类[GPTModel](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/base_models/gpt/gpt_model.py#L37)使用(LLM模型统一入口)和类Megatron模型搭建机制ModuleSpec。 + +## 适配流程 + +Mcore推理模型适配的核心流程如下: + +1.调用get_spec方法构建transformer_layer_spec + +2.使用该配置及其他关键参数初始化GPTModel 实例 + +3.创建InferModelForCausalLM类实例 + +4.将GPTModel对象赋值给InferForCausalLM的model成员变量 + +对应的流程图如下: + +![MCore_infer_building_process](./images/mcore_infer_migration_process.png) + +## 模型Spec构建说明 + +Transformer结构的大模型在模型搭建上具有相似性,Mcore推理通过以下两个核心函数实现了对多种 Transformer 类大模型(如 Qwen、DeepSeek)的泛化构建: + + ● [get_gpt_layer_local_spec](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/base_models/gpt/gpt_layer_specs.py#L50):构建单个 Transformer 层的模块规范,适用于每层结构相同的模型搭建(如Qwen3) + + ● [get_gpt_decoder_block_spec](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/base_models/gpt/gpt_layer_specs.py#L141):构建整个解码器块的模块规范,即多层Transformer模块,支持MoE、Dense混合层构建,适用于存在不同Transformer层结构的模型搭建(如DeepSeekV3) + +通过灵活的参数配置适配多种模型结构,实现类Transformer模型的高效搭建。下面详细介绍这两个核心函数的使用说明。 + +### get_gpt_layer_local_spec + +#### 功能说明 + +该函数用于定义一个 Transformer 层的模块规范(ModuleSpec),支持多种结构特性,如: + +● MoE(Mixture of Experts) + +● Multi Latent Attention + +● Q/K LayerNorm + +● Flash Attention + +● Sandwich Norm + +#### 参数说明 + +| 参数名 | 类型 | 描述 | 默认值 | +| ---------------------- | ------------- | ------------------------------------------------- | ------ | +| num_experts | Optional[int] | MoE 中专家数量,为None表示 Dense 层 | None | +| moe_grouped_gemm | bool | 是否使用 Grouped GEMM 加速 MoE 计算(仅支持开启) | True | +| qk_layernorm | bool | 是否对 Query 和 Key 做 归一化 | False | +| gated_linear_unit | bool | 是否开启门控线性单元 | True | +| multi_latent_attention | bool | 是否使用 Multi Latent Attention(MLA)模块 | False | +| normalization | Optional[str] | 使用的归一化方式(如"RMSNorm"、"LayerNorm") | None | +| qk_l2_norm | bool | 是否使用 L2 归一化(目前不支持) | False | +| use_flash_attention | bool | 是否启用 Flash Attention(默认启用) | True | +| sandwich_norm | bool | 是否启用 Sandwich Norm(GLM模型特有结构) | False | +| use_alltoall | bool | 是否使用AllToAll通信 | False | +| use_fused_mla | bool | 是否使用融合mla算子(量化) | False | + +#### 使用实例 + +```python +from mindformers.parallel_core.inference.base_models.gpt.gpt_layer_specs import get_gpt_layer_local_spec + +# 定义一个模块规范为: SelfAttention(使用FA)、MLP(稠密)、RMSNorm、无qk norm和sandwich norm的单层Transformer结构 +dense_layer_spec = get_gpt_layer_local_spec( + multi_latent_attention=False, # 不使用MLA + normalization="RMSNorm", # 归一化使用RMSNorm + qk_layernorm=False, # 不开启QK norm + use_flash_attention=True, # 使用FA加速推理 + sandwich_norm=False # 不开启sandwich norm +) +``` + +### get_gpt_decoder_block_spec + +#### 功能说明 + +该函数用于构建整个 Transformer 解码器块(Decoder Block)的结构。支持: + +● MoE和Dense层混合搭建 + +● 读取配置信息定义层间排布 + +根据配置中first_k_dense_replace参数,生成MoE层和Dense层的分布模式,按照分布模式逐层构建MoE层和Dense层的ModuleSpec(单层Transformer构建的方式参考get_gpt_layer_local_spec方法的使用说明),并将每一层的spec组合成layer_specs列表,从而得到模型整网的spec结构。 + +#### 参数说明 + +| 参数名 | 类型 | 描述 | 默认值 | +| ------------- | ----------------- | --------------------------------------- | ------ | +| config | TransformerConfig | 模型配置 | - | +| normalization | Optional[str] | 指定归一化类型 | None | +| qk_l2_norm | Optional[bool] | 是否启用 Q/K 的 L2 归一化(目前不支持) | None | + +#### 使用实例 + +```python +from mindformers.parallel_core.inference.base_models.gpt.gpt_layer_specs import get_gpt_decoder_block_spec + +# 定义一个TransformerBlock模块spec +transformer_block_spec = get_gpt_decoder_block_spec( + config=config, # transformer_config中包含模型结构的相关信息,通过读取相应配置调用get_gpt_layer_local_spec()逐层构建 + normalization="RMSNorm", # 归一化使用RMSNorm +) +``` + +## **GPTModel介绍及使用** + +[GPTModel](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/base_models/gpt/gpt_model.py#L37)作为Mcore大语言模型统一入口,支持Transformer结构的模型构建能力,下面简要介绍其init入参和调用示例。 + +### init入参说明 + +GPTModel实例化输入参数表如下所示,具体说明可参考相关文档注释 + +| 参数名 | 类型 | 描述 | 默认值 | +| ----------------------------------- | ---------------------- | ------------------------------------------------------------ | ---------------------- | +| config | TransformerConfig | 模型全局配置对象,包含隐藏层大小、注意力头数等核心参数 | - | +| transformer_layer_spec | ModuleSpec | 定义Transformer层结构的模块规范,指定SelfAttention/MLP等组件 | - | +| vocab_size | int | 词汇表大小,决定嵌入层和输出层的维度 | - | +| max_sequence_length | int | 序列最大长度,用于位置编码的初始化 | - | +| pre_process | bool | 是否执行嵌入层计算 | True | +| post_process | bool | 是否计算输出层logits | True | +| fp16_lm_cross_entropy | bool | 是否使用FP16计算交叉熵(推理不支持) | False | +| parallel_output | bool | 是否分散输出(保持张量并行分片) | True | +| share_embeddings_and_output_weights | bool | 是否共享嵌入层与输出层权重 | False | +| position_embedding_type | str | 位置编码类型 | 'none' | +| rotary_percent | float | 应用旋转编码的维度比例 | 1.0 | +| rotary_base | int | RoPE基础值 | 10000 | +| rope_scaling | bool | 是否启用RoPE外推算法(通过position_embedding_type='llama3'控制) | False | +| seq_len_interpolation_factor | float | 序列长度插值因子 | None | +| mtp_block_spec | ModuleSpec | MTP块规范 | None | +| model_comm_pgs | ModelCommProcessGroups | 通信域信息 | default_model_comm_pgs | +| quant_config | QuantizationConfig | 量化配置 | None | + +### 调用示例 + +```python +# InferModelForCausalLM init()声明self.model时实例化GPTModel +self.model = GPTModel( + config=config, # 读取模型配置转换得到transformer_config + transformer_layer_spec=get_gpt_layer_local_spec(), # 模型TransformerLayerSpec配置 + vocab_size=152064, # 模型词表大小 + max_sequence_length=32768, # 序列最大长度 + position_embedding_type='rope', # 使用基础RoPE位置编码 + rotary_base=10000, # RoPE基础值 + share_embeddings_and_output_weights=False, # 不共享嵌入层与输出层权重 + post_process=True # 开启后处理流程计算logits + ) +``` + +## LLM模型适配实践 + +[Qwen2](https://gitee.com/mindspore/mindformers/blob/master/mindformers/models/qwen2/modeling_qwen2_infer.py#L36)、[Qwen3](https://gitee.com/mindspore/mindformers/blob/master/mindformers/models/qwen3/modeling_qwen3_infer.py#L36)、[Qwen3Moe](https://gitee.com/mindspore/mindformers/blob/master/mindformers/models/qwen3_moe/modeling_qwen3_moe_infer.py#L37)目前已适配上库,可参考对应modeling_infer.py代码实现 + +### Qwen2适配 + +modeling_qwen2_infer.py中InferenceQwen2ForCausalLM实现代码如下: + +```python +@MindFormerRegister.register(MindFormerModuleType.MODELS) +class InferenceQwen2ForCausalLM(Qwen3PreTrainedModel, InferModelMixin): # 推理Mcore模型需继承InferModelMixin和对应的PreTrainedModel + # init初始化 + def __init__(self, config): + super().__init__(config, auto_prefix=False) + self.config = config + # 调用转换方法将模型Config转换为TransformerConfig + config: TransformerConfig = convert_to_transformer_config(self.config) + + # 省略部分步骤 + + # 实例化Qwen2模型结构的GPTModel + self.model = GPTModel( + config=config, + # 重点:构建Qwen2 Transformer Spec + transformer_layer_spec=get_gpt_layer_local_spec(), + vocab_size=self.vocab_size, + max_sequence_length=self.max_position_embeddings, + position_embedding_type=config.position_embedding_type, + rotary_base=self.config.rope_theta, + share_embeddings_and_output_weights=self.config.tie_word_embeddings, + pre_process=self.config.pre_process, + post_process=self.config.post_process, + model_comm_pgs=self.model_comm_pgs + ) +``` + +Qwen2模型结构分析结果为:自注意力模块(multi_latent_attention=False),前馈网络MLP模块(num_experts=None),RMSNorm模块(normalization="RMSNorm")、无qk norm和sandwich_norm(qk_layernorm=False && sandwich_norm=False),且所有Transformer层结构一致,故通过get_gpt_layer_local_spec()方法实现Qwen2的Spec搭建,实现代码如下: + +```python +transformer_layer_spec = get_gpt_layer_local_spec( + # multi_latent_attention参数默认为False + # num_experts参数默认为None + # sandwich_norm参数默认为False + normalization=config.normalization, # normalization配置为"RMSNorm" + qk_layernorm=False +) +``` + +### Qwen3适配 + +Qwen3对比Qwen2,模型结构的区别有且仅有一处:Qwen3具有QK Norm结构,故Qwen3的Spec搭建只需要打开qk_layernorm开关即可,实现代码如下: + +```python +transformer_layer_spec = get_gpt_layer_local_spec( + # multi_latent_attention参数默认为False + # num_experts参数默认为None + # sandwich_norm参数默认为False + normalization=config.normalization, # normalization配置为"RMSNorm" + qk_layernorm=True # 区别于Qwen2,Qwen3 qk_layernorm设为True +) +``` + +### Qwen3Moe适配 + +Qwen3Moe对比Qwen3稠密模型,模型结构的区别在于将MLP模块替换为MoE模块,故Qwen3 Moe的Spec搭建需要传入专家数参数num_experts,实现代码如下: + +```python +transformer_layer_spec = get_gpt_layer_local_spec( + # multi_latent_attention参数默认为False + # sandwich_norm参数默认为False + num_experts=config.num_moe_experts, # 区别于Qwen3稠密,Qwen3Moe需设置专家数参数 + normalization=config.normalization, # normalization配置为"RMSNorm" + qk_layernorm=True # 区别于Qwen2,Qwen3 qk_layernorm设为True +) +``` + +## Transformer模块介绍 + +在完成模型的整体架构适配后,理解并正确使用Mindformers Mcore推理框架中的核心模块,是确保模型功能完整的关键。这些核心模块是构成Transformer模型的基础组件,Mcore通过提供标准化、高可用、高性能的接口,实现了与主流推理框架vLLM的功能对齐。 + +本节将详细介绍Mindformers Mcore推理中涉及的几大核心模块,包括其功能、与vLLM的对应关系以及调用示例。开发者在进行模型适配时,可以直接复用这些模块,无需从零实现,从而大幅提升开发效率。 + +### **模块概览** + +下表列出了Mindformers Mcore推理框架中主要的核心模块及其核心功能: + +| 模块类型 | MindformersMcore模块名称 | 核心功能 | +| ---------------------- | ------------------------------ | ------------------------------------------------------------ | +| 归一化 (Norm) | LayerNorm,RMSNorm | 对输入特征进行归一化处理,稳定模型训练和推理过程。 | +| 激活函数 (Activation) | SiLU | 在神经网络层之间引入非线性,增强模型的表达能力。 | +| 位置编码 (RoPE) | RotaryEmbedding等 | 为输入序列中的每个位置生成旋转位置编码,使模型能够感知序列的顺序信息。 | +| 注意力机制 (Attention) | SelfAttention,MLASelfAttention | 计算序列内部各位置间的相关性,是模型捕捉长距离依赖的核心。 | +| 前馈网络 (MLP) | MLP | Transformer层中的前馈神经网络,负责对注意力层的输出进行非线性变换。 | +| 专家混合 (MoE) | MoELayer | 实现专家混合(Mixture of Experts)架构,通过路由机制在不同专家网络间进行选择,提升模型容量和效率。 | + +### Norm模块 + +#### 介绍 + +Norm模块(归一化层)核心作用是规范化输入的特征分布 。 + +Mcore当前支持[LayerNorm](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/transformer/norm.py#L26)和[RMSNorm](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/transformer/norm.py#L73)两种归一化接口作为Norm模块的调用接口,可通过在模型配置类中指定normalization参数来控制全局Norm模块的接口类型 + +主要参数配置如下: + +| 配置名称 | 默认值 | 含义 | +| ----------------- | ----------- | ------------------------------------ | +| normalization | "LayerNorm" | Norm类型,可选"LayerNorm"和"RMSNorm" | +| layernorm_epsilon | 1e-5 | eps value | + +#### 对比vLLM + +模块接口与vLLM的对应关系如下表(功能精度与vLLM接口对齐) + +| | Mcore | vLLM([v0.8.4](https://github.com/vllm-project/vllm/tree/v0.8.3)) | +| ------------ | --------- | ------------------------------------------------------------ | +| 层归一化 | LayerNorm | [LayerNorm](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/models/commandr.py#L70) | +| 均方根归一化 | RMSNorm | [RMSNorm](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/layers/layernorm.py#L82) | + +#### 使用示例 + +```python +from mindformers.parallel_core.inference.transformer.norm import get_norm_cls +from mindformers.parallel_core.utils.spec_utils import build_module + +# LayerNorm 调用 +layernorm_cls = get_norm_cls("LayerNorm") +layernorm = build_module( + layernorm_cls, + config=config, # TransformerConfig配置类对象 + hidden_size=config.hidden_size, + eps=config.layernorm_epsilon +) +# RMSNorm 调用 +rmsnorm_cls = get_norm_cls("RMSNorm") +rmsnorm = build_module( + rmsnorm_cls, + config=config, # TransformerConfig配置类对象 + hidden_size=config.hidden_size, + eps=config.layernorm_epsilon +) +``` + +### Activation模块 + +#### 介绍 + +Activation模块(激活函数)是神经网络中引入非线性特性的核心组件,直接影响模型的表达能力和训练效率。 + +Mcore当前仅支持[SiLU](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/transformer/activation.py#L23)激活函数接口,为保证可扩展性,仍可通过在模型配置类中指定hidden_act参数来控制全局Activation模块的接口类型 + +主要配置参数如下: + +| 配置名称 | 默认值 | 含义 | +| ---------- | ------ | ------------ | +| hidden_act | "gelu" | 激活函数类型 | + +#### 对比vLLM + +模块接口与vLLM的对应关系如下表(在叠加Matmul之后功能精度与vLLM组合接口对齐) + +| | Mcore | vLLM([v0.8.4](https://github.com/vllm-project/vllm/tree/v0.8.3)) | +| --------- | ----- | ------------------------------------------------------------ | +| σ线性单元 | SiLU | [SiLUAndMul](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/layers/activation.py#L55) | + +#### 使用示例 + +```python +from mindformers.parallel_core.inference.transformer.activation import get_act_func + +# SiLU 实例化 +silu = get_act_func("silu") +``` + +### RoPE模块 + +#### 介绍 + +RoPE(Rotary Position Embedding,旋转位置编码)是一种基于绝对位置编码的改进方法,通过将位置信息编码为旋转矩阵,使模型能够感知序列中元素的相对位置关系。 + +Mcore当前支持[RotaryEmbedding](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/base_models/common/embeddings/rotary_pos_embedding.py#L36)、[Llama3RotaryEmbedding](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/base_models/common/embeddings/rotary_pos_embedding.py#L173)、[YaRNScalingRotaryEmbedding](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/base_models/common/embeddings/yarn_rotary_pos_embedding.py#L30)三种旋转位置编码接口作为RoPE模块的调用接口,可通过在模型配置中position_embedding_type参数来控制RoPE模块的接口类型 + +主要参数配置如下: + +| 配置名称 | 默认值 | 含义 | +| ----------------------- | ------ | ----------------------------- | +| kv_channels | None | Key 和 Value 的通道数 | +| rotary_percent | 1.0 | 旋转编码维度比例 | +| rotary_base | 10000 | 旋转周期 | +| rotary_cos_format | 0 | ApplyRotaryPosEmb大算子Format | +| rotary_dtype | fp16 | 旋转矩阵的数据类型 | +| max_position_embeddings | 4096 | 最大位置嵌入长度 | + +#### 对比vLLM + +模块接口与vLLM的对应关系如下表(功能精度与vLLM接口对齐) + +| | Mcore | vLLM([v0.8.4](https://github.com/vllm-project/vllm/tree/v0.8.3)) | +| -------------- | -------------------------- | ------------------------------------------------------------ | +| 基础RoPE | RotaryEmbedding | [RotaryEmbedding](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/layers/rotary_embedding.py#L79) | +| Llama3外推RoPE | Llama3RotaryEmbedding | [Llama3RotaryEmbedding](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/layers/rotary_embedding.py#L808) | +| Yarn外推RoPE | YaRNScalingRotaryEmbedding | [DeepseekScalingRotaryEmbedding](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/layers/rotary_embedding.py#L698) | + +#### 使用示例 + +```python +from mindformers.parallel_core.inference.base_models.common.embeddings.rope_utils import get_rope + +rotary_pos_emb = get_rope( + transformer_config, + hidden_dim=hidden_dim, + rotary_percent=rotary_percent, + rotary_base=rotary_base, + rotary_dtype=rotary_dtype, + position_embedding_type="rope", + original_max_position_embeddings=8192, +) +``` + +### Attention模块 + +#### 介绍 + +Attention模块 (注意力机制)是大语言模型的核心组件之一,其核心功能是通过计算序列中所有位置之间的相关性,动态捕捉长距离依赖关系。 + +Mcore当前支持[SelfAttention](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/transformer/attention.py#L283)和[MLASelfAttention](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/transformer/multi_latent_attention.py#L249)作为Attention模块的调用接口,其中SelfAttention支持MHA、GQA变体,可作为类Llama、Qwen模型的Attention模块;MLASelfAttention支持低秩压缩,可作为类DeepSeek模型的Attention模块。接口均支持添加偏置bias,可通过在模型配置类中指定add_qkv_bias和add_bias_linear参数取值来控制是否打开。 + +SelfAttention主要配置参数如下: + +| 配置名称 | 默认值 | 含义 | +| ------------------- | ------ | ---------------------------- | +| hidden_size | 0 | 隐藏层维度 | +| num_attention_heads | 0 | 注意力头数 | +| num_query_groups | None | GQA组数,默认不开启 | +| add_qkv_bias | False | 计算QKV时,是否添加bias | +| add_bias_linear | True | 是否给所有的Linear层添加bias | + +MLASelfAttention主要配置参数如下: + +| 配置名称 | 默认值 | 含义 | +| ------------------- | ------ | ---------------------------------- | +| hidden_size | 0 | 隐藏层维度 | +| num_attention_heads | 0 | 注意力头数 | +| add_qkv_bias | False | 计算QKV时,是否添加bias | +| add_bias_linear | True | 是否给所有的Linear层添加bias | +| q_lora_rank | 512 | Query的压缩维度 | +| kv_lora_rank | 512 | Key和Value的压缩维度 | +| qk_head_dim | 128 | Query和Key的head_dim | +| qk_pos_emb_head_dim | 64 | Query和Key的带有位置信息的head_dim | +| v_head_dim | 128 | Value的head_dim | + +#### 对比vLLM + +模块接口与vLLM的对应关系如下表(功能精度与vLLM接口对齐) + +| | Mcore | vLLM([v0.8.4](https://github.com/vllm-project/vllm/tree/v0.8.3)) | +| -------------- | ---------------- | ------------------------------------------------------------ | +| 自注意力 | SelfAttention | [Qwen3Attention](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/models/qwen3.py#L54)[LlamaAttention](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/models/llama.py#L98) | +| 多头潜在注意力 | MLASelfAttention | [DeepseekV2MLAAttention](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/models/deepseek_v2.py#L189) | + +#### 使用示例 + +基于ModuleSpec搭建(详细说明可参考Mcore module spec的搭建原理和教程)Mcore Attention模块,以Qwen3为例,指定module(使用SeflAttention)和submodules(SelfAttention各子模块使用的基础接口),通过build_module方法实例化Attention模块。 + +```python +from mindformers.parallel_core.utils.spec_utils import ModuleSpec, build_module +from mindformers.parallel_core.inference.transformer.attention import SelfAttention, SelfAttentionSubmodules + +# Mcore Qwen3Attention ModuleSpec搭建 +self_attn_spec = ModuleSpec( + module=SelfAttention, # Qwen3使用SelfAttention + submodules=SelfAttentionSubmodules( + # SelfAttention各子模块使用的接口 + core_attention=FlashAttention, + linear_proj=RowParallelLinear, + linear_qkv=QKVParallelLinear, + q_layernorm=get_norm_cls(normalization), + k_layernorm=get_norm_cls(normalization), + ) + ) + +# Mcore Qwen3Attention 实例化 +self_attention = build_module( + self_attn_spec, + config=transformer_config, + layer_number=layer_number, + model_comm_pgs=model_comm_pgs, + ) +``` + +### MLP模块 + +#### 介绍 + +MLP模块(前馈神经网络)是Transformer模型中的核心组件之一,负责对注意力层的输出进行非线性变换,其标准结构由两个线性变换和一个激活函数组成。 + +Mcore实现了[MLP](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/transformer/mlp.py#L50)作为MLP模块的调用接口,适用于稠密模型的搭建。接口支持添加偏置bias,可通过在模型配置类中指定add_bias_linear参数来控制是否打开。 + +主要参数配置如下: + +| 配置名称 | 默认值 | 含义 | +| ----------------- | --------------- | ---------------------------- | +| hidden_size | 0 | 隐藏层维度 | +| ffn_hidden_size | 4 * hidden_size | 第一个Linear层的映射维度 | +| gated_linear_unit | False | 是否使用门控单元 | +| add_bias_linear | True | 是否给所有的Linear层添加bias | +| hidden_act | "gelu" | 激活函数 | + +#### 对比vLLM + +模块接口与vLLM的对应关系如下表(功能精度与vLLM接口对齐) + +| | Mcore | vLLM([v0.8.4](https://github.com/vllm-project/vllm/tree/v0.8.3)) | +| -------- | ----- | ------------------------------------------------------------ | +| 前馈网络 | MLP | [Qwen2MLP](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/models/qwen2.py#L64) | + +#### 使用示例 + +```python +from mindformers.parallel_core.utils.spec_utils import ModuleSpec, build_module +from mindformers.parallel_core.inference.transformer.attention import MLP, MLPSubmodules + +# Mcore Qwen3MLP ModuleSpec搭建 +mlp_spec = ModuleSpec( + module=MLP, # Qwen3稠密使用MLP + submodules=MLPSubmodules( + # MLP各子模块使用的接口 + linear_fc1=ColumnParallelLinear, + linear_fc2=RowParallelLinear, ) + ) + +# Mcore Qwen3MLP 实例化 +mlp = build_module( + mlp_spec, + config=transformer_config, + ) +``` + +### MoE模块 + +#### 介绍 + +MoE(Mixture of Experts,专家混合)模块是一种通过动态选择专家网络(Experts)提升模型容量和效率的架构。其核心思想是为不同输入分配不同的专家网络处理,从而在保持计算资源可控的前提下扩展模型规模。 + +Mcore实现了[MoELayer](https://gitee.com/mindspore/mindformers/blob/master/mindformers/parallel_core/inference/transformer/moe/moe_layer.py#L100)作为MoE模块的调用接口,目前MoE模块支持以下特性: + +● 动态路由机制 :基于输入内容动态选择Top-K专家网络。 + +● 灵活配置 :支持专家数量、激活专家数等,通过TransformerConfig配置项控制 + +主要参数配置如下: + +| 配置名称 | 默认值 | 含义 | +| ------------------- | ------ | -------------------------------- | +| hidden_size | 0 | 隐藏层维度 | +| moe_ffn_hidden_size | None | 每个专家第一个Linear层的映射维度 | +| num_moe_experts | None | 专家数 | +| moe_router_topk | 2 | 每个token选择的专家数 | +| hidden_act | "gelu" | 激活函数 | +|moe_router_score_function|"softmax"|路由得分函数类型 +|moe_router_fusion|False|是否router模块使用融合算子 + +#### 对比vLLM + +模块接口与vLLM的对应关系如下表 + +| | Mcore | vLLM([v0.8.4](https://github.com/vllm-project/vllm/tree/v0.8.3)) | +| -------- | -------- | ------------------------------------------------------------ | +| 前馈网络 | MoELayer | [DeepseekV2MoE](https://github.com/vllm-project/vllm/blob/v0.8.4/vllm/model_executor/models/deepseek_v2.py#L96) | + +#### 使用示例 + +```python +from mindformers.parallel_core.utils.spec_utils import ModuleSpec, build_module +from mindformers.parallel_core.inference.transformer.moe.moe_layer import MoELayer, MoESubmodules +from mindformers.parallel_core.inference.transformer.mlp import MLPSubmodules +from mindformers.parallel_core.inference.transformer.moe.experts import GroupedMLP +from mindformers.parallel_core.inference.transformer.moe.shared_experts import SharedExpertMLP + +# Mcore DeepseekV3MoE ModuleSpec搭建 +moe_spec = ModuleSpec( + module=MoELayer, # DeepseekV3使用MoELayer + submodules=MoESubmodules( + # experts路由专家搭建 + experts=ModuleSpec( + module=GroupedMLP, + submodules=None, + ), + # shared experts共享专家搭建 + shared_experts=ModuleSpec( + module=SharedExpertMLP, + params={"gate": False}, + submodules=MLPSubmodules( + linear_fc1=ColumnParallelLinear, + linear_fc2=RowParallelLinear, + ) + ) + ) + ) + +# Mcore DeepseekV3MoE 实例化 +moe = build_module( + moe_spec, + config=transformer_config, + ) +```