docker部署DeepSeek-R1-Distill-Qwen-7B报错

一、问题现象（附报错日志上下文）：
npu信息如下：
npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2                 Version: 24.1.rc2                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 2     910ProB             | OK            | 74.2        58                0    / 0             |
| 0                         | 0000:01:00.0  | 0           2377 / 15038      1    / 32768         |
+===========================+===============+====================================================+
| 5     910ProB             | OK            | 75.0        59                0    / 0             |
| 0                         | 0000:81:00.0  | 0           2354 / 15038      1    / 32768         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 2                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 5                                                            |
+===========================+===============+====================================================+

我的操作步骤按如下链接：
https://gitee.com/ascend/ModelZoo-PyTorch/tree/master/MindIE/LLM/DeepSeek/DeepSeek-R1-Distill-Qwen-7B#deepseek-r1-distill-qwen-7b
1.拉取docker镜像：
docker pull --platform=arm64 swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.0.RC2-800I-A2-py311-openeuler24.03-lts
2.创建容器：
docker run -it -d --net=host --shm-size=1g \
    --privileged \
    --name new7b \
    --device=/dev/davinci_manager \
    --device=/dev/hisi_hdc \
    --device=/dev/devmm_svm \
    -v /home/user/Ascend/driver:/usr/local/Ascend/driver:ro \
    -v /home/user/sbin:/usr/local/sbin:ro \
    -v /home/user/DeepSeek-R1-Distill-Qwen-7B:/usr/local/DeepSeek-R1-Distill-Qwen-7B:ro\
    swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.0.RC2-800I-A2-py311-openeuler24.03-lts \
    bash
3.进入容器：
docker exec -it new7b bash

因为量化总失败，所以直接进入对话测试：

4.进入对话测试，先进入llm_model路径，再source环境变量
cd $ATB_SPEED_HOME_PATH
#source环境变量
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /usr/local/Ascend/atb-models/set_env.sh
source /usr/local/Ascend/mindie/set_env.sh
#执行对话测试
export MINDIE_LOG_TO_STDOUT=1

torchrun --nproc_per_node 1 \
         --master_port 20037 \
         -m examples.run_pa \
         --model_path /usr/local/DeepSeek-R1-Distill-Qwen-7B\
         --max_output_length 20

控制台报错输出如下：
The old environment variable ATB_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible.
The old environment variable ATB_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible.
The old environment variable MINDIE_LLM_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible.
The old environment variable MINDIE_LLM_PYTHON_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible.
The old environment variable MINDIE_LLM_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible.
The old environment variable MINDIE_LLM_PYTHON_LOG_PATH will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_PATH as soon as possible.
The old environment variable ATB_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible.
The old environment variable MINDIE_LLM_PYTHON_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible.
The old environment variable MINDIE_LLM_PYTHON_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible.
The old environment variable MINDIE_LLM_LOG_TO_FILE will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_FILE as soon as possible.
The old environment variable OCK_LOG_LEVEL will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_LEVEL as soon as possible.
The old environment variable OCK_LOG_TO_STDOUT will be deprecated on 2025/12/31. Please use the new environment variable MINDIE_LOG_TO_STDOUT as soon as possible.
[2025-10-24 11:25:19,698] [1106] [281473819319232] [llmmodels] [INFO] [cpu_binding.py-254] : rank_id: 0, device_id: 0, numa_id: 0, shard_devices: [0], cpus: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
[2025-10-24 11:25:19,701] [1106] [281473819319232] [llmmodels] [INFO] [cpu_binding.py-280] : process 1106, new_affinity is [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], cpu count 32
[2025-10-24 11:25:20,430] [1106] [281473819319232] [llmmodels] [INFO] [model_runner.py-154] : model_runner.quantize: None, model_runner.kv_quant_type: None, model_runner.fa_quant_type: None, model_runner.dtype: torch.float16
[2025-10-24 11:25:28,965] [1106] [281473819319232] [llmmodels] [INFO] [dist.py-81] : initialize_distributed has been Set
[2025-10-24 11:25:28,966] [1106] [281473819319232] [llmmodels] [INFO] [model_runner.py-176] : init tokenizer done
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/Ascend/atb-models/examples/run_pa.py", line 545, in <module>
    pa_runner = PARunner(**input_dict)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Ascend/atb-models/examples/run_pa.py", line 106, in __init__
    self.model.load_weights(**kw_args)
  File "/usr/local/Ascend/atb-models/atb_llm/runner/model_runner.py", line 203, in load_weights
    self.model = self.model_cls(self.config,
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Ascend/atb-models/atb_llm/models/qwen2/flash_causal_qwen2.py", line 29, in __init__
    super().__init__(config, weights, **kwargs)
  File "/usr/local/Ascend/atb-models/atb_llm/models/base/flash_causal_lm.py", line 109, in __init__
    self.placeholder = torch.zeros(1, dtype=self.dtype, device="npu")
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: call aclnnInplaceZero failed, detail:EZ9999: Inner Error!
EZ9999: [PID: 1106] 2025-10-24-11:25:29.064.499 Parse dynamic kernel config fail.
        TraceBack (most recent call last):
       AclOpKernelInit failed opType
       ZerosLike ADD_TO_LAUNCHER_LIST_AICORE failed.

[ERROR] 2025-10-24-11:25:29 (PID:1106, Device:0, RankID:0) ERR01100 OPS call acl api failed
[2025-10-24 11:25:48,432] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1106) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib64/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib64/python3.11/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib64/python3.11/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
examples.run_pa FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-10-24_11:25:48
  host      : user
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1106)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

不太清楚具体该如何解决，求助🙏

Ascend/ModelZoo-PyTorch

内容风险标识

评论 (1)

Ascend/ModelZoo-PyTorch .gitee-modal { width: 500px !important; }

内容风险标识