登录
注册
开源
企业版
高校版
搜索
帮助中心
使用条款
关于我们
开源
企业版
高校版
私有云
模力方舟
AI 队友
登录
注册
代码拉取完成,页面将自动刷新
开源项目
>
开发工具
>
编译/构建/部署
&&
捐赠
捐赠前请先登录
取消
前往登录
扫描微信二维码支付
取消
支付完成
支付提示
将跳转至支付宝完成支付
确定
取消
Watch
不关注
关注所有动态
仅关注版本发行动态
关注但不提醒动态
1.4K
Star
7.6K
Fork
1.4K
GVP
方舟编译器
/
OpenArkCompiler
代码
Issues
200
Pull Requests
14
Wiki
统计
流水线
服务
质量分析
Jenkins for Gitee
腾讯云托管
腾讯云 Serverless
悬镜安全
阿里云 SAE
Codeblitz
SBOM
我知道了,不再自动展开
更新失败,请稍后重试!
移除标识
内容风险标识
本任务被
标识为内容中包含有代码安全 Bug 、隐私泄露等敏感信息,仓库外成员不可访问
spec-500-502-ref性能问题整理&&Huawei后续优化计划
待办的
#I45PLQ
yi_jiang
成员
创建于
2021-08-16 10:38
#### 500 性能问题总结 (1)500 最热的函数是 S_regmatch,该函数性能比 gcc 差 20% 以上,通过将该函数替换成 gcc 的版本,整体性能有显著提升。 替换最热的 S_regmatch 为 gcc 版本后, user time 统计(单位 s)。 > 【注意】gcc S_regmatch 是 called_once callee,O2 默认会被 inline 且本身会被删掉。为了能保留该 symbol,使用 gcc -fno-inline 关闭了内联,然后提取 gcc 版本的 S_regmatch。 | 500 ref | gcc | maple | maple 替换为gcc S_regmatch | 替换后整体提升 | | :-----: | :-----: | :-----: | :------------------------: | :------------: | | case 1 | 218.499 | 280.514 | 269.002 | 5.3% | | case 2 | 218.499 | 178.329 | 170.264 | 3.7% | | case 3 | 218.499 | 204.044 | 196.889 | 3.3% | (2)S_regmatch 性能问题主要集中在 RA 上,以 500 ref case1 为例,下表对比了 gcc S_regmatch 和 maple S_regmatch 在 cycles 总采样数量,分配栈帧大小,栈操作的 cycles 采样数量上的对比。maple 将更多的变量 spill 到了栈上。 | 500 ref case1 | gcc S_regmatch | maple S_regmatch | 差距 | | :-----------------: | :------------: | :--------------: | :----: | | cycles采样数 | 294063 | 415720 | 41.4% | | 分配栈帧大小(byte) | 464 | 1232 | 165.5% | | stack ldr/str采样数 | 109701 | 195745 | 78.4% | | stack操作占比 | 37.3% | 47.1% | 26.2% | (3)store merging 问题,详见 [issue](https://gitee.com/openarkcompiler/OpenArkCompiler/issues/I43ZJ5)。 问题详述如下: ### 500 ref 热点函数 第 1 列和第 2 列分别是 cpu-cycles percent 和 cache-misses percent。 第 3 列和第 4 列分别是 cpu-cycles samples 和 cache-misses samples。 第 5 列是当前可执行文件名称,第 6 列是 symbol name。 仅显示 cycles 百分比大于 0.5% 的 symbol。 #### 500 ref case 1 采样命令: ```sh perf record -e cpu-cycles,cache-misses -g ./perlbench_r -I./lib checkspam.pl 2500 5 25 11 150 1 1 1 1 ``` perf 报告: ```sh Samples: 1M of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 717386962088, DSO: perlbench_r Overhead Samples Command Symbol + 37.26% 5.62% 415720 43157 perlbench_r [.] S_regmatch ##### gcc S_regmatch ##### + 32.77% 4.71% 294063 22615 perlbench_r [.] S_regmatch ########################## + 11.14% 1.26% 124221 19733 perlbench_r [.] S_find_byclass + 9.54% 3.13% 106545 28123 perlbench_r [.] Perl_leave_scope + 4.91% 0.77% 54747 9168 perlbench_r [.] S_regtry + 2.84% 8.35% 31996 46504 perlbench_r [.] Perl_hv_common + 1.71% 0.55% 19115 4868 perlbench_r [.] Perl_save_pushptr + 1.53% 3.04% 17275 15945 perlbench_r [.] Perl_sv_setsv_flags + 1.50% 3.30% 16938 18312 perlbench_r [.] Perl_pp_nextstate + 1.45% 2.56% 16268 14864 perlbench_r [.] Perl_regexec_flags + 1.42% 5.73% 16006 33863 perlbench_r [.] Perl_pp_entersub + 1.35% 1.52% 15101 10375 perlbench_r [.] Perl_fbm_instr + 1.23% 4.35% 13812 23788 perlbench_r [.] Perl_pp_multideref + 1.12% 2.79% 12621 15979 perlbench_r [.] Perl_pp_match + 1.01% 2.35% 11312 13897 perlbench_r [.] Perl_pp_and + 0.95% 0.09% 10548 758 perlbench_r [.] Perl_ckwarn + 0.73% 2.67% 8243 13344 perlbench_r [.] Perl_pp_padsv + 0.60% 1.62% 6819 7586 perlbench_r [.] Perl_sv_clear + 0.57% 1.40% 6389 7741 perlbench_r [.] Perl_re_intuit_start + 0.55% 0.44% 6146 2897 perlbench_r [.] Perl_pp_iter + 0.53% 1.23% 6001 6175 perlbench_r [.] Perl_sv_upgrade + 0.53% 0.30% 5939 1698 perlbench_r [.] S_regrepeat ``` #### 500 ref case 2 采样命令: ```sh perf record -e cpu-cycles,cache-misses -g ./perlbench_r -I./lib diffmail.pl 4 800 10 17 19 300 ``` perf 报告: ```sh Samples: 1M of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 445421299381, DSO: perlbench_r Overhead Samples Command Symbol + 18.78% 5.74% 130396 40373 perlbench_r [.] S_regmatch ##### 替换为 gcc S_regmatch ##### + 15.62% 5.29% 103537 35774 perlbench_r [.] S_regmatch ################################ + 6.04% 5.41% 41928 38142 perlbench_r [.] Perl_pp_padsv + 4.14% 0.53% 28786 3703 perlbench_r [.] Perl_pp_substr + 4.05% 1.62% 28165 11492 perlbench_r [.] Perl_leave_scope + 3.66% 6.10% 25454 42784 perlbench_r [.] Perl_sv_setsv_flags + 3.23% 1.09% 22426 7609 perlbench_r [.] S_regrepeat + 3.12% 3.71% 21652 26126 perlbench_r [.] Perl_pp_nextstate + 3.01% 2.89% 20909 20195 perlbench_r [.] Perl_pp_and + 2.18% 1.35% 15164 9522 perlbench_r [.] Perl_pp_enter + 2.14% 1.08% 14889 8040 perlbench_r [.] Perl_sv_setpvn + 2.13% 5.91% 14812 41528 perlbench_r [.] Perl_hv_common + 2.12% 0.71% 14752 4978 perlbench_r [.] Perl_pp_leave + 1.80% 0.37% 12513 3988 perlbench_r [.] Perl_pp_preinc + 1.66% 1.25% 11526 8877 perlbench_r [.] Perl_runops_standard + 1.64% 1.43% 11430 9884 perlbench_r [.] Perl_sv_upgrade + 1.58% 0.11% 10948 763 perlbench_r [.] Perl_pp_ord + 1.55% 3.73% 10796 26195 perlbench_r [.] Perl_pp_multideref + 1.48% 0.29% 10300 2018 perlbench_r [.] S_setup_EXACTISH_ST_c1_c2 + 1.39% 1.12% 9623 7917 perlbench_r [.] Perl_pp_const + 1.33% 0.10% 9247 685 perlbench_r [.] Perl_translate_substr_offsets + 1.20% 0.16% 8345 1062 perlbench_r [.] Perl_pp_eq + 1.20% 1.39% 8319 8779 perlbench_r [.] Perl_sv_clear + 1.11% 0.34% 7739 2365 perlbench_r [.] Perl_pp_lt + 1.10% 0.11% 7662 782 perlbench_r [.] Perl_pp_unstack + 1.09% 2.93% 7561 20829 perlbench_r [.] Perl_regexec_flags + 1.05% 0.70% 7316 4828 perlbench_r [.] Perl_sv_setiv + 0.96% 2.56% 6635 17999 perlbench_r [.] Perl_pp_sassign + 0.95% 2.83% 6598 20339 perlbench_r [.] Perl_pp_entersub + 0.94% 0.29% 6532 2083 perlbench_r [.] Perl_pp_rv2sv + 0.85% 1.06% 5901 7537 perlbench_r [.] S_find_byclass + 0.67% 0.52% 4642 3636 perlbench_r [.] Perl_save_strlen + 0.67% 0.83% 4637 5924 perlbench_r [.] Perl_push_scope + 0.59% 0.25% 4079 1780 perlbench_r [.] S_glob_assign_glob + 0.58% 1.08% 4031 7654 perlbench_r [.] Perl_pp_match + 0.56% 0.91% 3906 6196 perlbench_r [.] Perl_pp_aassign + 0.56% 0.31% 3899 2239 perlbench_r [.] Perl_pop_scope 0.51% 0.19% 3550 1342 perlbench_r [.] S_share_hek_flags ``` #### 500 ref case 3: 采样命令: ```sh perf record -e cpu-cycles,cache-misses -g ./perlbench_r -I./lib splitmail.pl 6400 12 26 16 100 0 ``` perf 报告: ```sh Samples: 1M of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 525004111325, DSO: perlbench_r Overhead Samples Command Symbol + 16.62% 1.46% 136048 9904 perlbench_r [.] S_regmatch ##### 替换为 gcc S_regmatch ##### + 14.74% 0.91% 116766 6159 perlbench_r [.] S_regmatch ################################ + 11.01% 8.10% 90158 41508 perlbench_r [.] Perl_hv_common + 10.37% 8.89% 84833 72495 perlbench_r [.] Perl_hv_common + 9.56% 2.87% 78186 21477 perlbench_r [.] Perl_pp_multideref + 7.47% 12.85% 61105 103171 perlbench_r [.] Perl_regexec_flags + 4.99% 4.99% 40880 35093 perlbench_r [.] Perl_leave_scope + 3.42% 0.32% 27961 2702 perlbench_r [.] Perl_pp_unstack + 3.33% 2.24% 27210 18355 perlbench_r [.] Perl_av_fetch + 3.13% 0.81% 25635 4816 perlbench_r [.] Perl_pp_gvsv + 2.36% 1.30% 19343 11693 perlbench_r [.] Perl_pp_or + 1.98% 2.22% 16174 18391 perlbench_r [.] S_regtry + 1.93% 0.23% 15778 1870 perlbench_r [.] Perl_pp_preinc + 1.89% 9.10% 15470 74589 perlbench_r [.] S_cleanup_regmatch_info_aux + 1.76% 6.79% 14509 19775 perlbench_r [.] Perl_sv_cmp_flags + 1.28% 1.63% 10469 13453 perlbench_r [.] Perl_vivify_ref + 0.99% 1.46% 8119 7799 perlbench_r [.] Perl_pp_nextstate + 0.97% 0.01% 7895 103 perlbench_r [.] Perl_pp_stub + 0.93% 1.88% 7707 11304 perlbench_r [.] Perl_sv_setsv_flags + 0.83% 0.46% 6797 2361 perlbench_r [.] Perl_runops_standard + 0.74% 1.23% 6085 5575 perlbench_r [.] Perl_pp_padsv + 0.73% 1.87% 6002 15451 perlbench_r [.] Perl_save_destructor_x + 0.65% 0.08% 5355 729 perlbench_r [.] Perl_ckwarn + 0.61% 1.18% 5022 8563 perlbench_r [.] Perl_sv_eq_flags + 0.61% 1.29% 5009 5938 perlbench_r [.] Perl_hv_iternext_flags + 0.60% 2.03% 4946 9101 perlbench_r [.] Perl_newSVhek + 0.56% 0.21% 4629 1670 perlbench_r [.] Perl_pp_iter + 0.56% 0.74% 4561 3200 perlbench_r [.] Perl_pp_and + 0.52% 1.43% 4277 4404 perlbench_r [.] Perl_sortsv_flags ``` ### 502 热点函数 #### 502 ref case 1 采样命令: ```sh perf record -e cpu-cycles,cache-misses -g ./cpugcc_r gcc-pp.c -O3 -finline-limit=0 -fif-conversion -fif-conversion2 -o gcc-pp.opts-O3_-finline-limit_0_-fif-conversion_-fif-conversion2.s ``` perf 报告: ```sh Samples: 625K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 215393196829, DSO: cpugcc_r Overhead Samples Command Symbol + 2.12% 2.20% 7125 4695 cpugcc_r [.] df_worklist_dataflow_doublequeue + 1.86% 2.17% 6302 6310 cpugcc_r [.] bitmap_set_bit + 1.42% 1.47% 4806 4281 cpugcc_r [.] bitmap_bit_p + 1.42% 2.13% 4795 5699 cpugcc_r [.] df_note_compute + 1.34% 2.04% 4526 4327 cpugcc_r [.] bitmap_ior_into + 1.01% 2.94% 3408 6348 cpugcc_r [.] compute_transp + 0.92% 1.42% 3102 2952 cpugcc_r [.] bitmap_ior_and_compl + 0.88% 1.97% 2972 4091 cpugcc_r [.] bitmap_and + 0.87% 0.50% 2950 1675 cpugcc_r [.] htab_find_slot_with_hash + 0.85% 0.05% 2855 253 cpugcc_r [.] record_reg_classes + 0.72% 2.09% 2421 3721 cpugcc_r [.] bitmap_and_into + 0.71% 0.08% 2389 337 cpugcc_r [.] find_reloads + 0.71% 0.23% 2380 923 cpugcc_r [.] sorted_array_from_bitmap_set + 0.70% 0.61% 2421 2223 cpugcc_r [.] ggc_alloc_stat + 0.69% 0.24% 2326 838 cpugcc_r [.] constrain_operands + 0.67% 1.16% 2265 2783 cpugcc_r [.] bitmap_copy + 0.67% 0.97% 2254 2725 cpugcc_r [.] fast_dce + 0.66% 0.95% 2211 2334 cpugcc_r [.] find_reg_note + 0.64% 0.90% 2154 2723 cpugcc_r [.] bitmap_clear_bit + 0.61% 0.60% 2074 1909 cpugcc_r [.] pool_alloc + 0.59% 0.46% 2005 1596 cpugcc_r [.] extract_insn + 0.59% 0.49% 1998 1317 cpugcc_r [.] regstat_bb_compute_ri + 0.59% 0.87% 1984 1770 cpugcc_r [.] inverted_post_order_compute + 0.58% 0.74% 1953 1519 cpugcc_r [.] bitmap_elt_insert_after + 0.54% 0.89% 1835 2505 cpugcc_r [.] df_lr_bb_local_compute + 0.54% 0.34% 1822 1286 cpugcc_r [.] df_ref_create_structure ``` #### 502 ref case 2 采样命令: ```sh perf record -e cpu-cycles,cache-misses -g ./cpugcc_r gcc-pp.c -O2 -finline-limit=36000 -fpic -o gcc-pp.opts-O2_-finline-limit_36000_-fpic.s ``` perf 报告: ```sh Samples: 720K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 242373601238, DSO: cpugcc_r Overhead Samples Command Symbol + 4.26% 7.34% 16053 19824 cpugcc_r [.] compute_transp + 2.09% 1.92% 7895 5406 cpugcc_r [.] df_worklist_dataflow_doublequeue + 1.74% 1.99% 6616 7195 cpugcc_r [.] bitmap_set_bit + 1.61% 2.17% 6074 6043 cpugcc_r [.] bitmap_ior_into + 1.29% 1.79% 4884 6110 cpugcc_r [.] df_note_compute + 1.28% 1.25% 4864 4440 cpugcc_r [.] bitmap_bit_p + 1.04% 1.96% 3950 5176 cpugcc_r [.] bitmap_and + 1.04% 1.07% 3903 4045 cpugcc_r [.] ix86_delegitimize_address + 1.01% 1.47% 3837 4040 cpugcc_r [.] bitmap_ior_and_compl + 0.95% 2.22% 3603 4768 cpugcc_r [.] bitmap_and_into + 0.91% 0.07% 3453 426 cpugcc_r [.] find_reloads + 0.85% 0.96% 3189 3656 cpugcc_r [.] delegitimize_mem_from_attrs + 0.81% 0.86% 3049 2337 cpugcc_r [.] bitmap_elt_insert_after + 0.75% 0.04% 2861 245 cpugcc_r [.] record_reg_classes + 0.75% 0.36% 2862 1539 cpugcc_r [.] htab_find_slot_with_hash + 0.71% 0.54% 2733 2430 cpugcc_r [.] ggc_alloc_stat + 0.70% 1.06% 2646 3314 cpugcc_r [.] bitmap_copy + 0.67% 0.17% 2562 844 cpugcc_r [.] constrain_operands + 0.66% 0.83% 2495 2553 cpugcc_r [.] find_reg_note + 0.66% 0.64% 2468 2646 cpugcc_r [.] find_base_term + 0.62% 0.13% 2340 487 cpugcc_r [.] rtx_equal_for_memref_p + 0.62% 0.41% 2349 1888 cpugcc_r [.] extract_insn + 0.61% 0.06% 2305 289 cpugcc_r [.] ao_ref_from_mem + 0.59% 2.02% 2235 1944 cpugcc_r [.] pre_expr_reaches_here_p_work + 0.59% 0.42% 2229 1514 cpugcc_r [.] regstat_bb_compute_ri + 0.58% 0.76% 2218 2722 cpugcc_r [.] fast_dce + 0.57% 0.48% 2187 1911 cpugcc_r [.] pool_alloc + 0.57% 0.70% 2145 1848 cpugcc_r [.] inverted_post_order_compute + 0.56% 0.09% 2118 360 cpugcc_r [.] get_ref_base_and_extent + 0.56% 0.79% 2129 2804 cpugcc_r [.] df_lr_bb_local_compute + 0.56% 0.18% 2106 665 cpugcc_r [.] memrefs_conflict_p + 0.55% 0.69% 2091 2638 cpugcc_r [.] bitmap_clear_bit 0.51% 0.31% 1941 1832 cpugcc_r [.] reload 0.51% 0.09% 1951 413 cpugcc_r [.] ix86_decompose_address 0.51% 0.04% 1923 167 cpugcc_r [.] get_alias_set 0.50% 0.45% 1904 1832 cpugcc_r [.] canon_rtx ``` #### 502 ref case 3 采样命令: ```sh perf record -e cpu-cycles,cache-misses -g ./cpugcc_r gcc-smaller.c -O3 -fipa-pta -o gcc-smaller.opts-O3_-fipa-pta.s ``` perf 报告: ```sh Samples: 727K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 241888277315, DSO: cpugcc_r Overhead Samples Command Symbol + 18.62% 18.04% 70999 71934 cpugcc_r [.] bitmap_ior_into + 3.91% 2.34% 14901 12303 cpugcc_r [.] do_complex_constraint + 2.33% 2.42% 8879 8819 cpugcc_r [.] bitmap_set_bit + 1.90% 1.79% 7161 4511 cpugcc_r [.] df_worklist_dataflow_doublequeue + 1.23% 1.19% 4644 4173 cpugcc_r [.] bitmap_bit_p + 1.04% 2.66% 3936 7043 cpugcc_r [.] compute_transp + 1.00% 1.41% 3788 4329 cpugcc_r [.] df_note_compute + 0.91% 0.66% 3463 3318 cpugcc_r [.] find + 0.85% 1.37% 3210 3331 cpugcc_r [.] bitmap_ior_and_compl + 0.81% 1.71% 3054 4181 cpugcc_r [.] bitmap_and + 0.73% 0.42% 2786 1667 cpugcc_r [.] htab_find_slot_with_hash + 0.72% 0.19% 2744 925 cpugcc_r [.] sorted_array_from_bitmap_set + 0.72% 0.87% 2895 2578 cpugcc_r [.] bitmap_elt_insert_after + 0.66% 1.85% 2479 3938 cpugcc_r [.] bitmap_and_into + 0.60% 0.03% 2274 155 cpugcc_r [.] record_reg_classes + 0.59% 0.98% 2254 2773 cpugcc_r [.] bitmap_copy + 0.58% 1.05% 2193 2321 cpugcc_r [.] inverted_post_order_compute + 0.58% 0.14% 2194 542 cpugcc_r [.] constrain_operands + 0.57% 0.05% 2173 228 cpugcc_r [.] find_reloads + 0.54% 0.40% 2087 1711 cpugcc_r [.] ggc_alloc_stat + 0.53% 0.69% 2024 2276 cpugcc_r [.] fast_dce 0.52% 0.74% 1959 2104 cpugcc_r [.] find_reg_note ``` #### 502 ref case 4 采样命令: ```sh perf record -e cpu-cycles,cache-misses -g ./cpugcc_r ref32.c -O5 -o ref32.opts-O5.s ``` perf 报告: ```sh Samples: 713K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 231080095228, DSO: cpugcc_r Overhead Samples Command Symbol + 3.48% 11.50% 12592 12824 cpugcc_r [.] rtl_split_edge + 3.30% 2.51% 11951 10183 cpugcc_r [.] bitmap_set_bit + 3.03% 2.02% 10912 8214 cpugcc_r [.] bitmap_bit_p + 2.79% 2.06% 10061 7459 cpugcc_r [.] df_worklist_dataflow_doublequeue + 1.95% 1.79% 7037 7040 cpugcc_r [.] bitmap_ior_into + 1.75% 1.63% 6312 5954 cpugcc_r [.] df_note_compute + 1.31% 1.78% 4736 5763 cpugcc_r [.] bitmap_and + 1.25% 1.32% 4498 5164 cpugcc_r [.] find_reg_note + 1.23% 2.13% 4433 5700 cpugcc_r [.] et_splay + 1.20% 0.56% 4314 2589 cpugcc_r [.] vrp_visit_phi_node + 1.13% 1.22% 4113 4435 cpugcc_r [.] bitmap_copy + 1.03% 1.20% 3725 4147 cpugcc_r [.] bitmap_ior_and_compl + 1.03% 1.26% 3711 3563 cpugcc_r [.] inverted_post_order_compute + 1.01% 0.87% 3684 3026 cpugcc_r [.] sbitmap_a_or_b + 0.86% 0.73% 3103 3298 cpugcc_r [.] fast_dce + 0.85% 0.81% 3080 2784 cpugcc_r [.] df_lr_bb_local_compute + 0.85% 0.60% 3088 2501 cpugcc_r [.] last_stmt + 0.82% 1.06% 2975 2946 cpugcc_r [.] find_unreachable_blocks + 0.81% 0.97% 2961 2618 cpugcc_r [.] calc_idoms + 0.76% 1.05% 2759 3089 cpugcc_r [.] bitmap_clear + 0.72% 0.58% 2623 2107 cpugcc_r [.] pool_alloc + 0.71% 0.70% 2563 2652 cpugcc_r [.] gsi_start_phis + 0.70% 0.78% 2522 2277 cpugcc_r [.] post_order_compute + 0.69% 1.33% 2482 3392 cpugcc_r [.] bitmap_and_into + 0.67% 0.05% 2411 662 cpugcc_r [.] record_reg_classes + 0.66% 0.84% 2409 2409 cpugcc_r [.] df_live_bb_local_compute + 0.65% 0.75% 2353 2566 cpugcc_r [.] remove_unused_locals + 0.62% 0.79% 2274 2042 cpugcc_r [.] calc_dfs_tree_nonrec + 0.57% 0.51% 2079 2139 cpugcc_r [.] bitmap_clear_bit + 0.57% 0.14% 2073 717 cpugcc_r [.] fold_binary_loc + 0.57% 0.76% 2050 1750 cpugcc_r [.] flow_loops_find + 0.57% 0.25% 2051 1464 cpugcc_r [.] df_ref_create_structure + 0.55% 0.26% 1988 1220 cpugcc_r [.] htab_find_slot_with_hash + 0.54% 0.51% 1948 1761 cpugcc_r [.] init_alias_analysis + 0.54% 0.42% 1933 1929 cpugcc_r [.] regstat_bb_compute_ri 0.51% 0.50% 1853 1684 cpugcc_r [.] mark_all_vars_used_1 ``` #### 502 ref case 5 采样命令: ```sh perf record -e cpu-cycles,cache-misses -g ./cpugcc_r ref32.c -O3 -fselective-scheduling -fselective-scheduling2 -o ref32.opts-O3_-fselective-scheduling_-fselective-scheduling2.s ``` perf 报告: ```sh Samples: 878K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 261867226907, DSO: cpugcc_r Overhead Samples Command Symbol + 3.22% 1.86% 13193 10837 cpugcc_r [.] bitmap_set_bit + 3.06% 7.75% 12508 12685 cpugcc_r [.] rtl_split_edge + 2.78% 1.38% 11333 7981 cpugcc_r [.] bitmap_bit_p + 2.46% 1.35% 10036 7181 cpugcc_r [.] df_worklist_dataflow_doublequeue + 1.90% 1.25% 7770 7175 cpugcc_r [.] bitmap_ior_into + 1.55% 1.09% 6329 5771 cpugcc_r [.] df_note_compute + 1.27% 0.26% 5180 1235 cpugcc_r [.] sched_analyze_insn + 1.16% 1.24% 4718 5824 cpugcc_r [.] bitmap_and + 1.12% 0.92% 4549 5434 cpugcc_r [.] find_reg_note + 1.11% 1.47% 4525 5725 cpugcc_r [.] et_splay + 1.07% 0.88% 4392 4590 cpugcc_r [.] bitmap_copy + 0.98% 0.38% 3963 2569 cpugcc_r [.] vrp_visit_phi_node + 0.94% 0.90% 3823 4541 cpugcc_r [.] bitmap_ior_and_compl + 0.90% 0.85% 3679 3532 cpugcc_r [.] inverted_post_order_compute + 0.87% 0.49% 3559 2541 cpugcc_r [.] pool_alloc + 0.84% 0.83% 3457 3622 cpugcc_r [.] bitmap_clear + 0.81% 0.54% 3336 2717 cpugcc_r [.] sbitmap_a_or_b + 0.76% 0.53% 3097 2613 cpugcc_r [.] df_lr_bb_local_compute + 0.74% 0.68% 3032 2679 cpugcc_r [.] calc_idoms + 0.73% 0.52% 2992 3374 cpugcc_r [.] fast_dce + 0.71% 0.72% 2919 2910 cpugcc_r [.] find_unreachable_blocks + 0.71% 0.42% 2897 2545 cpugcc_r [.] last_stmt + 0.64% 0.41% 2615 3021 cpugcc_r [.] extract_insn + 0.62% 0.95% 2530 3621 cpugcc_r [.] bitmap_and_into 0.61% 0.53% 2492 2251 cpugcc_r [.] post_order_compute 0.60% 0.46% 2473 2581 cpugcc_r [.] gsi_start_phis 0.60% 0.03% 2428 496 cpugcc_r [.] record_reg_classes 0.60% 0.59% 2452 2474 cpugcc_r [.] df_live_bb_local_compute 0.59% 0.52% 2413 2636 cpugcc_r [.] remove_unused_locals 0.55% 0.54% 2273 2043 cpugcc_r [.] calc_dfs_tree_nonrec 0.55% 0.41% 2230 2506 cpugcc_r [.] bitmap_clear_bit 0.52% 0.21% 2118 1693 cpugcc_r [.] df_ref_create_structure 0.50% 0.09% 2061 713 cpugcc_r [.] fold_binary_loc ``` ### **HW进行中的主要性能优化相关开发** #### 1. RA前global scope的instruction scheduling #### 2. 中端CFG优化  1)类似[链接](https://gitee.com/openarkcompiler/OpenArkCompiler/issues/I43ZMH) 中的场景,尽可能化简逻辑运算,减少判断次数,化简CFG  2)基于Value Range的冗余跳转化简 #### 3. IPA增强alias或函数多版本 #### 4. sink/hoist优化使能 #### 5. SR-add opt的后端支持 [链接](https://gitee.com/openarkcompiler/OpenArkCompiler/issues/I45F06) #### 6. alignment for loop/label/jump/data #### 7. RA后strldopt冗余消除 ### [附录] perf 使用 perf 采样 ```sh perf record -e cpu-cycles,cache-misses -g [cmd] ``` `-e` 指定采样事件,这里同时采样 `cpu-cycles` 和 `cache-misses`,`-g` 记录 call-graph 信息,方便 perf report 的时候查看调用栈信息。采样结束后,在当前目录生成 perf.data,使用 perf report 查看: ```sh perf report -n --no-children --group -f -i perf.data ``` `-n` 显示每个 symbol 的采样数量,`--no-children` 只统计 caller 本身的采样数量,不统计它的 callee。`--group` 会将 perf.data 中所有的采样事件都显示出来。`-i` 指定输入文件名。 进入 perf report 交互界面后,可以看到热点函数的采样统计,一般会选中某个当前 DSO 的 symbol,然后使用快捷键 `d` 和 `F`,`d` 表示仅显示当前 DSO 的 symbol,无关 symbol(比如 libc.so 中的 malloc)会隐藏,`F` 表示对当前显示的 symbol 重新计算采样百分比。选中某个 symbol,按 `a` 可以进入该 symbol 的汇编代码,汇编代码是标记过的,可以看到每段汇编指令的采样分布。
#### 500 性能问题总结 (1)500 最热的函数是 S_regmatch,该函数性能比 gcc 差 20% 以上,通过将该函数替换成 gcc 的版本,整体性能有显著提升。 替换最热的 S_regmatch 为 gcc 版本后, user time 统计(单位 s)。 > 【注意】gcc S_regmatch 是 called_once callee,O2 默认会被 inline 且本身会被删掉。为了能保留该 symbol,使用 gcc -fno-inline 关闭了内联,然后提取 gcc 版本的 S_regmatch。 | 500 ref | gcc | maple | maple 替换为gcc S_regmatch | 替换后整体提升 | | :-----: | :-----: | :-----: | :------------------------: | :------------: | | case 1 | 218.499 | 280.514 | 269.002 | 5.3% | | case 2 | 218.499 | 178.329 | 170.264 | 3.7% | | case 3 | 218.499 | 204.044 | 196.889 | 3.3% | (2)S_regmatch 性能问题主要集中在 RA 上,以 500 ref case1 为例,下表对比了 gcc S_regmatch 和 maple S_regmatch 在 cycles 总采样数量,分配栈帧大小,栈操作的 cycles 采样数量上的对比。maple 将更多的变量 spill 到了栈上。 | 500 ref case1 | gcc S_regmatch | maple S_regmatch | 差距 | | :-----------------: | :------------: | :--------------: | :----: | | cycles采样数 | 294063 | 415720 | 41.4% | | 分配栈帧大小(byte) | 464 | 1232 | 165.5% | | stack ldr/str采样数 | 109701 | 195745 | 78.4% | | stack操作占比 | 37.3% | 47.1% | 26.2% | (3)store merging 问题,详见 [issue](https://gitee.com/openarkcompiler/OpenArkCompiler/issues/I43ZJ5)。 问题详述如下: ### 500 ref 热点函数 第 1 列和第 2 列分别是 cpu-cycles percent 和 cache-misses percent。 第 3 列和第 4 列分别是 cpu-cycles samples 和 cache-misses samples。 第 5 列是当前可执行文件名称,第 6 列是 symbol name。 仅显示 cycles 百分比大于 0.5% 的 symbol。 #### 500 ref case 1 采样命令: ```sh perf record -e cpu-cycles,cache-misses -g ./perlbench_r -I./lib checkspam.pl 2500 5 25 11 150 1 1 1 1 ``` perf 报告: ```sh Samples: 1M of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 717386962088, DSO: perlbench_r Overhead Samples Command Symbol + 37.26% 5.62% 415720 43157 perlbench_r [.] S_regmatch ##### gcc S_regmatch ##### + 32.77% 4.71% 294063 22615 perlbench_r [.] S_regmatch ########################## + 11.14% 1.26% 124221 19733 perlbench_r [.] S_find_byclass + 9.54% 3.13% 106545 28123 perlbench_r [.] Perl_leave_scope + 4.91% 0.77% 54747 9168 perlbench_r [.] S_regtry + 2.84% 8.35% 31996 46504 perlbench_r [.] Perl_hv_common + 1.71% 0.55% 19115 4868 perlbench_r [.] Perl_save_pushptr + 1.53% 3.04% 17275 15945 perlbench_r [.] Perl_sv_setsv_flags + 1.50% 3.30% 16938 18312 perlbench_r [.] Perl_pp_nextstate + 1.45% 2.56% 16268 14864 perlbench_r [.] Perl_regexec_flags + 1.42% 5.73% 16006 33863 perlbench_r [.] Perl_pp_entersub + 1.35% 1.52% 15101 10375 perlbench_r [.] Perl_fbm_instr + 1.23% 4.35% 13812 23788 perlbench_r [.] Perl_pp_multideref + 1.12% 2.79% 12621 15979 perlbench_r [.] Perl_pp_match + 1.01% 2.35% 11312 13897 perlbench_r [.] Perl_pp_and + 0.95% 0.09% 10548 758 perlbench_r [.] Perl_ckwarn + 0.73% 2.67% 8243 13344 perlbench_r [.] Perl_pp_padsv + 0.60% 1.62% 6819 7586 perlbench_r [.] Perl_sv_clear + 0.57% 1.40% 6389 7741 perlbench_r [.] Perl_re_intuit_start + 0.55% 0.44% 6146 2897 perlbench_r [.] Perl_pp_iter + 0.53% 1.23% 6001 6175 perlbench_r [.] Perl_sv_upgrade + 0.53% 0.30% 5939 1698 perlbench_r [.] S_regrepeat ``` #### 500 ref case 2 采样命令: ```sh perf record -e cpu-cycles,cache-misses -g ./perlbench_r -I./lib diffmail.pl 4 800 10 17 19 300 ``` perf 报告: ```sh Samples: 1M of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 445421299381, DSO: perlbench_r Overhead Samples Command Symbol + 18.78% 5.74% 130396 40373 perlbench_r [.] S_regmatch ##### 替换为 gcc S_regmatch ##### + 15.62% 5.29% 103537 35774 perlbench_r [.] S_regmatch ################################ + 6.04% 5.41% 41928 38142 perlbench_r [.] Perl_pp_padsv + 4.14% 0.53% 28786 3703 perlbench_r [.] Perl_pp_substr + 4.05% 1.62% 28165 11492 perlbench_r [.] Perl_leave_scope + 3.66% 6.10% 25454 42784 perlbench_r [.] Perl_sv_setsv_flags + 3.23% 1.09% 22426 7609 perlbench_r [.] S_regrepeat + 3.12% 3.71% 21652 26126 perlbench_r [.] Perl_pp_nextstate + 3.01% 2.89% 20909 20195 perlbench_r [.] Perl_pp_and + 2.18% 1.35% 15164 9522 perlbench_r [.] Perl_pp_enter + 2.14% 1.08% 14889 8040 perlbench_r [.] Perl_sv_setpvn + 2.13% 5.91% 14812 41528 perlbench_r [.] Perl_hv_common + 2.12% 0.71% 14752 4978 perlbench_r [.] Perl_pp_leave + 1.80% 0.37% 12513 3988 perlbench_r [.] Perl_pp_preinc + 1.66% 1.25% 11526 8877 perlbench_r [.] Perl_runops_standard + 1.64% 1.43% 11430 9884 perlbench_r [.] Perl_sv_upgrade + 1.58% 0.11% 10948 763 perlbench_r [.] Perl_pp_ord + 1.55% 3.73% 10796 26195 perlbench_r [.] Perl_pp_multideref + 1.48% 0.29% 10300 2018 perlbench_r [.] S_setup_EXACTISH_ST_c1_c2 + 1.39% 1.12% 9623 7917 perlbench_r [.] Perl_pp_const + 1.33% 0.10% 9247 685 perlbench_r [.] Perl_translate_substr_offsets + 1.20% 0.16% 8345 1062 perlbench_r [.] Perl_pp_eq + 1.20% 1.39% 8319 8779 perlbench_r [.] Perl_sv_clear + 1.11% 0.34% 7739 2365 perlbench_r [.] Perl_pp_lt + 1.10% 0.11% 7662 782 perlbench_r [.] Perl_pp_unstack + 1.09% 2.93% 7561 20829 perlbench_r [.] Perl_regexec_flags + 1.05% 0.70% 7316 4828 perlbench_r [.] Perl_sv_setiv + 0.96% 2.56% 6635 17999 perlbench_r [.] Perl_pp_sassign + 0.95% 2.83% 6598 20339 perlbench_r [.] Perl_pp_entersub + 0.94% 0.29% 6532 2083 perlbench_r [.] Perl_pp_rv2sv + 0.85% 1.06% 5901 7537 perlbench_r [.] S_find_byclass + 0.67% 0.52% 4642 3636 perlbench_r [.] Perl_save_strlen + 0.67% 0.83% 4637 5924 perlbench_r [.] Perl_push_scope + 0.59% 0.25% 4079 1780 perlbench_r [.] S_glob_assign_glob + 0.58% 1.08% 4031 7654 perlbench_r [.] Perl_pp_match + 0.56% 0.91% 3906 6196 perlbench_r [.] Perl_pp_aassign + 0.56% 0.31% 3899 2239 perlbench_r [.] Perl_pop_scope 0.51% 0.19% 3550 1342 perlbench_r [.] S_share_hek_flags ``` #### 500 ref case 3: 采样命令: ```sh perf record -e cpu-cycles,cache-misses -g ./perlbench_r -I./lib splitmail.pl 6400 12 26 16 100 0 ``` perf 报告: ```sh Samples: 1M of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 525004111325, DSO: perlbench_r Overhead Samples Command Symbol + 16.62% 1.46% 136048 9904 perlbench_r [.] S_regmatch ##### 替换为 gcc S_regmatch ##### + 14.74% 0.91% 116766 6159 perlbench_r [.] S_regmatch ################################ + 11.01% 8.10% 90158 41508 perlbench_r [.] Perl_hv_common + 10.37% 8.89% 84833 72495 perlbench_r [.] Perl_hv_common + 9.56% 2.87% 78186 21477 perlbench_r [.] Perl_pp_multideref + 7.47% 12.85% 61105 103171 perlbench_r [.] Perl_regexec_flags + 4.99% 4.99% 40880 35093 perlbench_r [.] Perl_leave_scope + 3.42% 0.32% 27961 2702 perlbench_r [.] Perl_pp_unstack + 3.33% 2.24% 27210 18355 perlbench_r [.] Perl_av_fetch + 3.13% 0.81% 25635 4816 perlbench_r [.] Perl_pp_gvsv + 2.36% 1.30% 19343 11693 perlbench_r [.] Perl_pp_or + 1.98% 2.22% 16174 18391 perlbench_r [.] S_regtry + 1.93% 0.23% 15778 1870 perlbench_r [.] Perl_pp_preinc + 1.89% 9.10% 15470 74589 perlbench_r [.] S_cleanup_regmatch_info_aux + 1.76% 6.79% 14509 19775 perlbench_r [.] Perl_sv_cmp_flags + 1.28% 1.63% 10469 13453 perlbench_r [.] Perl_vivify_ref + 0.99% 1.46% 8119 7799 perlbench_r [.] Perl_pp_nextstate + 0.97% 0.01% 7895 103 perlbench_r [.] Perl_pp_stub + 0.93% 1.88% 7707 11304 perlbench_r [.] Perl_sv_setsv_flags + 0.83% 0.46% 6797 2361 perlbench_r [.] Perl_runops_standard + 0.74% 1.23% 6085 5575 perlbench_r [.] Perl_pp_padsv + 0.73% 1.87% 6002 15451 perlbench_r [.] Perl_save_destructor_x + 0.65% 0.08% 5355 729 perlbench_r [.] Perl_ckwarn + 0.61% 1.18% 5022 8563 perlbench_r [.] Perl_sv_eq_flags + 0.61% 1.29% 5009 5938 perlbench_r [.] Perl_hv_iternext_flags + 0.60% 2.03% 4946 9101 perlbench_r [.] Perl_newSVhek + 0.56% 0.21% 4629 1670 perlbench_r [.] Perl_pp_iter + 0.56% 0.74% 4561 3200 perlbench_r [.] Perl_pp_and + 0.52% 1.43% 4277 4404 perlbench_r [.] Perl_sortsv_flags ``` ### 502 热点函数 #### 502 ref case 1 采样命令: ```sh perf record -e cpu-cycles,cache-misses -g ./cpugcc_r gcc-pp.c -O3 -finline-limit=0 -fif-conversion -fif-conversion2 -o gcc-pp.opts-O3_-finline-limit_0_-fif-conversion_-fif-conversion2.s ``` perf 报告: ```sh Samples: 625K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 215393196829, DSO: cpugcc_r Overhead Samples Command Symbol + 2.12% 2.20% 7125 4695 cpugcc_r [.] df_worklist_dataflow_doublequeue + 1.86% 2.17% 6302 6310 cpugcc_r [.] bitmap_set_bit + 1.42% 1.47% 4806 4281 cpugcc_r [.] bitmap_bit_p + 1.42% 2.13% 4795 5699 cpugcc_r [.] df_note_compute + 1.34% 2.04% 4526 4327 cpugcc_r [.] bitmap_ior_into + 1.01% 2.94% 3408 6348 cpugcc_r [.] compute_transp + 0.92% 1.42% 3102 2952 cpugcc_r [.] bitmap_ior_and_compl + 0.88% 1.97% 2972 4091 cpugcc_r [.] bitmap_and + 0.87% 0.50% 2950 1675 cpugcc_r [.] htab_find_slot_with_hash + 0.85% 0.05% 2855 253 cpugcc_r [.] record_reg_classes + 0.72% 2.09% 2421 3721 cpugcc_r [.] bitmap_and_into + 0.71% 0.08% 2389 337 cpugcc_r [.] find_reloads + 0.71% 0.23% 2380 923 cpugcc_r [.] sorted_array_from_bitmap_set + 0.70% 0.61% 2421 2223 cpugcc_r [.] ggc_alloc_stat + 0.69% 0.24% 2326 838 cpugcc_r [.] constrain_operands + 0.67% 1.16% 2265 2783 cpugcc_r [.] bitmap_copy + 0.67% 0.97% 2254 2725 cpugcc_r [.] fast_dce + 0.66% 0.95% 2211 2334 cpugcc_r [.] find_reg_note + 0.64% 0.90% 2154 2723 cpugcc_r [.] bitmap_clear_bit + 0.61% 0.60% 2074 1909 cpugcc_r [.] pool_alloc + 0.59% 0.46% 2005 1596 cpugcc_r [.] extract_insn + 0.59% 0.49% 1998 1317 cpugcc_r [.] regstat_bb_compute_ri + 0.59% 0.87% 1984 1770 cpugcc_r [.] inverted_post_order_compute + 0.58% 0.74% 1953 1519 cpugcc_r [.] bitmap_elt_insert_after + 0.54% 0.89% 1835 2505 cpugcc_r [.] df_lr_bb_local_compute + 0.54% 0.34% 1822 1286 cpugcc_r [.] df_ref_create_structure ``` #### 502 ref case 2 采样命令: ```sh perf record -e cpu-cycles,cache-misses -g ./cpugcc_r gcc-pp.c -O2 -finline-limit=36000 -fpic -o gcc-pp.opts-O2_-finline-limit_36000_-fpic.s ``` perf 报告: ```sh Samples: 720K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 242373601238, DSO: cpugcc_r Overhead Samples Command Symbol + 4.26% 7.34% 16053 19824 cpugcc_r [.] compute_transp + 2.09% 1.92% 7895 5406 cpugcc_r [.] df_worklist_dataflow_doublequeue + 1.74% 1.99% 6616 7195 cpugcc_r [.] bitmap_set_bit + 1.61% 2.17% 6074 6043 cpugcc_r [.] bitmap_ior_into + 1.29% 1.79% 4884 6110 cpugcc_r [.] df_note_compute + 1.28% 1.25% 4864 4440 cpugcc_r [.] bitmap_bit_p + 1.04% 1.96% 3950 5176 cpugcc_r [.] bitmap_and + 1.04% 1.07% 3903 4045 cpugcc_r [.] ix86_delegitimize_address + 1.01% 1.47% 3837 4040 cpugcc_r [.] bitmap_ior_and_compl + 0.95% 2.22% 3603 4768 cpugcc_r [.] bitmap_and_into + 0.91% 0.07% 3453 426 cpugcc_r [.] find_reloads + 0.85% 0.96% 3189 3656 cpugcc_r [.] delegitimize_mem_from_attrs + 0.81% 0.86% 3049 2337 cpugcc_r [.] bitmap_elt_insert_after + 0.75% 0.04% 2861 245 cpugcc_r [.] record_reg_classes + 0.75% 0.36% 2862 1539 cpugcc_r [.] htab_find_slot_with_hash + 0.71% 0.54% 2733 2430 cpugcc_r [.] ggc_alloc_stat + 0.70% 1.06% 2646 3314 cpugcc_r [.] bitmap_copy + 0.67% 0.17% 2562 844 cpugcc_r [.] constrain_operands + 0.66% 0.83% 2495 2553 cpugcc_r [.] find_reg_note + 0.66% 0.64% 2468 2646 cpugcc_r [.] find_base_term + 0.62% 0.13% 2340 487 cpugcc_r [.] rtx_equal_for_memref_p + 0.62% 0.41% 2349 1888 cpugcc_r [.] extract_insn + 0.61% 0.06% 2305 289 cpugcc_r [.] ao_ref_from_mem + 0.59% 2.02% 2235 1944 cpugcc_r [.] pre_expr_reaches_here_p_work + 0.59% 0.42% 2229 1514 cpugcc_r [.] regstat_bb_compute_ri + 0.58% 0.76% 2218 2722 cpugcc_r [.] fast_dce + 0.57% 0.48% 2187 1911 cpugcc_r [.] pool_alloc + 0.57% 0.70% 2145 1848 cpugcc_r [.] inverted_post_order_compute + 0.56% 0.09% 2118 360 cpugcc_r [.] get_ref_base_and_extent + 0.56% 0.79% 2129 2804 cpugcc_r [.] df_lr_bb_local_compute + 0.56% 0.18% 2106 665 cpugcc_r [.] memrefs_conflict_p + 0.55% 0.69% 2091 2638 cpugcc_r [.] bitmap_clear_bit 0.51% 0.31% 1941 1832 cpugcc_r [.] reload 0.51% 0.09% 1951 413 cpugcc_r [.] ix86_decompose_address 0.51% 0.04% 1923 167 cpugcc_r [.] get_alias_set 0.50% 0.45% 1904 1832 cpugcc_r [.] canon_rtx ``` #### 502 ref case 3 采样命令: ```sh perf record -e cpu-cycles,cache-misses -g ./cpugcc_r gcc-smaller.c -O3 -fipa-pta -o gcc-smaller.opts-O3_-fipa-pta.s ``` perf 报告: ```sh Samples: 727K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 241888277315, DSO: cpugcc_r Overhead Samples Command Symbol + 18.62% 18.04% 70999 71934 cpugcc_r [.] bitmap_ior_into + 3.91% 2.34% 14901 12303 cpugcc_r [.] do_complex_constraint + 2.33% 2.42% 8879 8819 cpugcc_r [.] bitmap_set_bit + 1.90% 1.79% 7161 4511 cpugcc_r [.] df_worklist_dataflow_doublequeue + 1.23% 1.19% 4644 4173 cpugcc_r [.] bitmap_bit_p + 1.04% 2.66% 3936 7043 cpugcc_r [.] compute_transp + 1.00% 1.41% 3788 4329 cpugcc_r [.] df_note_compute + 0.91% 0.66% 3463 3318 cpugcc_r [.] find + 0.85% 1.37% 3210 3331 cpugcc_r [.] bitmap_ior_and_compl + 0.81% 1.71% 3054 4181 cpugcc_r [.] bitmap_and + 0.73% 0.42% 2786 1667 cpugcc_r [.] htab_find_slot_with_hash + 0.72% 0.19% 2744 925 cpugcc_r [.] sorted_array_from_bitmap_set + 0.72% 0.87% 2895 2578 cpugcc_r [.] bitmap_elt_insert_after + 0.66% 1.85% 2479 3938 cpugcc_r [.] bitmap_and_into + 0.60% 0.03% 2274 155 cpugcc_r [.] record_reg_classes + 0.59% 0.98% 2254 2773 cpugcc_r [.] bitmap_copy + 0.58% 1.05% 2193 2321 cpugcc_r [.] inverted_post_order_compute + 0.58% 0.14% 2194 542 cpugcc_r [.] constrain_operands + 0.57% 0.05% 2173 228 cpugcc_r [.] find_reloads + 0.54% 0.40% 2087 1711 cpugcc_r [.] ggc_alloc_stat + 0.53% 0.69% 2024 2276 cpugcc_r [.] fast_dce 0.52% 0.74% 1959 2104 cpugcc_r [.] find_reg_note ``` #### 502 ref case 4 采样命令: ```sh perf record -e cpu-cycles,cache-misses -g ./cpugcc_r ref32.c -O5 -o ref32.opts-O5.s ``` perf 报告: ```sh Samples: 713K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 231080095228, DSO: cpugcc_r Overhead Samples Command Symbol + 3.48% 11.50% 12592 12824 cpugcc_r [.] rtl_split_edge + 3.30% 2.51% 11951 10183 cpugcc_r [.] bitmap_set_bit + 3.03% 2.02% 10912 8214 cpugcc_r [.] bitmap_bit_p + 2.79% 2.06% 10061 7459 cpugcc_r [.] df_worklist_dataflow_doublequeue + 1.95% 1.79% 7037 7040 cpugcc_r [.] bitmap_ior_into + 1.75% 1.63% 6312 5954 cpugcc_r [.] df_note_compute + 1.31% 1.78% 4736 5763 cpugcc_r [.] bitmap_and + 1.25% 1.32% 4498 5164 cpugcc_r [.] find_reg_note + 1.23% 2.13% 4433 5700 cpugcc_r [.] et_splay + 1.20% 0.56% 4314 2589 cpugcc_r [.] vrp_visit_phi_node + 1.13% 1.22% 4113 4435 cpugcc_r [.] bitmap_copy + 1.03% 1.20% 3725 4147 cpugcc_r [.] bitmap_ior_and_compl + 1.03% 1.26% 3711 3563 cpugcc_r [.] inverted_post_order_compute + 1.01% 0.87% 3684 3026 cpugcc_r [.] sbitmap_a_or_b + 0.86% 0.73% 3103 3298 cpugcc_r [.] fast_dce + 0.85% 0.81% 3080 2784 cpugcc_r [.] df_lr_bb_local_compute + 0.85% 0.60% 3088 2501 cpugcc_r [.] last_stmt + 0.82% 1.06% 2975 2946 cpugcc_r [.] find_unreachable_blocks + 0.81% 0.97% 2961 2618 cpugcc_r [.] calc_idoms + 0.76% 1.05% 2759 3089 cpugcc_r [.] bitmap_clear + 0.72% 0.58% 2623 2107 cpugcc_r [.] pool_alloc + 0.71% 0.70% 2563 2652 cpugcc_r [.] gsi_start_phis + 0.70% 0.78% 2522 2277 cpugcc_r [.] post_order_compute + 0.69% 1.33% 2482 3392 cpugcc_r [.] bitmap_and_into + 0.67% 0.05% 2411 662 cpugcc_r [.] record_reg_classes + 0.66% 0.84% 2409 2409 cpugcc_r [.] df_live_bb_local_compute + 0.65% 0.75% 2353 2566 cpugcc_r [.] remove_unused_locals + 0.62% 0.79% 2274 2042 cpugcc_r [.] calc_dfs_tree_nonrec + 0.57% 0.51% 2079 2139 cpugcc_r [.] bitmap_clear_bit + 0.57% 0.14% 2073 717 cpugcc_r [.] fold_binary_loc + 0.57% 0.76% 2050 1750 cpugcc_r [.] flow_loops_find + 0.57% 0.25% 2051 1464 cpugcc_r [.] df_ref_create_structure + 0.55% 0.26% 1988 1220 cpugcc_r [.] htab_find_slot_with_hash + 0.54% 0.51% 1948 1761 cpugcc_r [.] init_alias_analysis + 0.54% 0.42% 1933 1929 cpugcc_r [.] regstat_bb_compute_ri 0.51% 0.50% 1853 1684 cpugcc_r [.] mark_all_vars_used_1 ``` #### 502 ref case 5 采样命令: ```sh perf record -e cpu-cycles,cache-misses -g ./cpugcc_r ref32.c -O3 -fselective-scheduling -fselective-scheduling2 -o ref32.opts-O3_-fselective-scheduling_-fselective-scheduling2.s ``` perf 报告: ```sh Samples: 878K of events 'cpu-cycles:u, cache-misses:u', Event count (approx.): 261867226907, DSO: cpugcc_r Overhead Samples Command Symbol + 3.22% 1.86% 13193 10837 cpugcc_r [.] bitmap_set_bit + 3.06% 7.75% 12508 12685 cpugcc_r [.] rtl_split_edge + 2.78% 1.38% 11333 7981 cpugcc_r [.] bitmap_bit_p + 2.46% 1.35% 10036 7181 cpugcc_r [.] df_worklist_dataflow_doublequeue + 1.90% 1.25% 7770 7175 cpugcc_r [.] bitmap_ior_into + 1.55% 1.09% 6329 5771 cpugcc_r [.] df_note_compute + 1.27% 0.26% 5180 1235 cpugcc_r [.] sched_analyze_insn + 1.16% 1.24% 4718 5824 cpugcc_r [.] bitmap_and + 1.12% 0.92% 4549 5434 cpugcc_r [.] find_reg_note + 1.11% 1.47% 4525 5725 cpugcc_r [.] et_splay + 1.07% 0.88% 4392 4590 cpugcc_r [.] bitmap_copy + 0.98% 0.38% 3963 2569 cpugcc_r [.] vrp_visit_phi_node + 0.94% 0.90% 3823 4541 cpugcc_r [.] bitmap_ior_and_compl + 0.90% 0.85% 3679 3532 cpugcc_r [.] inverted_post_order_compute + 0.87% 0.49% 3559 2541 cpugcc_r [.] pool_alloc + 0.84% 0.83% 3457 3622 cpugcc_r [.] bitmap_clear + 0.81% 0.54% 3336 2717 cpugcc_r [.] sbitmap_a_or_b + 0.76% 0.53% 3097 2613 cpugcc_r [.] df_lr_bb_local_compute + 0.74% 0.68% 3032 2679 cpugcc_r [.] calc_idoms + 0.73% 0.52% 2992 3374 cpugcc_r [.] fast_dce + 0.71% 0.72% 2919 2910 cpugcc_r [.] find_unreachable_blocks + 0.71% 0.42% 2897 2545 cpugcc_r [.] last_stmt + 0.64% 0.41% 2615 3021 cpugcc_r [.] extract_insn + 0.62% 0.95% 2530 3621 cpugcc_r [.] bitmap_and_into 0.61% 0.53% 2492 2251 cpugcc_r [.] post_order_compute 0.60% 0.46% 2473 2581 cpugcc_r [.] gsi_start_phis 0.60% 0.03% 2428 496 cpugcc_r [.] record_reg_classes 0.60% 0.59% 2452 2474 cpugcc_r [.] df_live_bb_local_compute 0.59% 0.52% 2413 2636 cpugcc_r [.] remove_unused_locals 0.55% 0.54% 2273 2043 cpugcc_r [.] calc_dfs_tree_nonrec 0.55% 0.41% 2230 2506 cpugcc_r [.] bitmap_clear_bit 0.52% 0.21% 2118 1693 cpugcc_r [.] df_ref_create_structure 0.50% 0.09% 2061 713 cpugcc_r [.] fold_binary_loc ``` ### **HW进行中的主要性能优化相关开发** #### 1. RA前global scope的instruction scheduling #### 2. 中端CFG优化  1)类似[链接](https://gitee.com/openarkcompiler/OpenArkCompiler/issues/I43ZMH) 中的场景,尽可能化简逻辑运算,减少判断次数,化简CFG  2)基于Value Range的冗余跳转化简 #### 3. IPA增强alias或函数多版本 #### 4. sink/hoist优化使能 #### 5. SR-add opt的后端支持 [链接](https://gitee.com/openarkcompiler/OpenArkCompiler/issues/I45F06) #### 6. alignment for loop/label/jump/data #### 7. RA后strldopt冗余消除 ### [附录] perf 使用 perf 采样 ```sh perf record -e cpu-cycles,cache-misses -g [cmd] ``` `-e` 指定采样事件,这里同时采样 `cpu-cycles` 和 `cache-misses`,`-g` 记录 call-graph 信息,方便 perf report 的时候查看调用栈信息。采样结束后,在当前目录生成 perf.data,使用 perf report 查看: ```sh perf report -n --no-children --group -f -i perf.data ``` `-n` 显示每个 symbol 的采样数量,`--no-children` 只统计 caller 本身的采样数量,不统计它的 callee。`--group` 会将 perf.data 中所有的采样事件都显示出来。`-i` 指定输入文件名。 进入 perf report 交互界面后,可以看到热点函数的采样统计,一般会选中某个当前 DSO 的 symbol,然后使用快捷键 `d` 和 `F`,`d` 表示仅显示当前 DSO 的 symbol,无关 symbol(比如 libc.so 中的 malloc)会隐藏,`F` 表示对当前显示的 symbol 重新计算采样百分比。选中某个 symbol,按 `a` 可以进入该 symbol 的汇编代码,汇编代码是标记过的,可以看到每段汇编指令的采样分布。
评论 (
0
)
登录
后才可以发表评论
状态
待办的
待办的
进行中
已完成
已关闭
负责人
未设置
yi_jiang
yi_jiang
负责人
协作者
+负责人
+协作者
fredchow
fredchow
负责人
协作者
+负责人
+协作者
williambillchen
williambillchen
负责人
协作者
+负责人
+协作者
Leo Young
leo-young
负责人
协作者
+负责人
+协作者
标签
未设置
标签管理
里程碑
未关联里程碑
未关联里程碑
Pull Requests
未关联
未关联
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
未关联
分支 (27)
标签 (2)
master
lite_maplecg_assembler
merge_branch_20230823
merge_branch_20230608
lite_maplecg
update_llvm_15
merge_branch
wchen_merge169
dev_MapleFE
newir09
fredchow_ginlinepgo1
fredchow_lfovect1
wchen_merge174
fye_pgo_cglower
wchen_merge175
fredchow_lfounroll4
fredchow_funcdelete1
newir08
wchen_merge170
ahuang_m103
newir07
wchen_merge173
newir06
newir05
dev_MapleFE_v2
newir04
abstractir
v1.0.0
v0.2.1
开始日期   -   截止日期
-
置顶选项
不置顶
置顶等级:高
置顶等级:中
置顶等级:低
优先级
不指定
严重
主要
次要
不重要
参与者(4)
C++
1
https://gitee.com/openarkcompiler/OpenArkCompiler.git
git@gitee.com:openarkcompiler/OpenArkCompiler.git
openarkcompiler
OpenArkCompiler
OpenArkCompiler
点此查找更多帮助
搜索帮助
Git 命令在线学习
如何在 Gitee 导入 GitHub 仓库
Git 仓库基础操作
企业版和社区版功能对比
SSH 公钥设置
如何处理代码冲突
仓库体积过大,如何减小?
如何找回被删除的仓库数据
Gitee 产品配额说明
GitHub仓库快速导入Gitee及同步更新
什么是 Release(发行版)
将 PHP 项目自动发布到 packagist.org
仓库举报
回到顶部
登录提示
该操作需登录 Gitee 帐号,请先登录后再操作。
立即登录
没有帐号,去注册