diff --git a/docs/mindformers/docs/source_en/feature/monitor.md b/docs/mindformers/docs/source_en/feature/monitor.md index 516349875e5f133d30a46da71c40e8d99c6aec50..5002eb4153f4433cfb3c5a4faf84085aaec7e7cb 100644 --- a/docs/mindformers/docs/source_en/feature/monitor.md +++ b/docs/mindformers/docs/source_en/feature/monitor.md @@ -27,6 +27,10 @@ monitor_config: device_local_norm_format: ['log', 'tensorboard'] optimizer_state_format: null weight_state_format: null + weight_stable_rank_format: null + weight_eigenvalue_format: null + weight_aggregation: False + experts_abstract: False throughput_baseline: null print_struct: False check_for_global_norm: False @@ -57,6 +61,10 @@ callbacks: | monitor_config.device_local_norm_format | Sets the logging form of the indicator `device_local_norm` | str or list[str] | | monitor_config.optimizer_state_format | Sets the logging form of the indicator `optimizer_state` | str or list[str] | | monitor_config.weight_state_format | Sets the logging form of the indicator `weight L2-norm` | str or list[str] | +| monitor_config.weight_stable_rank_format | Sets the logging form of the indicator `weight_stable_rank` | str or list[str] | +| monitor_config.weight_eigenvalue_format | Sets the logging form of the indicator `weight_eigenvalue` | str or list[str] | +| monitor_config.weight_aggregation | Whether to perform weight aggregation via communication before setting the calculation metrics `weight_stable_rank` and `weight_eigenvalue` | bool | +| monitor_config.experts_abstract | Sets the way indicator `weight_stable_rank` and `weight_eigenvalue` showed in MOE model log:whether to display only statistical values (the metrics `weight_stable_rank` and `weight_eigenvalue` under the MOE model do not support TensorBoard display) | bool | | monitor_config.throughput_baseline | Sets the baseline value for the metric `throughput linearity`, which needs to be positive. It will be written to both TensorBoard and logs. Defaults to `null` when not set, indicating that the metric is not monitored | int or float | | monitor_config.print_struct | Sets whether to print all trainable parameter names for the model. If `True`, it will print the names of all trainable parameters at the start of the first step and exit training at the end of the step. Default is `False`. | bool | | monitor_config.check_for_global_norm | Sets whether to enable anomaly monitoring for indicator `global norm`. Default is `False` | bool | diff --git a/docs/mindformers/docs/source_zh_cn/feature/monitor.md b/docs/mindformers/docs/source_zh_cn/feature/monitor.md index 5aa787fe140f693a68aed88f73d6be8915dc2061..de13f075193d4c90fcd4d846d2128e6f6687bcea 100644 --- a/docs/mindformers/docs/source_zh_cn/feature/monitor.md +++ b/docs/mindformers/docs/source_zh_cn/feature/monitor.md @@ -27,6 +27,10 @@ monitor_config: device_local_norm_format: ['log', 'tensorboard'] optimizer_state_format: null weight_state_format: null + weight_stable_rank_format: null + weight_eigenvalue_format: null + weight_aggregation: False + experts_abstract: False throughput_baseline: null print_struct: False check_for_global_norm: False @@ -57,6 +61,10 @@ callbacks: | monitor_config.device_local_norm_format | 设置指标`device_local_norm`的记录形式 | str或list[str] | | monitor_config.optimizer_state_format | 设置指标`optimizer_state`的记录形式 | str或list[str] | | monitor_config.weight_state_format | 设置指标`权重L2-norm`的记录形式 | str或list[str] | +| monitor_config.weight_stable_rank_format | 设置指标`weight_stable_rank`的记录形式 | str或list[str] | +| monitor_config.weight_eigenvalue_format | 设置指标`weight_eigenvalue`的记录形式 | str或list[str] | +| monitor_config.weight_aggregation | 设置计算指标`weight_stable_rank`和`weight_eigenvalue`时是否先通信做权重聚合 | bool | +| monitor_config.experts_abstract | MOE模型下设置展示指标`weight_stable_rank`和`weight_eigenvalue`在log中的展示形式:是否只展示统计值(MOE模型下指标`weight_stable_rank`和`weight_eigenvalue`不支持Tensorboard显示) | bool | | monitor_config.throughput_baseline | 设置指标`吞吐量线性度`的基线值,需要为正数。会同时写入到 TensorBoard 和日志。未设置时默认为`null`,表示不监控该指标 | int或float | | monitor_config.print_struct | 设置是否打印模型的全部可训练参数名。若为`True`,则会在第一个step开始时打印所有可训练参数的名称,并在step结束后退出训练。默认为`False` | bool | | monitor_config.check_for_global_norm | 设置是否开启指标`global norm`的异常监测。默认为`False` | bool |