From 5ad20a9bdf5313595bd18f384ed58bdf9e3a6848 Mon Sep 17 00:00:00 2001 From: Ming Wang Date: Thu, 24 Apr 2025 15:58:30 +0800 Subject: [PATCH 1/4] mm/hugetlb: LoongArch: Return NULL from huge_pte_offset() for none PMD commit bd51834d1cf65a2c801295d230c220aeebf87a73 upstream. LoongArch's huge_pte_offset() currently returns a pointer to a PMD slot even if the underlying entry points to invalid_pte_table (indicating no mapping). Callers like smaps_hugetlb_range() fetch this invalid entry value (the address of invalid_pte_table) via this pointer. The generic is_swap_pte() check then incorrectly identifies this address as a swap entry on LoongArch, because it satisfies the !pte_present() && !pte_none() conditions. This misinterpretation, combined with a coincidental match by is_migration_entry() on the address bits, leads to kernel crashes in pfn_swap_entry_to_page(). Fix this at the architecture level by modifying huge_pte_offset() to check the PMD entry's content using pmd_none() before returning. If the entry is none (i.e., it points to invalid_pte_table), return NULL instead of the pointer to the slot. Co-developed-by: Hongchen Zhang Signed-off-by: Hongchen Zhang Signed-off-by: Huacai Chen Signed-off-by: Ming Wang --- arch/loongarch/mm/hugetlbpage.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/loongarch/mm/hugetlbpage.c b/arch/loongarch/mm/hugetlbpage.c index 62ddcea0aa14..aca52c42e94e 100644 --- a/arch/loongarch/mm/hugetlbpage.c +++ b/arch/loongarch/mm/hugetlbpage.c @@ -47,7 +47,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, pmd = pmd_offset(pud, addr); } } - return (pte_t *) pmd; + return pmd_none(pmdp_get(pmd)) ? NULL : (pte_t *) pmd; } int pmd_huge(pmd_t pmd) -- Gitee From 09ec17ffaee23a7283ace964baa4bf395bc46e91 Mon Sep 17 00:00:00 2001 From: Tianyang Zhang Date: Wed, 16 Apr 2025 16:24:05 +0800 Subject: [PATCH 2/4] mm/page_alloc.c: avoid infinite retries caused by cpuset race commit 7bc688db8dd2e7b2cc0b0169ef0d98e2a238c96f upstream. __alloc_pages_slowpath has no change detection for ac->nodemask in the part of retry path, while cpuset can modify it in parallel. For some processes that set mempolicy as MPOL_BIND, this results ac->nodemask changes, and then the should_reclaim_retry will judge based on the latest nodemask and jump to retry, while the get_page_from_freelist only traverses the zonelist from ac->preferred_zoneref, which selected by a expired nodemask and may cause infinite retries in some cases cpu 64: __alloc_pages_slowpath { /* ..... */ retry: /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */ if (alloc_flags & ALLOC_KSWAPD) wake_all_kswapds(order, gfp_mask, ac); /* cpu 1: cpuset_write_resmask update_nodemask update_nodemasks_hier update_tasks_nodemask mpol_rebind_task mpol_rebind_policy mpol_rebind_nodemask // mempolicy->nodes has been modified, // which ac->nodemask point to */ /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */ if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, did_some_progress > 0, &no_progress_loops)) goto retry; } Simultaneously starting multiple cpuset01 from LTP can quickly reproduce this issue on a multi node server when the maximum memory pressure is reached and the swap is enabled Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn Fixes: 902b62810a57 ("mm, page_alloc: fix more premature OOM due to race with cpuset update"). Signed-off-by: Tianyang Zhang Cc: Vlastimil Babka Cc: Suren Baghdasaryan Cc: Michal Hocko Cc: Brendan Jackman Cc: Johannes Weiner Cc: Zi Yan Cc: Signed-off-by: Andrew Morton Signed-off-by: Ming Wang --- mm/page_alloc.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index aa292df1c282..6fee765b26d0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4205,6 +4205,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, } retry: + /* + * Deal with possible cpuset update races or zonelist updates to avoid + * infinite retries. + */ + if (check_retry_cpuset(cpuset_mems_cookie, ac) || + check_retry_zonelist(zonelist_iter_cookie)) + goto restart; + /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */ if (alloc_flags & ALLOC_KSWAPD) wake_all_kswapds(order, gfp_mask, ac); -- Gitee From 9dd6b08b7fb0c54b1d776435bbe2ab004fc85485 Mon Sep 17 00:00:00 2001 From: Zi Yan Date: Tue, 17 Oct 2023 12:31:28 -0400 Subject: [PATCH 3/4] mm/migrate: correct nr_failed in migrate_pages_sync() commit a259945efe6ada94087ef666e9b38f8e34ea34ba upstream nr_failed was missing the large folio splits from migrate_pages_batch() and can cause a mismatch between migrate_pages() return value and the number of not migrated pages, i.e., when the return value of migrate_pages() is 0, there are still pages left in the from page list. It will happen when a non-PMD THP large folio fails to migrate due to -ENOMEM and is split successfully but not all the split pages are not migrated, migrate_pages_batch() would return non-zero, but astats.nr_thp_split = 0. nr_failed would be 0 and returned to the caller of migrate_pages(), but the not migrated pages are left in the from page list without being added back to LRU lists. Fix it by adding a new nr_split counter for large folio splits and adding it to nr_failed in migrate_page_sync() after migrate_pages_batch() is done. Link: https://lkml.kernel.org/r/20231017163129.2025214-1-zi.yan@sent.com Fixes: 2ef7dbb26990 ("migrate_pages: try migrate in batch asynchronously firstly") Signed-off-by: Zi Yan Acked-by: Huang Ying Reviewed-by: Baolin Wang Cc: David Hildenbrand Cc: Matthew Wilcox Signed-off-by: Andrew Morton Change-Id: Iba2d57d235bd7cb131c7d8a7fe07cec94861126b Signed-off-by: Ming Wang --- mm/migrate.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index 1004b1def1c2..4ed470885217 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1504,6 +1504,7 @@ struct migrate_pages_stats { int nr_thp_succeeded; /* THP migrated successfully */ int nr_thp_failed; /* THP failed to be migrated */ int nr_thp_split; /* THP split before migrating */ + int nr_split; /* Large folio (include THP) split before migrating */ }; /* @@ -1623,6 +1624,7 @@ static int migrate_pages_batch(struct list_head *from, int nr_retry_pages = 0; int pass = 0; bool is_thp = false; + bool is_large = false; struct folio *folio, *folio2, *dst = NULL, *dst2; int rc, rc_saved = 0, nr_pages; LIST_HEAD(unmap_folios); @@ -1638,7 +1640,8 @@ static int migrate_pages_batch(struct list_head *from, nr_retry_pages = 0; list_for_each_entry_safe(folio, folio2, from, lru) { - is_thp = folio_test_large(folio) && folio_test_pmd_mappable(folio); + is_large = folio_test_large(folio); + is_thp = is_large && folio_test_pmd_mappable(folio); nr_pages = folio_nr_pages(folio); cond_resched(); @@ -1658,6 +1661,7 @@ static int migrate_pages_batch(struct list_head *from, stats->nr_thp_failed++; if (!try_split_folio(folio, split_folios)) { stats->nr_thp_split++; + stats->nr_split++; continue; } stats->nr_failed_pages += nr_pages; @@ -1686,11 +1690,12 @@ static int migrate_pages_batch(struct list_head *from, nr_failed++; stats->nr_thp_failed += is_thp; /* Large folio NUMA faulting doesn't split to retry. */ - if (folio_test_large(folio) && !nosplit) { + if (is_large && !nosplit) { int ret = try_split_folio(folio, split_folios); if (!ret) { stats->nr_thp_split += is_thp; + stats->nr_split += is_large; break; } else if (reason == MR_LONGTERM_PIN && ret == -EAGAIN) { @@ -1836,6 +1841,7 @@ static int migrate_pages_sync(struct list_head *from, new_folio_t get_new_folio, stats->nr_succeeded += astats.nr_succeeded; stats->nr_thp_succeeded += astats.nr_thp_succeeded; stats->nr_thp_split += astats.nr_thp_split; + stats->nr_split += astats.nr_split; if (rc < 0) { stats->nr_failed_pages += astats.nr_failed_pages; stats->nr_thp_failed += astats.nr_thp_failed; @@ -1843,7 +1849,11 @@ static int migrate_pages_sync(struct list_head *from, new_folio_t get_new_folio, return rc; } stats->nr_thp_failed += astats.nr_thp_split; - nr_failed += astats.nr_thp_split; + /* + * Do not count rc, as pages will be retried below. + * Count nr_split only, since it includes nr_thp_split. + */ + nr_failed += astats.nr_split; /* * Fall back to migrate all failed folios one by one synchronously. All * failed folios except split THPs will be retried, so their failure -- Gitee From 0babf107deccfb051c75a862b0accdc3a2d2f656 Mon Sep 17 00:00:00 2001 From: Tianyang Zhang Date: Wed, 7 May 2025 18:06:24 +0800 Subject: [PATCH 4/4] LoongArch: Prevent cond_resched() occurring within kernel-fpu When CONFIG_PREEMPT_COUNT is not configured (i.e. CONFIG_PREEMPT_NONE/ CONFIG_PREEMPT_VOLUNTARY), preempt_disable() / preempt_enable() merely acts as a barrier(). However, in these cases cond_resched() can still trigger a context switch and modify the CSR.EUEN, resulting in do_fpu() exception being activated within the kernel-fpu critical sections, as demonstrated in the following path: dcn32_calculate_wm_and_dlg() DC_FP_START() dcn32_calculate_wm_and_dlg_fpu() dcn32_find_dummy_latency_index_for_fw_based_mclk_switch() dcn32_internal_validate_bw() dcn32_enable_phantom_stream() dc_create_stream_for_sink() kzalloc(GFP_KERNEL) __kmem_cache_alloc_node() __cond_resched() DC_FP_END() This patch is similar to commit d02198550423a0b (x86/fpu: Improve crypto performance by making kernel-mode FPU reliably usable in softirqs). It uses local_bh_disable() instead of preempt_disable() for non-RT kernels so it can avoid the cond_resched() issue, and also extend the kernel-fpu application scenarios to the softirq context. Link: https://lore.kernel.org/all/20250509020626.23414-1-zhangtianyang@loongson.cn/ Signed-off-by: Tianyang Zhang Signed-off-by: Huacai Chen Signed-off-by: Ming Wang --- arch/loongarch/kernel/kfpu.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/arch/loongarch/kernel/kfpu.c b/arch/loongarch/kernel/kfpu.c index ec5b28e570c9..4c476904227f 100644 --- a/arch/loongarch/kernel/kfpu.c +++ b/arch/loongarch/kernel/kfpu.c @@ -18,11 +18,28 @@ static unsigned int euen_mask = CSR_EUEN_FPEN; static DEFINE_PER_CPU(bool, in_kernel_fpu); static DEFINE_PER_CPU(unsigned int, euen_current); +static inline void fpregs_lock(void) +{ + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_disable(); + else + local_bh_disable(); +} + +static inline void fpregs_unlock(void) +{ + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_enable(); + else + local_bh_enable(); +} + void kernel_fpu_begin(void) { unsigned int *euen_curr; - preempt_disable(); + if (!irqs_disabled()) + fpregs_lock(); WARN_ON(this_cpu_read(in_kernel_fpu)); @@ -73,7 +90,8 @@ void kernel_fpu_end(void) this_cpu_write(in_kernel_fpu, false); - preempt_enable(); + if (!irqs_disabled()) + fpregs_unlock(); } EXPORT_SYMBOL_GPL(kernel_fpu_end); -- Gitee