From 08115cdf70b06f33ff3689720b301ee68ab3b80b Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Fri, 13 Jun 2025 19:26:50 +0200 Subject: [PATCH 01/49] UPSTREAM: posix-cpu-timers: fix race between handle_posix_cpu_timers() and posix_cpu_timer_del() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit commit f90fff1e152dedf52b932240ebbd670d83330eca upstream. If an exiting non-autoreaping task has already passed exit_notify() and calls handle_posix_cpu_timers() from IRQ, it can be reaped by its parent or debugger right after unlock_task_sighand(). If a concurrent posix_cpu_timer_del() runs at that moment, it won't be able to detect timer->it.cpu.firing != 0: cpu_timer_task_rcu() and/or lock_task_sighand() will fail. Add the tsk->exit_state check into run_posix_cpu_timers() to fix this. This fix is not needed if CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y, because exit_task_work() is called before exit_notify(). But the check still makes sense, task_work_add(&tsk->posix_cputimers_work.work) will fail anyway in this case. Bug: 425282960 Cc: stable@vger.kernel.org Reported-by: Benoît Sevens Fixes: 0bdd2ed4138e ("sched: run_posix_cpu_timers: Don't check ->exit_state, use lock_task_sighand()") Signed-off-by: Oleg Nesterov Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman (cherry picked from commit c29d5318708e67ac13c1b6fc1007d179fb65b4d7) Signed-off-by: Lee Jones Change-Id: I2a9b8114abf2647c346e763edee1d424a07e86fe --- kernel/time/posix-cpu-timers.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c index e9c6f9d0e42c..9af1f2a72a0a 100644 --- a/kernel/time/posix-cpu-timers.c +++ b/kernel/time/posix-cpu-timers.c @@ -1437,6 +1437,15 @@ void run_posix_cpu_timers(void) lockdep_assert_irqs_disabled(); + /* + * Ensure that release_task(tsk) can't happen while + * handle_posix_cpu_timers() is running. Otherwise, a concurrent + * posix_cpu_timer_del() may fail to lock_task_sighand(tsk) and + * miss timer->it.cpu.firing != 0. + */ + if (tsk->exit_state) + return; + /* * If the actual expiry is deferred to task work context and the * work is already scheduled there is no point to do anything here. From 3b5bd5416eb3ab4dccd436690affae5036b90953 Mon Sep 17 00:00:00 2001 From: Shiming Cheng Date: Fri, 30 May 2025 09:26:08 +0800 Subject: [PATCH 02/49] UPSTREAM: net: fix udp gso skb_segment after pull from frag_list Commit a1e40ac5b5e9 ("net: gso: fix udp gso fraglist segmentation after pull from frag_list") detected invalid geometry in frag_list skbs and redirects them from skb_segment_list to more robust skb_segment. But some packets with modified geometry can also hit bugs in that code. We don't know how many such cases exist. Addressing each one by one also requires touching the complex skb_segment code, which risks introducing bugs for other types of skbs. Instead, linearize all these packets that fail the basic invariants on gso fraglist skbs. That is more robust. If only part of the fraglist payload is pulled into head_skb, it will always cause exception when splitting skbs by skb_segment. For detailed call stack information, see below. Valid SKB_GSO_FRAGLIST skbs - consist of two or more segments - the head_skb holds the protocol headers plus first gso_size - one or more frag_list skbs hold exactly one segment - all but the last must be gso_size Optional datapath hooks such as NAT and BPF (bpf_skb_pull_data) can modify fraglist skbs, breaking these invariants. In extreme cases they pull one part of data into skb linear. For UDP, this causes three payloads with lengths of (11,11,10) bytes were pulled tail to become (12,10,10) bytes. The skbs no longer meets the above SKB_GSO_FRAGLIST conditions because payload was pulled into head_skb, it needs to be linearized before pass to regular skb_segment. skb_segment+0xcd0/0xd14 __udp_gso_segment+0x334/0x5f4 udp4_ufo_fragment+0x118/0x15c inet_gso_segment+0x164/0x338 skb_mac_gso_segment+0xc4/0x13c __skb_gso_segment+0xc4/0x124 validate_xmit_skb+0x9c/0x2c0 validate_xmit_skb_list+0x4c/0x80 sch_direct_xmit+0x70/0x404 __dev_queue_xmit+0x64c/0xe5c neigh_resolve_output+0x178/0x1c4 ip_finish_output2+0x37c/0x47c __ip_finish_output+0x194/0x240 ip_finish_output+0x20/0xf4 ip_output+0x100/0x1a0 NF_HOOK+0xc4/0x16c ip_forward+0x314/0x32c ip_rcv+0x90/0x118 __netif_receive_skb+0x74/0x124 process_backlog+0xe8/0x1a4 __napi_poll+0x5c/0x1f8 net_rx_action+0x154/0x314 handle_softirqs+0x154/0x4b8 [118.376811] [C201134] rxq0_pus: [name:bug&]kernel BUG at net/core/skbuff.c:4278! [118.376829] [C201134] rxq0_pus: [name:traps&]Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP [118.470774] [C201134] rxq0_pus: [name:mrdump&]Kernel Offset: 0x178cc00000 from 0xffffffc008000000 [118.470810] [C201134] rxq0_pus: [name:mrdump&]PHYS_OFFSET: 0x40000000 [118.470827] [C201134] rxq0_pus: [name:mrdump&]pstate: 60400005 (nZCv daif +PAN -UAO) [118.470848] [C201134] rxq0_pus: [name:mrdump&]pc : [0xffffffd79598aefc] skb_segment+0xcd0/0xd14 [118.470900] [C201134] rxq0_pus: [name:mrdump&]lr : [0xffffffd79598a5e8] skb_segment+0x3bc/0xd14 [118.470928] [C201134] rxq0_pus: [name:mrdump&]sp : ffffffc008013770 Fixes: a1e40ac5b5e9 ("gso: fix udp gso fraglist segmentation after pull from frag_list") Bug: 426014478 Change-Id: Ib9d9c84b6f20afc1e1d129ceb59c9c3a7eb8e6de (cherry picked from commit 3382a1ed7f778db841063f5d7e317ac55f9e7f72) Signed-off-by: Shiming Cheng Reviewed-by: Willem de Bruijn Signed-off-by: David S. Miller --- net/ipv4/udp_offload.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c index 132cfc3b2c84..3870b59f5400 100644 --- a/net/ipv4/udp_offload.c +++ b/net/ipv4/udp_offload.c @@ -332,6 +332,7 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb, bool copy_dtor; __sum16 check; __be16 newlen; + int ret = 0; mss = skb_shinfo(gso_skb)->gso_size; if (gso_skb->len <= sizeof(*uh) + mss) @@ -354,6 +355,10 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb, if (skb_pagelen(gso_skb) - sizeof(*uh) == skb_shinfo(gso_skb)->gso_size) return __udp_gso_segment_list(gso_skb, features, is_ipv6); + ret = __skb_linearize(gso_skb); + if (ret) + return ERR_PTR(ret); + /* Setup csum, as fraglist skips this in udp4_gro_receive. */ gso_skb->csum_start = skb_transport_header(gso_skb) - gso_skb->head; gso_skb->csum_offset = offsetof(struct udphdr, check); From 279274c126be34b41ed69765f9d23ff9f3031e76 Mon Sep 17 00:00:00 2001 From: Mukesh Pilaniya Date: Fri, 27 Jun 2025 13:33:16 +0530 Subject: [PATCH 03/49] ANDROID: virt: gunyah: Replace arm_smccc_1_1_smc with arm_smccc_1_1_invoke Replace arm_smccc_1_1_smc with arm_smccc_1_1_invoke because arm_smccc_1_1_invoke() determines the conduit (hvc/smc/none) before making an SMC, which may not be supported on some virtual platforms. Bug: 428106948 Change-Id: Ib21c7790b03996e73caa0874dc826d78e7b1c3d8 Signed-off-by: Mukesh Pilaniya --- drivers/virt/gunyah/gunyah_qcom.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/virt/gunyah/gunyah_qcom.c b/drivers/virt/gunyah/gunyah_qcom.c index f2342d51a018..622d6a07db02 100644 --- a/drivers/virt/gunyah/gunyah_qcom.c +++ b/drivers/virt/gunyah/gunyah_qcom.c @@ -187,7 +187,7 @@ static bool gunyah_has_qcom_extensions(void) uuid_t uuid; u32 *up; - arm_smccc_1_1_smc(GUNYAH_QCOM_EXT_CALL_UUID_ID, &res); + arm_smccc_1_1_invoke(GUNYAH_QCOM_EXT_CALL_UUID_ID, &res); up = (u32 *)&uuid.b[0]; up[0] = lower_32_bits(res.a0); From cb35713803cd6692440647d8c8998c149760848e Mon Sep 17 00:00:00 2001 From: VAMSHI GAJJELA Date: Tue, 1 Jul 2025 11:30:16 +0000 Subject: [PATCH 04/49] ANDROID: scsi: ufs: add UFSHCD_ANDROID_QUIRK_NO_IS_READ_ON_H8 Add UFSHCD_ANDROID_QUIRK_NO_IS_READ_ON_H8 for host controllers which break when the Interrupt Status register is re-read after entering hibern8. In such cases after hibern8 entry is reported, no further register access will occur in the interrupt handler. Bug: 350576949 Change-Id: I8e810c96203a97f030216aae39253a2e102c7ebf Signed-off-by: VAMSHI GAJJELA --- drivers/ufs/core/ufshcd.c | 5 +++++ include/ufs/ufshcd.h | 3 +++ 2 files changed, 8 insertions(+) diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c index b21d96365b65..0c4eb4a48fc0 100644 --- a/drivers/ufs/core/ufshcd.c +++ b/drivers/ufs/core/ufshcd.c @@ -7006,6 +7006,11 @@ static irqreturn_t ufshcd_intr(int irq, void *__hba) if (enabled_intr_status) retval |= ufshcd_sl_intr(hba, enabled_intr_status); + if (hba->android_quirks & + UFSHCD_ANDROID_QUIRK_NO_IS_READ_ON_H8 && + intr_status & UIC_HIBERNATE_ENTER) + break; + intr_status = ufshcd_readl(hba, REG_INTERRUPT_STATUS); } diff --git a/include/ufs/ufshcd.h b/include/ufs/ufshcd.h index 66bd5c15375e..cde9ad6489b2 100644 --- a/include/ufs/ufshcd.h +++ b/include/ufs/ufshcd.h @@ -704,6 +704,9 @@ enum ufshcd_android_quirks { /* Set IID to one. */ UFSHCD_ANDROID_QUIRK_SET_IID_TO_ONE = 1 << 30, + + /* Do not read IS after H8 enter */ + UFSHCD_ANDROID_QUIRK_NO_IS_READ_ON_H8 = 1 << 31, }; enum ufshcd_caps { From e0a00524db094fcb6240182ab0638a0ccb44efd5 Mon Sep 17 00:00:00 2001 From: "T.J. Mercier" Date: Fri, 28 Mar 2025 22:05:04 +0000 Subject: [PATCH 05/49] ANDROID: gki_defconfig: Enable CONFIG_UDMABUF The main use case is to allow very large O_DIRECT writes into a memfd, which can then be converted into a udmabuf. (O_DIRECT writes into regular dmabufs are not possible.) Bug: 303531391 Bug: 389839576 Change-Id: Ifd970826ed1ecb4fe2d365854bcd19276b07f614 Signed-off-by: T.J. Mercier (cherry picked from commit 2f84f21fd838bd4203626fe300404d3ce923f770) Bug: 423003849 --- arch/arm64/configs/gki_defconfig | 1 + arch/x86/configs/gki_defconfig | 1 + 2 files changed, 2 insertions(+) diff --git a/arch/arm64/configs/gki_defconfig b/arch/arm64/configs/gki_defconfig index aee331a1430b..7b89c07f23b5 100644 --- a/arch/arm64/configs/gki_defconfig +++ b/arch/arm64/configs/gki_defconfig @@ -581,6 +581,7 @@ CONFIG_RTC_CLASS=y CONFIG_RTC_LIB_KUNIT_TEST=m CONFIG_RTC_DRV_PL030=y CONFIG_RTC_DRV_PL031=y +CONFIG_UDMABUF=y CONFIG_DMABUF_HEAPS=y CONFIG_DMABUF_SYSFS_STATS=y CONFIG_DMABUF_HEAPS_DEFERRED_FREE=y diff --git a/arch/x86/configs/gki_defconfig b/arch/x86/configs/gki_defconfig index c7bd6055c20b..6fe64231eb62 100644 --- a/arch/x86/configs/gki_defconfig +++ b/arch/x86/configs/gki_defconfig @@ -535,6 +535,7 @@ CONFIG_LEDS_TRIGGER_TRANSIENT=y CONFIG_EDAC=y CONFIG_RTC_CLASS=y CONFIG_RTC_LIB_KUNIT_TEST=m +CONFIG_UDMABUF=y CONFIG_DMABUF_HEAPS=y CONFIG_DMABUF_SYSFS_STATS=y CONFIG_DMABUF_HEAPS_DEFERRED_FREE=y From f45ef0a06f85390cbc8574c5b29c2dd0af4adf7d Mon Sep 17 00:00:00 2001 From: Vlastimil Babka Date: Thu, 25 Apr 2024 20:56:04 -0700 Subject: [PATCH 06/49] UPSTREAM: mm: page_alloc: change move_freepages() to __move_freepages_block() The function is now supposed to be called only on a single pageblock and checks start_pfn and end_pfn accordingly. Rename it to make this more obvious and drop the end_pfn parameter which can be determined trivially and none of the callers use it for anything else. Also make the (now internal) end_pfn exclusive, which is more common. Link: https://lkml.kernel.org/r/81b1d642-2ec0-49f5-89fc-19a3828419ff@suse.cz Signed-off-by: Vlastimil Babka Reviewed-by: Zi Yan Acked-by: Johannes Weiner Cc: David Hildenbrand Cc: "Huang, Ying" Cc: Mel Gorman Signed-off-by: Andrew Morton Bug: 420836317 (cherry picked from commit e1f42a577f63647dadf1abe4583053c03d6be045) Change-Id: I1e9ecd1670fda3edafff834849fbac2705a36324 Signed-off-by: yipeng xiang --- mm/page_alloc.c | 43 ++++++++++++++++++++----------------------- 1 file changed, 20 insertions(+), 23 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 152b0424fcbf..7f9f3f3df9f9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1766,18 +1766,18 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone, * Change the type of a block and move all its free pages to that * type's freelist. */ -static int move_freepages(struct zone *zone, unsigned long start_pfn, - unsigned long end_pfn, int old_mt, int new_mt) +static int __move_freepages_block(struct zone *zone, unsigned long start_pfn, + int old_mt, int new_mt) { struct page *page; - unsigned long pfn; + unsigned long pfn, end_pfn; unsigned int order; int pages_moved = 0; VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1)); - VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn); + end_pfn = pageblock_end_pfn(start_pfn); - for (pfn = start_pfn; pfn <= end_pfn;) { + for (pfn = start_pfn; pfn < end_pfn;) { page = pfn_to_page(pfn); if (!PageBuddy(page)) { pfn++; @@ -1803,14 +1803,13 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn, static bool prep_move_freepages_block(struct zone *zone, struct page *page, unsigned long *start_pfn, - unsigned long *end_pfn, int *num_free, int *num_movable) { unsigned long pfn, start, end; pfn = page_to_pfn(page); start = pageblock_start_pfn(pfn); - end = pageblock_end_pfn(pfn) - 1; + end = pageblock_end_pfn(pfn); /* * The caller only has the lock for @zone, don't touch ranges @@ -1821,16 +1820,15 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page, */ if (!zone_spans_pfn(zone, start)) return false; - if (!zone_spans_pfn(zone, end)) + if (!zone_spans_pfn(zone, end - 1)) return false; *start_pfn = start; - *end_pfn = end; if (num_free) { *num_free = 0; *num_movable = 0; - for (pfn = start; pfn <= end;) { + for (pfn = start; pfn < end;) { page = pfn_to_page(pfn); if (PageBuddy(page)) { int nr = 1 << buddy_order(page); @@ -1856,13 +1854,12 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page, static int move_freepages_block(struct zone *zone, struct page *page, int old_mt, int new_mt) { - unsigned long start_pfn, end_pfn; + unsigned long start_pfn; - if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, - NULL, NULL)) + if (!prep_move_freepages_block(zone, page, &start_pfn, NULL, NULL)) return -1; - return move_freepages(zone, start_pfn, end_pfn, old_mt, new_mt); + return __move_freepages_block(zone, start_pfn, old_mt, new_mt); } #ifdef CONFIG_MEMORY_ISOLATION @@ -1933,10 +1930,9 @@ static void split_large_buddy(struct zone *zone, struct page *page, bool move_freepages_block_isolate(struct zone *zone, struct page *page, int migratetype) { - unsigned long start_pfn, end_pfn, pfn; + unsigned long start_pfn, pfn; - if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, - NULL, NULL)) + if (!prep_move_freepages_block(zone, page, &start_pfn, NULL, NULL)) return false; /* No splits needed if buddies can't span multiple blocks */ @@ -1967,8 +1963,9 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page, return true; } move: - move_freepages(zone, start_pfn, end_pfn, - get_pfnblock_migratetype(page, start_pfn), migratetype); + __move_freepages_block(zone, start_pfn, + get_pfnblock_migratetype(page, start_pfn), + migratetype); return true; } #endif /* CONFIG_MEMORY_ISOLATION */ @@ -2068,7 +2065,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page, unsigned int alloc_flags, bool whole_block) { int free_pages, movable_pages, alike_pages; - unsigned long start_pfn, end_pfn; + unsigned long start_pfn; int block_type; block_type = get_pageblock_migratetype(page); @@ -2101,8 +2098,8 @@ steal_suitable_fallback(struct zone *zone, struct page *page, goto single_page; /* moving whole block can fail due to zone boundary conditions */ - if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, - &free_pages, &movable_pages)) + if (!prep_move_freepages_block(zone, page, &start_pfn, &free_pages, + &movable_pages)) goto single_page; /* @@ -2132,7 +2129,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page, */ if (free_pages + alike_pages >= (1 << (pageblock_order-1)) || page_group_by_mobility_disabled) { - move_freepages(zone, start_pfn, end_pfn, block_type, start_type); + __move_freepages_block(zone, start_pfn, block_type, start_type); return __rmqueue_smallest(zone, order, start_type); } From bf5861fc36df819b62a20e06a5885e178954d8fa Mon Sep 17 00:00:00 2001 From: Huan Yang Date: Mon, 26 Aug 2024 14:40:48 +0800 Subject: [PATCH 07/49] UPSTREAM: mm: page_alloc: simpify page del and expand When page del from buddy and need expand, it will account free_pages in zone's migratetype. The current way is to subtract the page number of the current order when deleting, and then add it back when expanding. This is unnecessary, as when migrating the same type, we can directly record the difference between the high-order pages and the expand added, and then subtract it directly. This patch merge that, only when del and expand done, then account free_pages. Link: https://lkml.kernel.org/r/20240826064048.187790-1-link@vivo.com Signed-off-by: Huan Yang Reviewed-by: Vlastimil Babka Signed-off-by: Andrew Morton Bug: 420836317 (cherry picked from commit 94deaf69dcd33462c61fa8cabb0883e3085a1046) Change-Id: I26196bc41cbf0f64dc9a9bc2249c9c814ca055d0 Signed-off-by: yipeng xiang --- mm/page_alloc.c | 35 +++++++++++++++++++++++++---------- 1 file changed, 25 insertions(+), 10 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7f9f3f3df9f9..1c1098a37e48 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1518,11 +1518,11 @@ struct page *__pageblock_pfn_to_page(unsigned long start_pfn, * * -- nyc */ -static inline void expand(struct zone *zone, struct page *page, - int low, int high, int migratetype) +static inline unsigned int expand(struct zone *zone, struct page *page, int low, + int high, int migratetype) { - unsigned long size = 1 << high; - unsigned long nr_added = 0; + unsigned int size = 1 << high; + unsigned int nr_added = 0; while (high > low) { high--; @@ -1542,7 +1542,19 @@ static inline void expand(struct zone *zone, struct page *page, set_buddy_order(&page[size], high); nr_added += size; } - account_freepages(zone, nr_added, migratetype); + + return nr_added; +} + +static __always_inline void page_del_and_expand(struct zone *zone, + struct page *page, int low, + int high, int migratetype) +{ + int nr_pages = 1 << high; + + __del_page_from_free_list(page, zone, high, migratetype); + nr_pages -= expand(zone, page, low, high, migratetype); + account_freepages(zone, -nr_pages, migratetype); } static void check_new_page_bad(struct page *page) @@ -1727,8 +1739,9 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, page = get_page_from_free_area(area, migratetype); if (!page) continue; - del_page_from_free_list(page, zone, current_order, migratetype); - expand(zone, page, order, current_order, migratetype); + + page_del_and_expand(zone, page, order, current_order, + migratetype); trace_mm_page_alloc_zone_locked(page, order, migratetype, pcp_allowed_order(order) && migratetype < MIGRATE_PCPTYPES); @@ -2079,9 +2092,12 @@ steal_suitable_fallback(struct zone *zone, struct page *page, /* Take ownership for orders >= pageblock_order */ if (current_order >= pageblock_order) { + unsigned int nr_added; + del_page_from_free_list(page, zone, current_order, block_type); change_pageblock_range(page, current_order, start_type); - expand(zone, page, order, current_order, start_type); + nr_added = expand(zone, page, order, current_order, start_type); + account_freepages(zone, nr_added, start_type); return page; } @@ -2134,8 +2150,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page, } single_page: - del_page_from_free_list(page, zone, current_order, block_type); - expand(zone, page, order, current_order, block_type); + page_del_and_expand(zone, page, order, current_order, block_type); return page; } From 4e131ac87c4127c9b02838e58e2b5a87559c81d1 Mon Sep 17 00:00:00 2001 From: gaoxiang17 Date: Fri, 20 Sep 2024 20:20:30 +0800 Subject: [PATCH 08/49] UPSTREAM: mm/page_alloc: add some detailed comments in can_steal_fallback mm/page_alloc: add some detailed comments in can_steal_fallback [akpm@linux-foundation.org: tweak grammar, fit to 80 cols] Link: https://lkml.kernel.org/r/20240920122030.159751-1-gxxa03070307@gmail.com Signed-off-by: gaoxiang17 Signed-off-by: Andrew Morton Bug: 420836317 (cherry picked from commit 6025ea5abbe5d813d6a41c78e6ea14259fb503f4) Change-Id: Ib4a77bf96edeba6ce2c6627c99aacaf148b07d92 Signed-off-by: yipeng xiang --- mm/page_alloc.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1c1098a37e48..f887c9bc0152 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2018,6 +2018,14 @@ static bool can_steal_fallback(unsigned int order, int start_mt) if (order >= pageblock_order) return true; + /* + * Movable pages won't cause permanent fragmentation, so when you alloc + * small pages, you just need to temporarily steal unmovable or + * reclaimable pages that are closest to the request size. After a + * while, memory compaction may occur to form large contiguous pages, + * and the next movable allocation may not need to steal. Unmovable and + * reclaimable allocations need to actually steal pages. + */ if (order >= pageblock_order / 2 || start_mt == MIGRATE_RECLAIMABLE || start_mt == MIGRATE_UNMOVABLE || From 65b7c505d9e1780c7110cfaf9f26a1513c845fc0 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Mon, 24 Feb 2025 19:08:24 -0500 Subject: [PATCH 09/49] BACKPORT: mm: page_alloc: don't steal single pages from biggest buddy MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The fallback code searches for the biggest buddy first in an attempt to steal the whole block and encourage type grouping down the line. The approach used to be this: - Non-movable requests will split the largest buddy and steal the remainder. This splits up contiguity, but it allows subsequent requests of this type to fall back into adjacent space. - Movable requests go and look for the smallest buddy instead. The thinking is that movable requests can be compacted, so grouping is less important than retaining contiguity. c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion") enforces freelist type hygiene, which restricts stealing to either claiming the whole block or just taking the requested chunk; no additional pages or buddy remainders can be stolen any more. The patch mishandled when to switch to finding the smallest buddy in that new reality. As a result, it may steal the exact request size, but from the biggest buddy. This causes fracturing for no good reason. Fix this by committing to the new behavior: either steal the whole block, or fall back to the smallest buddy. Remove single-page stealing from steal_suitable_fallback(). Rename it to try_to_steal_block() to make the intentions clear. If this fails, always fall back to the smallest buddy. The following is from 4 runs of mmtest's thpchallenge. "Pollute" is single page fallback, "steal" is conversion of a partially used block. The numbers for free block conversions (omitted) are comparable. vanilla patched @pollute[unmovable from reclaimable]: 27 106 @pollute[unmovable from movable]: 82 46 @pollute[reclaimable from unmovable]: 256 83 @pollute[reclaimable from movable]: 46 8 @pollute[movable from unmovable]: 4841 868 @pollute[movable from reclaimable]: 5278 12568 @steal[unmovable from reclaimable]: 11 12 @steal[unmovable from movable]: 113 49 @steal[reclaimable from unmovable]: 19 34 @steal[reclaimable from movable]: 47 21 @steal[movable from unmovable]: 250 183 @steal[movable from reclaimable]: 81 93 The allocator appears to do a better job at keeping stealing and polluting to the first fallback preference. As a result, the numbers for "from movable" - the least preferred fallback option, and most detrimental to compactability - are down across the board. Link: https://lkml.kernel.org/r/20250225001023.1494422-2-hannes@cmpxchg.org Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion") Signed-off-by: Johannes Weiner Suggested-by: Vlastimil Babka Reviewed-by: Brendan Jackman Reviewed-by: Vlastimil Babka Signed-off-by: Andrew Morton Bug: 420836317 (cherry picked from commit c2f6ea38fc1b640aa7a2e155cc1c0410ff91afa2) [ MAX_PAGE_ORDER is not defined in linux-6.6, so it is replaced with MAX_ORDER. The original patch: - VM_BUG_ON(current_order > MAX_PAGE_ORDER); linux-6.6 patch: - VM_BUG_ON(current_order > MAX_ORDER); ] Change-Id: I44a62580f1fcb53a2baff6ce3a8af08e9a20fdc0 Signed-off-by: yipeng xiang --- mm/page_alloc.c | 80 +++++++++++++++++++++---------------------------- 1 file changed, 34 insertions(+), 46 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f887c9bc0152..1eddb7b66336 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2077,13 +2077,12 @@ static inline bool boost_watermark(struct zone *zone) * can claim the whole pageblock for the requested migratetype. If not, we check * the pageblock for constituent pages; if at least half of the pages are free * or compatible, we can still claim the whole block, so pages freed in the - * future will be put on the correct free list. Otherwise, we isolate exactly - * the order we need from the fallback block and leave its migratetype alone. + * future will be put on the correct free list. */ static struct page * -steal_suitable_fallback(struct zone *zone, struct page *page, - int current_order, int order, int start_type, - unsigned int alloc_flags, bool whole_block) +try_to_steal_block(struct zone *zone, struct page *page, + int current_order, int order, int start_type, + unsigned int alloc_flags) { int free_pages, movable_pages, alike_pages; unsigned long start_pfn; @@ -2096,7 +2095,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page, * highatomic accounting. */ if (is_migrate_highatomic(block_type)) - goto single_page; + return NULL; /* Take ownership for orders >= pageblock_order */ if (current_order >= pageblock_order) { @@ -2117,14 +2116,10 @@ steal_suitable_fallback(struct zone *zone, struct page *page, if (boost_watermark(zone) && (alloc_flags & ALLOC_KSWAPD)) set_bit(ZONE_BOOSTED_WATERMARK, &zone->flags); - /* We are not allowed to try stealing from the whole block */ - if (!whole_block) - goto single_page; - /* moving whole block can fail due to zone boundary conditions */ if (!prep_move_freepages_block(zone, page, &start_pfn, &free_pages, &movable_pages)) - goto single_page; + return NULL; /* * Determine how many pages are compatible with our allocation. @@ -2157,9 +2152,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page, return __rmqueue_smallest(zone, order, start_type); } -single_page: - page_del_and_expand(zone, page, order, current_order, block_type); - return page; + return NULL; } /* @@ -2351,14 +2344,19 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, } /* - * Try finding a free buddy page on the fallback list and put it on the free - * list of requested migratetype, possibly along with other pages from the same - * block, depending on fragmentation avoidance heuristics. Returns true if - * fallback was found so that __rmqueue_smallest() can grab it. + * Try finding a free buddy page on the fallback list. + * + * This will attempt to steal a whole pageblock for the requested type + * to ensure grouping of such requests in the future. + * + * If a whole block cannot be stolen, regress to __rmqueue_smallest() + * logic to at least break up as little contiguity as possible. * * The use of signed ints for order and current_order is a deliberate * deviation from the rest of this file, to make the for loop * condition simpler. + * + * Return the stolen page, or NULL if none can be found. */ static __always_inline struct page * __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, @@ -2392,45 +2390,35 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, if (fallback_mt == -1) continue; - /* - * We cannot steal all free pages from the pageblock and the - * requested migratetype is movable. In that case it's better to - * steal and split the smallest available page instead of the - * largest available page, because even if the next movable - * allocation falls back into a different pageblock than this - * one, it won't cause permanent fragmentation. - */ - if (!can_steal && start_migratetype == MIGRATE_MOVABLE - && current_order > order) - goto find_smallest; + if (!can_steal) + break; - goto do_steal; + page = get_page_from_free_area(area, fallback_mt); + page = try_to_steal_block(zone, page, current_order, order, + start_migratetype, alloc_flags); + if (page) + goto got_one; } - return NULL; + if (alloc_flags & ALLOC_NOFRAGMENT) + return NULL; -find_smallest: + /* No luck stealing blocks. Find the smallest fallback page */ for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) { area = &(zone->free_area[current_order]); fallback_mt = find_suitable_fallback(area, current_order, start_migratetype, false, &can_steal); - if (fallback_mt != -1) - break; + if (fallback_mt == -1) + continue; + + page = get_page_from_free_area(area, fallback_mt); + page_del_and_expand(zone, page, order, current_order, fallback_mt); + goto got_one; } - /* - * This should not happen - we already found a suitable fallback - * when looking for the largest page. - */ - VM_BUG_ON(current_order > MAX_ORDER); - -do_steal: - page = get_page_from_free_area(area, fallback_mt); - - /* take off list, maybe claim block, expand remainder */ - page = steal_suitable_fallback(zone, page, current_order, order, - start_migratetype, alloc_flags, can_steal); + return NULL; +got_one: trace_mm_page_alloc_extfrag(page, order, current_order, start_migratetype, fallback_mt); From 707dfe67d68fdbf6b1c6dbd0e7fc5564fb3c71e6 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Mon, 24 Feb 2025 19:08:25 -0500 Subject: [PATCH 10/49] UPSTREAM: mm: page_alloc: remove remnants of unlocked migratetype updates The freelist hygiene patches made migratetype accesses fully protected under the zone->lock. Remove remnants of handling the race conditions that existed before from the MIGRATE_HIGHATOMIC code. Link: https://lkml.kernel.org/r/20250225001023.1494422-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Reviewed-by: Brendan Jackman Reviewed-by: Vlastimil Babka Signed-off-by: Andrew Morton Bug: 420836317 (cherry picked from commit 020396a581dc69be2d30939fabde6c029d847034) Change-Id: Ia1266c34f09db1c404df7f37c1a9ff06d61c0cce Signed-off-by: yipeng xiang --- mm/page_alloc.c | 50 ++++++++++++++++--------------------------------- 1 file changed, 16 insertions(+), 34 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1eddb7b66336..e5c9acfd999f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2082,20 +2082,10 @@ static inline bool boost_watermark(struct zone *zone) static struct page * try_to_steal_block(struct zone *zone, struct page *page, int current_order, int order, int start_type, - unsigned int alloc_flags) + int block_type, unsigned int alloc_flags) { int free_pages, movable_pages, alike_pages; unsigned long start_pfn; - int block_type; - - block_type = get_pageblock_migratetype(page); - - /* - * This can happen due to races and we want to prevent broken - * highatomic accounting. - */ - if (is_migrate_highatomic(block_type)) - return NULL; /* Take ownership for orders >= pageblock_order */ if (current_order >= pageblock_order) { @@ -2280,33 +2270,22 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, spin_lock_irqsave(&zone->lock, flags); for (order = 0; order < NR_PAGE_ORDERS; order++) { struct free_area *area = &(zone->free_area[order]); - int mt; + unsigned long size; page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC); if (!page) continue; - mt = get_pageblock_migratetype(page); /* - * In page freeing path, migratetype change is racy so - * we can counter several free pages in a pageblock - * in this loop although we changed the pageblock type - * from highatomic to ac->migratetype. So we should - * adjust the count once. + * It should never happen but changes to + * locking could inadvertently allow a per-cpu + * drain to add pages to MIGRATE_HIGHATOMIC + * while unreserving so be safe and watch for + * underflows. */ - if (is_migrate_highatomic(mt)) { - unsigned long size; - /* - * It should never happen but changes to - * locking could inadvertently allow a per-cpu - * drain to add pages to MIGRATE_HIGHATOMIC - * while unreserving so be safe and watch for - * underflows. - */ - size = max(pageblock_nr_pages, 1UL << order); - size = min(size, zone->nr_reserved_highatomic); - zone->nr_reserved_highatomic -= size; - } + size = max(pageblock_nr_pages, 1UL << order); + size = min(size, zone->nr_reserved_highatomic); + zone->nr_reserved_highatomic -= size; /* * Convert to ac->migratetype and avoid the normal @@ -2318,10 +2297,12 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, * may increase. */ if (order < pageblock_order) - ret = move_freepages_block(zone, page, mt, + ret = move_freepages_block(zone, page, + MIGRATE_HIGHATOMIC, ac->migratetype); else { - move_to_free_list(page, zone, order, mt, + move_to_free_list(page, zone, order, + MIGRATE_HIGHATOMIC, ac->migratetype); change_pageblock_range(page, order, ac->migratetype); @@ -2395,7 +2376,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, page = get_page_from_free_area(area, fallback_mt); page = try_to_steal_block(zone, page, current_order, order, - start_migratetype, alloc_flags); + start_migratetype, fallback_mt, + alloc_flags); if (page) goto got_one; } From c746bc1949da759ee8a3362bafcb01872f7f30d5 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Mon, 24 Feb 2025 19:08:26 -0500 Subject: [PATCH 11/49] BACKPORT: mm: page_alloc: group fallback functions together The way the fallback rules are spread out makes them hard to follow. Move the functions next to each other at least. Link: https://lkml.kernel.org/r/20250225001023.1494422-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Reviewed-by: Brendan Jackman Reviewed-by: Vlastimil Babka Signed-off-by: Andrew Morton Bug: 420836317 (cherry picked from commit a4138a2702a4428317ecdb115934554df4b788b4) [ 1. In the original patch of the find_suitable_fallback function, replace MIGRATE_PCPTYPES with MIGRATE_FALLBACKS.; 2. Keep the hook function in the reserve_highatomic_pageblock and unreserve_highatomic_pageblock functions. ] Change-Id: I069e8dd7f8b009c686daef4459f9f1452b3f4c2c Signed-off-by: yipeng xiang --- mm/page_alloc.c | 414 ++++++++++++++++++++++++------------------------ 1 file changed, 207 insertions(+), 207 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e5c9acfd999f..5dd4970d2485 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1994,6 +1994,43 @@ static void change_pageblock_range(struct page *pageblock_page, } } +static inline bool boost_watermark(struct zone *zone) +{ + unsigned long max_boost; + + if (!watermark_boost_factor) + return false; + /* + * Don't bother in zones that are unlikely to produce results. + * On small machines, including kdump capture kernels running + * in a small area, boosting the watermark can cause an out of + * memory situation immediately. + */ + if ((pageblock_nr_pages * 4) > zone_managed_pages(zone)) + return false; + + max_boost = mult_frac(zone->_watermark[WMARK_HIGH], + watermark_boost_factor, 10000); + + /* + * high watermark may be uninitialised if fragmentation occurs + * very early in boot so do not boost. We do not fall + * through and boost by pageblock_nr_pages as failing + * allocations that early means that reclaim is not going + * to help and it may even be impossible to reclaim the + * boosted watermark resulting in a hang. + */ + if (!max_boost) + return false; + + max_boost = max(pageblock_nr_pages, max_boost); + + zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages, + max_boost); + + return true; +} + /* * When we are falling back to another migratetype during allocation, try to * steal extra free pages from the same pageblocks to satisfy further @@ -2035,41 +2072,38 @@ static bool can_steal_fallback(unsigned int order, int start_mt) return false; } -static inline bool boost_watermark(struct zone *zone) +/* + * Check whether there is a suitable fallback freepage with requested order. + * If only_stealable is true, this function returns fallback_mt only if + * we can steal other freepages all together. This would help to reduce + * fragmentation due to mixed migratetype pages in one pageblock. + */ +int find_suitable_fallback(struct free_area *area, unsigned int order, + int migratetype, bool only_stealable, bool *can_steal) { - unsigned long max_boost; + int i; + int fallback_mt; - if (!watermark_boost_factor) - return false; - /* - * Don't bother in zones that are unlikely to produce results. - * On small machines, including kdump capture kernels running - * in a small area, boosting the watermark can cause an out of - * memory situation immediately. - */ - if ((pageblock_nr_pages * 4) > zone_managed_pages(zone)) - return false; + if (area->nr_free == 0) + return -1; - max_boost = mult_frac(zone->_watermark[WMARK_HIGH], - watermark_boost_factor, 10000); + *can_steal = false; + for (i = 0; i < MIGRATE_FALLBACKS - 1 ; i++) { + fallback_mt = fallbacks[migratetype][i]; + if (free_area_empty(area, fallback_mt)) + continue; - /* - * high watermark may be uninitialised if fragmentation occurs - * very early in boot so do not boost. We do not fall - * through and boost by pageblock_nr_pages as failing - * allocations that early means that reclaim is not going - * to help and it may even be impossible to reclaim the - * boosted watermark resulting in a hang. - */ - if (!max_boost) - return false; + if (can_steal_fallback(order, migratetype)) + *can_steal = true; - max_boost = max(pageblock_nr_pages, max_boost); + if (!only_stealable) + return fallback_mt; - zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages, - max_boost); + if (*can_steal) + return fallback_mt; + } - return true; + return -1; } /* @@ -2145,185 +2179,6 @@ try_to_steal_block(struct zone *zone, struct page *page, return NULL; } -/* - * Check whether there is a suitable fallback freepage with requested order. - * If only_stealable is true, this function returns fallback_mt only if - * we can steal other freepages all together. This would help to reduce - * fragmentation due to mixed migratetype pages in one pageblock. - */ -int find_suitable_fallback(struct free_area *area, unsigned int order, - int migratetype, bool only_stealable, bool *can_steal) -{ - int i; - int fallback_mt; - - if (area->nr_free == 0) - return -1; - - *can_steal = false; - for (i = 0; i < MIGRATE_FALLBACKS - 1 ; i++) { - fallback_mt = fallbacks[migratetype][i]; - if (free_area_empty(area, fallback_mt)) - continue; - - if (can_steal_fallback(order, migratetype)) - *can_steal = true; - - if (!only_stealable) - return fallback_mt; - - if (*can_steal) - return fallback_mt; - } - - return -1; -} - -/* - * Reserve the pageblock(s) surrounding an allocation request for - * exclusive use of high-order atomic allocations if there are no - * empty page blocks that contain a page with a suitable order - */ -static void reserve_highatomic_pageblock(struct page *page, int order, - struct zone *zone) -{ - int mt; - unsigned long max_managed, flags; - bool bypass = false; - - /* - * The number reserved as: minimum is 1 pageblock, maximum is - * roughly 1% of a zone. But if 1% of a zone falls below a - * pageblock size, then don't reserve any pageblocks. - * Check is race-prone but harmless. - */ - if ((zone_managed_pages(zone) / 100) < pageblock_nr_pages) - return; - max_managed = ALIGN((zone_managed_pages(zone) / 100), pageblock_nr_pages); - if (zone->nr_reserved_highatomic >= max_managed) - return; - trace_android_vh_reserve_highatomic_bypass(page, &bypass); - if (bypass) - return; - - spin_lock_irqsave(&zone->lock, flags); - - /* Recheck the nr_reserved_highatomic limit under the lock */ - if (zone->nr_reserved_highatomic >= max_managed) - goto out_unlock; - - /* Yoink! */ - mt = get_pageblock_migratetype(page); - /* Only reserve normal pageblocks (i.e., they can merge with others) */ - if (!migratetype_is_mergeable(mt)) - goto out_unlock; - - if (order < pageblock_order) { - if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1) - goto out_unlock; - zone->nr_reserved_highatomic += pageblock_nr_pages; - } else { - change_pageblock_range(page, order, MIGRATE_HIGHATOMIC); - zone->nr_reserved_highatomic += 1 << order; - } - -out_unlock: - spin_unlock_irqrestore(&zone->lock, flags); -} - -/* - * Used when an allocation is about to fail under memory pressure. This - * potentially hurts the reliability of high-order allocations when under - * intense memory pressure but failed atomic allocations should be easier - * to recover from than an OOM. - * - * If @force is true, try to unreserve pageblocks even though highatomic - * pageblock is exhausted. - */ -static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, - bool force) -{ - struct zonelist *zonelist = ac->zonelist; - unsigned long flags; - struct zoneref *z; - struct zone *zone; - struct page *page; - int order; - bool skip_unreserve_highatomic = false; - int ret; - - for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->highest_zoneidx, - ac->nodemask) { - /* - * Preserve at least one pageblock unless memory pressure - * is really high. - */ - if (!force && zone->nr_reserved_highatomic <= - pageblock_nr_pages) - continue; - - trace_android_vh_unreserve_highatomic_bypass(force, zone, - &skip_unreserve_highatomic); - if (skip_unreserve_highatomic) - continue; - - spin_lock_irqsave(&zone->lock, flags); - for (order = 0; order < NR_PAGE_ORDERS; order++) { - struct free_area *area = &(zone->free_area[order]); - unsigned long size; - - page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC); - if (!page) - continue; - - /* - * It should never happen but changes to - * locking could inadvertently allow a per-cpu - * drain to add pages to MIGRATE_HIGHATOMIC - * while unreserving so be safe and watch for - * underflows. - */ - size = max(pageblock_nr_pages, 1UL << order); - size = min(size, zone->nr_reserved_highatomic); - zone->nr_reserved_highatomic -= size; - - /* - * Convert to ac->migratetype and avoid the normal - * pageblock stealing heuristics. Minimally, the caller - * is doing the work and needs the pages. More - * importantly, if the block was always converted to - * MIGRATE_UNMOVABLE or another type then the number - * of pageblocks that cannot be completely freed - * may increase. - */ - if (order < pageblock_order) - ret = move_freepages_block(zone, page, - MIGRATE_HIGHATOMIC, - ac->migratetype); - else { - move_to_free_list(page, zone, order, - MIGRATE_HIGHATOMIC, - ac->migratetype); - change_pageblock_range(page, order, - ac->migratetype); - ret = 1; - } - /* - * Reserving the block(s) already succeeded, - * so this should not fail on zone boundaries. - */ - WARN_ON_ONCE(ret == -1); - if (ret > 0) { - spin_unlock_irqrestore(&zone->lock, flags); - return ret; - } - } - spin_unlock_irqrestore(&zone->lock, flags); - } - - return false; -} - /* * Try finding a free buddy page on the fallback list. * @@ -3215,6 +3070,151 @@ noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) } ALLOW_ERROR_INJECTION(should_fail_alloc_page, TRUE); +/* + * Reserve the pageblock(s) surrounding an allocation request for + * exclusive use of high-order atomic allocations if there are no + * empty page blocks that contain a page with a suitable order + */ +static void reserve_highatomic_pageblock(struct page *page, int order, + struct zone *zone) +{ + int mt; + unsigned long max_managed, flags; + bool bypass = false; + + /* + * The number reserved as: minimum is 1 pageblock, maximum is + * roughly 1% of a zone. But if 1% of a zone falls below a + * pageblock size, then don't reserve any pageblocks. + * Check is race-prone but harmless. + */ + if ((zone_managed_pages(zone) / 100) < pageblock_nr_pages) + return; + max_managed = ALIGN((zone_managed_pages(zone) / 100), pageblock_nr_pages); + if (zone->nr_reserved_highatomic >= max_managed) + return; + trace_android_vh_reserve_highatomic_bypass(page, &bypass); + if (bypass) + return; + + spin_lock_irqsave(&zone->lock, flags); + + /* Recheck the nr_reserved_highatomic limit under the lock */ + if (zone->nr_reserved_highatomic >= max_managed) + goto out_unlock; + + /* Yoink! */ + mt = get_pageblock_migratetype(page); + /* Only reserve normal pageblocks (i.e., they can merge with others) */ + if (!migratetype_is_mergeable(mt)) + goto out_unlock; + + if (order < pageblock_order) { + if (move_freepages_block(zone, page, mt, MIGRATE_HIGHATOMIC) == -1) + goto out_unlock; + zone->nr_reserved_highatomic += pageblock_nr_pages; + } else { + change_pageblock_range(page, order, MIGRATE_HIGHATOMIC); + zone->nr_reserved_highatomic += 1 << order; + } + +out_unlock: + spin_unlock_irqrestore(&zone->lock, flags); +} + +/* + * Used when an allocation is about to fail under memory pressure. This + * potentially hurts the reliability of high-order allocations when under + * intense memory pressure but failed atomic allocations should be easier + * to recover from than an OOM. + * + * If @force is true, try to unreserve pageblocks even though highatomic + * pageblock is exhausted. + */ +static bool unreserve_highatomic_pageblock(const struct alloc_context *ac, + bool force) +{ + struct zonelist *zonelist = ac->zonelist; + unsigned long flags; + struct zoneref *z; + struct zone *zone; + struct page *page; + int order; + bool skip_unreserve_highatomic = false; + int ret; + + for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->highest_zoneidx, + ac->nodemask) { + /* + * Preserve at least one pageblock unless memory pressure + * is really high. + */ + if (!force && zone->nr_reserved_highatomic <= + pageblock_nr_pages) + continue; + + trace_android_vh_unreserve_highatomic_bypass(force, zone, + &skip_unreserve_highatomic); + if (skip_unreserve_highatomic) + continue; + + spin_lock_irqsave(&zone->lock, flags); + for (order = 0; order < NR_PAGE_ORDERS; order++) { + struct free_area *area = &(zone->free_area[order]); + unsigned long size; + + page = get_page_from_free_area(area, MIGRATE_HIGHATOMIC); + if (!page) + continue; + + /* + * It should never happen but changes to + * locking could inadvertently allow a per-cpu + * drain to add pages to MIGRATE_HIGHATOMIC + * while unreserving so be safe and watch for + * underflows. + */ + size = max(pageblock_nr_pages, 1UL << order); + size = min(size, zone->nr_reserved_highatomic); + zone->nr_reserved_highatomic -= size; + + /* + * Convert to ac->migratetype and avoid the normal + * pageblock stealing heuristics. Minimally, the caller + * is doing the work and needs the pages. More + * importantly, if the block was always converted to + * MIGRATE_UNMOVABLE or another type then the number + * of pageblocks that cannot be completely freed + * may increase. + */ + if (order < pageblock_order) + ret = move_freepages_block(zone, page, + MIGRATE_HIGHATOMIC, + ac->migratetype); + else { + move_to_free_list(page, zone, order, + MIGRATE_HIGHATOMIC, + ac->migratetype); + change_pageblock_range(page, order, + ac->migratetype); + ret = 1; + } + /* + * Reserving the block(s) already succeeded, + * so this should not fail on zone boundaries. + */ + WARN_ON_ONCE(ret == -1); + if (ret > 0) { + spin_unlock_irqrestore(&zone->lock, flags); + return ret; + } + } + spin_unlock_irqrestore(&zone->lock, flags); + } + + return false; +} + static inline long __zone_watermark_unusable_free(struct zone *z, unsigned int order, unsigned int alloc_flags) { From 59eb95395c83d3e304302b95cff6d804aaf62b86 Mon Sep 17 00:00:00 2001 From: Brendan Jackman Date: Fri, 28 Feb 2025 09:52:17 +0000 Subject: [PATCH 12/49] BACKPORT: mm/page_alloc: clarify terminology in migratetype fallback code Patch series "mm/page_alloc: Some clarifications for migratetype fallback", v4. A couple of patches to try and make the code easier to follow. This patch (of 2): This code is rather confusing because: 1. "Steal" is sometimes used to refer to the general concept of allocating from a from a block of a fallback migratetype (steal_suitable_fallback()) but sometimes it refers specifically to converting a whole block's migratetype (can_steal_fallback()). 2. can_steal_fallback() sounds as though it's answering the question "am I functionally permitted to allocate from that other type" but in fact it is encoding a heuristic preference. 3. The same piece of data has different names in different places: can_steal vs whole_block. This reinforces point 2 because it looks like the different names reflect a shift in intent from "am I allowed to steal" to "do I want to steal", but no such shift exists. Fix 1. by avoiding the term "steal" in ambiguous contexts. Start using the term "claim" to refer to the special case of stealing the entire block. Fix 2. by using "should" instead of "can", and also rename its parameters and add some commentary to make it more explicit what they mean. Fix 3. by adopting the new "claim" terminology universally for this set of variables. Link: https://lkml.kernel.org/r/20250228-clarify-steal-v4-0-cb2ef1a4e610@google.com Link: https://lkml.kernel.org/r/20250228-clarify-steal-v4-1-cb2ef1a4e610@google.com Signed-off-by: Brendan Jackman Reviewed-by: Vlastimil Babka Cc: Johannes Weiner Cc: Mel Gorman Cc: Michal Hocko Cc: Yosry Ahmed Signed-off-by: Andrew Morton Bug: 420836317 (cherry picked from commit e47f1f56dd82cc6d91f5c4d914a534aa03cd12ca) [In the original patch of the find_suitable_fallback function, replace MIGRATE_PCPTYPES with MIGRATE_FALLBACKS.;] Change-Id: I8f1b57aebf308f378f50cd1381f31d249362078e Signed-off-by: yipeng xiang --- mm/compaction.c | 4 +-- mm/internal.h | 2 +- mm/page_alloc.c | 72 ++++++++++++++++++++++++------------------------- 3 files changed, 39 insertions(+), 39 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 75ee7750ce2a..7e0f264e46d8 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2279,7 +2279,7 @@ static enum compact_result __compact_finished(struct compact_control *cc) ret = COMPACT_NO_SUITABLE_PAGE; for (order = cc->order; order < NR_PAGE_ORDERS; order++) { struct free_area *area = &cc->zone->free_area[order]; - bool can_steal; + bool claim_block; /* Job done if page is free of the right migratetype */ if (!free_area_empty(area, migratetype)) @@ -2296,7 +2296,7 @@ static enum compact_result __compact_finished(struct compact_control *cc) * other migratetype buddy lists. */ if (find_suitable_fallback(area, order, migratetype, - true, &can_steal) != -1) + true, &claim_block) != -1) /* * Movable pages are OK in any pageblock. If we are * stealing for a non-movable allocation, make sure diff --git a/mm/internal.h b/mm/internal.h index da8bd4bfbb3e..313f6e6ea62e 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -815,7 +815,7 @@ void init_cma_reserved_pageblock(struct page *page); #endif /* CONFIG_COMPACTION || CONFIG_CMA */ int find_suitable_fallback(struct free_area *area, unsigned int order, - int migratetype, bool only_stealable, bool *can_steal); + int migratetype, bool claim_only, bool *claim_block); static inline bool free_area_empty(struct free_area *area, int migratetype) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 5dd4970d2485..53fd6c8d6611 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2033,22 +2033,22 @@ static inline bool boost_watermark(struct zone *zone) /* * When we are falling back to another migratetype during allocation, try to - * steal extra free pages from the same pageblocks to satisfy further - * allocations, instead of polluting multiple pageblocks. + * claim entire blocks to satisfy further allocations, instead of polluting + * multiple pageblocks. * - * If we are stealing a relatively large buddy page, it is likely there will - * be more free pages in the pageblock, so try to steal them all. For - * reclaimable and unmovable allocations, we steal regardless of page size, - * as fragmentation caused by those allocations polluting movable pageblocks - * is worse than movable allocations stealing from unmovable and reclaimable - * pageblocks. + * If we are stealing a relatively large buddy page, it is likely there will be + * more free pages in the pageblock, so try to claim the whole block. For + * reclaimable and unmovable allocations, we try to claim the whole block + * regardless of page size, as fragmentation caused by those allocations + * polluting movable pageblocks is worse than movable allocations stealing from + * unmovable and reclaimable pageblocks. */ -static bool can_steal_fallback(unsigned int order, int start_mt) +static bool should_try_claim_block(unsigned int order, int start_mt) { /* * Leaving this order check is intended, although there is * relaxed order check in next check. The reason is that - * we can actually steal whole pageblock if this condition met, + * we can actually claim the whole pageblock if this condition met, * but, below check doesn't guarantee it and that is just heuristic * so could be changed anytime. */ @@ -2061,7 +2061,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt) * reclaimable pages that are closest to the request size. After a * while, memory compaction may occur to form large contiguous pages, * and the next movable allocation may not need to steal. Unmovable and - * reclaimable allocations need to actually steal pages. + * reclaimable allocations need to actually claim the whole block. */ if (order >= pageblock_order / 2 || start_mt == MIGRATE_RECLAIMABLE || @@ -2074,12 +2074,14 @@ static bool can_steal_fallback(unsigned int order, int start_mt) /* * Check whether there is a suitable fallback freepage with requested order. - * If only_stealable is true, this function returns fallback_mt only if - * we can steal other freepages all together. This would help to reduce + * Sets *claim_block to instruct the caller whether it should convert a whole + * pageblock to the returned migratetype. + * If only_claim is true, this function returns fallback_mt only if + * we would do this whole-block claiming. This would help to reduce * fragmentation due to mixed migratetype pages in one pageblock. */ int find_suitable_fallback(struct free_area *area, unsigned int order, - int migratetype, bool only_stealable, bool *can_steal) + int migratetype, bool only_claim, bool *claim_block) { int i; int fallback_mt; @@ -2087,19 +2089,16 @@ int find_suitable_fallback(struct free_area *area, unsigned int order, if (area->nr_free == 0) return -1; - *can_steal = false; + *claim_block = false; for (i = 0; i < MIGRATE_FALLBACKS - 1 ; i++) { fallback_mt = fallbacks[migratetype][i]; if (free_area_empty(area, fallback_mt)) continue; - if (can_steal_fallback(order, migratetype)) - *can_steal = true; + if (should_try_claim_block(order, migratetype)) + *claim_block = true; - if (!only_stealable) - return fallback_mt; - - if (*can_steal) + if (*claim_block || !only_claim) return fallback_mt; } @@ -2107,14 +2106,14 @@ int find_suitable_fallback(struct free_area *area, unsigned int order, } /* - * This function implements actual steal behaviour. If order is large enough, we - * can claim the whole pageblock for the requested migratetype. If not, we check - * the pageblock for constituent pages; if at least half of the pages are free - * or compatible, we can still claim the whole block, so pages freed in the - * future will be put on the correct free list. + * This function implements actual block claiming behaviour. If order is large + * enough, we can claim the whole pageblock for the requested migratetype. If + * not, we check the pageblock for constituent pages; if at least half of the + * pages are free or compatible, we can still claim the whole block, so pages + * freed in the future will be put on the correct free list. */ static struct page * -try_to_steal_block(struct zone *zone, struct page *page, +try_to_claim_block(struct zone *zone, struct page *page, int current_order, int order, int start_type, int block_type, unsigned int alloc_flags) { @@ -2182,11 +2181,12 @@ try_to_steal_block(struct zone *zone, struct page *page, /* * Try finding a free buddy page on the fallback list. * - * This will attempt to steal a whole pageblock for the requested type + * This will attempt to claim a whole pageblock for the requested type * to ensure grouping of such requests in the future. * - * If a whole block cannot be stolen, regress to __rmqueue_smallest() - * logic to at least break up as little contiguity as possible. + * If a whole block cannot be claimed, steal an individual page, regressing to + * __rmqueue_smallest() logic to at least break up as little contiguity as + * possible. * * The use of signed ints for order and current_order is a deliberate * deviation from the rest of this file, to make the for loop @@ -2203,7 +2203,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, int min_order = order; struct page *page; int fallback_mt; - bool can_steal; + bool claim_block; /* * Do not steal pages from freelists belonging to other pageblocks @@ -2222,15 +2222,15 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, --current_order) { area = &(zone->free_area[current_order]); fallback_mt = find_suitable_fallback(area, current_order, - start_migratetype, false, &can_steal); + start_migratetype, false, &claim_block); if (fallback_mt == -1) continue; - if (!can_steal) + if (!claim_block) break; page = get_page_from_free_area(area, fallback_mt); - page = try_to_steal_block(zone, page, current_order, order, + page = try_to_claim_block(zone, page, current_order, order, start_migratetype, fallback_mt, alloc_flags); if (page) @@ -2240,11 +2240,11 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, if (alloc_flags & ALLOC_NOFRAGMENT) return NULL; - /* No luck stealing blocks. Find the smallest fallback page */ + /* No luck claiming pageblock. Find the smallest fallback page */ for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) { area = &(zone->free_area[current_order]); fallback_mt = find_suitable_fallback(area, current_order, - start_migratetype, false, &can_steal); + start_migratetype, false, &claim_block); if (fallback_mt == -1) continue; From ae27d6c79c4ece8ff2c103cd2548c6263c4b25da Mon Sep 17 00:00:00 2001 From: Brendan Jackman Date: Fri, 28 Feb 2025 09:52:18 +0000 Subject: [PATCH 13/49] UPSTREAM: mm/page_alloc: clarify should_claim_block() commentary There's lots of text here but it's a little hard to follow, this is an attempt to break it up and align its structure more closely with the code. Reword the top-level function comment to just explain what question the function answers from the point of view of the caller. Break up the internal logic into different sections that can have their own commentary describing why that part of the rationale is present. Note the page_group_by_mobility_disabled logic is not explained in the commentary, that is outside the scope of this patch... Link: https://lkml.kernel.org/r/20250228-clarify-steal-v4-2-cb2ef1a4e610@google.com Signed-off-by: Brendan Jackman Reviewed-by: Vlastimil Babka Cc: Johannes Weiner Cc: Mel Gorman Cc: Michal Hocko Cc: Yosry Ahmed Signed-off-by: Andrew Morton Bug: 420836317 (cherry picked from commit a14efee04796dd3f614eaf5348ca1ac099c21349) Change-Id: I6c7f908a4e9f025726dadab210c2d59004fe1946 Signed-off-by: yipeng xiang --- mm/page_alloc.c | 46 ++++++++++++++++++++++++++-------------------- 1 file changed, 26 insertions(+), 20 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 53fd6c8d6611..af4ca23861e8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2032,16 +2032,9 @@ static inline bool boost_watermark(struct zone *zone) } /* - * When we are falling back to another migratetype during allocation, try to - * claim entire blocks to satisfy further allocations, instead of polluting - * multiple pageblocks. - * - * If we are stealing a relatively large buddy page, it is likely there will be - * more free pages in the pageblock, so try to claim the whole block. For - * reclaimable and unmovable allocations, we try to claim the whole block - * regardless of page size, as fragmentation caused by those allocations - * polluting movable pageblocks is worse than movable allocations stealing from - * unmovable and reclaimable pageblocks. + * When we are falling back to another migratetype during allocation, should we + * try to claim an entire block to satisfy further allocations, instead of + * polluting multiple pageblocks? */ static bool should_try_claim_block(unsigned int order, int start_mt) { @@ -2056,19 +2049,32 @@ static bool should_try_claim_block(unsigned int order, int start_mt) return true; /* - * Movable pages won't cause permanent fragmentation, so when you alloc - * small pages, you just need to temporarily steal unmovable or - * reclaimable pages that are closest to the request size. After a - * while, memory compaction may occur to form large contiguous pages, - * and the next movable allocation may not need to steal. Unmovable and - * reclaimable allocations need to actually claim the whole block. + * Above a certain threshold, always try to claim, as it's likely there + * will be more free pages in the pageblock. */ - if (order >= pageblock_order / 2 || - start_mt == MIGRATE_RECLAIMABLE || - start_mt == MIGRATE_UNMOVABLE || - page_group_by_mobility_disabled) + if (order >= pageblock_order / 2) return true; + /* + * Unmovable/reclaimable allocations would cause permanent + * fragmentations if they fell back to allocating from a movable block + * (polluting it), so we try to claim the whole block regardless of the + * allocation size. Later movable allocations can always steal from this + * block, which is less problematic. + */ + if (start_mt == MIGRATE_RECLAIMABLE || start_mt == MIGRATE_UNMOVABLE) + return true; + + if (page_group_by_mobility_disabled) + return true; + + /* + * Movable pages won't cause permanent fragmentation, so when you alloc + * small pages, we just need to temporarily steal unmovable or + * reclaimable pages that are closest to the request size. After a + * while, memory compaction may occur to form large contiguous pages, + * and the next movable allocation may not need to steal. + */ return false; } From b5b61c9e5781847fb6311900b54a060c3d7af420 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Mon, 7 Apr 2025 14:01:53 -0400 Subject: [PATCH 14/49] BACKPORT: mm: page_alloc: speed up fallbacks in rmqueue_bulk() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy") as the root cause of a 56.4% regression in vm-scalability::lru-file-mmap-read. Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion"), as the root cause for a regression in worst-case zone->lock+irqoff hold times. Both of these patches modify the page allocator's fallback path to be less greedy in an effort to stave off fragmentation. The flip side of this is that fallbacks are also less productive each time around, which means the fallback search can run much more frequently. Carlos' traces point to rmqueue_bulk() specifically, which tries to refill the percpu cache by allocating a large batch of pages in a loop. It highlights how once the native freelists are exhausted, the fallback code first scans orders top-down for whole blocks to claim, then falls back to a bottom-up search for the smallest buddy to steal. For the next batch page, it goes through the same thing again. This can be made more efficient. Since rmqueue_bulk() holds the zone->lock over the entire batch, the freelists are not subject to outside changes; when the search for a block to claim has already failed, there is no point in trying again for the next page. Modify __rmqueue() to remember the last successful fallback mode, and restart directly from there on the next rmqueue_bulk() iteration. Oliver confirms that this improves beyond the regression that the test robot reported against c2f6ea38fc1b: commit: f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap") c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy") acc4d5ff0b ("Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net") 2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()") <--- your patch f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5 ---------------- --------------------------- --------------------------- --------------------------- %stddev %change %stddev %change %stddev %change %stddev \ | \ | \ | \ 25525364 ± 3% -56.4% 11135467 -57.8% 10779336 +31.6% 33581409 vm-scalability.throughput Carlos confirms that worst-case times are almost fully recovered compared to before the earlier culprit patch: 2dd482ba627d (before freelist hygiene): 1ms c0cd6f557b90 (after freelist hygiene): 90ms next-20250319 (steal smallest buddy): 280ms this patch : 8ms [jackmanb@google.com: comment updates] Link: https://lkml.kernel.org/r/D92AC0P9594X.3BML64MUKTF8Z@google.com [hannes@cmpxchg.org: reset rmqueue_mode in rmqueue_buddy() error loop, per Yunsheng Lin] Link: https://lkml.kernel.org/r/20250409140023.GA2313@cmpxchg.org Link: https://lkml.kernel.org/r/20250407180154.63348-1-hannes@cmpxchg.org Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion") Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy") Signed-off-by: Johannes Weiner Signed-off-by: Brendan Jackman Reported-by: kernel test robot Reported-by: Carlos Song Tested-by: Carlos Song Tested-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202503271547.fc08b188-lkp@intel.com Reviewed-by: Brendan Jackman Tested-by: Shivank Garg Acked-by: Zi Yan Reviewed-by: Vlastimil Babka Cc: [6.10+] Signed-off-by: Andrew Morton Bug: 420836317 (cherry picked from commit 90abee6d7895d5eef18c91d870d8168be4e76e9d) [Resolve conflicts caused by cma_redirect_restricted ] Change-Id: I4bf9e270886716b0a3f11f9edce9a73e855b1fe9 Signed-off-by: yipeng xiang --- mm/page_alloc.c | 116 +++++++++++++++++++++++++++++++++--------------- 1 file changed, 81 insertions(+), 35 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index af4ca23861e8..97dc7a5280a2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2185,23 +2185,15 @@ try_to_claim_block(struct zone *zone, struct page *page, } /* - * Try finding a free buddy page on the fallback list. - * - * This will attempt to claim a whole pageblock for the requested type - * to ensure grouping of such requests in the future. - * - * If a whole block cannot be claimed, steal an individual page, regressing to - * __rmqueue_smallest() logic to at least break up as little contiguity as - * possible. + * Try to allocate from some fallback migratetype by claiming the entire block, + * i.e. converting it to the allocation's start migratetype. * * The use of signed ints for order and current_order is a deliberate * deviation from the rest of this file, to make the for loop * condition simpler. - * - * Return the stolen page, or NULL if none can be found. */ static __always_inline struct page * -__rmqueue_fallback(struct zone *zone, int order, int start_migratetype, +__rmqueue_claim(struct zone *zone, int order, int start_migratetype, unsigned int alloc_flags) { struct free_area *area; @@ -2239,14 +2231,29 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, page = try_to_claim_block(zone, page, current_order, order, start_migratetype, fallback_mt, alloc_flags); - if (page) - goto got_one; + if (page) { + trace_mm_page_alloc_extfrag(page, order, current_order, + start_migratetype, fallback_mt); + return page; + } } - if (alloc_flags & ALLOC_NOFRAGMENT) - return NULL; + return NULL; +} + +/* + * Try to steal a single page from some fallback migratetype. Leave the rest of + * the block as its current migratetype, potentially causing fragmentation. + */ +static __always_inline struct page * +__rmqueue_steal(struct zone *zone, int order, int start_migratetype) +{ + struct free_area *area; + int current_order; + struct page *page; + int fallback_mt; + bool claim_block; - /* No luck claiming pageblock. Find the smallest fallback page */ for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) { area = &(zone->free_area[current_order]); fallback_mt = find_suitable_fallback(area, current_order, @@ -2256,25 +2263,28 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, page = get_page_from_free_area(area, fallback_mt); page_del_and_expand(zone, page, order, current_order, fallback_mt); - goto got_one; + trace_mm_page_alloc_extfrag(page, order, current_order, + start_migratetype, fallback_mt); + return page; } return NULL; - -got_one: - trace_mm_page_alloc_extfrag(page, order, current_order, - start_migratetype, fallback_mt); - - return page; } +enum rmqueue_mode { + RMQUEUE_NORMAL, + RMQUEUE_CMA, + RMQUEUE_CLAIM, + RMQUEUE_STEAL, +}; + /* * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. */ static __always_inline struct page * __rmqueue(struct zone *zone, unsigned int order, int migratetype, - unsigned int alloc_flags) + unsigned int alloc_flags, enum rmqueue_mode *mode) { struct page *page = NULL; @@ -2297,16 +2307,48 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, } } - page = __rmqueue_smallest(zone, order, migratetype); - if (unlikely(!page)) { - if (!cma_redirect_restricted() && alloc_flags & ALLOC_CMA) + /* + * First try the freelists of the requested migratetype, then try + * fallbacks modes with increasing levels of fragmentation risk. + * + * The fallback logic is expensive and rmqueue_bulk() calls in + * a loop with the zone->lock held, meaning the freelists are + * not subject to any outside changes. Remember in *mode where + * we found pay dirt, to save us the search on the next call. + */ + switch (*mode) { + case RMQUEUE_NORMAL: + page = __rmqueue_smallest(zone, order, migratetype); + if (page) + return page; + fallthrough; + case RMQUEUE_CMA: + if (!cma_redirect_restricted() && alloc_flags & ALLOC_CMA) { page = __rmqueue_cma_fallback(zone, order); - - if (!page) - page = __rmqueue_fallback(zone, order, migratetype, - alloc_flags); + if (page) { + *mode = RMQUEUE_CMA; + return page; + } + } + fallthrough; + case RMQUEUE_CLAIM: + page = __rmqueue_claim(zone, order, migratetype, alloc_flags); + if (page) { + /* Replenished preferred freelist, back to normal mode. */ + *mode = RMQUEUE_NORMAL; + return page; + } + fallthrough; + case RMQUEUE_STEAL: + if (!(alloc_flags & ALLOC_NOFRAGMENT)) { + page = __rmqueue_steal(zone, order, migratetype); + if (page) { + *mode = RMQUEUE_STEAL; + return page; + } + } } - return page; + return NULL; } /* @@ -2318,6 +2360,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, unsigned long count, struct list_head *list, int migratetype, unsigned int alloc_flags) { + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; unsigned long flags; int i; @@ -2333,7 +2376,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, if (cma_redirect_restricted() && is_migrate_cma(migratetype)) page = __rmqueue_cma_fallback(zone, order); else - page = __rmqueue(zone, order, migratetype, alloc_flags); + page = __rmqueue(zone, order, migratetype, alloc_flags, &rmqm); if (unlikely(page == NULL)) break; @@ -2889,9 +2932,12 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, alloc_flags & ALLOC_CMA) page = __rmqueue_cma_fallback(zone, order); - if (!page) + if (!page) { + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; + page = __rmqueue(zone, order, migratetype, - alloc_flags); + alloc_flags, &rmqm); + } /* * If the allocation fails, allow OOM handling and * order-0 (atomic) allocs access to HIGHATOMIC From 2bc327484ee44b9e74be35b014e1e8a09d470bbd Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Mon, 7 Apr 2025 14:01:54 -0400 Subject: [PATCH 15/49] BACKPORT: mm: page_alloc: tighten up find_suitable_fallback() find_suitable_fallback() is not as efficient as it could be, and somewhat difficult to follow. 1. should_try_claim_block() is a loop invariant. There is no point in checking fallback areas if the caller is interested in claimable blocks but the order and the migratetype don't allow for that. 2. __rmqueue_steal() doesn't care about claimability, so it shouldn't have to run those tests. Different callers want different things from this helper: 1. __compact_finished() scans orders up until it finds a claimable block 2. __rmqueue_claim() scans orders down as long as blocks are claimable 3. __rmqueue_steal() doesn't care about claimability at all Move should_try_claim_block() out of the loop. Only test it for the two callers who care in the first place. Distinguish "no blocks" from "order + mt are not claimable" in the return value; __rmqueue_claim() can stop once order becomes unclaimable, __compact_finished() can keep advancing until order becomes claimable. Before: Performance counter stats for './run case-lru-file-mmap-read' (5 runs): 85,294.85 msec task-clock # 5.644 CPUs utilized ( +- 0.32% ) 15,968 context-switches # 187.209 /sec ( +- 3.81% ) 153 cpu-migrations # 1.794 /sec ( +- 3.29% ) 801,808 page-faults # 9.400 K/sec ( +- 0.10% ) 733,358,331,786 instructions # 1.87 insn per cycle ( +- 0.20% ) (64.94%) 392,622,904,199 cycles # 4.603 GHz ( +- 0.31% ) (64.84%) 148,563,488,531 branches # 1.742 G/sec ( +- 0.18% ) (63.86%) 152,143,228 branch-misses # 0.10% of all branches ( +- 1.19% ) (62.82%) 15.1128 +- 0.0637 seconds time elapsed ( +- 0.42% ) After: Performance counter stats for './run case-lru-file-mmap-read' (5 runs): 84,380.21 msec task-clock # 5.664 CPUs utilized ( +- 0.21% ) 16,656 context-switches # 197.392 /sec ( +- 3.27% ) 151 cpu-migrations # 1.790 /sec ( +- 3.28% ) 801,703 page-faults # 9.501 K/sec ( +- 0.09% ) 731,914,183,060 instructions # 1.88 insn per cycle ( +- 0.38% ) (64.90%) 388,673,535,116 cycles # 4.606 GHz ( +- 0.24% ) (65.06%) 148,251,482,143 branches # 1.757 G/sec ( +- 0.37% ) (63.92%) 149,766,550 branch-misses # 0.10% of all branches ( +- 1.22% ) (62.88%) 14.8968 +- 0.0486 seconds time elapsed ( +- 0.33% ) Link: https://lkml.kernel.org/r/20250407180154.63348-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Reviewed-by: Brendan Jackman Tested-by: Shivank Garg Reviewed-by: Vlastimil Babka Cc: Carlos Song Cc: Mel Gorman Signed-off-by: Andrew Morton Bug: 420836317 Change-Id: I2886de9da0fd99047cf5c675cd2ae7c386267770 (cherry picked from commit ee414bd97b3fa0a4f74e40004e3b4191326bd46c) [In the original patch, the variable MIGRATE_PCPTYPES in the find_suitable_fallback function should be MIGRATE_FALLBACKS in Linux 6.6, causing the patch to fail to apply directly.] Signed-off-by: yipeng xiang --- mm/compaction.c | 4 +--- mm/internal.h | 2 +- mm/page_alloc.c | 31 +++++++++++++------------------ 3 files changed, 15 insertions(+), 22 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 7e0f264e46d8..89570cd884c7 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2279,7 +2279,6 @@ static enum compact_result __compact_finished(struct compact_control *cc) ret = COMPACT_NO_SUITABLE_PAGE; for (order = cc->order; order < NR_PAGE_ORDERS; order++) { struct free_area *area = &cc->zone->free_area[order]; - bool claim_block; /* Job done if page is free of the right migratetype */ if (!free_area_empty(area, migratetype)) @@ -2295,8 +2294,7 @@ static enum compact_result __compact_finished(struct compact_control *cc) * Job done if allocation would steal freepages from * other migratetype buddy lists. */ - if (find_suitable_fallback(area, order, migratetype, - true, &claim_block) != -1) + if (find_suitable_fallback(area, order, migratetype, true) >= 0) /* * Movable pages are OK in any pageblock. If we are * stealing for a non-movable allocation, make sure diff --git a/mm/internal.h b/mm/internal.h index 313f6e6ea62e..3fb4222fc3c9 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -815,7 +815,7 @@ void init_cma_reserved_pageblock(struct page *page); #endif /* CONFIG_COMPACTION || CONFIG_CMA */ int find_suitable_fallback(struct free_area *area, unsigned int order, - int migratetype, bool claim_only, bool *claim_block); + int migratetype, bool claimable); static inline bool free_area_empty(struct free_area *area, int migratetype) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 97dc7a5280a2..d1ecd3793e40 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2080,31 +2080,25 @@ static bool should_try_claim_block(unsigned int order, int start_mt) /* * Check whether there is a suitable fallback freepage with requested order. - * Sets *claim_block to instruct the caller whether it should convert a whole - * pageblock to the returned migratetype. - * If only_claim is true, this function returns fallback_mt only if + * If claimable is true, this function returns fallback_mt only if * we would do this whole-block claiming. This would help to reduce * fragmentation due to mixed migratetype pages in one pageblock. */ int find_suitable_fallback(struct free_area *area, unsigned int order, - int migratetype, bool only_claim, bool *claim_block) + int migratetype, bool claimable) { int i; - int fallback_mt; + + if (claimable && !should_try_claim_block(order, migratetype)) + return -2; if (area->nr_free == 0) return -1; - *claim_block = false; for (i = 0; i < MIGRATE_FALLBACKS - 1 ; i++) { - fallback_mt = fallbacks[migratetype][i]; - if (free_area_empty(area, fallback_mt)) - continue; + int fallback_mt = fallbacks[migratetype][i]; - if (should_try_claim_block(order, migratetype)) - *claim_block = true; - - if (*claim_block || !only_claim) + if (!free_area_empty(area, fallback_mt)) return fallback_mt; } @@ -2201,7 +2195,6 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype, int min_order = order; struct page *page; int fallback_mt; - bool claim_block; /* * Do not steal pages from freelists belonging to other pageblocks @@ -2220,11 +2213,14 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype, --current_order) { area = &(zone->free_area[current_order]); fallback_mt = find_suitable_fallback(area, current_order, - start_migratetype, false, &claim_block); + start_migratetype, true); + + /* No block in that order */ if (fallback_mt == -1) continue; - if (!claim_block) + /* Advanced into orders too low to claim, abort */ + if (fallback_mt == -2) break; page = get_page_from_free_area(area, fallback_mt); @@ -2252,12 +2248,11 @@ __rmqueue_steal(struct zone *zone, int order, int start_migratetype) int current_order; struct page *page; int fallback_mt; - bool claim_block; for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) { area = &(zone->free_area[current_order]); fallback_mt = find_suitable_fallback(area, current_order, - start_migratetype, false, &claim_block); + start_migratetype, false); if (fallback_mt == -1) continue; From 6d61bc2d2d7fcb6e4eef25e5182563b0ae79e4c3 Mon Sep 17 00:00:00 2001 From: Richard Chang Date: Wed, 2 Jul 2025 07:16:49 +0000 Subject: [PATCH 16/49] ANDROID: restricted vendor_hook: add swap_readpage_bdev_sync Add restricted vendor hook to optimize the swap-in latency. Bug: 401975249 Bug: 428209185 Change-Id: I1a2be1a309769590cb427e13762e29d8c8fa9cf6 Signed-off-by: Richard Chang --- drivers/android/vendor_hooks.c | 1 + include/trace/hooks/mm.h | 4 ++++ mm/page_io.c | 13 +++++++++++++ 3 files changed, 18 insertions(+) diff --git a/drivers/android/vendor_hooks.c b/drivers/android/vendor_hooks.c index d3f7ff4fde56..00ffd7ed2ffc 100644 --- a/drivers/android/vendor_hooks.c +++ b/drivers/android/vendor_hooks.c @@ -625,6 +625,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_migration_target_bypass); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_shrink_node_memcgs); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_swap_writepage); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_swap_readpage_bdev_sync); +EXPORT_TRACEPOINT_SYMBOL_GPL(android_rvh_swap_readpage_bdev_sync); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_dpm_wait_start); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_dpm_wait_finish); EXPORT_TRACEPOINT_SYMBOL_GPL(android_vh_sync_irq_wait_start); diff --git a/include/trace/hooks/mm.h b/include/trace/hooks/mm.h index 65eb40c00944..8087138ba33c 100644 --- a/include/trace/hooks/mm.h +++ b/include/trace/hooks/mm.h @@ -549,6 +549,10 @@ DECLARE_HOOK(android_vh_swap_readpage_bdev_sync, TP_PROTO(struct block_device *bdev, sector_t sector, struct page *page, bool *read), TP_ARGS(bdev, sector, page, read)); +DECLARE_RESTRICTED_HOOK(android_rvh_swap_readpage_bdev_sync, + TP_PROTO(struct block_device *bdev, sector_t sector, + struct page *page, bool *read), + TP_ARGS(bdev, sector, page, read), 4); DECLARE_HOOK(android_vh_alloc_flags_cma_adjust, TP_PROTO(gfp_t gfp_mask, unsigned int *alloc_flags), TP_ARGS(gfp_mask, alloc_flags)); diff --git a/mm/page_io.c b/mm/page_io.c index 648fd53303a9..a3feadd1ba9e 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -471,6 +471,19 @@ static void swap_readpage_bdev_sync(struct folio *folio, struct bio bio; bool read = false; + trace_android_rvh_swap_readpage_bdev_sync(sis->bdev, + swap_page_sector(&folio->page) + get_start_sect(sis->bdev), + &folio->page, &read); + if (read) { + count_vm_events(PSWPIN, folio_nr_pages(folio)); + return; + } + + /* + * trace_android_vh_swap_readpage_bdev_sync is deprecated, and + * should not be carried over into later kernels. + * Use trace_android_rvh_swap_readpage_bdev_sync instead. + */ trace_android_vh_swap_readpage_bdev_sync(sis->bdev, swap_page_sector(&folio->page) + get_start_sect(sis->bdev), &folio->page, &read); From cc8b083f6fb69d104e028513ddf46ef77adb2723 Mon Sep 17 00:00:00 2001 From: Richard Chang Date: Wed, 2 Jul 2025 07:58:27 +0000 Subject: [PATCH 17/49] ANDROID: ABI: Update pixel symbol list Adding the following symbols: - __traceiter_android_rvh_swap_readpage_bdev_sync - __tracepoint_android_rvh_swap_readpage_bdev_sync Bug: 401975249 Bug: 428209185 Change-Id: Ibdad385b2a9dc36e585ff3aa1ee9334680c57a20 Signed-off-by: Richard Chang --- android/abi_gki_aarch64.stg | 20 ++++++++++++++++++++ android/abi_gki_aarch64_pixel | 2 ++ 2 files changed, 22 insertions(+) diff --git a/android/abi_gki_aarch64.stg b/android/abi_gki_aarch64.stg index d8ed1e76d857..41b693bf1454 100644 --- a/android/abi_gki_aarch64.stg +++ b/android/abi_gki_aarch64.stg @@ -360789,6 +360789,15 @@ elf_symbol { type_id: 0x9baf3eaf full_name: "__traceiter_android_rvh_show_max_freq" } +elf_symbol { + id: 0xb80ecc98 + name: "__traceiter_android_rvh_swap_readpage_bdev_sync" + is_defined: true + symbol_type: FUNCTION + crc: 0xecf99d88 + type_id: 0x9bab3090 + full_name: "__traceiter_android_rvh_swap_readpage_bdev_sync" +} elf_symbol { id: 0x3b650ee3 name: "__traceiter_android_rvh_tcp_rcv_spurious_retrans" @@ -367899,6 +367908,15 @@ elf_symbol { type_id: 0x18ccbd2c full_name: "__tracepoint_android_rvh_show_max_freq" } +elf_symbol { + id: 0x64ce7cd6 + name: "__tracepoint_android_rvh_swap_readpage_bdev_sync" + is_defined: true + symbol_type: OBJECT + crc: 0x72fbf2a6 + type_id: 0x18ccbd2c + full_name: "__tracepoint_android_rvh_swap_readpage_bdev_sync" +} elf_symbol { id: 0x5380a8d5 name: "__tracepoint_android_rvh_tcp_rcv_spurious_retrans" @@ -436958,6 +436976,7 @@ interface { symbol_id: 0x1228e7e9 symbol_id: 0x73c83ef4 symbol_id: 0x46515de8 + symbol_id: 0xb80ecc98 symbol_id: 0x3b650ee3 symbol_id: 0xcf016f05 symbol_id: 0x79480d0a @@ -437748,6 +437767,7 @@ interface { symbol_id: 0x8a4070f7 symbol_id: 0x00b7ed82 symbol_id: 0xe8cacf26 + symbol_id: 0x64ce7cd6 symbol_id: 0x5380a8d5 symbol_id: 0x1f12a317 symbol_id: 0x454d16cc diff --git a/android/abi_gki_aarch64_pixel b/android/abi_gki_aarch64_pixel index d64fef8faf50..d17b443f7c9e 100644 --- a/android/abi_gki_aarch64_pixel +++ b/android/abi_gki_aarch64_pixel @@ -2669,6 +2669,7 @@ __traceiter_android_rvh_setscheduler_prio __traceiter_android_rvh_set_task_cpu __traceiter_android_rvh_set_user_nice_locked + __traceiter_android_rvh_swap_readpage_bdev_sync __traceiter_android_rvh_tick_entry __traceiter_android_rvh_try_to_wake_up_success __traceiter_android_rvh_uclamp_eff_get @@ -2808,6 +2809,7 @@ __tracepoint_android_rvh_setscheduler_prio __tracepoint_android_rvh_set_task_cpu __tracepoint_android_rvh_set_user_nice_locked + __tracepoint_android_rvh_swap_readpage_bdev_sync __tracepoint_android_rvh_tick_entry __tracepoint_android_rvh_try_to_wake_up_success __tracepoint_android_rvh_uclamp_eff_get From 326b0bd6324844d4fa25ace1939a34e871aa2caf Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 5 Jun 2025 12:00:09 +0200 Subject: [PATCH 18/49] BACKPORT: FROMGIT: sched/core: Fix migrate_swap() vs. hotplug On Mon, Jun 02, 2025 at 03:22:13PM +0800, Kuyo Chang wrote: > So, the potential race scenario is: > > CPU0 CPU1 > // doing migrate_swap(cpu0/cpu1) > stop_two_cpus() > ... > // doing _cpu_down() > sched_cpu_deactivate() > set_cpu_active(cpu, false); > balance_push_set(cpu, true); > cpu_stop_queue_two_works > __cpu_stop_queue_work(stopper1,...); > __cpu_stop_queue_work(stopper2,..); > stop_cpus_in_progress -> true > preempt_enable(); > ... > 1st balance_push > stop_one_cpu_nowait > cpu_stop_queue_work > __cpu_stop_queue_work > list_add_tail -> 1st add push_work > wake_up_q(&wakeq); -> "wakeq is empty. > This implies that the stopper is at wakeq@migrate_swap." > preempt_disable > wake_up_q(&wakeq); > wake_up_process // wakeup migrate/0 > try_to_wake_up > ttwu_queue > ttwu_queue_cond ->meet below case > if (cpu == smp_processor_id()) > return false; > ttwu_do_activate > //migrate/0 wakeup done > wake_up_process // wakeup migrate/1 > try_to_wake_up > ttwu_queue > ttwu_queue_cond > ttwu_queue_wakelist > __ttwu_queue_wakelist > __smp_call_single_queue > preempt_enable(); > > 2nd balance_push > stop_one_cpu_nowait > cpu_stop_queue_work > __cpu_stop_queue_work > list_add_tail -> 2nd add push_work, so the double list add is detected > ... > ... > cpu1 get ipi, do sched_ttwu_pending, wakeup migrate/1 > So this balance_push() is part of schedule(), and schedule() is supposed to switch to stopper task, but because of this race condition, stopper task is stuck in WAKING state and not actually visible to be picked. Therefore CPU1 can do another schedule() and end up doing another balance_push() even though the last one hasn't been done yet. This is a confluence of fail, where both wake_q and ttwu_wakelist can cause crucial wakeups to be delayed, resulting in the malfunction of balance_push. Since there is only a single stopper thread to be woken, the wake_q doesn't really add anything here, and can be removed in favour of direct wakeups of the stopper thread. Then add a clause to ttwu_queue_cond() to ensure the stopper threads are never queued / delayed. Of all 3 moving parts, the last addition was the balance_push() machinery, so pick that as the point the bug was introduced. Fixes: 2558aacff858 ("sched/hotplug: Ensure only per-cpu kthreads run during hotplug") Reported-by: Kuyo Chang Signed-off-by: Peter Zijlstra (Intel) Tested-by: Kuyo Chang Link: https://lkml.kernel.org/r/20250605100009.GO39944@noisy.programming.kicks-ass.net (cherry picked from commit b18ad3387895ae22eb784f721d476094ad71899b git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent) Bug: 419157029 Change-Id: Ia54be189b1ab08f2171c094e4182ebb99330565f [jstultz: Resolved trivial collision in cherry-pick] Signed-off-by: John Stultz --- kernel/sched/core.c | 5 +++++ kernel/stop_machine.c | 20 ++++++++++---------- 2 files changed, 15 insertions(+), 10 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e3e17a54c71f..41f11c0f834e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4073,6 +4073,11 @@ bool cpus_share_cache(int this_cpu, int that_cpu) static inline bool ttwu_queue_cond(struct task_struct *p, int cpu) { +#ifdef CONFIG_SMP + if (p->sched_class == &stop_sched_class) + return false; +#endif + /* * Do not complicate things with the async wake_list while the CPU is * in hotplug state. diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c index 7b65bb0b4a66..0c3d387d3db7 100644 --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -82,18 +82,15 @@ static void cpu_stop_signal_done(struct cpu_stop_done *done) } static void __cpu_stop_queue_work(struct cpu_stopper *stopper, - struct cpu_stop_work *work, - struct wake_q_head *wakeq) + struct cpu_stop_work *work) { list_add_tail(&work->list, &stopper->works); - wake_q_add(wakeq, stopper->thread); } /* queue @work to @stopper. if offline, @work is completed immediately */ static bool cpu_stop_queue_work(unsigned int cpu, struct cpu_stop_work *work) { struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu); - DEFINE_WAKE_Q(wakeq); unsigned long flags; bool enabled; @@ -101,12 +98,13 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct cpu_stop_work *work) raw_spin_lock_irqsave(&stopper->lock, flags); enabled = stopper->enabled; if (enabled) - __cpu_stop_queue_work(stopper, work, &wakeq); + __cpu_stop_queue_work(stopper, work); else if (work->done) cpu_stop_signal_done(work->done); raw_spin_unlock_irqrestore(&stopper->lock, flags); - wake_up_q(&wakeq); + if (enabled) + wake_up_process(stopper->thread); preempt_enable(); return enabled; @@ -264,7 +262,6 @@ static int cpu_stop_queue_two_works(int cpu1, struct cpu_stop_work *work1, { struct cpu_stopper *stopper1 = per_cpu_ptr(&cpu_stopper, cpu1); struct cpu_stopper *stopper2 = per_cpu_ptr(&cpu_stopper, cpu2); - DEFINE_WAKE_Q(wakeq); int err; retry: @@ -300,8 +297,8 @@ retry: } err = 0; - __cpu_stop_queue_work(stopper1, work1, &wakeq); - __cpu_stop_queue_work(stopper2, work2, &wakeq); + __cpu_stop_queue_work(stopper1, work1); + __cpu_stop_queue_work(stopper2, work2); unlock: raw_spin_unlock(&stopper2->lock); @@ -316,7 +313,10 @@ unlock: goto retry; } - wake_up_q(&wakeq); + if (!err) { + wake_up_process(stopper1->thread); + wake_up_process(stopper2->thread); + } preempt_enable(); return err; From 6d27de405aaf6127f4b7184a8377813eb2a030a5 Mon Sep 17 00:00:00 2001 From: Nikita Ioffe Date: Tue, 1 Jul 2025 18:35:42 +0000 Subject: [PATCH 19/49] ANDROID: KVM: arm64: use hyp_trace_raw_fops for trace_pipe_raw The trace_pipe_raw interface is expected to return binary trace format, while hyp_trace_pipe_fops returns the text trace format. This patch change trace_pipe_raw fops hyp_trace_raw_fops which provides the binary output. Bug: 428904926 Test: presubmit Change-Id: Id72d2c7df366934f00b17674078c94c2b2d288be Signed-off-by: Nikita Ioffe --- arch/arm64/kvm/hyp_trace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/kvm/hyp_trace.c b/arch/arm64/kvm/hyp_trace.c index b4a5d117568a..b6ce4aa33eb0 100644 --- a/arch/arm64/kvm/hyp_trace.c +++ b/arch/arm64/kvm/hyp_trace.c @@ -861,7 +861,7 @@ int hyp_trace_init_tracefs(void) tracefs_create_file("trace_pipe", TRACEFS_MODE_READ, per_cpu_dir, (void *)cpu, &hyp_trace_pipe_fops); tracefs_create_file("trace_pipe_raw", TRACEFS_MODE_READ, per_cpu_dir, - (void *)cpu, &hyp_trace_pipe_fops); + (void *)cpu, &hyp_trace_raw_fops); } hyp_trace_init_event_tracefs(root); From 65f295739c930f94ecd495b993c7571b2c7f4e95 Mon Sep 17 00:00:00 2001 From: Nikita Ioffe Date: Thu, 3 Jul 2025 16:06:03 +0000 Subject: [PATCH 20/49] ANDROID: kvm: arm64: start hypervisor event IDs from 1 IDs of tracing events are expected to be positive integers, hence this patch. This is a quick fix to make sure that hypervisor tracing works, while the proper solution that avoids ID collision is being worked on. Bug: 428904926 Test: presubmit Change-Id: I95459cbf32466351b6a539ea2111e8d091291c2b Signed-off-by: Nikita Ioffe --- arch/arm64/kvm/hyp_events.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kvm/hyp_events.c b/arch/arm64/kvm/hyp_events.c index 424cd5189355..086931bec32c 100644 --- a/arch/arm64/kvm/hyp_events.c +++ b/arch/arm64/kvm/hyp_events.c @@ -250,7 +250,10 @@ bool hyp_trace_init_event_early(void) } static struct dentry *event_tracefs; -static unsigned int last_event_id; +// Event IDs should be positive integers, hence starting from 1 here. +// NOTE: this introduces ID clash between hypervisor events and kernel events. +// For now this doesn't seem to cause problems, but we should fix it... +static unsigned int last_event_id = 1; struct hyp_event_table { struct hyp_event *start; From 925ea90047178598e9bfa45b7e82505c069d3dec Mon Sep 17 00:00:00 2001 From: Nikita Ioffe Date: Fri, 4 Jul 2025 12:04:53 +0000 Subject: [PATCH 21/49] ANDROID: kvm: arm64: add per_cpu/cpuX/trace file The trace interface was present in android14-6.1 kernel, and is used by perfetto (although perfetto can work without it), so we should keep it. Bug: 428904926 Test: presubmit Change-Id: I51cc82324b3ef1ad8a801ae54f427eaf8790acd2 Signed-off-by: Nikita Ioffe --- arch/arm64/kvm/hyp_trace.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/arm64/kvm/hyp_trace.c b/arch/arm64/kvm/hyp_trace.c index b6ce4aa33eb0..4eb85aad055f 100644 --- a/arch/arm64/kvm/hyp_trace.c +++ b/arch/arm64/kvm/hyp_trace.c @@ -862,6 +862,8 @@ int hyp_trace_init_tracefs(void) (void *)cpu, &hyp_trace_pipe_fops); tracefs_create_file("trace_pipe_raw", TRACEFS_MODE_READ, per_cpu_dir, (void *)cpu, &hyp_trace_raw_fops); + tracefs_create_file("trace", TRACEFS_MODE_WRITE, per_cpu_dir, + (void *)cpu, &hyp_trace_fops); } hyp_trace_init_event_tracefs(root); From 84bb4ef6233cddaf10a656f21375d5e008d6751c Mon Sep 17 00:00:00 2001 From: Kuen-Han Tsai Date: Tue, 17 Jun 2025 13:07:12 +0800 Subject: [PATCH 22/49] UPSTREAM: usb: gadget: u_serial: Fix race condition in TTY wakeup A race condition occurs when gs_start_io() calls either gs_start_rx() or gs_start_tx(), as those functions briefly drop the port_lock for usb_ep_queue(). This allows gs_close() and gserial_disconnect() to clear port.tty and port_usb, respectively. Use the null-safe TTY Port helper function to wake up TTY. Example CPU1: CPU2: gserial_connect() // lock gs_close() // await lock gs_start_rx() // unlock usb_ep_queue() gs_close() // lock, reset port.tty and unlock gs_start_rx() // lock tty_wakeup() // NPE Fixes: 35f95fd7f234 ("TTY: usb/u_serial, use tty from tty_port") Cc: stable Signed-off-by: Kuen-Han Tsai Reviewed-by: Prashanth K Link: https://lore.kernel.org/linux-usb/20240116141801.396398-1-khtsai@google.com/ Link: https://lore.kernel.org/r/20250617050844.1848232-2-khtsai@google.com Signed-off-by: Greg Kroah-Hartman Bug: 417232809 (cherry picked from commit c529c3730bd09115684644e26bf01ecbd7e2c2c9) Change-Id: I0dfff41aeae526bc3c334266f2773e6636d8dd33 Signed-off-by: Kuen-Han Tsai --- drivers/usb/gadget/function/u_serial.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/usb/gadget/function/u_serial.c b/drivers/usb/gadget/function/u_serial.c index 729b0472bab0..7a306b11881f 100644 --- a/drivers/usb/gadget/function/u_serial.c +++ b/drivers/usb/gadget/function/u_serial.c @@ -291,8 +291,8 @@ __acquires(&port->port_lock) break; } - if (do_tty_wake && port->port.tty) - tty_wakeup(port->port.tty); + if (do_tty_wake) + tty_port_tty_wakeup(&port->port); return status; } @@ -573,7 +573,7 @@ static int gs_start_io(struct gs_port *port) gs_start_tx(port); /* Unblock any pending writes into our circular buffer, in case * we didn't in gs_start_tx() */ - tty_wakeup(port->port.tty); + tty_port_tty_wakeup(&port->port); } else { /* Free reqs only if we are still connected */ if (port->port_usb) { From 5b1c4cc0868731a80e0460cc81504d7a40130f02 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Yuan-Jen=20=28=E6=B7=B5=E4=BB=81=29=20Cheng?= Date: Thu, 3 Jul 2025 10:15:11 +0000 Subject: [PATCH 23/49] ANDROID: Add the dma header to aarch64 allowlist MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Export the header in all_headers_allowlist_aarch64, for dma driver to use. Bug: 343869732 Test: Verified the dma ddk modules are able to include the header. Change-Id: Ib4bc8dada58495dc25bb1b41e6b502c18fe59591 Signed-off-by: Yuan-Jen (淵仁) Cheng --- BUILD.bazel | 2 ++ 1 file changed, 2 insertions(+) diff --git a/BUILD.bazel b/BUILD.bazel index 4be135108507..d17808239217 100644 --- a/BUILD.bazel +++ b/BUILD.bazel @@ -1025,6 +1025,7 @@ ddk_headers( name = "all_headers_allowlist_aarch64", hdrs = [ "drivers/dma-buf/heaps/deferred-free-helper.h", + "drivers/dma/dmaengine.h", "drivers/extcon/extcon.h", "drivers/pci/controller/dwc/pcie-designware.h", "drivers/thermal/thermal_core.h", @@ -1046,6 +1047,7 @@ ddk_headers( "arch/arm64/include", "arch/arm64/include/uapi", "drivers/dma-buf", + "drivers/dma", "drivers/extcon", "drivers/pci/controller/dwc", "drivers/thermal", From 949ed5babab53bdde799b8ece3922047054f4beb Mon Sep 17 00:00:00 2001 From: yipeng xiang Date: Thu, 26 Jun 2025 11:19:45 +0800 Subject: [PATCH 24/49] ANDROID: mm: export vm_normal_folio_pmd to allow vendors to implement simplified smaps The current process smaps operation is time-consuming. Exporting the vm_normal_folio_pmd function enables vendors to provide a more efficient and simplified version of smaps. Bug: 427633539 Change-Id: I7710f5d1656a9f7a4ae883aefc93135c93e637b5 Signed-off-by: yipeng xiang --- mm/memory.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/memory.c b/mm/memory.c index a04841dc9291..2646e93c9004 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -713,6 +713,7 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma, return page_folio(page); return NULL; } +EXPORT_SYMBOL_GPL(vm_normal_folio_pmd); #endif static void restore_exclusive_pte(struct vm_area_struct *vma, From 573a6732fcff2e3a5f529c8c7c401b6f46be18ff Mon Sep 17 00:00:00 2001 From: yipeng xiang Date: Fri, 4 Jul 2025 16:58:00 +0800 Subject: [PATCH 25/49] ANDROID: GKI: Update symbols list file for honor White list the vm_normal_folio_pmd 1 function symbol(s) added 'struct folio* vm_normal_folio_pmd(struct vm_area_struct*, unsigned long, pmd_t)' Bug: 427633539 Change-Id: Ib7e2f6d4a574871202cc10b32342f84a63811c0a Signed-off-by: yipeng xiang --- android/abi_gki_aarch64.stg | 17 +++++++++++++++++ android/abi_gki_aarch64_honor | 1 + 2 files changed, 18 insertions(+) diff --git a/android/abi_gki_aarch64.stg b/android/abi_gki_aarch64.stg index 41b693bf1454..baa12bbf4722 100644 --- a/android/abi_gki_aarch64.stg +++ b/android/abi_gki_aarch64.stg @@ -321953,6 +321953,13 @@ function { parameter_id: 0x27a7c613 parameter_id: 0x4585663f } +function { + id: 0x5e21336c + return_type_id: 0x2170d06d + parameter_id: 0x0a134144 + parameter_id: 0x33756485 + parameter_id: 0xae60496e +} function { id: 0x5e29431a return_type_id: 0x295c7202 @@ -434551,6 +434558,15 @@ elf_symbol { type_id: 0xfc37fa4b full_name: "vm_node_stat" } +elf_symbol { + id: 0x4e194253 + name: "vm_normal_folio_pmd" + is_defined: true + symbol_type: FUNCTION + crc: 0xa737dbaa + type_id: 0x5e21336c + full_name: "vm_normal_folio_pmd" +} elf_symbol { id: 0x2570ceae name: "vm_normal_page" @@ -445161,6 +445177,7 @@ interface { symbol_id: 0xdc09fb10 symbol_id: 0x5849ff8e symbol_id: 0xaf85c216 + symbol_id: 0x4e194253 symbol_id: 0x2570ceae symbol_id: 0xacc76406 symbol_id: 0xef2c49d1 diff --git a/android/abi_gki_aarch64_honor b/android/abi_gki_aarch64_honor index 48c49720e0ea..33decf01d449 100644 --- a/android/abi_gki_aarch64_honor +++ b/android/abi_gki_aarch64_honor @@ -94,6 +94,7 @@ bio_crypt_set_ctx zero_fill_bio_iter percpu_ref_is_zero + vm_normal_folio_pmd __trace_bputs __traceiter_android_vh_proactive_compact_wmark_high __tracepoint_android_vh_proactive_compact_wmark_high From 5f592a6260c334e0b195ef229eb11910fc5a5890 Mon Sep 17 00:00:00 2001 From: Aran Dalton Date: Mon, 7 Jul 2025 16:56:35 +0800 Subject: [PATCH 26/49] ANDROID: ABI: Update symbol list for sunxi 5 function symbol(s) added 'bool drm_is_panel_follower(struct device*)' 'int drm_panel_add_follower(struct device*, struct drm_panel_follower*)' 'void drm_panel_remove_follower(struct drm_panel_follower*)' 'int hid_driver_reset_resume(struct hid_device*)' 'int hid_driver_suspend(struct hid_device*, pm_message_t)' Bug: 429955708 Change-Id: Iaf02aef7b07559aafd283f496b3c7088d0b89669 Signed-off-by: Aran Dalton --- android/abi_gki_aarch64.stg | 129 ++++++++++++++++++++++++++++++++++ android/abi_gki_aarch64_sunxi | 5 ++ 2 files changed, 134 insertions(+) diff --git a/android/abi_gki_aarch64.stg b/android/abi_gki_aarch64.stg index baa12bbf4722..679394de3bae 100644 --- a/android/abi_gki_aarch64.stg +++ b/android/abi_gki_aarch64.stg @@ -8228,6 +8228,11 @@ pointer_reference { kind: POINTER pointee_type_id: 0x15e4d187 } +pointer_reference { + id: 0x0fe9f911 + kind: POINTER + pointee_type_id: 0x15e702d9 +} pointer_reference { id: 0x0fe9ffda kind: POINTER @@ -17573,6 +17578,11 @@ pointer_reference { kind: POINTER pointee_type_id: 0x9e7aaf3f } +pointer_reference { + id: 0x2d0e9efd + kind: POINTER + pointee_type_id: 0x9e7a9d6b +} pointer_reference { id: 0x2d0fdd7c kind: POINTER @@ -27928,6 +27938,11 @@ pointer_reference { kind: POINTER pointee_type_id: 0xca7029d8 } +pointer_reference { + id: 0x380eb497 + kind: POINTER + pointee_type_id: 0xca7a34c0 +} pointer_reference { id: 0x381020ff kind: POINTER @@ -35043,6 +35058,11 @@ qualified { qualifier: CONST qualified_type_id: 0x592e728c } +qualified { + id: 0xca7a34c0 + qualifier: CONST + qualified_type_id: 0x59af6589 +} qualified { id: 0xca8285c3 qualifier: CONST @@ -99904,6 +99924,11 @@ member { type_id: 0x37e7a473 offset: 768 } +member { + id: 0x36181e96 + name: "funcs" + type_id: 0x380eb497 +} member { id: 0x36184afd name: "funcs" @@ -152610,6 +152635,12 @@ member { type_id: 0x9bd401b6 offset: 16 } +member { + id: 0xd3327091 + name: "panel" + type_id: 0x10617cac + offset: 192 +} member { id: 0xd3a8d2cb name: "panel" @@ -152633,6 +152664,17 @@ member { type_id: 0x2a670b41 offset: 9024 } +member { + id: 0xf2e51365 + name: "panel_prepared" + type_id: 0x2d0e9efd +} +member { + id: 0x289370ad + name: "panel_unpreparing" + type_id: 0x2d0e9efd + offset: 64 +} member { id: 0x616a797d name: "panic" @@ -239344,6 +239386,27 @@ struct_union { member_id: 0x3a2d3750 } } +struct_union { + id: 0x15e702d9 + kind: STRUCT + name: "drm_panel_follower" + definition { + bytesize: 32 + member_id: 0x36181e96 + member_id: 0x7c00ebb3 + member_id: 0xd3327091 + } +} +struct_union { + id: 0x59af6589 + kind: STRUCT + name: "drm_panel_follower_funcs" + definition { + bytesize: 16 + member_id: 0xf2e51365 + member_id: 0x289370ad + } +} struct_union { id: 0x5c75f1b8 kind: STRUCT @@ -308489,6 +308552,11 @@ function { parameter_id: 0x0258f96e parameter_id: 0xd41e888f } +function { + id: 0x13622fd7 + return_type_id: 0x48b5725f + parameter_id: 0x0fe9f911 +} function { id: 0x1362a71c return_type_id: 0x48b5725f @@ -345553,6 +345621,12 @@ function { parameter_id: 0x0258f96e parameter_id: 0x0fa01494 } +function { + id: 0x9d297a90 + return_type_id: 0x6720d32f + parameter_id: 0x0258f96e + parameter_id: 0x0fe9f911 +} function { id: 0x9d2c14da return_type_id: 0x6720d32f @@ -348130,6 +348204,11 @@ function { parameter_id: 0x0c2e195c parameter_id: 0x3ca4f8de } +function { + id: 0x9e7a9d6b + return_type_id: 0x6720d32f + parameter_id: 0x0fe9f911 +} function { id: 0x9e7aaf3f return_type_id: 0x6720d32f @@ -389913,6 +389992,15 @@ elf_symbol { type_id: 0xfa1de4ef full_name: "drm_is_current_master" } +elf_symbol { + id: 0xa3983618 + name: "drm_is_panel_follower" + is_defined: true + symbol_type: FUNCTION + crc: 0xcfdfa487 + type_id: 0xfe32655f + full_name: "drm_is_panel_follower" +} elf_symbol { id: 0xc8af6225 name: "drm_kms_helper_connector_hotplug_event" @@ -390561,6 +390649,15 @@ elf_symbol { type_id: 0x14800eb8 full_name: "drm_panel_add" } +elf_symbol { + id: 0x2b742694 + name: "drm_panel_add_follower" + is_defined: true + symbol_type: FUNCTION + crc: 0x2db618bd + type_id: 0x9d297a90 + full_name: "drm_panel_add_follower" +} elf_symbol { id: 0xd67ad69f name: "drm_panel_bridge_add_typed" @@ -390651,6 +390748,15 @@ elf_symbol { type_id: 0x14800eb8 full_name: "drm_panel_remove" } +elf_symbol { + id: 0x6016204a + name: "drm_panel_remove_follower" + is_defined: true + symbol_type: FUNCTION + crc: 0x397cfaf5 + type_id: 0x13622fd7 + full_name: "drm_panel_remove_follower" +} elf_symbol { id: 0x046720ab name: "drm_panel_unprepare" @@ -396752,6 +396858,24 @@ elf_symbol { type_id: 0x13e1603f full_name: "hid_destroy_device" } +elf_symbol { + id: 0x1706be22 + name: "hid_driver_reset_resume" + is_defined: true + symbol_type: FUNCTION + crc: 0x371549c9 + type_id: 0x9ef9d283 + full_name: "hid_driver_reset_resume" +} +elf_symbol { + id: 0x4c3911f0 + name: "hid_driver_suspend" + is_defined: true + symbol_type: FUNCTION + crc: 0xe6a4222b + type_id: 0x9d398c85 + full_name: "hid_driver_suspend" +} elf_symbol { id: 0x8717f26f name: "hid_hw_close" @@ -440224,6 +440348,7 @@ interface { symbol_id: 0x3a6e27e9 symbol_id: 0xc9aa2ffd symbol_id: 0xec79cf1c + symbol_id: 0xa3983618 symbol_id: 0xc8af6225 symbol_id: 0x8a043efe symbol_id: 0x3c6b600d @@ -440296,6 +440421,7 @@ interface { symbol_id: 0xc73568f4 symbol_id: 0x124ae77d symbol_id: 0xdc6725cf + symbol_id: 0x2b742694 symbol_id: 0xd67ad69f symbol_id: 0x48cde8a9 symbol_id: 0x633d0644 @@ -440306,6 +440432,7 @@ interface { symbol_id: 0xad1d778f symbol_id: 0xcf81b673 symbol_id: 0x864914fa + symbol_id: 0x6016204a symbol_id: 0x046720ab symbol_id: 0x3c07bbff symbol_id: 0xbdb562b1 @@ -440982,6 +441109,8 @@ interface { symbol_id: 0xccc593d6 symbol_id: 0x97a02af0 symbol_id: 0x2ffc7c7e + symbol_id: 0x1706be22 + symbol_id: 0x4c3911f0 symbol_id: 0x8717f26f symbol_id: 0x361004c8 symbol_id: 0xcf5ea9a2 diff --git a/android/abi_gki_aarch64_sunxi b/android/abi_gki_aarch64_sunxi index 4b51d7f71b55..27b308dd7254 100644 --- a/android/abi_gki_aarch64_sunxi +++ b/android/abi_gki_aarch64_sunxi @@ -91,3 +91,8 @@ __tracepoint_dwc3_readl __tracepoint_dwc3_writel pinctrl_gpio_set_config + drm_is_panel_follower + drm_panel_add_follower + drm_panel_remove_follower + hid_driver_reset_resume + hid_driver_suspend From be36ded30366f3f52140daa24fc1ab18efd3f9a0 Mon Sep 17 00:00:00 2001 From: John Stultz Date: Sun, 6 Jul 2025 18:08:53 +0000 Subject: [PATCH 27/49] ANDROID: Revert "cpufreq: Avoid using inconsistent policy->min and policy->max" The combination of the cpufreq changes that came in with v6.12.28, commit 573b04722907 ("cpufreq: Avoid using inconsistent policy->min and policy->max") and commit 962d88304c3c ("cpufreq: Fix setting policy limits when frequency tables are used") unfortunately broke the KABI. The second of which was reverted in ad2b007ef43c ("Revert "cpufreq: Fix setting policy limits when frequency tables are used""). However, that change is actually a necessary fix to the first. As the refactoring to passing the max and min through the arguments couldn't be done without KABI impact, the changes to be more consistent with policy->min/max ends up introducing a subtle problem where the new max value being set ends up being clamped to the current max value - thus cpufreq max can be reduced but not increased (with the min increased but not decreased). A minimal fix of this effectively undoes the key point of commit 573b04722907, so it seems best to revert the whole thing for now. I think the small pre-existing risk of the policy->max/min values being read when shortly to an intermediate value before getting assigned the final value seems to be less problematic in practice. Fixes: ad2b007ef43c ("Revert "cpufreq: Fix setting policy limits when frequency tables are used"") Bug: 428984800 Signed-off-by: John Stultz Change-Id: I5a76cc2b0056071ffa26a682458df5fe0a4b83a3 --- drivers/cpufreq/cpufreq.c | 31 ++++++------------------------- 1 file changed, 6 insertions(+), 25 deletions(-) diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index d0aba74067c9..3a7bd62ef6b7 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -543,6 +543,7 @@ static unsigned int __resolve_freq(struct cpufreq_policy *policy, unsigned int idx; unsigned int old_target_freq = target_freq; + target_freq = clamp_val(target_freq, policy->min, policy->max); trace_android_vh_cpufreq_resolve_freq(policy, &target_freq, old_target_freq); if (!policy->freq_table) @@ -568,22 +569,7 @@ static unsigned int __resolve_freq(struct cpufreq_policy *policy, unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy, unsigned int target_freq) { - unsigned int min = READ_ONCE(policy->min); - unsigned int max = READ_ONCE(policy->max); - - /* - * If this function runs in parallel with cpufreq_set_policy(), it may - * read policy->min before the update and policy->max after the update - * or the other way around, so there is no ordering guarantee. - * - * Resolve this by always honoring the max (in case it comes from - * thermal throttling or similar). - */ - if (unlikely(min > max)) - min = max; - - return __resolve_freq(policy, clamp_val(target_freq, min, max), - CPUFREQ_RELATION_LE); + return __resolve_freq(policy, target_freq, CPUFREQ_RELATION_LE); } EXPORT_SYMBOL_GPL(cpufreq_driver_resolve_freq); @@ -2369,7 +2355,6 @@ int __cpufreq_driver_target(struct cpufreq_policy *policy, if (cpufreq_disabled()) return -ENODEV; - target_freq = clamp_val(target_freq, policy->min, policy->max); target_freq = __resolve_freq(policy, target_freq, relation); trace_android_vh_cpufreq_target(policy, &target_freq, old_target_freq); @@ -2662,15 +2647,11 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, * Resolve policy min/max to available frequencies. It ensures * no frequency resolution will neither overshoot the requested maximum * nor undershoot the requested minimum. - * - * Avoid storing intermediate values in policy->max or policy->min and - * compiler optimizations around them because they may be accessed - * concurrently by cpufreq_driver_resolve_freq() during the update. */ - WRITE_ONCE(policy->max, __resolve_freq(policy, new_data.max, CPUFREQ_RELATION_H)); - new_data.min = __resolve_freq(policy, new_data.min, CPUFREQ_RELATION_L); - WRITE_ONCE(policy->min, new_data.min > policy->max ? policy->max : new_data.min); - + policy->min = new_data.min; + policy->max = new_data.max; + policy->min = __resolve_freq(policy, policy->min, CPUFREQ_RELATION_L); + policy->max = __resolve_freq(policy, policy->max, CPUFREQ_RELATION_H); trace_cpu_frequency_limits(policy); policy->cached_target_freq = UINT_MAX; From 96c29dad8f890f2f21f8f8a044eaa714cdfb92cb Mon Sep 17 00:00:00 2001 From: Yu Kuai Date: Fri, 19 Jul 2024 15:15:04 +0800 Subject: [PATCH 28/49] UPSTREAM: blk-cgroup: check for pd_(alloc|free)_fn in blkcg_activate_policy() Currently all policies implement pd_(alloc|free)_fn, however, this is not necessary for ioprio that only works for blkcg, not blkg. There are no functional changes, prepare to cleanup activating ioprio policy. Change-Id: I47d38ac673419e9676de6f13838f55f45027d35e Signed-off-by: Yu Kuai Reviewed-by: Christoph Hellwig Acked-by: Tejun Heo Link: https://lore.kernel.org/r/20240719071506.158075-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe Bug:427107450 (cherry picked from commit ae8650b45d1837aae117fa147aeef69540bb3fe8) Reviewed-by: Zhengxu Zhang --- block/blk-cgroup.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 64551b0aa51e..91b788149381 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1566,6 +1566,14 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol) if (blkcg_policy_enabled(q, pol)) return 0; + /* + * Policy is allowed to be registered without pd_alloc_fn/pd_free_fn, + * for example, ioprio. Such policy will work on blkcg level, not disk + * level, and don't need to be activated. + */ + if (WARN_ON_ONCE(!pol->pd_alloc_fn || !pol->pd_free_fn)) + return -EINVAL; + if (queue_is_mq(q)) blk_mq_freeze_queue(q); retry: @@ -1745,9 +1753,12 @@ int blkcg_policy_register(struct blkcg_policy *pol) goto err_unlock; } - /* Make sure cpd/pd_alloc_fn and cpd/pd_free_fn in pairs */ + /* + * Make sure cpd/pd_alloc_fn and cpd/pd_free_fn in pairs, and policy + * without pd_alloc_fn/pd_free_fn can't be activated. + */ if ((!pol->cpd_alloc_fn ^ !pol->cpd_free_fn) || - (!pol->pd_alloc_fn ^ !pol->pd_free_fn)) + (!pol->pd_alloc_fn ^ !pol->pd_free_fn)) goto err_unlock; /* register @pol */ From 250bbe1cbfafba17b58373bd17426340011bb28b Mon Sep 17 00:00:00 2001 From: "Isaac J. Manjarres" Date: Tue, 8 Jul 2025 12:14:04 -0700 Subject: [PATCH 29/49] ANDROID: GKI: Update symbol list for Pixel Watch Bug: 430323364 Bug: 411748239 Change-Id: I48a305eab7d7901a5642674d9a71bec813193470 Signed-off-by: Isaac J. Manjarres --- android/abi_gki_aarch64_pixel_watch | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/android/abi_gki_aarch64_pixel_watch b/android/abi_gki_aarch64_pixel_watch index db61c65ca0ff..a5621a839612 100644 --- a/android/abi_gki_aarch64_pixel_watch +++ b/android/abi_gki_aarch64_pixel_watch @@ -288,6 +288,7 @@ delayed_work_timer_fn destroy_workqueue dev_addr_mod + _dev_alert dev_alloc_name __dev_change_net_namespace dev_close @@ -869,6 +870,7 @@ gpiod_get_raw_value gpiod_get_raw_value_cansleep gpiod_get_value + gpiod_is_active_low gpiod_set_raw_value gpiod_set_value gpiod_set_value_cansleep @@ -2091,6 +2093,7 @@ tick_nohz_get_sleep_length timer_delete timer_delete_sync + timer_shutdown_sync topology_clear_scale_freq_source topology_update_done topology_update_thermal_pressure @@ -2171,6 +2174,10 @@ __traceiter_mmap_lock_acquire_returned __traceiter_mmap_lock_released __traceiter_mmap_lock_start_locking + __traceiter_rwmmio_post_read + __traceiter_rwmmio_post_write + __traceiter_rwmmio_read + __traceiter_rwmmio_write __traceiter_sched_overutilized_tp __traceiter_sched_switch __traceiter_sk_data_ready @@ -2246,6 +2253,10 @@ tracepoint_probe_register tracepoint_probe_register_prio tracepoint_probe_unregister + __tracepoint_rwmmio_post_read + __tracepoint_rwmmio_post_write + __tracepoint_rwmmio_read + __tracepoint_rwmmio_write __tracepoint_sched_overutilized_tp __tracepoint_sched_switch __tracepoint_sk_data_ready From f44d593749dcbd4e1013121fa615ecca412d1cb3 Mon Sep 17 00:00:00 2001 From: "T.J. Mercier" Date: Wed, 25 Jun 2025 20:06:55 +0000 Subject: [PATCH 30/49] ANDROID: Track per-process dmabuf RSS DMA buffers exist for sharing memory (between processes, drivers, and hardware) so they are not accounted the same way as user memory present on a MM's LRUs. Per-process attribution of dmabuf memory is not maintained by the kernel, so to obtain it from userspace, several files from procfs and sysfs must be read any time the information is desired. This process is slow, which can lead to dmabuf accounting information being out-of-date when it is desired during events like low memory, or bugreport generation, masking the cause of memory issues. This patch attributes dmabuf memory to any process that holds a reference to a buffer. A process can hold a reference to a dmabuf in two ways: 1) Through a file descriptor 2) Though a mapping A single buffer can be referenced more than once by a single process with multiple file descriptors for the same buffer, multiple mappings for the same buffer, or any combination of the two. The full size of a buffer is effectively pinned until no references exist from any process, or anywhere else in the kernel such as drivers that have imported the buffer. Even if a partial mapping of the buffer is the only reference that exists. Therefore buffer accounting is always performed in units of the full buffer size, and only once for each process, regardless of the number and type of references a process has for a single buffer. The /proc//dmabuf_rss file in procfs now reports the sum of all buffer sizes referenced by a process. The units are bytes. This allows userspace to obtain per-process dmabuf accounting information quickly compared to calculating it from multiple sources in procfs and sysfs. Note that a dmabuf can be backed by different types of memory such as system DRAM, GPU VRAM, or others. This patch makes no distinction between these different types of memory, so on systems with non-unified memory the reported values should be interpreted with this in mind. Bug: 424648392 Change-Id: I1de8e937f2971fe714008b459e410dde2a251b90 Signed-off-by: T.J. Mercier --- drivers/dma-buf/dma-buf.c | 141 +++++++++++++++++++++++++++++++++++++- fs/file.c | 4 ++ fs/proc/base.c | 22 ++++++ include/linux/dma-buf.h | 43 ++++++++++++ include/linux/sched.h | 4 ++ init/init_task.c | 1 + kernel/fork.c | 85 ++++++++++++++++++++++- mm/mmap.c | 14 +++- 8 files changed, 307 insertions(+), 7 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 0b02ced1eb33..c8c05d2e112a 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -162,9 +162,121 @@ static struct file_system_type dma_buf_fs_type = { .kill_sb = kill_anon_super, }; +static struct task_dma_buf_record *find_task_dmabuf_record( + struct task_struct *task, struct dma_buf *dmabuf) +{ + struct task_dma_buf_record *rec; + + lockdep_assert_held(&task->dmabuf_info->lock); + + list_for_each_entry(rec, &task->dmabuf_info->dmabufs, node) + if (dmabuf == rec->dmabuf) + return rec; + + return NULL; +} + +static int new_task_dmabuf_record(struct task_struct *task, struct dma_buf *dmabuf) +{ + struct task_dma_buf_record *rec; + + lockdep_assert_held(&task->dmabuf_info->lock); + + rec = kmalloc(sizeof(*rec), GFP_KERNEL); + if (!rec) + return -ENOMEM; + + task->dmabuf_info->rss += dmabuf->size; + rec->dmabuf = dmabuf; + rec->refcnt = 1; + list_add(&rec->node, &task->dmabuf_info->dmabufs); + + return 0; +} + +/** + * dma_buf_account_task - Account a dmabuf to a task + * @dmabuf: [in] pointer to dma_buf + * @task: [in] pointer to task_struct + * + * When a process obtains a dmabuf file descriptor, or maps a dmabuf, this + * function attributes the provided @dmabuf to the @task. The first time @dmabuf + * is attributed to @task, the buffer's size is added to the @task's dmabuf RSS. + * + * Return: + * * 0 on success + * * A negative error code upon error + */ +int dma_buf_account_task(struct dma_buf *dmabuf, struct task_struct *task) +{ + struct task_dma_buf_record *rec; + int ret = 0; + + if (!dmabuf || !task) + return -EINVAL; + + if (!task->dmabuf_info) { + pr_err("%s dmabuf accounting record was not allocated\n", __func__); + return -ENOMEM; + } + + spin_lock(&task->dmabuf_info->lock); + rec = find_task_dmabuf_record(task, dmabuf); + if (!rec) + ret = new_task_dmabuf_record(task, dmabuf); + else + ++rec->refcnt; + spin_unlock(&task->dmabuf_info->lock); + + return ret; +} + +/** + * dma_buf_unaccount_task - Unaccount a dmabuf from a task + * @dmabuf: [in] pointer to dma_buf + * @task: [in] pointer to task_struct + * + * When a process closes a dmabuf file descriptor, or unmaps a dmabuf, this + * function removes the provided @dmabuf attribution from the @task. When all + * references to @dmabuf are removed from @task, the buffer's size is removed + * from the task's dmabuf RSS. + * + * Return: + * * 0 on success + * * A negative error code upon error + */ +void dma_buf_unaccount_task(struct dma_buf *dmabuf, struct task_struct *task) +{ + struct task_dma_buf_record *rec; + + if (!dmabuf || !task) + return; + + if (!task->dmabuf_info) { + pr_err("%s dmabuf accounting record was not allocated\n", __func__); + return; + } + + spin_lock(&task->dmabuf_info->lock); + rec = find_task_dmabuf_record(task, dmabuf); + if (!rec) { /* Failed fd_install? */ + pr_err("dmabuf not found in task list\n"); + goto err; + } + + if (--rec->refcnt == 0) { + list_del(&rec->node); + kfree(rec); + task->dmabuf_info->rss -= dmabuf->size; + } +err: + spin_unlock(&task->dmabuf_info->lock); +} + static int dma_buf_mmap_internal(struct file *file, struct vm_area_struct *vma) { struct dma_buf *dmabuf; + int ret; if (!is_dma_buf_file(file)) return -EINVAL; @@ -180,7 +292,15 @@ static int dma_buf_mmap_internal(struct file *file, struct vm_area_struct *vma) dmabuf->size >> PAGE_SHIFT) return -EINVAL; - return dmabuf->ops->mmap(dmabuf, vma); + ret = dma_buf_account_task(dmabuf, current); + if (ret) + return ret; + + ret = dmabuf->ops->mmap(dmabuf, vma); + if (ret) + dma_buf_unaccount_task(dmabuf, current); + + return ret; } static loff_t dma_buf_llseek(struct file *file, loff_t offset, int whence) @@ -557,6 +677,12 @@ static void dma_buf_show_fdinfo(struct seq_file *m, struct file *file) spin_unlock(&dmabuf->name_lock); } +static int dma_buf_flush(struct file *file, fl_owner_t id) +{ + dma_buf_unaccount_task(file->private_data, current); + return 0; +} + static const struct file_operations dma_buf_fops = { .release = dma_buf_file_release, .mmap = dma_buf_mmap_internal, @@ -565,6 +691,7 @@ static const struct file_operations dma_buf_fops = { .unlocked_ioctl = dma_buf_ioctl, .compat_ioctl = compat_ptr_ioctl, .show_fdinfo = dma_buf_show_fdinfo, + .flush = dma_buf_flush, }; /* @@ -1555,6 +1682,8 @@ EXPORT_SYMBOL_GPL(dma_buf_end_cpu_access_partial); int dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma, unsigned long pgoff) { + int ret; + if (WARN_ON(!dmabuf || !vma)) return -EINVAL; @@ -1575,7 +1704,15 @@ int dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma, vma_set_file(vma, dmabuf->file); vma->vm_pgoff = pgoff; - return dmabuf->ops->mmap(dmabuf, vma); + ret = dma_buf_account_task(dmabuf, current); + if (ret) + return ret; + + ret = dmabuf->ops->mmap(dmabuf, vma); + if (ret) + dma_buf_unaccount_task(dmabuf, current); + + return ret; } EXPORT_SYMBOL_NS_GPL(dma_buf_mmap, DMA_BUF); diff --git a/fs/file.c b/fs/file.c index 1f1181b189bf..e924929ac366 100644 --- a/fs/file.c +++ b/fs/file.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include "internal.h" @@ -593,6 +594,9 @@ void fd_install(unsigned int fd, struct file *file) struct files_struct *files = current->files; struct fdtable *fdt; + if (is_dma_buf_file(file) && dma_buf_account_task(file->private_data, current)) + pr_err("FD dmabuf accounting failed\n"); + rcu_read_lock_sched(); if (unlikely(files->resize_in_progress)) { diff --git a/fs/proc/base.c b/fs/proc/base.c index 7cff02bc816e..f7d8188b0ccf 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -100,6 +100,7 @@ #include #include #include +#include #include #include #include "internal.h" @@ -3304,6 +3305,24 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns, } #endif /* CONFIG_STACKLEAK_METRICS */ +#ifdef CONFIG_DMA_SHARED_BUFFER +static int proc_dmabuf_rss_show(struct seq_file *m, struct pid_namespace *ns, + struct pid *pid, struct task_struct *task) +{ + if (!task->dmabuf_info) { + pr_err("%s dmabuf accounting record was not allocated\n", __func__); + return -ENOMEM; + } + + if (!(task->flags & PF_KTHREAD)) + seq_printf(m, "%lld\n", READ_ONCE(task->dmabuf_info->rss)); + else + seq_puts(m, "0\n"); + + return 0; +} +#endif + /* * Thread groups */ @@ -3427,6 +3446,9 @@ static const struct pid_entry tgid_base_stuff[] = { ONE("ksm_merging_pages", S_IRUSR, proc_pid_ksm_merging_pages), ONE("ksm_stat", S_IRUSR, proc_pid_ksm_stat), #endif +#ifdef CONFIG_DMA_SHARED_BUFFER + ONE("dmabuf_rss", S_IRUGO, proc_dmabuf_rss_show), +#endif }; static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx) diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 64d67293d76b..1647fb38fe80 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -24,6 +24,9 @@ #include #include #include +#ifndef __GENKSYMS__ +#include +#endif struct device; struct dma_buf; @@ -639,6 +642,43 @@ struct dma_buf_export_info { ANDROID_KABI_RESERVE(2); }; +/** + * struct task_dma_buf_record - Holds the number of (VMA and FD) references to a + * dmabuf by a collection of tasks that share both mm_struct and files_struct. + * This is the list entry type for @task_dma_buf_info dmabufs list. + * + * @node: Stores the list this record is on. + * @dmabuf: The dmabuf this record is for. + * @refcnt: The number of VMAs and FDs that reference @dmabuf by the tasks that + * share this record. + */ +struct task_dma_buf_record { + struct list_head node; + struct dma_buf *dmabuf; + unsigned long refcnt; +}; + +/** + * struct task_dma_buf_info - Holds a RSS counter, and a list of dmabufs for all + * tasks that share both mm_struct and files_struct. + * + * @rss: The sum of all dmabuf memory referenced by the tasks via memory + * mappings or file descriptors in bytes. Buffers referenced more than + * once by the process (multiple mmaps, multiple FDs, or any combination + * of both mmaps and FDs) only cause the buffer to be accounted to the + * process once. Partial mappings cause the full size of the buffer to be + * accounted, regardless of the size of the mapping. + * @refcnt: The number of tasks sharing this struct. + * @lock: Lock protecting writes for @rss, and reads/writes for @dmabufs. + * @dmabufs: List of all dmabufs referenced by the tasks. + */ +struct task_dma_buf_info { + s64 rss; + refcount_t refcnt; + spinlock_t lock; + struct list_head dmabufs; +}; + /** * DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters * @name: export-info name @@ -741,4 +781,7 @@ int dma_buf_vmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map); void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map); long dma_buf_set_name(struct dma_buf *dmabuf, const char *name); int dma_buf_get_flags(struct dma_buf *dmabuf, unsigned long *flags); + +int dma_buf_account_task(struct dma_buf *dmabuf, struct task_struct *task); +void dma_buf_unaccount_task(struct dma_buf *dmabuf, struct task_struct *task); #endif /* __DMA_BUF_H__ */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 1299b4497d87..68ba96bde447 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -70,6 +70,7 @@ struct seq_file; struct sighand_struct; struct signal_struct; struct task_delay_info; +struct task_dma_buf_info; struct task_group; struct user_event_mm; @@ -1516,6 +1517,9 @@ struct task_struct { */ struct callback_head l1d_flush_kill; #endif + + struct task_dma_buf_info *dmabuf_info; + ANDROID_KABI_RESERVE(1); ANDROID_KABI_RESERVE(2); ANDROID_KABI_RESERVE(3); diff --git a/init/init_task.c b/init/init_task.c index 31ceb0e469f7..d80c007ab59b 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -214,6 +214,7 @@ struct task_struct init_task .android_vendor_data1 = {0, }, .android_oem_data1 = {0, }, #endif + .dmabuf_info = NULL, }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index 75b1a4458a7e..66636a979911 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -101,6 +101,7 @@ #include #include #include +#include #include #include @@ -994,12 +995,32 @@ static inline void put_signal_struct(struct signal_struct *sig) free_signal_struct(sig); } +static void put_dmabuf_info(struct task_struct *tsk) +{ + if (!tsk->dmabuf_info) { + pr_err("%s dmabuf accounting record was not allocated\n", __func__); + return; + } + + if (!refcount_dec_and_test(&tsk->dmabuf_info->refcnt)) + return; + + if (READ_ONCE(tsk->dmabuf_info->rss)) + pr_err("%s destroying task with non-zero dmabuf rss\n", __func__); + + if (!list_empty(&tsk->dmabuf_info->dmabufs)) + pr_err("%s destroying task with non-empty dmabuf list\n", __func__); + + kfree(tsk->dmabuf_info); +} + void __put_task_struct(struct task_struct *tsk) { WARN_ON(!tsk->exit_state); WARN_ON(refcount_read(&tsk->usage)); WARN_ON(tsk == current); + put_dmabuf_info(tsk); io_uring_free(tsk); cgroup_free(tsk); task_numa_free(tsk, true); @@ -2268,6 +2289,58 @@ static void rv_task_fork(struct task_struct *p) #define rv_task_fork(p) do {} while (0) #endif +static int copy_dmabuf_info(u64 clone_flags, struct task_struct *p) +{ + struct task_dma_buf_record *rec, *copy; + + if (current->dmabuf_info && (clone_flags & (CLONE_VM | CLONE_FILES)) + == (CLONE_VM | CLONE_FILES)) { + /* + * Both MM and FD references to dmabufs are shared with the parent, so + * we can share a RSS counter with the parent. + */ + refcount_inc(¤t->dmabuf_info->refcnt); + p->dmabuf_info = current->dmabuf_info; + return 0; + } + + p->dmabuf_info = kmalloc(sizeof(*p->dmabuf_info), GFP_KERNEL); + if (!p->dmabuf_info) + return -ENOMEM; + + refcount_set(&p->dmabuf_info->refcnt, 1); + spin_lock_init(&p->dmabuf_info->lock); + INIT_LIST_HEAD(&p->dmabuf_info->dmabufs); + if (current->dmabuf_info) { + spin_lock(¤t->dmabuf_info->lock); + p->dmabuf_info->rss = current->dmabuf_info->rss; + list_for_each_entry(rec, ¤t->dmabuf_info->dmabufs, node) { + copy = kmalloc(sizeof(*copy), GFP_KERNEL); + if (!copy) { + spin_unlock(¤t->dmabuf_info->lock); + goto err_list_copy; + } + + copy->dmabuf = rec->dmabuf; + copy->refcnt = rec->refcnt; + list_add(©->node, &p->dmabuf_info->dmabufs); + } + spin_unlock(¤t->dmabuf_info->lock); + } else { + p->dmabuf_info->rss = 0; + } + + return 0; + +err_list_copy: + list_for_each_entry_safe(rec, copy, &p->dmabuf_info->dmabufs, node) { + list_del(&rec->node); + kfree(rec); + } + kfree(p->dmabuf_info); + return -ENOMEM; +} + /* * This creates a new process as a copy of the old one, * but does not actually start it yet. @@ -2509,14 +2582,18 @@ __latent_entropy struct task_struct *copy_process( p->bpf_ctx = NULL; #endif - /* Perform scheduler related setup. Assign this task to a CPU. */ - retval = sched_fork(clone_flags, p); + retval = copy_dmabuf_info(clone_flags, p); if (retval) goto bad_fork_cleanup_policy; + /* Perform scheduler related setup. Assign this task to a CPU. */ + retval = sched_fork(clone_flags, p); + if (retval) + goto bad_fork_cleanup_dmabuf; + retval = perf_event_init_task(p, clone_flags); if (retval) - goto bad_fork_cleanup_policy; + goto bad_fork_cleanup_dmabuf; retval = audit_alloc(p); if (retval) goto bad_fork_cleanup_perf; @@ -2819,6 +2896,8 @@ bad_fork_cleanup_audit: audit_free(p); bad_fork_cleanup_perf: perf_event_free_task(p); +bad_fork_cleanup_dmabuf: + put_dmabuf_info(p); bad_fork_cleanup_policy: lockdep_free_task(p); #ifdef CONFIG_NUMA diff --git a/mm/mmap.c b/mm/mmap.c index 4c74fb3d7a94..6da684ab9f98 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -49,6 +49,7 @@ #include #include #include +#include #include #include @@ -144,8 +145,11 @@ static void remove_vma(struct vm_area_struct *vma, bool unreachable) { might_sleep(); vma_close(vma); - if (vma->vm_file) + if (vma->vm_file) { + if (is_dma_buf_file(vma->vm_file)) + dma_buf_unaccount_task(vma->vm_file->private_data, current); fput(vma->vm_file); + } mpol_put(vma_policy(vma)); if (unreachable) __vm_area_free(vma); @@ -2417,8 +2421,14 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, if (err) goto out_free_mpol; - if (new->vm_file) + if (new->vm_file) { get_file(new->vm_file); + if (is_dma_buf_file(new->vm_file)) { + /* Should never fail since this task already references the buffer */ + if (dma_buf_account_task(new->vm_file->private_data, current)) + pr_err("%s failed to account dmabuf\n", __func__); + } + } if (new->vm_ops && new->vm_ops->open) new->vm_ops->open(new); From bddab7cf5de4a43346bc8e6803b20738b6d9e1cb Mon Sep 17 00:00:00 2001 From: "T.J. Mercier" Date: Wed, 25 Jun 2025 21:15:34 +0000 Subject: [PATCH 31/49] ANDROID: Track per-process dmabuf RSS HWM A per-process high watermark counter for dmabuf memory is useful for detecting bursty / transient allocations causing memory pressure spikes that don't appear in the dmabuf RSS counter when userspace reacts to memory pressure and reads RSS after buffers have already been freed. The /proc//dmabuf_rss_hwm file in procfs now reports the maximum value of /proc//dmabuf_rss during the lifetime of the process. The value of /proc//dmabuf_rss_hwm can be reset to the current value of /proc//dmabuf_rss by writing "0" to the file. Bug: 424648392 Change-Id: I184d83d48ec63b805b712f19e121199a63095965 Signed-off-by: T.J. Mercier --- drivers/dma-buf/dma-buf.c | 8 ++++ fs/proc/base.c | 77 +++++++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 7 +++- kernel/fork.c | 2 + 4 files changed, 92 insertions(+), 2 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index c8c05d2e112a..7c9ac163d115 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -187,6 +187,14 @@ static int new_task_dmabuf_record(struct task_struct *task, struct dma_buf *dmab return -ENOMEM; task->dmabuf_info->rss += dmabuf->size; + /* + * task->dmabuf_info->lock protects against concurrent writers, so no + * worries about stale rss_hwm between the read and write, and we don't + * need to cmpxchg here. + */ + if (task->dmabuf_info->rss > task->dmabuf_info->rss_hwm) + task->dmabuf_info->rss_hwm = task->dmabuf_info->rss; + rec->dmabuf = dmabuf; rec->refcnt = 1; list_add(&rec->node, &task->dmabuf_info->dmabufs); diff --git a/fs/proc/base.c b/fs/proc/base.c index f7d8188b0ccf..6b91ddcab7e2 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3321,6 +3321,82 @@ static int proc_dmabuf_rss_show(struct seq_file *m, struct pid_namespace *ns, return 0; } + +static int proc_dmabuf_rss_hwm_show(struct seq_file *m, void *v) +{ + struct inode *inode = m->private; + struct task_struct *task; + int ret = 0; + + task = get_proc_task(inode); + if (!task) + return -ESRCH; + + if (!task->dmabuf_info) { + pr_err("%s dmabuf accounting record was not allocated\n", __func__); + ret = -ENOMEM; + goto out; + } + + if (!(task->flags & PF_KTHREAD)) + seq_printf(m, "%lld\n", READ_ONCE(task->dmabuf_info->rss_hwm)); + else + seq_puts(m, "0\n"); + +out: + put_task_struct(task); + + return ret; +} + +static int proc_dmabuf_rss_hwm_open(struct inode *inode, struct file *filp) +{ + return single_open(filp, proc_dmabuf_rss_hwm_show, inode); +} + +static ssize_t +proc_dmabuf_rss_hwm_write(struct file *file, const char __user *buf, + size_t count, loff_t *offset) +{ + struct inode *inode = file_inode(file); + struct task_struct *task; + unsigned long long val; + int ret; + + ret = kstrtoull_from_user(buf, count, 10, &val); + if (ret) + return ret; + + if (val != 0) + return -EINVAL; + + task = get_proc_task(inode); + if (!task) + return -ESRCH; + + if (!task->dmabuf_info) { + pr_err("%s dmabuf accounting record was not allocated\n", __func__); + ret = -ENOMEM; + goto out; + } + + spin_lock(&task->dmabuf_info->lock); + task->dmabuf_info->rss_hwm = task->dmabuf_info->rss; + spin_unlock(&task->dmabuf_info->lock); + +out: + put_task_struct(task); + + return ret < 0 ? ret : count; +} + +static const struct file_operations proc_dmabuf_rss_hwm_operations = { + .open = proc_dmabuf_rss_hwm_open, + .write = proc_dmabuf_rss_hwm_write, + .read = seq_read, + .llseek = seq_lseek, + .release = single_release, +}; #endif /* @@ -3448,6 +3524,7 @@ static const struct pid_entry tgid_base_stuff[] = { #endif #ifdef CONFIG_DMA_SHARED_BUFFER ONE("dmabuf_rss", S_IRUGO, proc_dmabuf_rss_show), + REG("dmabuf_rss_hwm", S_IRUGO|S_IWUSR, proc_dmabuf_rss_hwm_operations), #endif }; diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 1647fb38fe80..a362c8ba7a21 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -659,8 +659,8 @@ struct task_dma_buf_record { }; /** - * struct task_dma_buf_info - Holds a RSS counter, and a list of dmabufs for all - * tasks that share both mm_struct and files_struct. + * struct task_dma_buf_info - Holds RSS and RSS HWM counters, and a list of + * dmabufs for all tasks that share both mm_struct and files_struct. * * @rss: The sum of all dmabuf memory referenced by the tasks via memory * mappings or file descriptors in bytes. Buffers referenced more than @@ -668,12 +668,15 @@ struct task_dma_buf_record { * of both mmaps and FDs) only cause the buffer to be accounted to the * process once. Partial mappings cause the full size of the buffer to be * accounted, regardless of the size of the mapping. + * @rss_hwm: The maximum value of @rss over the lifetime of this struct. (Unless, + * reset by userspace.) * @refcnt: The number of tasks sharing this struct. * @lock: Lock protecting writes for @rss, and reads/writes for @dmabufs. * @dmabufs: List of all dmabufs referenced by the tasks. */ struct task_dma_buf_info { s64 rss; + s64 rss_hwm; refcount_t refcnt; spinlock_t lock; struct list_head dmabufs; diff --git a/kernel/fork.c b/kernel/fork.c index 66636a979911..e1d7d244d43a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2314,6 +2314,7 @@ static int copy_dmabuf_info(u64 clone_flags, struct task_struct *p) if (current->dmabuf_info) { spin_lock(¤t->dmabuf_info->lock); p->dmabuf_info->rss = current->dmabuf_info->rss; + p->dmabuf_info->rss_hwm = current->dmabuf_info->rss; list_for_each_entry(rec, ¤t->dmabuf_info->dmabufs, node) { copy = kmalloc(sizeof(*copy), GFP_KERNEL); if (!copy) { @@ -2328,6 +2329,7 @@ static int copy_dmabuf_info(u64 clone_flags, struct task_struct *p) spin_unlock(¤t->dmabuf_info->lock); } else { p->dmabuf_info->rss = 0; + p->dmabuf_info->rss_hwm = 0; } return 0; From 0bf76c5311039778424e741617a0dd14b77f1763 Mon Sep 17 00:00:00 2001 From: "T.J. Mercier" Date: Tue, 8 Jul 2025 22:58:14 +0000 Subject: [PATCH 32/49] ANDROID: Track per-process dmabuf PSS DMA buffers exist for sharing memory, so dividing a buffer's size by the number of processes with references to it to obtain proportional set size is a useful metric for understanding an individual process's share of system-wide dmabuf memory. Dmabuf memory is not guaranteed to be representable by struct pages, and a process may hold only file descriptor references to a buffer. So PSS cannot be calculated on a per-page basis, and PSS accounting is always performed in units of the full buffer size, and only once for each process regardless of the number and type of references a process has for a single buffer. The /proc//dmabuf_pss file in procfs now reports the sum of all buffer PSS values referenced by a process. The units are bytes. This allows userspace to obtain per-process dmabuf accounting information quickly compared to calculating it from multiple sources in procfs and sysfs. Note that a dmabuf can be backed by different types of memory such as system DRAM, GPU VRAM, or others. This patch makes no distinction between these different types of memory, so on systems with non-unified memory the reported values should be interpreted with this in mind. Bug: 424648392 Change-Id: I8ec370b0d7fd37e69f677c6f580940c89cc03a42 Signed-off-by: T.J. Mercier --- drivers/dma-buf/dma-buf.c | 8 ++++++++ fs/proc/base.c | 34 ++++++++++++++++++++++++++++++++++ include/linux/dma-buf.h | 8 ++++++++ 3 files changed, 50 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 7c9ac163d115..cb91dadeb465 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -115,6 +115,9 @@ static void dma_buf_release(struct dentry *dentry) if (dmabuf->resv == (struct dma_resv *)&dmabuf[1]) dma_resv_fini(dmabuf->resv); + if (atomic64_read(&dmabuf->num_unique_refs)) + pr_err("destroying dmabuf with non-zero task refs\n"); + WARN_ON(!list_empty(&dmabuf->attachments)); module_put(dmabuf->owner); kfree(dmabuf->name); @@ -199,6 +202,8 @@ static int new_task_dmabuf_record(struct task_struct *task, struct dma_buf *dmab rec->refcnt = 1; list_add(&rec->node, &task->dmabuf_info->dmabufs); + atomic64_inc(&dmabuf->num_unique_refs); + return 0; } @@ -276,6 +281,7 @@ void dma_buf_unaccount_task(struct dma_buf *dmabuf, struct task_struct *task) list_del(&rec->node); kfree(rec); task->dmabuf_info->rss -= dmabuf->size; + atomic64_dec(&dmabuf->num_unique_refs); } err: spin_unlock(&task->dmabuf_info->lock); @@ -851,6 +857,8 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; } + atomic64_set(&dmabuf->num_unique_refs, 0); + file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; dmabuf->file = file; diff --git a/fs/proc/base.c b/fs/proc/base.c index 6b91ddcab7e2..2eee67e06ffe 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3397,6 +3397,39 @@ static const struct file_operations proc_dmabuf_rss_hwm_operations = { .llseek = seq_lseek, .release = single_release, }; + +static int proc_dmabuf_pss_show(struct seq_file *m, struct pid_namespace *ns, + struct pid *pid, struct task_struct *task) +{ + struct task_dma_buf_record *rec; + u64 pss = 0; + + if (!task->dmabuf_info) { + pr_err("%s dmabuf accounting record was not allocated\n", __func__); + return -ENOMEM; + } + + if (!(task->flags & PF_KTHREAD)) { + spin_lock(&task->dmabuf_info->lock); + list_for_each_entry(rec, &task->dmabuf_info->dmabufs, node) { + s64 refs = atomic64_read(&rec->dmabuf->num_unique_refs); + + if (refs <= 0) { + pr_err("dmabuf has <= refs %lld\n", refs); + continue; + } + + pss += rec->dmabuf->size / (size_t)refs; + } + spin_unlock(&task->dmabuf_info->lock); + + seq_printf(m, "%llu\n", pss); + } else { + seq_puts(m, "0\n"); + } + + return 0; +} #endif /* @@ -3525,6 +3558,7 @@ static const struct pid_entry tgid_base_stuff[] = { #ifdef CONFIG_DMA_SHARED_BUFFER ONE("dmabuf_rss", S_IRUGO, proc_dmabuf_rss_show), REG("dmabuf_rss_hwm", S_IRUGO|S_IWUSR, proc_dmabuf_rss_hwm_operations), + ONE("dmabuf_pss", S_IRUGO, proc_dmabuf_pss_show), #endif }; diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index a362c8ba7a21..267bf322272f 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -25,6 +25,7 @@ #include #include #ifndef __GENKSYMS__ +#include #include #endif @@ -534,6 +535,13 @@ struct dma_buf { } *sysfs_entry; #endif + /** + * @num_unique_refs: + * + * The number of tasks that reference this buffer. For calculating PSS. + */ + atomic64_t num_unique_refs; + ANDROID_KABI_RESERVE(1); ANDROID_KABI_RESERVE(2); }; From 59af12872db84137ca14525d864249b32a0ceebb Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Thu, 3 Jul 2025 20:22:25 +0000 Subject: [PATCH 33/49] ANDROID: fixup task_struct to avoid ABI breakage Reuse task_struct.worker_private to store task_dma_buf_info pointer and avoid adding new task_struct members that would lead to ABI breakage. This aliasing works because task_struct.worker_private is used only for kthreads and io_workers which task_dma_buf_info is used for user tasks. Bug: 424648392 Change-Id: I2caa708d8a729095b308932c1b35c3157835639b Signed-off-by: Suren Baghdasaryan --- drivers/dma-buf/dma-buf.c | 69 ++++++++++++++++++------- fs/proc/base.c | 104 +++++++++++++++++++++++--------------- include/linux/dma-buf.h | 22 ++++++++ include/linux/sched.h | 4 +- init/init_task.c | 2 +- kernel/fork.c | 69 +++++++++++++++---------- 6 files changed, 180 insertions(+), 90 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index cb91dadeb465..5b3e3fdc1599 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -168,11 +168,21 @@ static struct file_system_type dma_buf_fs_type = { static struct task_dma_buf_record *find_task_dmabuf_record( struct task_struct *task, struct dma_buf *dmabuf) { + struct task_dma_buf_info *dmabuf_info = get_task_dma_buf_info(task); struct task_dma_buf_record *rec; - lockdep_assert_held(&task->dmabuf_info->lock); + if (!dmabuf_info) + return NULL; - list_for_each_entry(rec, &task->dmabuf_info->dmabufs, node) + if (IS_ERR(dmabuf_info)) { + pr_err("%s dmabuf accounting record is missing, error %ld\n", + __func__, PTR_ERR(dmabuf_info)); + return NULL; + } + + lockdep_assert_held(&dmabuf_info->lock); + + list_for_each_entry(rec, &dmabuf_info->dmabufs, node) if (dmabuf == rec->dmabuf) return rec; @@ -181,26 +191,36 @@ static struct task_dma_buf_record *find_task_dmabuf_record( static int new_task_dmabuf_record(struct task_struct *task, struct dma_buf *dmabuf) { + struct task_dma_buf_info *dmabuf_info = get_task_dma_buf_info(task); struct task_dma_buf_record *rec; - lockdep_assert_held(&task->dmabuf_info->lock); + if (!dmabuf_info) + return 0; + + if (IS_ERR(dmabuf_info)) { + pr_err("%s dmabuf accounting record is missing, error %ld\n", + __func__, PTR_ERR(dmabuf_info)); + return PTR_ERR(dmabuf_info); + } + + lockdep_assert_held(&dmabuf_info->lock); rec = kmalloc(sizeof(*rec), GFP_KERNEL); if (!rec) return -ENOMEM; - task->dmabuf_info->rss += dmabuf->size; + dmabuf_info->rss += dmabuf->size; /* - * task->dmabuf_info->lock protects against concurrent writers, so no + * dmabuf_info->lock protects against concurrent writers, so no * worries about stale rss_hwm between the read and write, and we don't * need to cmpxchg here. */ - if (task->dmabuf_info->rss > task->dmabuf_info->rss_hwm) - task->dmabuf_info->rss_hwm = task->dmabuf_info->rss; + if (dmabuf_info->rss > dmabuf_info->rss_hwm) + dmabuf_info->rss_hwm = dmabuf_info->rss; rec->dmabuf = dmabuf; rec->refcnt = 1; - list_add(&rec->node, &task->dmabuf_info->dmabufs); + list_add(&rec->node, &dmabuf_info->dmabufs); atomic64_inc(&dmabuf->num_unique_refs); @@ -222,24 +242,30 @@ static int new_task_dmabuf_record(struct task_struct *task, struct dma_buf *dmab */ int dma_buf_account_task(struct dma_buf *dmabuf, struct task_struct *task) { + struct task_dma_buf_info *dmabuf_info; struct task_dma_buf_record *rec; int ret = 0; if (!dmabuf || !task) return -EINVAL; - if (!task->dmabuf_info) { - pr_err("%s dmabuf accounting record was not allocated\n", __func__); - return -ENOMEM; + dmabuf_info = get_task_dma_buf_info(task); + if (!dmabuf_info) + return 0; + + if (IS_ERR(dmabuf_info)) { + pr_err("%s dmabuf accounting record is missing, error %ld\n", + __func__, PTR_ERR(dmabuf_info)); + return PTR_ERR(dmabuf_info); } - spin_lock(&task->dmabuf_info->lock); + spin_lock(&dmabuf_info->lock); rec = find_task_dmabuf_record(task, dmabuf); if (!rec) ret = new_task_dmabuf_record(task, dmabuf); else ++rec->refcnt; - spin_unlock(&task->dmabuf_info->lock); + spin_unlock(&dmabuf_info->lock); return ret; } @@ -260,17 +286,22 @@ int dma_buf_account_task(struct dma_buf *dmabuf, struct task_struct *task) */ void dma_buf_unaccount_task(struct dma_buf *dmabuf, struct task_struct *task) { + struct task_dma_buf_info *dmabuf_info = get_task_dma_buf_info(task); struct task_dma_buf_record *rec; if (!dmabuf || !task) return; - if (!task->dmabuf_info) { - pr_err("%s dmabuf accounting record was not allocated\n", __func__); - return; + if (!dmabuf_info) + return; + + if (IS_ERR(dmabuf_info)) { + pr_err("%s dmabuf accounting record is missing, error %ld\n", + __func__, PTR_ERR(dmabuf_info)); + return; } - spin_lock(&task->dmabuf_info->lock); + spin_lock(&dmabuf_info->lock); rec = find_task_dmabuf_record(task, dmabuf); if (!rec) { /* Failed fd_install? */ pr_err("dmabuf not found in task list\n"); @@ -280,11 +311,11 @@ void dma_buf_unaccount_task(struct dma_buf *dmabuf, struct task_struct *task) if (--rec->refcnt == 0) { list_del(&rec->node); kfree(rec); - task->dmabuf_info->rss -= dmabuf->size; + dmabuf_info->rss -= dmabuf->size; atomic64_dec(&dmabuf->num_unique_refs); } err: - spin_unlock(&task->dmabuf_info->lock); + spin_unlock(&dmabuf_info->lock); } static int dma_buf_mmap_internal(struct file *file, struct vm_area_struct *vma) diff --git a/fs/proc/base.c b/fs/proc/base.c index 2eee67e06ffe..0a3f28f7f1d9 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3309,21 +3309,27 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns, static int proc_dmabuf_rss_show(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { - if (!task->dmabuf_info) { - pr_err("%s dmabuf accounting record was not allocated\n", __func__); - return -ENOMEM; + struct task_dma_buf_info *dmabuf_info = get_task_dma_buf_info(task); + + if (!dmabuf_info) { + seq_puts(m, "0\n"); + return 0; } - if (!(task->flags & PF_KTHREAD)) - seq_printf(m, "%lld\n", READ_ONCE(task->dmabuf_info->rss)); - else - seq_puts(m, "0\n"); + if (IS_ERR(dmabuf_info)) { + pr_err("%s dmabuf accounting record is missing, error %ld\n", + __func__, PTR_ERR(dmabuf_info)); + return PTR_ERR(dmabuf_info); + } + + seq_printf(m, "%lld\n", READ_ONCE(dmabuf_info->rss)); return 0; } static int proc_dmabuf_rss_hwm_show(struct seq_file *m, void *v) { + struct task_dma_buf_info *dmabuf_info; struct inode *inode = m->private; struct task_struct *task; int ret = 0; @@ -3332,16 +3338,20 @@ static int proc_dmabuf_rss_hwm_show(struct seq_file *m, void *v) if (!task) return -ESRCH; - if (!task->dmabuf_info) { - pr_err("%s dmabuf accounting record was not allocated\n", __func__); - ret = -ENOMEM; + dmabuf_info = get_task_dma_buf_info(task); + if (!dmabuf_info) { + seq_puts(m, "0\n"); goto out; } - if (!(task->flags & PF_KTHREAD)) - seq_printf(m, "%lld\n", READ_ONCE(task->dmabuf_info->rss_hwm)); - else - seq_puts(m, "0\n"); + if (IS_ERR(dmabuf_info)) { + pr_err("%s dmabuf accounting record is missing, error %ld\n", + __func__, PTR_ERR(dmabuf_info)); + ret = PTR_ERR(dmabuf_info); + goto out; + } + + seq_printf(m, "%lld\n", READ_ONCE(dmabuf_info->rss_hwm)); out: put_task_struct(task); @@ -3358,6 +3368,7 @@ static ssize_t proc_dmabuf_rss_hwm_write(struct file *file, const char __user *buf, size_t count, loff_t *offset) { + struct task_dma_buf_info *dmabuf_info; struct inode *inode = file_inode(file); struct task_struct *task; unsigned long long val; @@ -3374,15 +3385,22 @@ proc_dmabuf_rss_hwm_write(struct file *file, const char __user *buf, if (!task) return -ESRCH; - if (!task->dmabuf_info) { - pr_err("%s dmabuf accounting record was not allocated\n", __func__); - ret = -ENOMEM; + dmabuf_info = get_task_dma_buf_info(task); + if (!dmabuf_info) { + ret = -EINVAL; goto out; } - spin_lock(&task->dmabuf_info->lock); - task->dmabuf_info->rss_hwm = task->dmabuf_info->rss; - spin_unlock(&task->dmabuf_info->lock); + if (IS_ERR(dmabuf_info)) { + pr_err("%s dmabuf accounting record is missing, error %ld\n", + __func__, PTR_ERR(dmabuf_info)); + ret = PTR_ERR(dmabuf_info); + goto out; + } + + spin_lock(&dmabuf_info->lock); + dmabuf_info->rss_hwm = dmabuf_info->rss; + spin_unlock(&dmabuf_info->lock); out: put_task_struct(task); @@ -3401,33 +3419,37 @@ static const struct file_operations proc_dmabuf_rss_hwm_operations = { static int proc_dmabuf_pss_show(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { + struct task_dma_buf_info *dmabuf_info; struct task_dma_buf_record *rec; u64 pss = 0; - if (!task->dmabuf_info) { - pr_err("%s dmabuf accounting record was not allocated\n", __func__); - return -ENOMEM; - } - - if (!(task->flags & PF_KTHREAD)) { - spin_lock(&task->dmabuf_info->lock); - list_for_each_entry(rec, &task->dmabuf_info->dmabufs, node) { - s64 refs = atomic64_read(&rec->dmabuf->num_unique_refs); - - if (refs <= 0) { - pr_err("dmabuf has <= refs %lld\n", refs); - continue; - } - - pss += rec->dmabuf->size / (size_t)refs; - } - spin_unlock(&task->dmabuf_info->lock); - - seq_printf(m, "%llu\n", pss); - } else { + dmabuf_info = get_task_dma_buf_info(task); + if (!dmabuf_info) { seq_puts(m, "0\n"); + return 0; } + if (IS_ERR(dmabuf_info)) { + pr_err("%s dmabuf accounting record is missing, error %ld\n", + __func__, PTR_ERR(dmabuf_info)); + return PTR_ERR(dmabuf_info); + } + + spin_lock(&dmabuf_info->lock); + list_for_each_entry(rec, &dmabuf_info->dmabufs, node) { + s64 refs = atomic64_read(&rec->dmabuf->num_unique_refs); + + if (refs <= 0) { + pr_err("dmabuf has <= refs %lld\n", refs); + continue; + } + + pss += rec->dmabuf->size / (size_t)refs; + } + spin_unlock(&dmabuf_info->lock); + + seq_printf(m, "%llu\n", pss); + return 0; } #endif diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 267bf322272f..654085da8bc4 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -690,6 +690,28 @@ struct task_dma_buf_info { struct list_head dmabufs; }; +static inline bool task_has_dma_buf_info(struct task_struct *task) +{ + return (task->flags & (PF_KTHREAD | PF_IO_WORKER)) == 0; +} + +extern struct task_struct init_task; + +static inline +struct task_dma_buf_info *get_task_dma_buf_info(struct task_struct *task) +{ + if (!task) + return ERR_PTR(-EINVAL); + + if (!task_has_dma_buf_info(task)) + return NULL; + + if (!task->worker_private) + return ERR_PTR(-ENOMEM); + + return (struct task_dma_buf_info *)task->worker_private; +} + /** * DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters * @name: export-info name diff --git a/include/linux/sched.h b/include/linux/sched.h index 68ba96bde447..3cff2446536d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1003,6 +1003,7 @@ struct task_struct { int __user *clear_child_tid; /* PF_KTHREAD | PF_IO_WORKER */ + /* Otherwise used as task_dma_buf_info pointer */ void *worker_private; u64 utime; @@ -1517,9 +1518,6 @@ struct task_struct { */ struct callback_head l1d_flush_kill; #endif - - struct task_dma_buf_info *dmabuf_info; - ANDROID_KABI_RESERVE(1); ANDROID_KABI_RESERVE(2); ANDROID_KABI_RESERVE(3); diff --git a/init/init_task.c b/init/init_task.c index d80c007ab59b..1903a2abde55 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -214,7 +214,7 @@ struct task_struct init_task .android_vendor_data1 = {0, }, .android_oem_data1 = {0, }, #endif - .dmabuf_info = NULL, + .worker_private = NULL, }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index e1d7d244d43a..9c71a69e0d17 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -997,21 +997,27 @@ static inline void put_signal_struct(struct signal_struct *sig) static void put_dmabuf_info(struct task_struct *tsk) { - if (!tsk->dmabuf_info) { - pr_err("%s dmabuf accounting record was not allocated\n", __func__); + struct task_dma_buf_info *dmabuf_info = get_task_dma_buf_info(tsk); + + if (!dmabuf_info) + return; + + if (IS_ERR(dmabuf_info)) { + pr_err("%s dmabuf accounting record is missing, error %ld\n", + __func__, PTR_ERR(dmabuf_info)); return; } - if (!refcount_dec_and_test(&tsk->dmabuf_info->refcnt)) + if (!refcount_dec_and_test(&dmabuf_info->refcnt)) return; - if (READ_ONCE(tsk->dmabuf_info->rss)) + if (READ_ONCE(dmabuf_info->rss)) pr_err("%s destroying task with non-zero dmabuf rss\n", __func__); - if (!list_empty(&tsk->dmabuf_info->dmabufs)) + if (!list_empty(&dmabuf_info->dmabufs)) pr_err("%s destroying task with non-empty dmabuf list\n", __func__); - kfree(tsk->dmabuf_info); + kfree(dmabuf_info); } void __put_task_struct(struct task_struct *tsk) @@ -2291,55 +2297,66 @@ static void rv_task_fork(struct task_struct *p) static int copy_dmabuf_info(u64 clone_flags, struct task_struct *p) { + struct task_dma_buf_info *new_dmabuf_info; + struct task_dma_buf_info *dmabuf_info; struct task_dma_buf_record *rec, *copy; - if (current->dmabuf_info && (clone_flags & (CLONE_VM | CLONE_FILES)) + if (!task_has_dma_buf_info(p)) + return 0; /* Task is not supposed to have dmabuf_info */ + + dmabuf_info = get_task_dma_buf_info(current); + /* Original might not have dmabuf_info and that's fine */ + if (IS_ERR(dmabuf_info)) + dmabuf_info = NULL; + + if (dmabuf_info && (clone_flags & (CLONE_VM | CLONE_FILES)) == (CLONE_VM | CLONE_FILES)) { /* * Both MM and FD references to dmabufs are shared with the parent, so * we can share a RSS counter with the parent. */ - refcount_inc(¤t->dmabuf_info->refcnt); - p->dmabuf_info = current->dmabuf_info; + refcount_inc(&dmabuf_info->refcnt); + p->worker_private = dmabuf_info; return 0; } - p->dmabuf_info = kmalloc(sizeof(*p->dmabuf_info), GFP_KERNEL); - if (!p->dmabuf_info) + new_dmabuf_info = kmalloc(sizeof(*new_dmabuf_info), GFP_KERNEL); + if (!new_dmabuf_info) return -ENOMEM; - refcount_set(&p->dmabuf_info->refcnt, 1); - spin_lock_init(&p->dmabuf_info->lock); - INIT_LIST_HEAD(&p->dmabuf_info->dmabufs); - if (current->dmabuf_info) { - spin_lock(¤t->dmabuf_info->lock); - p->dmabuf_info->rss = current->dmabuf_info->rss; - p->dmabuf_info->rss_hwm = current->dmabuf_info->rss; - list_for_each_entry(rec, ¤t->dmabuf_info->dmabufs, node) { + refcount_set(&new_dmabuf_info->refcnt, 1); + spin_lock_init(&new_dmabuf_info->lock); + INIT_LIST_HEAD(&new_dmabuf_info->dmabufs); + if (dmabuf_info) { + spin_lock(&dmabuf_info->lock); + new_dmabuf_info->rss = dmabuf_info->rss; + new_dmabuf_info->rss_hwm = dmabuf_info->rss; + list_for_each_entry(rec, &dmabuf_info->dmabufs, node) { copy = kmalloc(sizeof(*copy), GFP_KERNEL); if (!copy) { - spin_unlock(¤t->dmabuf_info->lock); + spin_unlock(&dmabuf_info->lock); goto err_list_copy; } copy->dmabuf = rec->dmabuf; copy->refcnt = rec->refcnt; - list_add(©->node, &p->dmabuf_info->dmabufs); + list_add(©->node, &new_dmabuf_info->dmabufs); } - spin_unlock(¤t->dmabuf_info->lock); + spin_unlock(&dmabuf_info->lock); } else { - p->dmabuf_info->rss = 0; - p->dmabuf_info->rss_hwm = 0; + new_dmabuf_info->rss = 0; + new_dmabuf_info->rss_hwm = 0; } + p->worker_private = new_dmabuf_info; return 0; err_list_copy: - list_for_each_entry_safe(rec, copy, &p->dmabuf_info->dmabufs, node) { + list_for_each_entry_safe(rec, copy, &new_dmabuf_info->dmabufs, node) { list_del(&rec->node); kfree(rec); } - kfree(p->dmabuf_info); + kfree(new_dmabuf_info); return -ENOMEM; } From e9f7ac1c2533c2a075cf5c6a9c550bb076110ff4 Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Thu, 3 Jul 2025 22:25:54 +0000 Subject: [PATCH 34/49] ANDROID: fixup dma_buf struct to avoid ABI breakage Wrap dma_buf into dma_buf_ext object containing additional num_unique_refs field required for dmabuf PSS accounting. Bug: 424648392 Change-Id: I3929ec2cf7cda2626452b5c80949aecefec900e6 Signed-off-by: Suren Baghdasaryan --- drivers/dma-buf/dma-buf.c | 22 ++++++++++++---------- fs/proc/base.c | 2 +- include/linux/dma-buf.h | 17 +++++++++++++++-- 3 files changed, 28 insertions(+), 13 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 5b3e3fdc1599..71065b03012a 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -93,6 +93,7 @@ static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen) static void dma_buf_release(struct dentry *dentry) { + struct dma_buf_ext *dmabuf_ext; struct dma_buf *dmabuf; dmabuf = dentry->d_fsdata; @@ -115,13 +116,13 @@ static void dma_buf_release(struct dentry *dentry) if (dmabuf->resv == (struct dma_resv *)&dmabuf[1]) dma_resv_fini(dmabuf->resv); - if (atomic64_read(&dmabuf->num_unique_refs)) + dmabuf_ext = get_dmabuf_ext(dmabuf); + if (atomic64_read(&dmabuf_ext->num_unique_refs)) pr_err("destroying dmabuf with non-zero task refs\n"); - WARN_ON(!list_empty(&dmabuf->attachments)); module_put(dmabuf->owner); kfree(dmabuf->name); - kfree(dmabuf); + kfree(dmabuf_ext); } static int dma_buf_file_release(struct inode *inode, struct file *file) @@ -221,8 +222,7 @@ static int new_task_dmabuf_record(struct task_struct *task, struct dma_buf *dmab rec->dmabuf = dmabuf; rec->refcnt = 1; list_add(&rec->node, &dmabuf_info->dmabufs); - - atomic64_inc(&dmabuf->num_unique_refs); + atomic64_inc(&get_dmabuf_ext(dmabuf)->num_unique_refs); return 0; } @@ -312,7 +312,7 @@ void dma_buf_unaccount_task(struct dma_buf *dmabuf, struct task_struct *task) list_del(&rec->node); kfree(rec); dmabuf_info->rss -= dmabuf->size; - atomic64_dec(&dmabuf->num_unique_refs); + atomic64_dec(&get_dmabuf_ext(dmabuf)->num_unique_refs); } err: spin_unlock(&dmabuf_info->lock); @@ -831,10 +831,11 @@ err_alloc_file: */ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) { + struct dma_buf_ext *dmabuf_ext; struct dma_buf *dmabuf; struct dma_resv *resv = exp_info->resv; struct file *file; - size_t alloc_size = sizeof(struct dma_buf); + size_t alloc_size = sizeof(struct dma_buf_ext); int ret; if (WARN_ON(!exp_info->priv || !exp_info->ops @@ -864,12 +865,13 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) else /* prevent &dma_buf[1] == dma_buf->resv */ alloc_size += 1; - dmabuf = kzalloc(alloc_size, GFP_KERNEL); - if (!dmabuf) { + dmabuf_ext = kzalloc(alloc_size, GFP_KERNEL); + if (!dmabuf_ext) { ret = -ENOMEM; goto err_file; } + dmabuf = &dmabuf_ext->dmabuf; dmabuf->priv = exp_info->priv; dmabuf->ops = exp_info->ops; dmabuf->size = exp_info->size; @@ -888,7 +890,7 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; } - atomic64_set(&dmabuf->num_unique_refs, 0); + atomic64_set(&dmabuf_ext->num_unique_refs, 0); file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; diff --git a/fs/proc/base.c b/fs/proc/base.c index 0a3f28f7f1d9..3d78cd1286a5 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3437,7 +3437,7 @@ static int proc_dmabuf_pss_show(struct seq_file *m, struct pid_namespace *ns, spin_lock(&dmabuf_info->lock); list_for_each_entry(rec, &dmabuf_info->dmabufs, node) { - s64 refs = atomic64_read(&rec->dmabuf->num_unique_refs); + s64 refs = atomic64_read(&get_dmabuf_ext(rec->dmabuf)->num_unique_refs); if (refs <= 0) { pr_err("dmabuf has <= refs %lld\n", refs); diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 654085da8bc4..d9487fb2e549 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -535,6 +535,11 @@ struct dma_buf { } *sysfs_entry; #endif + ANDROID_KABI_RESERVE(1); + ANDROID_KABI_RESERVE(2); +}; + +struct dma_buf_ext { /** * @num_unique_refs: * @@ -542,10 +547,18 @@ struct dma_buf { */ atomic64_t num_unique_refs; - ANDROID_KABI_RESERVE(1); - ANDROID_KABI_RESERVE(2); + /* + * dma_buf can have a reservation object after it, so keep this member + * at the end of this structure. + */ + struct dma_buf dmabuf; }; +static inline struct dma_buf_ext *get_dmabuf_ext(struct dma_buf *dmabuf) +{ + return container_of(dmabuf, struct dma_buf_ext, dmabuf); +} + /** * struct dma_buf_attach_ops - importer operations for an attachment * From c8fdc081cfa165aaa5bd87979d33b419499574cf Mon Sep 17 00:00:00 2001 From: "T.J. Mercier" Date: Tue, 1 Jul 2025 00:51:35 +0000 Subject: [PATCH 35/49] ANDROID: Add dmabuf RSS trace event Dmabuf RSS is associated with a task, or group of tasks sharing the same mm_struct and files_struct. Any time the RSS counter is modified for a task, or group of tasks, emit a trace event with the current value of the dmabuf RSS counter. This allows for fast tracking of per-process dmabuf RSS by userspace analysis tools like Perfetto, compared to periodically obtaining per-process dmabuf RSS from procfs. Bug: 424646615 Change-Id: I74434dddacc342918cb52b1b9e2fa6679e332764 Signed-off-by: T.J. Mercier --- drivers/dma-buf/dma-buf.c | 5 +++++ include/trace/events/kmem.h | 25 +++++++++++++++++++++++++ 2 files changed, 30 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 71065b03012a..35bea250e08d 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -31,6 +31,9 @@ #include #include +#ifndef __GENKSYMS__ +#include +#endif #include #include "dma-buf-sysfs-stats.h" @@ -211,6 +214,7 @@ static int new_task_dmabuf_record(struct task_struct *task, struct dma_buf *dmab return -ENOMEM; dmabuf_info->rss += dmabuf->size; + trace_dmabuf_rss_stat(dmabuf_info->rss, dmabuf->size, dmabuf); /* * dmabuf_info->lock protects against concurrent writers, so no * worries about stale rss_hwm between the read and write, and we don't @@ -312,6 +316,7 @@ void dma_buf_unaccount_task(struct dma_buf *dmabuf, struct task_struct *task) list_del(&rec->node); kfree(rec); dmabuf_info->rss -= dmabuf->size; + trace_dmabuf_rss_stat(dmabuf_info->rss, -dmabuf->size, dmabuf); atomic64_dec(&get_dmabuf_ext(dmabuf)->num_unique_refs); } err: diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index 68f5280a41a4..896f8de946d0 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -8,6 +8,7 @@ #include #include #include +#include TRACE_EVENT(kmem_cache_alloc, @@ -487,6 +488,30 @@ TRACE_EVENT(rss_stat, __print_symbolic(__entry->member, TRACE_MM_PAGES), __entry->size) ); + +TRACE_EVENT(dmabuf_rss_stat, + + TP_PROTO(size_t rss, ssize_t rss_delta, struct dma_buf *dmabuf), + + TP_ARGS(rss, rss_delta, dmabuf), + + TP_STRUCT__entry( + __field(size_t, rss) + __field(ssize_t, rss_delta) + __field(unsigned long, i_ino) + ), + + TP_fast_assign( + __entry->rss = rss; + __entry->rss_delta = rss_delta; + __entry->i_ino = file_inode(dmabuf->file)->i_ino; + ), + + TP_printk("rss=%zu delta=%zd i_ino=%lu", + __entry->rss, + __entry->rss_delta, + __entry->i_ino) + ); #endif /* _TRACE_KMEM_H */ /* This part must be outside protection */ From a9597c7b32ec09bdaf13909f77b203e77b4fbb69 Mon Sep 17 00:00:00 2001 From: "qinglin.li" Date: Wed, 9 Jul 2025 16:25:29 +0800 Subject: [PATCH 36/49] ANDROID: GKI: Update symbol list for Amlogic 1 function symbol(s) added 'int snd_soc_get_dai_name(const struct of_phandle_args*, const char**)' Bug: 430463604 Change-Id: I282e456c2b5ff44cb309fdb27faeb115ee9c2d9d Signed-off-by: Qinglin Li --- android/abi_gki_aarch64.stg | 16 +++++++++++ android/abi_gki_aarch64_amlogic | 49 +++++++++++++++++++++++++++++++++ 2 files changed, 65 insertions(+) diff --git a/android/abi_gki_aarch64.stg b/android/abi_gki_aarch64.stg index 679394de3bae..43a9176d2de6 100644 --- a/android/abi_gki_aarch64.stg +++ b/android/abi_gki_aarch64.stg @@ -329005,6 +329005,12 @@ function { parameter_id: 0x391f15ea parameter_id: 0xf435685e } +function { + id: 0x9294d8c1 + return_type_id: 0x6720d32f + parameter_id: 0x3c01aef6 + parameter_id: 0x051414e1 +} function { id: 0x92956fd0 return_type_id: 0x6720d32f @@ -422737,6 +422743,15 @@ elf_symbol { type_id: 0x909c23c2 full_name: "snd_soc_get_dai_id" } +elf_symbol { + id: 0x4086fab0 + name: "snd_soc_get_dai_name" + is_defined: true + symbol_type: FUNCTION + crc: 0x347721f4 + type_id: 0x9294d8c1 + full_name: "snd_soc_get_dai_name" +} elf_symbol { id: 0xa64c7fe5 name: "snd_soc_get_dai_via_args" @@ -443980,6 +443995,7 @@ interface { symbol_id: 0x7918ef41 symbol_id: 0x97843792 symbol_id: 0x54622a57 + symbol_id: 0x4086fab0 symbol_id: 0xa64c7fe5 symbol_id: 0x5eb2e502 symbol_id: 0x33a917a0 diff --git a/android/abi_gki_aarch64_amlogic b/android/abi_gki_aarch64_amlogic index ae6e6cb73da1..2a3c0510146a 100644 --- a/android/abi_gki_aarch64_amlogic +++ b/android/abi_gki_aarch64_amlogic @@ -1,3 +1,5 @@ + + [abi_symbol_list] add_cpu add_device_randomness @@ -209,10 +211,12 @@ consume_skb contig_page_data __contpte_try_unfold + _copy_from_iter copy_from_kernel_nofault __copy_overflow copy_page_from_iter_atomic copy_splice_read + _copy_to_iter cpu_all_bits cpu_bit_bitmap cpufreq_boost_enabled @@ -245,10 +249,13 @@ crypto_aead_setauthsize crypto_aead_setkey crypto_ahash_digest + crypto_ahash_final + crypto_ahash_finup crypto_ahash_setkey crypto_alloc_aead crypto_alloc_ahash crypto_alloc_base + crypto_alloc_rng crypto_alloc_shash crypto_alloc_skcipher crypto_cipher_encrypt_one @@ -258,13 +265,17 @@ crypto_dequeue_request crypto_destroy_tfm crypto_enqueue_request + crypto_get_default_null_skcipher crypto_has_alg crypto_init_queue __crypto_memneq + crypto_put_default_null_skcipher crypto_register_ahash crypto_register_alg crypto_register_shash crypto_register_skcipher + crypto_req_done + crypto_rng_reset crypto_sha1_finup crypto_sha1_update crypto_shash_digest @@ -623,6 +634,7 @@ drm_atomic_set_mode_prop_for_crtc drm_atomic_state_alloc drm_atomic_state_clear + drm_atomic_state_default_release __drm_atomic_state_free drm_compat_ioctl drm_connector_attach_content_type_property @@ -793,6 +805,7 @@ extcon_set_state extcon_set_state_sync extcon_unregister_notifier + extract_iter_to_sg fasync_helper fault_in_iov_iter_readable __fdget @@ -1102,8 +1115,10 @@ ioremap_prot io_schedule iounmap + iov_iter_advance iov_iter_alignment iov_iter_init + iov_iter_npages iov_iter_revert iov_iter_zero iput @@ -1269,12 +1284,14 @@ __local_bh_enable_ip __lock_buffer lockref_get + lock_sock_nested logfc log_post_read_mmio log_post_write_mmio log_read_mmio log_write_mmio lookup_bdev + lookup_user_key loops_per_jiffy LZ4_decompress_safe LZ4_decompress_safe_partial @@ -1726,6 +1743,8 @@ proc_mkdir proc_mkdir_data proc_remove + proto_register + proto_unregister __pskb_copy_fclone pskb_expand_head __pskb_pull_tail @@ -1845,6 +1864,8 @@ release_firmware __release_region release_resource + release_sock + release_sock remap_pfn_range remap_vmalloc_range remove_cpu @@ -1940,6 +1961,8 @@ sdio_writel sdio_writesb sdio_writew + security_sk_clone + security_sock_graft send_sig seq_list_next seq_list_start @@ -2000,6 +2023,7 @@ single_open_size single_release si_swapinfo + sk_alloc skb_add_rx_frag skb_checksum_help skb_clone @@ -2026,6 +2050,7 @@ skb_scrub_packet skb_trim skb_tstamp_tx + sk_free skip_spaces smpboot_register_percpu_thread smp_call_function @@ -2046,6 +2071,7 @@ snd_pcm_lib_preallocate_pages snd_pcm_period_elapsed snd_pcm_rate_to_rate_bit + snd_pcm_set_managed_buffer_all snd_pcm_stop snd_pcm_stop_xrun _snd_pcm_stream_lock_irqsave @@ -2068,6 +2094,7 @@ snd_soc_dai_set_tdm_slot snd_soc_dapm_get_enum_double snd_soc_dapm_put_enum_double + snd_soc_get_dai_name snd_soc_get_volsw snd_soc_get_volsw_range snd_soc_info_enum_double @@ -2082,6 +2109,7 @@ snd_soc_of_parse_audio_simple_widgets snd_soc_of_parse_card_name snd_soc_of_parse_tdm_slot + snd_soc_of_put_dai_link_codecs snd_soc_pm_ops snd_soc_put_volsw snd_soc_put_volsw_range @@ -2090,7 +2118,25 @@ snd_soc_unregister_component snprintf __sock_create + sock_init_data + sock_kfree_s + sock_kmalloc + sock_kzfree_s + sock_no_accept + sock_no_bind + sock_no_connect + sock_no_getname + sock_no_ioctl + sock_no_listen + sock_no_mmap + sock_no_recvmsg + sock_no_sendmsg + sock_no_shutdown + sock_no_socketpair + sock_register sock_release + sock_unregister + sock_wake_async sock_wfree sort spi_add_device @@ -2172,6 +2218,7 @@ sysfs_create_file_ns sysfs_create_files sysfs_create_group + sysfs_create_groups sysfs_create_link sysfs_emit __sysfs_match_string @@ -2574,10 +2621,12 @@ wakeup_source_register wakeup_source_unregister __wake_up_sync + __wake_up_sync_key __warn_flushing_systemwide_wq __warn_printk wireless_nlevent_flush wireless_send_event + woken_wake_function work_busy write_cache_pages write_inode_now From 1f02134847c8bed09b4cda0a53cecce9b1271844 Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Wed, 9 Jul 2025 16:52:30 -0700 Subject: [PATCH 37/49] Revert "ANDROID: Add dmabuf RSS trace event" Revert submission 3680024 Reason for revert: replacing with a fixed version Reverted changes: /q/submissionid:3680024 Bug: 430499939 Change-Id: I5ba5cdbabd1e967889b3523abca289fa8e8a3ec9 Signed-off-by: Suren Baghdasaryan --- drivers/dma-buf/dma-buf.c | 5 ----- include/trace/events/kmem.h | 25 ------------------------- 2 files changed, 30 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 35bea250e08d..71065b03012a 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -31,9 +31,6 @@ #include #include -#ifndef __GENKSYMS__ -#include -#endif #include #include "dma-buf-sysfs-stats.h" @@ -214,7 +211,6 @@ static int new_task_dmabuf_record(struct task_struct *task, struct dma_buf *dmab return -ENOMEM; dmabuf_info->rss += dmabuf->size; - trace_dmabuf_rss_stat(dmabuf_info->rss, dmabuf->size, dmabuf); /* * dmabuf_info->lock protects against concurrent writers, so no * worries about stale rss_hwm between the read and write, and we don't @@ -316,7 +312,6 @@ void dma_buf_unaccount_task(struct dma_buf *dmabuf, struct task_struct *task) list_del(&rec->node); kfree(rec); dmabuf_info->rss -= dmabuf->size; - trace_dmabuf_rss_stat(dmabuf_info->rss, -dmabuf->size, dmabuf); atomic64_dec(&get_dmabuf_ext(dmabuf)->num_unique_refs); } err: diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index 896f8de946d0..68f5280a41a4 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -8,7 +8,6 @@ #include #include #include -#include TRACE_EVENT(kmem_cache_alloc, @@ -488,30 +487,6 @@ TRACE_EVENT(rss_stat, __print_symbolic(__entry->member, TRACE_MM_PAGES), __entry->size) ); - -TRACE_EVENT(dmabuf_rss_stat, - - TP_PROTO(size_t rss, ssize_t rss_delta, struct dma_buf *dmabuf), - - TP_ARGS(rss, rss_delta, dmabuf), - - TP_STRUCT__entry( - __field(size_t, rss) - __field(ssize_t, rss_delta) - __field(unsigned long, i_ino) - ), - - TP_fast_assign( - __entry->rss = rss; - __entry->rss_delta = rss_delta; - __entry->i_ino = file_inode(dmabuf->file)->i_ino; - ), - - TP_printk("rss=%zu delta=%zd i_ino=%lu", - __entry->rss, - __entry->rss_delta, - __entry->i_ino) - ); #endif /* _TRACE_KMEM_H */ /* This part must be outside protection */ From b26826e8ffe6bac56af44ea05dc371120b3b80f3 Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Wed, 9 Jul 2025 16:52:30 -0700 Subject: [PATCH 38/49] Revert "ANDROID: fixup dma_buf struct to avoid ABI breakage" Revert submission 3680024 Reason for revert: replacing with a fixed version Reverted changes: /q/submissionid:3680024 Bug: 430499939 Change-Id: I994b93b56b734441b6ecebf90b777a7a6f54f1ab Signed-off-by: Suren Baghdasaryan --- drivers/dma-buf/dma-buf.c | 22 ++++++++++------------ fs/proc/base.c | 2 +- include/linux/dma-buf.h | 17 ++--------------- 3 files changed, 13 insertions(+), 28 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 71065b03012a..5b3e3fdc1599 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -93,7 +93,6 @@ static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen) static void dma_buf_release(struct dentry *dentry) { - struct dma_buf_ext *dmabuf_ext; struct dma_buf *dmabuf; dmabuf = dentry->d_fsdata; @@ -116,13 +115,13 @@ static void dma_buf_release(struct dentry *dentry) if (dmabuf->resv == (struct dma_resv *)&dmabuf[1]) dma_resv_fini(dmabuf->resv); - dmabuf_ext = get_dmabuf_ext(dmabuf); - if (atomic64_read(&dmabuf_ext->num_unique_refs)) + if (atomic64_read(&dmabuf->num_unique_refs)) pr_err("destroying dmabuf with non-zero task refs\n"); + WARN_ON(!list_empty(&dmabuf->attachments)); module_put(dmabuf->owner); kfree(dmabuf->name); - kfree(dmabuf_ext); + kfree(dmabuf); } static int dma_buf_file_release(struct inode *inode, struct file *file) @@ -222,7 +221,8 @@ static int new_task_dmabuf_record(struct task_struct *task, struct dma_buf *dmab rec->dmabuf = dmabuf; rec->refcnt = 1; list_add(&rec->node, &dmabuf_info->dmabufs); - atomic64_inc(&get_dmabuf_ext(dmabuf)->num_unique_refs); + + atomic64_inc(&dmabuf->num_unique_refs); return 0; } @@ -312,7 +312,7 @@ void dma_buf_unaccount_task(struct dma_buf *dmabuf, struct task_struct *task) list_del(&rec->node); kfree(rec); dmabuf_info->rss -= dmabuf->size; - atomic64_dec(&get_dmabuf_ext(dmabuf)->num_unique_refs); + atomic64_dec(&dmabuf->num_unique_refs); } err: spin_unlock(&dmabuf_info->lock); @@ -831,11 +831,10 @@ err_alloc_file: */ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) { - struct dma_buf_ext *dmabuf_ext; struct dma_buf *dmabuf; struct dma_resv *resv = exp_info->resv; struct file *file; - size_t alloc_size = sizeof(struct dma_buf_ext); + size_t alloc_size = sizeof(struct dma_buf); int ret; if (WARN_ON(!exp_info->priv || !exp_info->ops @@ -865,13 +864,12 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) else /* prevent &dma_buf[1] == dma_buf->resv */ alloc_size += 1; - dmabuf_ext = kzalloc(alloc_size, GFP_KERNEL); - if (!dmabuf_ext) { + dmabuf = kzalloc(alloc_size, GFP_KERNEL); + if (!dmabuf) { ret = -ENOMEM; goto err_file; } - dmabuf = &dmabuf_ext->dmabuf; dmabuf->priv = exp_info->priv; dmabuf->ops = exp_info->ops; dmabuf->size = exp_info->size; @@ -890,7 +888,7 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; } - atomic64_set(&dmabuf_ext->num_unique_refs, 0); + atomic64_set(&dmabuf->num_unique_refs, 0); file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; diff --git a/fs/proc/base.c b/fs/proc/base.c index 3d78cd1286a5..0a3f28f7f1d9 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3437,7 +3437,7 @@ static int proc_dmabuf_pss_show(struct seq_file *m, struct pid_namespace *ns, spin_lock(&dmabuf_info->lock); list_for_each_entry(rec, &dmabuf_info->dmabufs, node) { - s64 refs = atomic64_read(&get_dmabuf_ext(rec->dmabuf)->num_unique_refs); + s64 refs = atomic64_read(&rec->dmabuf->num_unique_refs); if (refs <= 0) { pr_err("dmabuf has <= refs %lld\n", refs); diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index d9487fb2e549..654085da8bc4 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -535,11 +535,6 @@ struct dma_buf { } *sysfs_entry; #endif - ANDROID_KABI_RESERVE(1); - ANDROID_KABI_RESERVE(2); -}; - -struct dma_buf_ext { /** * @num_unique_refs: * @@ -547,18 +542,10 @@ struct dma_buf_ext { */ atomic64_t num_unique_refs; - /* - * dma_buf can have a reservation object after it, so keep this member - * at the end of this structure. - */ - struct dma_buf dmabuf; + ANDROID_KABI_RESERVE(1); + ANDROID_KABI_RESERVE(2); }; -static inline struct dma_buf_ext *get_dmabuf_ext(struct dma_buf *dmabuf) -{ - return container_of(dmabuf, struct dma_buf_ext, dmabuf); -} - /** * struct dma_buf_attach_ops - importer operations for an attachment * From ad0b76e69fc80167b8b53463c97660c1a0395248 Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Wed, 9 Jul 2025 16:52:30 -0700 Subject: [PATCH 39/49] Revert "ANDROID: fixup task_struct to avoid ABI breakage" Revert submission 3680024 Reason for revert: replacing with a fixed version Reverted changes: /q/submissionid:3680024 Bug: 430499939 Change-Id: Iddc2b99d2f80611d044d0015d62bb33936b4bcaf Signed-off-by: Suren Baghdasaryan --- drivers/dma-buf/dma-buf.c | 69 ++++++++------------------- fs/proc/base.c | 98 +++++++++++++++------------------------ include/linux/dma-buf.h | 22 --------- include/linux/sched.h | 4 +- init/init_task.c | 2 +- kernel/fork.c | 69 +++++++++++---------------- 6 files changed, 87 insertions(+), 177 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 5b3e3fdc1599..cb91dadeb465 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -168,21 +168,11 @@ static struct file_system_type dma_buf_fs_type = { static struct task_dma_buf_record *find_task_dmabuf_record( struct task_struct *task, struct dma_buf *dmabuf) { - struct task_dma_buf_info *dmabuf_info = get_task_dma_buf_info(task); struct task_dma_buf_record *rec; - if (!dmabuf_info) - return NULL; + lockdep_assert_held(&task->dmabuf_info->lock); - if (IS_ERR(dmabuf_info)) { - pr_err("%s dmabuf accounting record is missing, error %ld\n", - __func__, PTR_ERR(dmabuf_info)); - return NULL; - } - - lockdep_assert_held(&dmabuf_info->lock); - - list_for_each_entry(rec, &dmabuf_info->dmabufs, node) + list_for_each_entry(rec, &task->dmabuf_info->dmabufs, node) if (dmabuf == rec->dmabuf) return rec; @@ -191,36 +181,26 @@ static struct task_dma_buf_record *find_task_dmabuf_record( static int new_task_dmabuf_record(struct task_struct *task, struct dma_buf *dmabuf) { - struct task_dma_buf_info *dmabuf_info = get_task_dma_buf_info(task); struct task_dma_buf_record *rec; - if (!dmabuf_info) - return 0; - - if (IS_ERR(dmabuf_info)) { - pr_err("%s dmabuf accounting record is missing, error %ld\n", - __func__, PTR_ERR(dmabuf_info)); - return PTR_ERR(dmabuf_info); - } - - lockdep_assert_held(&dmabuf_info->lock); + lockdep_assert_held(&task->dmabuf_info->lock); rec = kmalloc(sizeof(*rec), GFP_KERNEL); if (!rec) return -ENOMEM; - dmabuf_info->rss += dmabuf->size; + task->dmabuf_info->rss += dmabuf->size; /* - * dmabuf_info->lock protects against concurrent writers, so no + * task->dmabuf_info->lock protects against concurrent writers, so no * worries about stale rss_hwm between the read and write, and we don't * need to cmpxchg here. */ - if (dmabuf_info->rss > dmabuf_info->rss_hwm) - dmabuf_info->rss_hwm = dmabuf_info->rss; + if (task->dmabuf_info->rss > task->dmabuf_info->rss_hwm) + task->dmabuf_info->rss_hwm = task->dmabuf_info->rss; rec->dmabuf = dmabuf; rec->refcnt = 1; - list_add(&rec->node, &dmabuf_info->dmabufs); + list_add(&rec->node, &task->dmabuf_info->dmabufs); atomic64_inc(&dmabuf->num_unique_refs); @@ -242,30 +222,24 @@ static int new_task_dmabuf_record(struct task_struct *task, struct dma_buf *dmab */ int dma_buf_account_task(struct dma_buf *dmabuf, struct task_struct *task) { - struct task_dma_buf_info *dmabuf_info; struct task_dma_buf_record *rec; int ret = 0; if (!dmabuf || !task) return -EINVAL; - dmabuf_info = get_task_dma_buf_info(task); - if (!dmabuf_info) - return 0; - - if (IS_ERR(dmabuf_info)) { - pr_err("%s dmabuf accounting record is missing, error %ld\n", - __func__, PTR_ERR(dmabuf_info)); - return PTR_ERR(dmabuf_info); + if (!task->dmabuf_info) { + pr_err("%s dmabuf accounting record was not allocated\n", __func__); + return -ENOMEM; } - spin_lock(&dmabuf_info->lock); + spin_lock(&task->dmabuf_info->lock); rec = find_task_dmabuf_record(task, dmabuf); if (!rec) ret = new_task_dmabuf_record(task, dmabuf); else ++rec->refcnt; - spin_unlock(&dmabuf_info->lock); + spin_unlock(&task->dmabuf_info->lock); return ret; } @@ -286,22 +260,17 @@ int dma_buf_account_task(struct dma_buf *dmabuf, struct task_struct *task) */ void dma_buf_unaccount_task(struct dma_buf *dmabuf, struct task_struct *task) { - struct task_dma_buf_info *dmabuf_info = get_task_dma_buf_info(task); struct task_dma_buf_record *rec; if (!dmabuf || !task) return; - if (!dmabuf_info) - return; - - if (IS_ERR(dmabuf_info)) { - pr_err("%s dmabuf accounting record is missing, error %ld\n", - __func__, PTR_ERR(dmabuf_info)); - return; + if (!task->dmabuf_info) { + pr_err("%s dmabuf accounting record was not allocated\n", __func__); + return; } - spin_lock(&dmabuf_info->lock); + spin_lock(&task->dmabuf_info->lock); rec = find_task_dmabuf_record(task, dmabuf); if (!rec) { /* Failed fd_install? */ pr_err("dmabuf not found in task list\n"); @@ -311,11 +280,11 @@ void dma_buf_unaccount_task(struct dma_buf *dmabuf, struct task_struct *task) if (--rec->refcnt == 0) { list_del(&rec->node); kfree(rec); - dmabuf_info->rss -= dmabuf->size; + task->dmabuf_info->rss -= dmabuf->size; atomic64_dec(&dmabuf->num_unique_refs); } err: - spin_unlock(&dmabuf_info->lock); + spin_unlock(&task->dmabuf_info->lock); } static int dma_buf_mmap_internal(struct file *file, struct vm_area_struct *vma) diff --git a/fs/proc/base.c b/fs/proc/base.c index 0a3f28f7f1d9..2eee67e06ffe 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3309,27 +3309,21 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns, static int proc_dmabuf_rss_show(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { - struct task_dma_buf_info *dmabuf_info = get_task_dma_buf_info(task); + if (!task->dmabuf_info) { + pr_err("%s dmabuf accounting record was not allocated\n", __func__); + return -ENOMEM; + } - if (!dmabuf_info) { + if (!(task->flags & PF_KTHREAD)) + seq_printf(m, "%lld\n", READ_ONCE(task->dmabuf_info->rss)); + else seq_puts(m, "0\n"); - return 0; - } - - if (IS_ERR(dmabuf_info)) { - pr_err("%s dmabuf accounting record is missing, error %ld\n", - __func__, PTR_ERR(dmabuf_info)); - return PTR_ERR(dmabuf_info); - } - - seq_printf(m, "%lld\n", READ_ONCE(dmabuf_info->rss)); return 0; } static int proc_dmabuf_rss_hwm_show(struct seq_file *m, void *v) { - struct task_dma_buf_info *dmabuf_info; struct inode *inode = m->private; struct task_struct *task; int ret = 0; @@ -3338,20 +3332,16 @@ static int proc_dmabuf_rss_hwm_show(struct seq_file *m, void *v) if (!task) return -ESRCH; - dmabuf_info = get_task_dma_buf_info(task); - if (!dmabuf_info) { + if (!task->dmabuf_info) { + pr_err("%s dmabuf accounting record was not allocated\n", __func__); + ret = -ENOMEM; + goto out; + } + + if (!(task->flags & PF_KTHREAD)) + seq_printf(m, "%lld\n", READ_ONCE(task->dmabuf_info->rss_hwm)); + else seq_puts(m, "0\n"); - goto out; - } - - if (IS_ERR(dmabuf_info)) { - pr_err("%s dmabuf accounting record is missing, error %ld\n", - __func__, PTR_ERR(dmabuf_info)); - ret = PTR_ERR(dmabuf_info); - goto out; - } - - seq_printf(m, "%lld\n", READ_ONCE(dmabuf_info->rss_hwm)); out: put_task_struct(task); @@ -3368,7 +3358,6 @@ static ssize_t proc_dmabuf_rss_hwm_write(struct file *file, const char __user *buf, size_t count, loff_t *offset) { - struct task_dma_buf_info *dmabuf_info; struct inode *inode = file_inode(file); struct task_struct *task; unsigned long long val; @@ -3385,22 +3374,15 @@ proc_dmabuf_rss_hwm_write(struct file *file, const char __user *buf, if (!task) return -ESRCH; - dmabuf_info = get_task_dma_buf_info(task); - if (!dmabuf_info) { - ret = -EINVAL; + if (!task->dmabuf_info) { + pr_err("%s dmabuf accounting record was not allocated\n", __func__); + ret = -ENOMEM; goto out; } - if (IS_ERR(dmabuf_info)) { - pr_err("%s dmabuf accounting record is missing, error %ld\n", - __func__, PTR_ERR(dmabuf_info)); - ret = PTR_ERR(dmabuf_info); - goto out; - } - - spin_lock(&dmabuf_info->lock); - dmabuf_info->rss_hwm = dmabuf_info->rss; - spin_unlock(&dmabuf_info->lock); + spin_lock(&task->dmabuf_info->lock); + task->dmabuf_info->rss_hwm = task->dmabuf_info->rss; + spin_unlock(&task->dmabuf_info->lock); out: put_task_struct(task); @@ -3419,36 +3401,32 @@ static const struct file_operations proc_dmabuf_rss_hwm_operations = { static int proc_dmabuf_pss_show(struct seq_file *m, struct pid_namespace *ns, struct pid *pid, struct task_struct *task) { - struct task_dma_buf_info *dmabuf_info; struct task_dma_buf_record *rec; u64 pss = 0; - dmabuf_info = get_task_dma_buf_info(task); - if (!dmabuf_info) { - seq_puts(m, "0\n"); - return 0; + if (!task->dmabuf_info) { + pr_err("%s dmabuf accounting record was not allocated\n", __func__); + return -ENOMEM; } - if (IS_ERR(dmabuf_info)) { - pr_err("%s dmabuf accounting record is missing, error %ld\n", - __func__, PTR_ERR(dmabuf_info)); - return PTR_ERR(dmabuf_info); - } + if (!(task->flags & PF_KTHREAD)) { + spin_lock(&task->dmabuf_info->lock); + list_for_each_entry(rec, &task->dmabuf_info->dmabufs, node) { + s64 refs = atomic64_read(&rec->dmabuf->num_unique_refs); - spin_lock(&dmabuf_info->lock); - list_for_each_entry(rec, &dmabuf_info->dmabufs, node) { - s64 refs = atomic64_read(&rec->dmabuf->num_unique_refs); + if (refs <= 0) { + pr_err("dmabuf has <= refs %lld\n", refs); + continue; + } - if (refs <= 0) { - pr_err("dmabuf has <= refs %lld\n", refs); - continue; + pss += rec->dmabuf->size / (size_t)refs; } + spin_unlock(&task->dmabuf_info->lock); - pss += rec->dmabuf->size / (size_t)refs; + seq_printf(m, "%llu\n", pss); + } else { + seq_puts(m, "0\n"); } - spin_unlock(&dmabuf_info->lock); - - seq_printf(m, "%llu\n", pss); return 0; } diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 654085da8bc4..267bf322272f 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -690,28 +690,6 @@ struct task_dma_buf_info { struct list_head dmabufs; }; -static inline bool task_has_dma_buf_info(struct task_struct *task) -{ - return (task->flags & (PF_KTHREAD | PF_IO_WORKER)) == 0; -} - -extern struct task_struct init_task; - -static inline -struct task_dma_buf_info *get_task_dma_buf_info(struct task_struct *task) -{ - if (!task) - return ERR_PTR(-EINVAL); - - if (!task_has_dma_buf_info(task)) - return NULL; - - if (!task->worker_private) - return ERR_PTR(-ENOMEM); - - return (struct task_dma_buf_info *)task->worker_private; -} - /** * DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters * @name: export-info name diff --git a/include/linux/sched.h b/include/linux/sched.h index 3cff2446536d..68ba96bde447 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1003,7 +1003,6 @@ struct task_struct { int __user *clear_child_tid; /* PF_KTHREAD | PF_IO_WORKER */ - /* Otherwise used as task_dma_buf_info pointer */ void *worker_private; u64 utime; @@ -1518,6 +1517,9 @@ struct task_struct { */ struct callback_head l1d_flush_kill; #endif + + struct task_dma_buf_info *dmabuf_info; + ANDROID_KABI_RESERVE(1); ANDROID_KABI_RESERVE(2); ANDROID_KABI_RESERVE(3); diff --git a/init/init_task.c b/init/init_task.c index 1903a2abde55..d80c007ab59b 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -214,7 +214,7 @@ struct task_struct init_task .android_vendor_data1 = {0, }, .android_oem_data1 = {0, }, #endif - .worker_private = NULL, + .dmabuf_info = NULL, }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index 9c71a69e0d17..e1d7d244d43a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -997,27 +997,21 @@ static inline void put_signal_struct(struct signal_struct *sig) static void put_dmabuf_info(struct task_struct *tsk) { - struct task_dma_buf_info *dmabuf_info = get_task_dma_buf_info(tsk); - - if (!dmabuf_info) - return; - - if (IS_ERR(dmabuf_info)) { - pr_err("%s dmabuf accounting record is missing, error %ld\n", - __func__, PTR_ERR(dmabuf_info)); + if (!tsk->dmabuf_info) { + pr_err("%s dmabuf accounting record was not allocated\n", __func__); return; } - if (!refcount_dec_and_test(&dmabuf_info->refcnt)) + if (!refcount_dec_and_test(&tsk->dmabuf_info->refcnt)) return; - if (READ_ONCE(dmabuf_info->rss)) + if (READ_ONCE(tsk->dmabuf_info->rss)) pr_err("%s destroying task with non-zero dmabuf rss\n", __func__); - if (!list_empty(&dmabuf_info->dmabufs)) + if (!list_empty(&tsk->dmabuf_info->dmabufs)) pr_err("%s destroying task with non-empty dmabuf list\n", __func__); - kfree(dmabuf_info); + kfree(tsk->dmabuf_info); } void __put_task_struct(struct task_struct *tsk) @@ -2297,66 +2291,55 @@ static void rv_task_fork(struct task_struct *p) static int copy_dmabuf_info(u64 clone_flags, struct task_struct *p) { - struct task_dma_buf_info *new_dmabuf_info; - struct task_dma_buf_info *dmabuf_info; struct task_dma_buf_record *rec, *copy; - if (!task_has_dma_buf_info(p)) - return 0; /* Task is not supposed to have dmabuf_info */ - - dmabuf_info = get_task_dma_buf_info(current); - /* Original might not have dmabuf_info and that's fine */ - if (IS_ERR(dmabuf_info)) - dmabuf_info = NULL; - - if (dmabuf_info && (clone_flags & (CLONE_VM | CLONE_FILES)) + if (current->dmabuf_info && (clone_flags & (CLONE_VM | CLONE_FILES)) == (CLONE_VM | CLONE_FILES)) { /* * Both MM and FD references to dmabufs are shared with the parent, so * we can share a RSS counter with the parent. */ - refcount_inc(&dmabuf_info->refcnt); - p->worker_private = dmabuf_info; + refcount_inc(¤t->dmabuf_info->refcnt); + p->dmabuf_info = current->dmabuf_info; return 0; } - new_dmabuf_info = kmalloc(sizeof(*new_dmabuf_info), GFP_KERNEL); - if (!new_dmabuf_info) + p->dmabuf_info = kmalloc(sizeof(*p->dmabuf_info), GFP_KERNEL); + if (!p->dmabuf_info) return -ENOMEM; - refcount_set(&new_dmabuf_info->refcnt, 1); - spin_lock_init(&new_dmabuf_info->lock); - INIT_LIST_HEAD(&new_dmabuf_info->dmabufs); - if (dmabuf_info) { - spin_lock(&dmabuf_info->lock); - new_dmabuf_info->rss = dmabuf_info->rss; - new_dmabuf_info->rss_hwm = dmabuf_info->rss; - list_for_each_entry(rec, &dmabuf_info->dmabufs, node) { + refcount_set(&p->dmabuf_info->refcnt, 1); + spin_lock_init(&p->dmabuf_info->lock); + INIT_LIST_HEAD(&p->dmabuf_info->dmabufs); + if (current->dmabuf_info) { + spin_lock(¤t->dmabuf_info->lock); + p->dmabuf_info->rss = current->dmabuf_info->rss; + p->dmabuf_info->rss_hwm = current->dmabuf_info->rss; + list_for_each_entry(rec, ¤t->dmabuf_info->dmabufs, node) { copy = kmalloc(sizeof(*copy), GFP_KERNEL); if (!copy) { - spin_unlock(&dmabuf_info->lock); + spin_unlock(¤t->dmabuf_info->lock); goto err_list_copy; } copy->dmabuf = rec->dmabuf; copy->refcnt = rec->refcnt; - list_add(©->node, &new_dmabuf_info->dmabufs); + list_add(©->node, &p->dmabuf_info->dmabufs); } - spin_unlock(&dmabuf_info->lock); + spin_unlock(¤t->dmabuf_info->lock); } else { - new_dmabuf_info->rss = 0; - new_dmabuf_info->rss_hwm = 0; + p->dmabuf_info->rss = 0; + p->dmabuf_info->rss_hwm = 0; } - p->worker_private = new_dmabuf_info; return 0; err_list_copy: - list_for_each_entry_safe(rec, copy, &new_dmabuf_info->dmabufs, node) { + list_for_each_entry_safe(rec, copy, &p->dmabuf_info->dmabufs, node) { list_del(&rec->node); kfree(rec); } - kfree(new_dmabuf_info); + kfree(p->dmabuf_info); return -ENOMEM; } From 30cf816a506fa06e86ca1af81ea4ec6bee1d95c6 Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Wed, 9 Jul 2025 16:52:30 -0700 Subject: [PATCH 40/49] Revert "ANDROID: Track per-process dmabuf PSS" Revert submission 3680024 Reason for revert: replacing with a fixed version Reverted changes: /q/submissionid:3680024 Bug: 430499939 Change-Id: Iabc974ee2bd75a88e8e3b4728dc0f1a58ecfe75c Signed-off-by: Suren Baghdasaryan --- drivers/dma-buf/dma-buf.c | 8 -------- fs/proc/base.c | 34 ---------------------------------- include/linux/dma-buf.h | 8 -------- 3 files changed, 50 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index cb91dadeb465..7c9ac163d115 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -115,9 +115,6 @@ static void dma_buf_release(struct dentry *dentry) if (dmabuf->resv == (struct dma_resv *)&dmabuf[1]) dma_resv_fini(dmabuf->resv); - if (atomic64_read(&dmabuf->num_unique_refs)) - pr_err("destroying dmabuf with non-zero task refs\n"); - WARN_ON(!list_empty(&dmabuf->attachments)); module_put(dmabuf->owner); kfree(dmabuf->name); @@ -202,8 +199,6 @@ static int new_task_dmabuf_record(struct task_struct *task, struct dma_buf *dmab rec->refcnt = 1; list_add(&rec->node, &task->dmabuf_info->dmabufs); - atomic64_inc(&dmabuf->num_unique_refs); - return 0; } @@ -281,7 +276,6 @@ void dma_buf_unaccount_task(struct dma_buf *dmabuf, struct task_struct *task) list_del(&rec->node); kfree(rec); task->dmabuf_info->rss -= dmabuf->size; - atomic64_dec(&dmabuf->num_unique_refs); } err: spin_unlock(&task->dmabuf_info->lock); @@ -857,8 +851,6 @@ struct dma_buf *dma_buf_export(const struct dma_buf_export_info *exp_info) dmabuf->resv = resv; } - atomic64_set(&dmabuf->num_unique_refs, 0); - file->private_data = dmabuf; file->f_path.dentry->d_fsdata = dmabuf; dmabuf->file = file; diff --git a/fs/proc/base.c b/fs/proc/base.c index 2eee67e06ffe..6b91ddcab7e2 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3397,39 +3397,6 @@ static const struct file_operations proc_dmabuf_rss_hwm_operations = { .llseek = seq_lseek, .release = single_release, }; - -static int proc_dmabuf_pss_show(struct seq_file *m, struct pid_namespace *ns, - struct pid *pid, struct task_struct *task) -{ - struct task_dma_buf_record *rec; - u64 pss = 0; - - if (!task->dmabuf_info) { - pr_err("%s dmabuf accounting record was not allocated\n", __func__); - return -ENOMEM; - } - - if (!(task->flags & PF_KTHREAD)) { - spin_lock(&task->dmabuf_info->lock); - list_for_each_entry(rec, &task->dmabuf_info->dmabufs, node) { - s64 refs = atomic64_read(&rec->dmabuf->num_unique_refs); - - if (refs <= 0) { - pr_err("dmabuf has <= refs %lld\n", refs); - continue; - } - - pss += rec->dmabuf->size / (size_t)refs; - } - spin_unlock(&task->dmabuf_info->lock); - - seq_printf(m, "%llu\n", pss); - } else { - seq_puts(m, "0\n"); - } - - return 0; -} #endif /* @@ -3558,7 +3525,6 @@ static const struct pid_entry tgid_base_stuff[] = { #ifdef CONFIG_DMA_SHARED_BUFFER ONE("dmabuf_rss", S_IRUGO, proc_dmabuf_rss_show), REG("dmabuf_rss_hwm", S_IRUGO|S_IWUSR, proc_dmabuf_rss_hwm_operations), - ONE("dmabuf_pss", S_IRUGO, proc_dmabuf_pss_show), #endif }; diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 267bf322272f..a362c8ba7a21 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -25,7 +25,6 @@ #include #include #ifndef __GENKSYMS__ -#include #include #endif @@ -535,13 +534,6 @@ struct dma_buf { } *sysfs_entry; #endif - /** - * @num_unique_refs: - * - * The number of tasks that reference this buffer. For calculating PSS. - */ - atomic64_t num_unique_refs; - ANDROID_KABI_RESERVE(1); ANDROID_KABI_RESERVE(2); }; From b1eeaed7fb5fbacdc5009d717c3c4a9952ae08a8 Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Wed, 9 Jul 2025 16:52:30 -0700 Subject: [PATCH 41/49] Revert "ANDROID: Track per-process dmabuf RSS HWM" Revert submission 3680024 Reason for revert: replacing with a fixed version Reverted changes: /q/submissionid:3680024 Bug: 430499939 Change-Id: I57d3532def7a03d4e785ec64ade727a43503a2fb Signed-off-by: Suren Baghdasaryan --- drivers/dma-buf/dma-buf.c | 8 ---- fs/proc/base.c | 77 --------------------------------------- include/linux/dma-buf.h | 7 +--- kernel/fork.c | 2 - 4 files changed, 2 insertions(+), 92 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 7c9ac163d115..c8c05d2e112a 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -187,14 +187,6 @@ static int new_task_dmabuf_record(struct task_struct *task, struct dma_buf *dmab return -ENOMEM; task->dmabuf_info->rss += dmabuf->size; - /* - * task->dmabuf_info->lock protects against concurrent writers, so no - * worries about stale rss_hwm between the read and write, and we don't - * need to cmpxchg here. - */ - if (task->dmabuf_info->rss > task->dmabuf_info->rss_hwm) - task->dmabuf_info->rss_hwm = task->dmabuf_info->rss; - rec->dmabuf = dmabuf; rec->refcnt = 1; list_add(&rec->node, &task->dmabuf_info->dmabufs); diff --git a/fs/proc/base.c b/fs/proc/base.c index 6b91ddcab7e2..f7d8188b0ccf 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3321,82 +3321,6 @@ static int proc_dmabuf_rss_show(struct seq_file *m, struct pid_namespace *ns, return 0; } - -static int proc_dmabuf_rss_hwm_show(struct seq_file *m, void *v) -{ - struct inode *inode = m->private; - struct task_struct *task; - int ret = 0; - - task = get_proc_task(inode); - if (!task) - return -ESRCH; - - if (!task->dmabuf_info) { - pr_err("%s dmabuf accounting record was not allocated\n", __func__); - ret = -ENOMEM; - goto out; - } - - if (!(task->flags & PF_KTHREAD)) - seq_printf(m, "%lld\n", READ_ONCE(task->dmabuf_info->rss_hwm)); - else - seq_puts(m, "0\n"); - -out: - put_task_struct(task); - - return ret; -} - -static int proc_dmabuf_rss_hwm_open(struct inode *inode, struct file *filp) -{ - return single_open(filp, proc_dmabuf_rss_hwm_show, inode); -} - -static ssize_t -proc_dmabuf_rss_hwm_write(struct file *file, const char __user *buf, - size_t count, loff_t *offset) -{ - struct inode *inode = file_inode(file); - struct task_struct *task; - unsigned long long val; - int ret; - - ret = kstrtoull_from_user(buf, count, 10, &val); - if (ret) - return ret; - - if (val != 0) - return -EINVAL; - - task = get_proc_task(inode); - if (!task) - return -ESRCH; - - if (!task->dmabuf_info) { - pr_err("%s dmabuf accounting record was not allocated\n", __func__); - ret = -ENOMEM; - goto out; - } - - spin_lock(&task->dmabuf_info->lock); - task->dmabuf_info->rss_hwm = task->dmabuf_info->rss; - spin_unlock(&task->dmabuf_info->lock); - -out: - put_task_struct(task); - - return ret < 0 ? ret : count; -} - -static const struct file_operations proc_dmabuf_rss_hwm_operations = { - .open = proc_dmabuf_rss_hwm_open, - .write = proc_dmabuf_rss_hwm_write, - .read = seq_read, - .llseek = seq_lseek, - .release = single_release, -}; #endif /* @@ -3524,7 +3448,6 @@ static const struct pid_entry tgid_base_stuff[] = { #endif #ifdef CONFIG_DMA_SHARED_BUFFER ONE("dmabuf_rss", S_IRUGO, proc_dmabuf_rss_show), - REG("dmabuf_rss_hwm", S_IRUGO|S_IWUSR, proc_dmabuf_rss_hwm_operations), #endif }; diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index a362c8ba7a21..1647fb38fe80 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -659,8 +659,8 @@ struct task_dma_buf_record { }; /** - * struct task_dma_buf_info - Holds RSS and RSS HWM counters, and a list of - * dmabufs for all tasks that share both mm_struct and files_struct. + * struct task_dma_buf_info - Holds a RSS counter, and a list of dmabufs for all + * tasks that share both mm_struct and files_struct. * * @rss: The sum of all dmabuf memory referenced by the tasks via memory * mappings or file descriptors in bytes. Buffers referenced more than @@ -668,15 +668,12 @@ struct task_dma_buf_record { * of both mmaps and FDs) only cause the buffer to be accounted to the * process once. Partial mappings cause the full size of the buffer to be * accounted, regardless of the size of the mapping. - * @rss_hwm: The maximum value of @rss over the lifetime of this struct. (Unless, - * reset by userspace.) * @refcnt: The number of tasks sharing this struct. * @lock: Lock protecting writes for @rss, and reads/writes for @dmabufs. * @dmabufs: List of all dmabufs referenced by the tasks. */ struct task_dma_buf_info { s64 rss; - s64 rss_hwm; refcount_t refcnt; spinlock_t lock; struct list_head dmabufs; diff --git a/kernel/fork.c b/kernel/fork.c index e1d7d244d43a..66636a979911 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2314,7 +2314,6 @@ static int copy_dmabuf_info(u64 clone_flags, struct task_struct *p) if (current->dmabuf_info) { spin_lock(¤t->dmabuf_info->lock); p->dmabuf_info->rss = current->dmabuf_info->rss; - p->dmabuf_info->rss_hwm = current->dmabuf_info->rss; list_for_each_entry(rec, ¤t->dmabuf_info->dmabufs, node) { copy = kmalloc(sizeof(*copy), GFP_KERNEL); if (!copy) { @@ -2329,7 +2328,6 @@ static int copy_dmabuf_info(u64 clone_flags, struct task_struct *p) spin_unlock(¤t->dmabuf_info->lock); } else { p->dmabuf_info->rss = 0; - p->dmabuf_info->rss_hwm = 0; } return 0; From 9e89b97c13b4369ff627a1640c9714ec6788d1ec Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Wed, 9 Jul 2025 16:52:30 -0700 Subject: [PATCH 42/49] Revert "ANDROID: Track per-process dmabuf RSS" Revert submission 3680024 Reason for revert: replacing with a fixed version Reverted changes: /q/submissionid:3680024 Bug: 430499939 Change-Id: I93de8460bcabdc2b1b2a0e12069d89dcad3a870d Signed-off-by: Suren Baghdasaryan --- drivers/dma-buf/dma-buf.c | 141 +------------------------------------- fs/file.c | 4 -- fs/proc/base.c | 22 ------ include/linux/dma-buf.h | 43 ------------ include/linux/sched.h | 4 -- init/init_task.c | 1 - kernel/fork.c | 83 +--------------------- mm/mmap.c | 14 +--- 8 files changed, 6 insertions(+), 306 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index c8c05d2e112a..0b02ced1eb33 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -162,121 +162,9 @@ static struct file_system_type dma_buf_fs_type = { .kill_sb = kill_anon_super, }; -static struct task_dma_buf_record *find_task_dmabuf_record( - struct task_struct *task, struct dma_buf *dmabuf) -{ - struct task_dma_buf_record *rec; - - lockdep_assert_held(&task->dmabuf_info->lock); - - list_for_each_entry(rec, &task->dmabuf_info->dmabufs, node) - if (dmabuf == rec->dmabuf) - return rec; - - return NULL; -} - -static int new_task_dmabuf_record(struct task_struct *task, struct dma_buf *dmabuf) -{ - struct task_dma_buf_record *rec; - - lockdep_assert_held(&task->dmabuf_info->lock); - - rec = kmalloc(sizeof(*rec), GFP_KERNEL); - if (!rec) - return -ENOMEM; - - task->dmabuf_info->rss += dmabuf->size; - rec->dmabuf = dmabuf; - rec->refcnt = 1; - list_add(&rec->node, &task->dmabuf_info->dmabufs); - - return 0; -} - -/** - * dma_buf_account_task - Account a dmabuf to a task - * @dmabuf: [in] pointer to dma_buf - * @task: [in] pointer to task_struct - * - * When a process obtains a dmabuf file descriptor, or maps a dmabuf, this - * function attributes the provided @dmabuf to the @task. The first time @dmabuf - * is attributed to @task, the buffer's size is added to the @task's dmabuf RSS. - * - * Return: - * * 0 on success - * * A negative error code upon error - */ -int dma_buf_account_task(struct dma_buf *dmabuf, struct task_struct *task) -{ - struct task_dma_buf_record *rec; - int ret = 0; - - if (!dmabuf || !task) - return -EINVAL; - - if (!task->dmabuf_info) { - pr_err("%s dmabuf accounting record was not allocated\n", __func__); - return -ENOMEM; - } - - spin_lock(&task->dmabuf_info->lock); - rec = find_task_dmabuf_record(task, dmabuf); - if (!rec) - ret = new_task_dmabuf_record(task, dmabuf); - else - ++rec->refcnt; - spin_unlock(&task->dmabuf_info->lock); - - return ret; -} - -/** - * dma_buf_unaccount_task - Unaccount a dmabuf from a task - * @dmabuf: [in] pointer to dma_buf - * @task: [in] pointer to task_struct - * - * When a process closes a dmabuf file descriptor, or unmaps a dmabuf, this - * function removes the provided @dmabuf attribution from the @task. When all - * references to @dmabuf are removed from @task, the buffer's size is removed - * from the task's dmabuf RSS. - * - * Return: - * * 0 on success - * * A negative error code upon error - */ -void dma_buf_unaccount_task(struct dma_buf *dmabuf, struct task_struct *task) -{ - struct task_dma_buf_record *rec; - - if (!dmabuf || !task) - return; - - if (!task->dmabuf_info) { - pr_err("%s dmabuf accounting record was not allocated\n", __func__); - return; - } - - spin_lock(&task->dmabuf_info->lock); - rec = find_task_dmabuf_record(task, dmabuf); - if (!rec) { /* Failed fd_install? */ - pr_err("dmabuf not found in task list\n"); - goto err; - } - - if (--rec->refcnt == 0) { - list_del(&rec->node); - kfree(rec); - task->dmabuf_info->rss -= dmabuf->size; - } -err: - spin_unlock(&task->dmabuf_info->lock); -} - static int dma_buf_mmap_internal(struct file *file, struct vm_area_struct *vma) { struct dma_buf *dmabuf; - int ret; if (!is_dma_buf_file(file)) return -EINVAL; @@ -292,15 +180,7 @@ static int dma_buf_mmap_internal(struct file *file, struct vm_area_struct *vma) dmabuf->size >> PAGE_SHIFT) return -EINVAL; - ret = dma_buf_account_task(dmabuf, current); - if (ret) - return ret; - - ret = dmabuf->ops->mmap(dmabuf, vma); - if (ret) - dma_buf_unaccount_task(dmabuf, current); - - return ret; + return dmabuf->ops->mmap(dmabuf, vma); } static loff_t dma_buf_llseek(struct file *file, loff_t offset, int whence) @@ -677,12 +557,6 @@ static void dma_buf_show_fdinfo(struct seq_file *m, struct file *file) spin_unlock(&dmabuf->name_lock); } -static int dma_buf_flush(struct file *file, fl_owner_t id) -{ - dma_buf_unaccount_task(file->private_data, current); - return 0; -} - static const struct file_operations dma_buf_fops = { .release = dma_buf_file_release, .mmap = dma_buf_mmap_internal, @@ -691,7 +565,6 @@ static const struct file_operations dma_buf_fops = { .unlocked_ioctl = dma_buf_ioctl, .compat_ioctl = compat_ptr_ioctl, .show_fdinfo = dma_buf_show_fdinfo, - .flush = dma_buf_flush, }; /* @@ -1682,8 +1555,6 @@ EXPORT_SYMBOL_GPL(dma_buf_end_cpu_access_partial); int dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma, unsigned long pgoff) { - int ret; - if (WARN_ON(!dmabuf || !vma)) return -EINVAL; @@ -1704,15 +1575,7 @@ int dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma, vma_set_file(vma, dmabuf->file); vma->vm_pgoff = pgoff; - ret = dma_buf_account_task(dmabuf, current); - if (ret) - return ret; - - ret = dmabuf->ops->mmap(dmabuf, vma); - if (ret) - dma_buf_unaccount_task(dmabuf, current); - - return ret; + return dmabuf->ops->mmap(dmabuf, vma); } EXPORT_SYMBOL_NS_GPL(dma_buf_mmap, DMA_BUF); diff --git a/fs/file.c b/fs/file.c index e924929ac366..1f1181b189bf 100644 --- a/fs/file.c +++ b/fs/file.c @@ -20,7 +20,6 @@ #include #include #include -#include #include #include "internal.h" @@ -594,9 +593,6 @@ void fd_install(unsigned int fd, struct file *file) struct files_struct *files = current->files; struct fdtable *fdt; - if (is_dma_buf_file(file) && dma_buf_account_task(file->private_data, current)) - pr_err("FD dmabuf accounting failed\n"); - rcu_read_lock_sched(); if (unlikely(files->resize_in_progress)) { diff --git a/fs/proc/base.c b/fs/proc/base.c index f7d8188b0ccf..7cff02bc816e 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -100,7 +100,6 @@ #include #include #include -#include #include #include #include "internal.h" @@ -3305,24 +3304,6 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns, } #endif /* CONFIG_STACKLEAK_METRICS */ -#ifdef CONFIG_DMA_SHARED_BUFFER -static int proc_dmabuf_rss_show(struct seq_file *m, struct pid_namespace *ns, - struct pid *pid, struct task_struct *task) -{ - if (!task->dmabuf_info) { - pr_err("%s dmabuf accounting record was not allocated\n", __func__); - return -ENOMEM; - } - - if (!(task->flags & PF_KTHREAD)) - seq_printf(m, "%lld\n", READ_ONCE(task->dmabuf_info->rss)); - else - seq_puts(m, "0\n"); - - return 0; -} -#endif - /* * Thread groups */ @@ -3446,9 +3427,6 @@ static const struct pid_entry tgid_base_stuff[] = { ONE("ksm_merging_pages", S_IRUSR, proc_pid_ksm_merging_pages), ONE("ksm_stat", S_IRUSR, proc_pid_ksm_stat), #endif -#ifdef CONFIG_DMA_SHARED_BUFFER - ONE("dmabuf_rss", S_IRUGO, proc_dmabuf_rss_show), -#endif }; static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx) diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 1647fb38fe80..64d67293d76b 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -24,9 +24,6 @@ #include #include #include -#ifndef __GENKSYMS__ -#include -#endif struct device; struct dma_buf; @@ -642,43 +639,6 @@ struct dma_buf_export_info { ANDROID_KABI_RESERVE(2); }; -/** - * struct task_dma_buf_record - Holds the number of (VMA and FD) references to a - * dmabuf by a collection of tasks that share both mm_struct and files_struct. - * This is the list entry type for @task_dma_buf_info dmabufs list. - * - * @node: Stores the list this record is on. - * @dmabuf: The dmabuf this record is for. - * @refcnt: The number of VMAs and FDs that reference @dmabuf by the tasks that - * share this record. - */ -struct task_dma_buf_record { - struct list_head node; - struct dma_buf *dmabuf; - unsigned long refcnt; -}; - -/** - * struct task_dma_buf_info - Holds a RSS counter, and a list of dmabufs for all - * tasks that share both mm_struct and files_struct. - * - * @rss: The sum of all dmabuf memory referenced by the tasks via memory - * mappings or file descriptors in bytes. Buffers referenced more than - * once by the process (multiple mmaps, multiple FDs, or any combination - * of both mmaps and FDs) only cause the buffer to be accounted to the - * process once. Partial mappings cause the full size of the buffer to be - * accounted, regardless of the size of the mapping. - * @refcnt: The number of tasks sharing this struct. - * @lock: Lock protecting writes for @rss, and reads/writes for @dmabufs. - * @dmabufs: List of all dmabufs referenced by the tasks. - */ -struct task_dma_buf_info { - s64 rss; - refcount_t refcnt; - spinlock_t lock; - struct list_head dmabufs; -}; - /** * DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters * @name: export-info name @@ -781,7 +741,4 @@ int dma_buf_vmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map); void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map); long dma_buf_set_name(struct dma_buf *dmabuf, const char *name); int dma_buf_get_flags(struct dma_buf *dmabuf, unsigned long *flags); - -int dma_buf_account_task(struct dma_buf *dmabuf, struct task_struct *task); -void dma_buf_unaccount_task(struct dma_buf *dmabuf, struct task_struct *task); #endif /* __DMA_BUF_H__ */ diff --git a/include/linux/sched.h b/include/linux/sched.h index 68ba96bde447..1299b4497d87 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -70,7 +70,6 @@ struct seq_file; struct sighand_struct; struct signal_struct; struct task_delay_info; -struct task_dma_buf_info; struct task_group; struct user_event_mm; @@ -1517,9 +1516,6 @@ struct task_struct { */ struct callback_head l1d_flush_kill; #endif - - struct task_dma_buf_info *dmabuf_info; - ANDROID_KABI_RESERVE(1); ANDROID_KABI_RESERVE(2); ANDROID_KABI_RESERVE(3); diff --git a/init/init_task.c b/init/init_task.c index d80c007ab59b..31ceb0e469f7 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -214,7 +214,6 @@ struct task_struct init_task .android_vendor_data1 = {0, }, .android_oem_data1 = {0, }, #endif - .dmabuf_info = NULL, }; EXPORT_SYMBOL(init_task); diff --git a/kernel/fork.c b/kernel/fork.c index 66636a979911..75b1a4458a7e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -101,7 +101,6 @@ #include #include #include -#include #include #include @@ -995,32 +994,12 @@ static inline void put_signal_struct(struct signal_struct *sig) free_signal_struct(sig); } -static void put_dmabuf_info(struct task_struct *tsk) -{ - if (!tsk->dmabuf_info) { - pr_err("%s dmabuf accounting record was not allocated\n", __func__); - return; - } - - if (!refcount_dec_and_test(&tsk->dmabuf_info->refcnt)) - return; - - if (READ_ONCE(tsk->dmabuf_info->rss)) - pr_err("%s destroying task with non-zero dmabuf rss\n", __func__); - - if (!list_empty(&tsk->dmabuf_info->dmabufs)) - pr_err("%s destroying task with non-empty dmabuf list\n", __func__); - - kfree(tsk->dmabuf_info); -} - void __put_task_struct(struct task_struct *tsk) { WARN_ON(!tsk->exit_state); WARN_ON(refcount_read(&tsk->usage)); WARN_ON(tsk == current); - put_dmabuf_info(tsk); io_uring_free(tsk); cgroup_free(tsk); task_numa_free(tsk, true); @@ -2289,58 +2268,6 @@ static void rv_task_fork(struct task_struct *p) #define rv_task_fork(p) do {} while (0) #endif -static int copy_dmabuf_info(u64 clone_flags, struct task_struct *p) -{ - struct task_dma_buf_record *rec, *copy; - - if (current->dmabuf_info && (clone_flags & (CLONE_VM | CLONE_FILES)) - == (CLONE_VM | CLONE_FILES)) { - /* - * Both MM and FD references to dmabufs are shared with the parent, so - * we can share a RSS counter with the parent. - */ - refcount_inc(¤t->dmabuf_info->refcnt); - p->dmabuf_info = current->dmabuf_info; - return 0; - } - - p->dmabuf_info = kmalloc(sizeof(*p->dmabuf_info), GFP_KERNEL); - if (!p->dmabuf_info) - return -ENOMEM; - - refcount_set(&p->dmabuf_info->refcnt, 1); - spin_lock_init(&p->dmabuf_info->lock); - INIT_LIST_HEAD(&p->dmabuf_info->dmabufs); - if (current->dmabuf_info) { - spin_lock(¤t->dmabuf_info->lock); - p->dmabuf_info->rss = current->dmabuf_info->rss; - list_for_each_entry(rec, ¤t->dmabuf_info->dmabufs, node) { - copy = kmalloc(sizeof(*copy), GFP_KERNEL); - if (!copy) { - spin_unlock(¤t->dmabuf_info->lock); - goto err_list_copy; - } - - copy->dmabuf = rec->dmabuf; - copy->refcnt = rec->refcnt; - list_add(©->node, &p->dmabuf_info->dmabufs); - } - spin_unlock(¤t->dmabuf_info->lock); - } else { - p->dmabuf_info->rss = 0; - } - - return 0; - -err_list_copy: - list_for_each_entry_safe(rec, copy, &p->dmabuf_info->dmabufs, node) { - list_del(&rec->node); - kfree(rec); - } - kfree(p->dmabuf_info); - return -ENOMEM; -} - /* * This creates a new process as a copy of the old one, * but does not actually start it yet. @@ -2582,18 +2509,14 @@ __latent_entropy struct task_struct *copy_process( p->bpf_ctx = NULL; #endif - retval = copy_dmabuf_info(clone_flags, p); - if (retval) - goto bad_fork_cleanup_policy; - /* Perform scheduler related setup. Assign this task to a CPU. */ retval = sched_fork(clone_flags, p); if (retval) - goto bad_fork_cleanup_dmabuf; + goto bad_fork_cleanup_policy; retval = perf_event_init_task(p, clone_flags); if (retval) - goto bad_fork_cleanup_dmabuf; + goto bad_fork_cleanup_policy; retval = audit_alloc(p); if (retval) goto bad_fork_cleanup_perf; @@ -2896,8 +2819,6 @@ bad_fork_cleanup_audit: audit_free(p); bad_fork_cleanup_perf: perf_event_free_task(p); -bad_fork_cleanup_dmabuf: - put_dmabuf_info(p); bad_fork_cleanup_policy: lockdep_free_task(p); #ifdef CONFIG_NUMA diff --git a/mm/mmap.c b/mm/mmap.c index 6da684ab9f98..4c74fb3d7a94 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -49,7 +49,6 @@ #include #include #include -#include #include #include @@ -145,11 +144,8 @@ static void remove_vma(struct vm_area_struct *vma, bool unreachable) { might_sleep(); vma_close(vma); - if (vma->vm_file) { - if (is_dma_buf_file(vma->vm_file)) - dma_buf_unaccount_task(vma->vm_file->private_data, current); + if (vma->vm_file) fput(vma->vm_file); - } mpol_put(vma_policy(vma)); if (unreachable) __vm_area_free(vma); @@ -2421,14 +2417,8 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, if (err) goto out_free_mpol; - if (new->vm_file) { + if (new->vm_file) get_file(new->vm_file); - if (is_dma_buf_file(new->vm_file)) { - /* Should never fail since this task already references the buffer */ - if (dma_buf_account_task(new->vm_file->private_data, current)) - pr_err("%s failed to account dmabuf\n", __func__); - } - } if (new->vm_ops && new->vm_ops->open) new->vm_ops->open(new); From f99b0f6dd206581de3941fce75a1e1d72bc92979 Mon Sep 17 00:00:00 2001 From: Snehal Koukuntla Date: Wed, 9 Jul 2025 11:35:14 +0000 Subject: [PATCH 43/49] ANDROID: KVM: arm64: Increase the pkvm reclaim buffer size Increase the internal reclaim buffer size of pkvm to accommodate Pixel use cases. Without this we are seeing >50% failure rate Bug: 426242992 Change-Id: I892cb1fe30fa97fea044187728d814dd832dd929 Signed-off-by: Snehal Koukuntla --- arch/arm64/include/asm/kvm_pkvm.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/kvm_pkvm.h b/arch/arm64/include/asm/kvm_pkvm.h index 4a16808c3ba8..80a1526684cb 100644 --- a/arch/arm64/include/asm/kvm_pkvm.h +++ b/arch/arm64/include/asm/kvm_pkvm.h @@ -593,7 +593,7 @@ static inline unsigned long host_s2_pgtable_pages(void) * Maximum number of consitutents allowed in a descriptor. This number is * arbitrary, see comment below on SG_MAX_SEGMENTS in hyp_ffa_proxy_pages(). */ -#define KVM_FFA_MAX_NR_CONSTITUENTS 4096 +#define KVM_FFA_MAX_NR_CONSTITUENTS 12288 static inline unsigned long hyp_ffa_proxy_pages(void) { From 789dd354a87c622bb04d7c7f0e861b7797a3c430 Mon Sep 17 00:00:00 2001 From: Mostafa Saleh Date: Wed, 2 Jul 2025 12:16:07 +0000 Subject: [PATCH 44/49] ANDROID: KVM: arm64: Don't update IOMMU under memory pressure host_stage2_unmap_unmoveable_regs() is called when the hypervisor pool is under pressure to map stage-2 enteries, so it unmap all enteries that can't be donated and owned by the host so it can be lazily faulted later. But that doesn't change any ownership of pages, so they are still owned by the host and must remain mapped in the IOMMU. Bug: 428939924 Change-Id: Id91183619a316a67bda48d8e9adf9b6ef49c104f Signed-off-by: Mostafa Saleh --- arch/arm64/kvm/hyp/nvhe/mem_protect.c | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c index bc1f8cb3faf3..afdd36e4ae8a 100644 --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c @@ -491,17 +491,9 @@ int __pkvm_prot_finalize(void) int host_stage2_unmap_reg_locked(phys_addr_t start, u64 size) { - int ret; - hyp_assert_lock_held(&host_mmu.lock); - ret = kvm_pgtable_stage2_reclaim_leaves(&host_mmu.pgt, start, size); - if (ret) - return ret; - - kvm_iommu_host_stage2_idmap(start, start + size, 0); - - return 0; + return kvm_pgtable_stage2_reclaim_leaves(&host_mmu.pgt, start, size); } static int host_stage2_unmap_unmoveable_regs(void) From 46e269016e966c5d4c38a3fc8ebe7c65d4609ba2 Mon Sep 17 00:00:00 2001 From: Zhengxu Zhang Date: Thu, 12 Jun 2025 09:14:21 +0800 Subject: [PATCH 45/49] FROMGIT: exfat: fdatasync flag should be same like generic_write_sync() Test: androbench by default setting, use 64GB sdcard. the random write speed: without this patch 3.5MB/s with this patch 7MB/s After patch "11a347fb6cef", the random write speed decreased significantly.the .write_iter() interface had been modified, and check the differences with generic_file_write_iter(), when calling generic_write_sync() and exfat_file_write_iter() to call vfs_fsync_range(), the fdatasync flag is wrong, and make not use the fdatasync mode, and make random write speed decreased. So use generic_write_sync() instead of vfs_fsync_range(). Fixes: 11a347fb6cef ("exfat: change to get file size from DataLength") Bug: 427084532 (cherry picked from commit 309914e6602c9e17ff84b20db8c4f1da0d6a2a36 https://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat.git dev) Link: https://lore.kernel.org/all/20250619013331.664521-1-zhengxu.zhang@unisoc.com/ Change-Id: I68319a27cabedd9d4a7fa35948affd8c27d72160 Signed-off-by: Zhengxu Zhang Acked-by: Yuezhang Mo Signed-off-by: Namjae Jeon --- fs/exfat/file.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/exfat/file.c b/fs/exfat/file.c index efd24e29f119..272208708ffc 100644 --- a/fs/exfat/file.c +++ b/fs/exfat/file.c @@ -610,9 +610,8 @@ static ssize_t exfat_file_write_iter(struct kiocb *iocb, struct iov_iter *iter) if (pos > valid_size) pos = valid_size; - if (iocb_is_dsync(iocb) && iocb->ki_pos > pos) { - ssize_t err = vfs_fsync_range(file, pos, iocb->ki_pos - 1, - iocb->ki_flags & IOCB_SYNC); + if (iocb->ki_pos > pos) { + ssize_t err = generic_write_sync(iocb, iocb->ki_pos - pos); if (err < 0) return err; } From 7f4572a697fa838816437adc3e9645201dc9c817 Mon Sep 17 00:00:00 2001 From: daiyang5 Date: Mon, 7 Jul 2025 10:04:20 +0800 Subject: [PATCH 46/49] ANDROID: export folio_deactivate() for GKI purpose. Export the symbol folio_deactivate() to access LRU list in ko module for customizing activate and deactivate operations. This is a necessary component of our memory reclaim strategy. Bug: 429908837 Change-Id: Ied760489b2c1726dbfe52629f6d544aa607e5106 Signed-off-by: daiyang5 --- mm/swap.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/swap.c b/mm/swap.c index 174259a9a5f7..30b5eebce985 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -736,6 +736,7 @@ void folio_deactivate(struct folio *folio) local_unlock(&cpu_fbatches.lock); } } +EXPORT_SYMBOL_GPL(folio_deactivate); /** * folio_mark_lazyfree - make an anon folio lazyfree From 615449fbacaefd7198d5d1404c6a3ef1149aba28 Mon Sep 17 00:00:00 2001 From: daiyang5 Date: Mon, 7 Jul 2025 08:59:31 +0800 Subject: [PATCH 47/49] ANDROID: GKI: Update symbol list for xiaomi 2 function symbol(s) added 'void folio_deactivate(struct folio*)' 'void folio_mark_accessed(struct folio*)' Bug: 429908837 Change-Id: I575c450aa91ff4298f681203efd1debfb2c810c5 Signed-off-by: daiyang5 --- android/abi_gki_aarch64.stg | 20 ++++++++++++++++++++ android/abi_gki_aarch64_xiaomi | 4 ++++ 2 files changed, 24 insertions(+) diff --git a/android/abi_gki_aarch64.stg b/android/abi_gki_aarch64.stg index 43a9176d2de6..78f10f7d9bec 100644 --- a/android/abi_gki_aarch64.stg +++ b/android/abi_gki_aarch64.stg @@ -393188,6 +393188,15 @@ elf_symbol { type_id: 0xf6f86f1f full_name: "folio_clear_dirty_for_io" } +elf_symbol { + id: 0x1ac8aa52 + name: "folio_deactivate" + is_defined: true + symbol_type: FUNCTION + crc: 0x7abc9b3a + type_id: 0x18c46588 + full_name: "folio_deactivate" +} elf_symbol { id: 0xf83588d6 name: "folio_end_private_2" @@ -393215,6 +393224,15 @@ elf_symbol { type_id: 0x637004ab full_name: "folio_mapping" } +elf_symbol { + id: 0xd2e101fd + name: "folio_mark_accessed" + is_defined: true + symbol_type: FUNCTION + crc: 0x74311ee4 + type_id: 0x18c46588 + full_name: "folio_mark_accessed" +} elf_symbol { id: 0xcef0ca54 name: "folio_mark_dirty" @@ -440717,9 +440735,11 @@ interface { symbol_id: 0x3c7c2553 symbol_id: 0x06c58be7 symbol_id: 0xab55569c + symbol_id: 0x1ac8aa52 symbol_id: 0xf83588d6 symbol_id: 0xa1c5bd8d symbol_id: 0x159a69a3 + symbol_id: 0xd2e101fd symbol_id: 0xcef0ca54 symbol_id: 0x39840ab2 symbol_id: 0xc05a6c7d diff --git a/android/abi_gki_aarch64_xiaomi b/android/abi_gki_aarch64_xiaomi index a8531903d2a7..1cc056048b75 100644 --- a/android/abi_gki_aarch64_xiaomi +++ b/android/abi_gki_aarch64_xiaomi @@ -197,6 +197,10 @@ __tracepoint_android_rvh_dequeue_task_fair __tracepoint_android_rvh_entity_tick +# required by mi_damon.ko + folio_deactivate + folio_mark_accessed + #required by cpq.ko elv_rb_former_request elv_rb_latter_request From fe630a04152399fa0646fa16cabae8dee2901a20 Mon Sep 17 00:00:00 2001 From: Rui Chen Date: Tue, 1 Jul 2025 17:57:26 +0800 Subject: [PATCH 48/49] FROMGIT: f2fs: introduce reserved_pin_section sysfs entry This patch introduces /sys/fs/f2fs//reserved_pin_section for tuning @needed parameter of has_not_enough_free_secs(), if we configure it w/ zero, it can avoid f2fs_gc() as much as possible while fallocating on pinned file. Signed-off-by: Chao Yu Reviewed-by: wangzijie Signed-off-by: Jaegeuk Kim Bug: 428889879 Bug: 431132476 (cherry picked from commit 59c1c89e9ba8cefff05aa982dd9e6719f25e8ec5 https: //git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git dev) Link: https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git/commit/?id=59c1c89e9ba8 Change-Id: I07184caa6e5037d45258474dcca8adf1836b0f2d Signed-off-by: Rui Chen (cherry picked from commit 12727f8a4b65b2fb55a7fc88199ab5f854be52a4) --- Documentation/ABI/testing/sysfs-fs-f2fs | 9 +++++++++ fs/f2fs/f2fs.h | 3 +++ fs/f2fs/file.c | 5 ++--- fs/f2fs/super.c | 4 ++++ fs/f2fs/sysfs.c | 9 +++++++++ 5 files changed, 27 insertions(+), 3 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs b/Documentation/ABI/testing/sysfs-fs-f2fs index 7e7ffbe8167b..eec506c44d97 100644 --- a/Documentation/ABI/testing/sysfs-fs-f2fs +++ b/Documentation/ABI/testing/sysfs-fs-f2fs @@ -858,3 +858,12 @@ Description: This is a read-only entry to show the value of sb.s_encoding_flags, SB_ENC_STRICT_MODE_FL 0x00000001 SB_ENC_NO_COMPAT_FALLBACK_FL 0x00000002 ============================ ========== + +What: /sys/fs/f2fs//reserved_pin_section +Date: June 2025 +Contact: "Chao Yu" +Description: This threshold is used to control triggering garbage collection while + fallocating on pinned file, so, it can guarantee there is enough free + reserved section before preallocating on pinned file. + By default, the value is ovp_sections, especially, for zoned ufs, the + value is 1. diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 654812a3acc7..f0932cb5a18a 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -1703,6 +1703,9 @@ struct f2fs_sb_info { /* for skip statistic */ unsigned long long skipped_gc_rwsem; /* FG_GC only */ + /* free sections reserved for pinned file */ + unsigned int reserved_pin_section; + /* threshold for gc trials on pinned files */ unsigned short gc_pin_file_threshold; struct f2fs_rwsem pin_sem; diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c index 479d49dd4ce5..f8832212ee37 100644 --- a/fs/f2fs/file.c +++ b/fs/f2fs/file.c @@ -1859,9 +1859,8 @@ next_alloc: } } - if (has_not_enough_free_secs(sbi, 0, f2fs_sb_has_blkzoned(sbi) ? - ZONED_PIN_SEC_REQUIRED_COUNT : - GET_SEC_FROM_SEG(sbi, overprovision_segments(sbi)))) { + if (has_not_enough_free_secs(sbi, 0, + sbi->reserved_pin_section)) { f2fs_down_write(&sbi->gc_lock); stat_inc_gc_call_count(sbi, FOREGROUND); err = f2fs_gc(sbi, &gc_control); diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index 58d545d53aa6..b5d23377166d 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -4652,6 +4652,10 @@ try_onemore: /* get segno of first zoned block device */ sbi->first_zoned_segno = get_first_zoned_segno(sbi); + sbi->reserved_pin_section = f2fs_sb_has_blkzoned(sbi) ? + ZONED_PIN_SEC_REQUIRED_COUNT : + GET_SEC_FROM_SEG(sbi, overprovision_segments(sbi)); + /* Read accumulated write IO statistics if exists */ seg_i = CURSEG_I(sbi, CURSEG_HOT_NODE); if (__exist_node_summaries(sbi)) diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c index d4a63b0254b9..46216f0a203a 100644 --- a/fs/f2fs/sysfs.c +++ b/fs/f2fs/sysfs.c @@ -824,6 +824,13 @@ out: return count; } + if (!strcmp(a->attr.name, "reserved_pin_section")) { + if (t > GET_SEC_FROM_SEG(sbi, overprovision_segments(sbi))) + return -EINVAL; + *ui = (unsigned int)t; + return count; + } + *ui = (unsigned int)t; return count; @@ -1130,6 +1137,7 @@ F2FS_SBI_GENERAL_RO_ATTR(unusable_blocks_per_sec); F2FS_SBI_GENERAL_RW_ATTR(blkzone_alloc_policy); #endif F2FS_SBI_GENERAL_RW_ATTR(carve_out); +F2FS_SBI_GENERAL_RW_ATTR(reserved_pin_section); /* STAT_INFO ATTR */ #ifdef CONFIG_F2FS_STAT_FS @@ -1323,6 +1331,7 @@ static struct attribute *f2fs_attrs[] = { ATTR_LIST(last_age_weight), ATTR_LIST(max_read_extent_count), ATTR_LIST(carve_out), + ATTR_LIST(reserved_pin_section), NULL, }; ATTRIBUTE_GROUPS(f2fs); From 2dabc476cf95d9303aee3f6766878584ea3a246b Mon Sep 17 00:00:00 2001 From: Mukesh Ojha Date: Tue, 8 Jul 2025 13:28:38 +0530 Subject: [PATCH 49/49] FROMGIT: pinmux: fix race causing mux_owner NULL with active mux_usecount commit 5a3e85c3c397 ("pinmux: Use sequential access to access desc->pinmux data") tried to address the issue when two client of the same gpio calls pinctrl_select_state() for the same functionality, was resulting in NULL pointer issue while accessing desc->mux_owner. However, issue was not completely fixed due to the way it was handled and it can still result in the same NULL pointer. The issue occurs due to the following interleaving: cpu0 (process A) cpu1 (process B) pin_request() { pin_free() { mutex_lock() desc->mux_usecount--; //becomes 0 .. mutex_unlock() mutex_lock(desc->mux) desc->mux_usecount++; // becomes 1 desc->mux_owner = owner; mutex_unlock(desc->mux) mutex_lock(desc->mux) desc->mux_owner = NULL; mutex_unlock(desc->mux) This sequence leads to a state where the pin appears to be in use (`mux_usecount == 1`) but has no owner (`mux_owner == NULL`), which can cause NULL pointer on next pin_request on the same pin. Ensure that updates to mux_usecount and mux_owner are performed atomically under the same lock. Only clear mux_owner when mux_usecount reaches zero and no new owner has been assigned. Bug: 430525600 Fixes: 5a3e85c3c397 ("pinmux: Use sequential access to access desc->pinmux data") Link: https://lore.kernel.org/lkml/20250708-pinmux-race-fix-v2-1-8ae9e8a0d1a1@oss.qualcomm.com/ Signed-off-by: Mukesh Ojha (cherry picked from commit 0b075c011032f88d1cfde3b45d6dcf08b44140eb git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl.git for-next) Change-Id: Iec29ea201ef0fc3d205bbc4f1a90cb5a56a62039 Signed-off-by: Mukesh Ojha --- drivers/pinctrl/pinmux.c | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/drivers/pinctrl/pinmux.c b/drivers/pinctrl/pinmux.c index 97e8af88df85..ab853d6c586b 100644 --- a/drivers/pinctrl/pinmux.c +++ b/drivers/pinctrl/pinmux.c @@ -238,6 +238,15 @@ static const char *pin_free(struct pinctrl_dev *pctldev, int pin, if (desc->mux_usecount) return NULL; } + + if (gpio_range) { + owner = desc->gpio_owner; + desc->gpio_owner = NULL; + } else { + owner = desc->mux_owner; + desc->mux_owner = NULL; + desc->mux_setting = NULL; + } } /* @@ -249,17 +258,6 @@ static const char *pin_free(struct pinctrl_dev *pctldev, int pin, else if (ops->free) ops->free(pctldev, pin); - scoped_guard(mutex, &desc->mux_lock) { - if (gpio_range) { - owner = desc->gpio_owner; - desc->gpio_owner = NULL; - } else { - owner = desc->mux_owner; - desc->mux_owner = NULL; - desc->mux_setting = NULL; - } - } - module_put(pctldev->owner); return owner;