From 8930e5a3d3a284c328f8139c7ca60d1674c39088 Mon Sep 17 00:00:00 2001 From: Lukasz Raczylo Date: Fri, 24 Apr 2026 18:59:44 +0100 Subject: [PATCH] patches/linux: sync prose edits from talos-kernel-patches (diff bodies unchanged) --- ...CIe-posted-write-after-TSTART-doorbe.patch | 44 ++++++++------- ...k-ISR-after-IER-re-enable-in-macb_tx.patch | 55 ++++++++++--------- ...stall-watchdog-to-recover-from-lost-.patch | 49 ++++++++++------- 3 files changed, 83 insertions(+), 65 deletions(-) diff --git a/patches/linux/0001-net-macb-flush-PCIe-posted-write-after-TSTART-doorbe.patch b/patches/linux/0001-net-macb-flush-PCIe-posted-write-after-TSTART-doorbe.patch index 1ad061f..c3631fb 100644 --- a/patches/linux/0001-net-macb-flush-PCIe-posted-write-after-TSTART-doorbe.patch +++ b/patches/linux/0001-net-macb-flush-PCIe-posted-write-after-TSTART-doorbe.patch @@ -3,32 +3,38 @@ From: Lukasz Raczylo Date: Fri, 24 Apr 2026 00:00:00 +0000 Subject: [PATCH 1/3] net: macb: flush PCIe posted write after TSTART doorbell -macb_start_xmit() and macb_tx_restart() both kick transmission by -OR-ing MACB_BIT(TSTART) into NCR. On PCIe-attached macb instances -- -notably BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 -- the -doorbell write is a posted PCIe write that can sit in the fabric's -write queue until something drains it. A source-level comment at -the TSTART site already acknowledges the problem: +macb_start_xmit() and macb_tx_restart() kick transmission by +OR-ing MACB_BIT(TSTART) into NCR. On PCIe-attached macb instances +(BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 is the setup we +have in front of us), writes to NCR are posted PCIe writes: they +are not guaranteed to reach the device before the issuing CPU +returns. An existing source-level comment at the TSTART site +acknowledges that such writes can be lost under some conditions: /* TSTART write might get dropped, so make the IRQ retrigger * a buffer read */ and arms a recovery handshake via queue->tx_pending / -queue->txubr_pending that is picked up on the next TCOMP interrupt. -That recovery path only runs if a TCOMP interrupt actually fires; -if the lost doorbell means no TX starts, there is no TCOMP, and the -ring stalls silently. +queue->txubr_pending that runs on the next TCOMP interrupt. That +recovery path depends on a subsequent TCOMP actually firing. If +the TSTART write never reaches the MAC, no TX begins, no TCOMP +completion arrives, and the ring remains quiescent without any +kernel-visible indication. -Add a read-back of NCR after the TSTART write. The read serialises -the PCIe posted-write queue and ensures the doorbell reaches the MAC -before macb_start_xmit() / macb_tx_restart() return. The existing -'TSTART might get dropped' handshake is preserved as a safety net -for cases where the fabric genuinely drops the write despite the -read barrier, but with this barrier it should rarely if ever be -needed on PCIe-attached parts. +Add a read-back of NCR after each TSTART write in macb_start_xmit() +and macb_tx_restart(). The read is an architected PCIe read +barrier for earlier posted writes on the same path; it ensures the +doorbell has reached the MAC before the functions return. -Observed to be the most common trigger for the silent TX stall -documented in the linked reports. +The existing tx_pending / txubr_pending handshake is left in place +unchanged -- it remains the correct recovery for any other reason +the MAC could silently fail to start TX. + +We do not have direct hardware evidence that TSTART is being lost +on the RP1 path. This patch is one of a three-patch series +("candidate fixes for silent TX stall on BCM2712/RP1"); see the +cover letter for context. We have verified it compiles and +applies cleanly; runtime verification is pending. Link: https://github.com/cilium/cilium/issues/43198 Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877 diff --git a/patches/linux/0002-net-macb-re-check-ISR-after-IER-re-enable-in-macb_tx.patch b/patches/linux/0002-net-macb-re-check-ISR-after-IER-re-enable-in-macb_tx.patch index 2a5f6dc..7665046 100644 --- a/patches/linux/0002-net-macb-re-check-ISR-after-IER-re-enable-in-macb_tx.patch +++ b/patches/linux/0002-net-macb-re-check-ISR-after-IER-re-enable-in-macb_tx.patch @@ -15,38 +15,41 @@ existing comment in the function notes: * interrupts are re-enabled. */ -and mitigates this by calling macb_tx_complete_pending() to look -for a completed descriptor whose TX_USED bit the hardware has -DMA'd but whose completion we processed without ever seeing an -interrupt for. +and mitigates this by calling macb_tx_complete_pending(), which +inspects driver-visible ring state (descriptor->ctrl, after rmb()) +and reschedules NAPI if a completion is observable in memory. -macb_tx_complete_pending() only inspects driver-visible ring state -(descriptor->ctrl, after rmb()). On PCIe-attached parts (BCM2712 + -RP1 on Raspberry Pi 5 in particular) the descriptor DMA write that -sets TX_USED can still be in flight in the PCIe fabric when we -check. The read-memory-barrier synchronises the CPU view of earlier -CPU writes, but does not force the peripheral's in-flight DMA to -retire. In that window the check returns false, napi exits, the -IER re-enable does not re-fire (the quirk above), and the queue -stalls silently. +On PCIe-attached parts (BCM2712 + RP1 on Raspberry Pi 5 is the +setup we have in front of us), the descriptor DMA write that sets +TX_USED may not have retired to system memory at the point +macb_tx_complete_pending() runs. The rmb() synchronises the CPU +view of earlier CPU writes; it is not sufficient to retire an +in-flight peripheral DMA write. Under that ordering the in-memory +descriptor can still read TX_USED=0 when the hardware has in fact +completed the frame; the check returns false; NAPI exits; the +quirk above prevents the re-enabled IER from re-firing; the ring +goes quiescent. -Re-check the hardware's own ISR state as well. Reading a MAC -register after IER re-enable serves two purposes: +Add an explicit ISR read after the IER write. The MMIO read +serves two independent purposes: - (1) It drains any in-flight PCIe DMA writes of descriptor state, - so a subsequent macb_tx_complete_pending() sees an accurate - view of TX_USED. + (1) It is an architected PCIe read barrier for earlier + peripheral-originated DMA writes on the same path, so a + subsequent macb_tx_complete_pending() observes any TX_USED + write that was in flight at the time of the barrier. - (2) It directly observes whether the hardware currently has a - pending TCOMP signal, catching the case the existing driver - comment describes (completions raised while masked, not - re-fired). + (2) It samples the hardware ISR directly, so a TCOMP bit that + the hardware set while TCOMP was masked is visible here, + independently of whether the descriptor DMA has retired. -If either path indicates pending work, schedule NAPI again. +If either signal indicates pending work, reschedule NAPI via the +same path as the existing check. -Combined with the PCIe posted-write flush in patch 1/3, this closes -the observed silent-TX-stall path on BCM2712/RP1 reported at the -links below. +This patch addresses one of three candidate races for the silent +TX stall described in the cover letter. Whether it is sufficient +by itself, or whether it requires the PCIe posted-write flush in +patch 1/3 to cover the observed behaviour, we have not yet +verified at runtime. Link: https://github.com/cilium/cilium/issues/43198 Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877 diff --git a/patches/linux/0003-net-macb-add-TX-stall-watchdog-to-recover-from-lost-.patch b/patches/linux/0003-net-macb-add-TX-stall-watchdog-to-recover-from-lost-.patch index 9d10ed2..c486941 100644 --- a/patches/linux/0003-net-macb-add-TX-stall-watchdog-to-recover-from-lost-.patch +++ b/patches/linux/0003-net-macb-add-TX-stall-watchdog-to-recover-from-lost-.patch @@ -1,30 +1,39 @@ From 0000000000000000000000000000000000000003 Mon Sep 17 00:00:00 2001 From: Lukasz Raczylo Date: Fri, 24 Apr 2026 00:00:00 +0000 -Subject: [PATCH 3/3] net: macb: add TX stall watchdog to recover from lost - TCOMP +Subject: [PATCH 3/3] net: macb: add TX stall watchdog as defence-in-depth + safety net -Patches 1/3 and 2/3 close two races by which a TCOMP interrupt can -be lost on PCIe-attached macb instances. This patch adds a -defence-in-depth safety net: a per-queue delayed_work that calls -macb_tx_restart() if queue->tx_tail has not advanced in one second -despite the ring being non-empty. +Patches 1/3 and 2/3 address two candidate races that could lead +to a TCOMP completion being missed on PCIe-attached macb +instances. This patch adds a defence-in-depth safety net, in +case a further race remains that we have not identified. -The watchdog introduces no new recovery logic. macb_tx_restart() -already exists, is correctly locked, and already checks the -hardware's TBQP against the driver's head index before writing -TSTART: on a healthy ring it is a no-op at the hardware level. All -the watchdog adds is the trigger. +The watchdog is a per-queue delayed_work that runs once per +second. It snapshots queue->tx_tail; if the ring is non-empty +(queue->tx_head != queue->tx_tail) and tx_tail has not advanced +since the previous tick, it calls macb_tx_restart(). -If patches 1/3 and 2/3 completely eliminate the stall, this code -never does anything beyond a spin_lock/unlock and a branch per -second per queue. If a further race remains -- hardware or -driver-level -- this turns a multi-minute silent hang into a -one-second bump. +No new recovery logic is introduced. macb_tx_restart() already +exists in this file, is correctly locked (tx_ptr_lock, bp->lock), +and verifies that the hardware's TBQP is behind the driver's +head index before re-asserting TSTART. On a healthy ring it is +a no-op at the hardware level; the watchdog only supplies the +missing trigger. -On our 24-node Raspberry Pi 5 fleet this was empirically needed: -before the patches in this series, multiple nodes per day hit the -stall and required external watchdog intervention to recover. +On a healthy queue the per-tick cost is one spin_lock_irqsave() +/ spin_unlock_irqrestore() and one branch. The delayed_work is +only scheduled between macb_open() and macb_close(), and is +cancelled synchronously on close. + +Context for submission: on our 24-node Raspberry Pi 5 fleet, +before this series, an out-of-band user-space watchdog +(monitoring tx_packets from /sys/class/net/.../statistics and +toggling the link down/up when it froze) was required to keep +nodes usable. We include this kernel-side watchdog as a cleaner +in-kernel equivalent for any residual stall that patches 1 and +2 do not cover. We are willing to drop this patch if the view +is that 1 and 2 should stand alone. Link: https://github.com/cilium/cilium/issues/43198 Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877