patches/linux: sync prose edits from talos-kernel-patches (diff bodies unchanged)

2026-06-11 00:09:29 +00:00 · 2026-04-24 18:59:44 +01:00
parent 2ed45176d6
commit 8930e5a3d3
3 changed files with 83 additions and 65 deletions
@@ -3,32 +3,38 @@ From: Lukasz Raczylo <lukasz@raczylo.com>
 Date: Fri, 24 Apr 2026 00:00:00 +0000
 Subject: [PATCH 1/3] net: macb: flush PCIe posted write after TSTART doorbell

-macb_start_xmit() and macb_tx_restart() both kick transmission by
-OR-ing MACB_BIT(TSTART) into NCR.  On PCIe-attached macb instances --
-notably BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 -- the
-doorbell write is a posted PCIe write that can sit in the fabric's
-write queue until something drains it.  A source-level comment at
-the TSTART site already acknowledges the problem:
+macb_start_xmit() and macb_tx_restart() kick transmission by
+OR-ing MACB_BIT(TSTART) into NCR.  On PCIe-attached macb instances
+(BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 is the setup we
+have in front of us), writes to NCR are posted PCIe writes: they
+are not guaranteed to reach the device before the issuing CPU
+returns.  An existing source-level comment at the TSTART site
+acknowledges that such writes can be lost under some conditions:

 	/* TSTART write might get dropped, so make the IRQ retrigger
 	 * a buffer read */

 and arms a recovery handshake via queue->tx_pending /
-queue->txubr_pending that is picked up on the next TCOMP interrupt.
-That recovery path only runs if a TCOMP interrupt actually fires;
-if the lost doorbell means no TX starts, there is no TCOMP, and the
-ring stalls silently.
+queue->txubr_pending that runs on the next TCOMP interrupt.  That
+recovery path depends on a subsequent TCOMP actually firing.  If
+the TSTART write never reaches the MAC, no TX begins, no TCOMP
+completion arrives, and the ring remains quiescent without any
+kernel-visible indication.

-Add a read-back of NCR after the TSTART write.  The read serialises
-the PCIe posted-write queue and ensures the doorbell reaches the MAC
-before macb_start_xmit() / macb_tx_restart() return.  The existing
-'TSTART might get dropped' handshake is preserved as a safety net
-for cases where the fabric genuinely drops the write despite the
-read barrier, but with this barrier it should rarely if ever be
-needed on PCIe-attached parts.
+Add a read-back of NCR after each TSTART write in macb_start_xmit()
+and macb_tx_restart().  The read is an architected PCIe read
+barrier for earlier posted writes on the same path; it ensures the
+doorbell has reached the MAC before the functions return.

-Observed to be the most common trigger for the silent TX stall
-documented in the linked reports.
+The existing tx_pending / txubr_pending handshake is left in place
+unchanged -- it remains the correct recovery for any other reason
+the MAC could silently fail to start TX.
+
+We do not have direct hardware evidence that TSTART is being lost
+on the RP1 path.  This patch is one of a three-patch series
+("candidate fixes for silent TX stall on BCM2712/RP1"); see the
+cover letter for context.  We have verified it compiles and
+applies cleanly; runtime verification is pending.

 Link: https://github.com/cilium/cilium/issues/43198
 Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
@@ -15,38 +15,41 @@ existing comment in the function notes:
 	 * interrupts are re-enabled.
 	 */

-and mitigates this by calling macb_tx_complete_pending() to look
-for a completed descriptor whose TX_USED bit the hardware has
-DMA'd but whose completion we processed without ever seeing an
-interrupt for.
+and mitigates this by calling macb_tx_complete_pending(), which
+inspects driver-visible ring state (descriptor->ctrl, after rmb())
+and reschedules NAPI if a completion is observable in memory.

-macb_tx_complete_pending() only inspects driver-visible ring state
-(descriptor->ctrl, after rmb()).  On PCIe-attached parts (BCM2712 +
-RP1 on Raspberry Pi 5 in particular) the descriptor DMA write that
-sets TX_USED can still be in flight in the PCIe fabric when we
-check.  The read-memory-barrier synchronises the CPU view of earlier
-CPU writes, but does not force the peripheral's in-flight DMA to
-retire.  In that window the check returns false, napi exits, the
-IER re-enable does not re-fire (the quirk above), and the queue
-stalls silently.
+On PCIe-attached parts (BCM2712 + RP1 on Raspberry Pi 5 is the
+setup we have in front of us), the descriptor DMA write that sets
+TX_USED may not have retired to system memory at the point
+macb_tx_complete_pending() runs.  The rmb() synchronises the CPU
+view of earlier CPU writes; it is not sufficient to retire an
+in-flight peripheral DMA write.  Under that ordering the in-memory
+descriptor can still read TX_USED=0 when the hardware has in fact
+completed the frame; the check returns false; NAPI exits; the
+quirk above prevents the re-enabled IER from re-firing; the ring
+goes quiescent.

-Re-check the hardware's own ISR state as well.  Reading a MAC
-register after IER re-enable serves two purposes:
+Add an explicit ISR read after the IER write.  The MMIO read
+serves two independent purposes:

-  (1) It drains any in-flight PCIe DMA writes of descriptor state,
-      so a subsequent macb_tx_complete_pending() sees an accurate
-      view of TX_USED.
+  (1) It is an architected PCIe read barrier for earlier
+      peripheral-originated DMA writes on the same path, so a
+      subsequent macb_tx_complete_pending() observes any TX_USED
+      write that was in flight at the time of the barrier.

-  (2) It directly observes whether the hardware currently has a
-      pending TCOMP signal, catching the case the existing driver
-      comment describes (completions raised while masked, not
-      re-fired).
+  (2) It samples the hardware ISR directly, so a TCOMP bit that
+      the hardware set while TCOMP was masked is visible here,
+      independently of whether the descriptor DMA has retired.

-If either path indicates pending work, schedule NAPI again.
+If either signal indicates pending work, reschedule NAPI via the
+same path as the existing check.

-Combined with the PCIe posted-write flush in patch 1/3, this closes
-the observed silent-TX-stall path on BCM2712/RP1 reported at the
-links below.
+This patch addresses one of three candidate races for the silent
+TX stall described in the cover letter.  Whether it is sufficient
+by itself, or whether it requires the PCIe posted-write flush in
+patch 1/3 to cover the observed behaviour, we have not yet
+verified at runtime.

 Link: https://github.com/cilium/cilium/issues/43198
 Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
@@ -1,30 +1,39 @@
 From 0000000000000000000000000000000000000003 Mon Sep 17 00:00:00 2001
 From: Lukasz Raczylo <lukasz@raczylo.com>
 Date: Fri, 24 Apr 2026 00:00:00 +0000
-Subject: [PATCH 3/3] net: macb: add TX stall watchdog to recover from lost
- TCOMP
+Subject: [PATCH 3/3] net: macb: add TX stall watchdog as defence-in-depth
+ safety net

-Patches 1/3 and 2/3 close two races by which a TCOMP interrupt can
-be lost on PCIe-attached macb instances.  This patch adds a
-defence-in-depth safety net: a per-queue delayed_work that calls
-macb_tx_restart() if queue->tx_tail has not advanced in one second
-despite the ring being non-empty.
+Patches 1/3 and 2/3 address two candidate races that could lead
+to a TCOMP completion being missed on PCIe-attached macb
+instances.  This patch adds a defence-in-depth safety net, in
+case a further race remains that we have not identified.

-The watchdog introduces no new recovery logic.  macb_tx_restart()
-already exists, is correctly locked, and already checks the
-hardware's TBQP against the driver's head index before writing
-TSTART: on a healthy ring it is a no-op at the hardware level.  All
-the watchdog adds is the trigger.
+The watchdog is a per-queue delayed_work that runs once per
+second.  It snapshots queue->tx_tail; if the ring is non-empty
+(queue->tx_head != queue->tx_tail) and tx_tail has not advanced
+since the previous tick, it calls macb_tx_restart().

-If patches 1/3 and 2/3 completely eliminate the stall, this code
-never does anything beyond a spin_lock/unlock and a branch per
-second per queue.  If a further race remains -- hardware or
-driver-level -- this turns a multi-minute silent hang into a
-one-second bump.
+No new recovery logic is introduced.  macb_tx_restart() already
+exists in this file, is correctly locked (tx_ptr_lock, bp->lock),
+and verifies that the hardware's TBQP is behind the driver's
+head index before re-asserting TSTART.  On a healthy ring it is
+a no-op at the hardware level; the watchdog only supplies the
+missing trigger.

-On our 24-node Raspberry Pi 5 fleet this was empirically needed:
-before the patches in this series, multiple nodes per day hit the
-stall and required external watchdog intervention to recover.
+On a healthy queue the per-tick cost is one spin_lock_irqsave()
+/ spin_unlock_irqrestore() and one branch.  The delayed_work is
+only scheduled between macb_open() and macb_close(), and is
+cancelled synchronously on close.
+
+Context for submission: on our 24-node Raspberry Pi 5 fleet,
+before this series, an out-of-band user-space watchdog
+(monitoring tx_packets from /sys/class/net/.../statistics and
+toggling the link down/up when it froze) was required to keep
+nodes usable.  We include this kernel-side watchdog as a cleaner
+in-kernel equivalent for any residual stall that patches 1 and
+2 do not cover.  We are willing to drop this patch if the view
+is that 1 and 2 should stand alone.

 Link: https://github.com/cilium/cilium/issues/43198
 Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877