patches/linux: sync prose edits from talos-kernel-patches (diff bodies unchanged)

This commit is contained in:
2026-04-24 18:59:44 +01:00
parent 2ed45176d6
commit 8930e5a3d3
3 changed files with 83 additions and 65 deletions
@@ -3,32 +3,38 @@ From: Lukasz Raczylo <lukasz@raczylo.com>
Date: Fri, 24 Apr 2026 00:00:00 +0000
Subject: [PATCH 1/3] net: macb: flush PCIe posted write after TSTART doorbell
macb_start_xmit() and macb_tx_restart() both kick transmission by
OR-ing MACB_BIT(TSTART) into NCR. On PCIe-attached macb instances --
notably BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 -- the
doorbell write is a posted PCIe write that can sit in the fabric's
write queue until something drains it. A source-level comment at
the TSTART site already acknowledges the problem:
macb_start_xmit() and macb_tx_restart() kick transmission by
OR-ing MACB_BIT(TSTART) into NCR. On PCIe-attached macb instances
(BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 is the setup we
have in front of us), writes to NCR are posted PCIe writes: they
are not guaranteed to reach the device before the issuing CPU
returns. An existing source-level comment at the TSTART site
acknowledges that such writes can be lost under some conditions:
/* TSTART write might get dropped, so make the IRQ retrigger
* a buffer read */
and arms a recovery handshake via queue->tx_pending /
queue->txubr_pending that is picked up on the next TCOMP interrupt.
That recovery path only runs if a TCOMP interrupt actually fires;
if the lost doorbell means no TX starts, there is no TCOMP, and the
ring stalls silently.
queue->txubr_pending that runs on the next TCOMP interrupt. That
recovery path depends on a subsequent TCOMP actually firing. If
the TSTART write never reaches the MAC, no TX begins, no TCOMP
completion arrives, and the ring remains quiescent without any
kernel-visible indication.
Add a read-back of NCR after the TSTART write. The read serialises
the PCIe posted-write queue and ensures the doorbell reaches the MAC
before macb_start_xmit() / macb_tx_restart() return. The existing
'TSTART might get dropped' handshake is preserved as a safety net
for cases where the fabric genuinely drops the write despite the
read barrier, but with this barrier it should rarely if ever be
needed on PCIe-attached parts.
Add a read-back of NCR after each TSTART write in macb_start_xmit()
and macb_tx_restart(). The read is an architected PCIe read
barrier for earlier posted writes on the same path; it ensures the
doorbell has reached the MAC before the functions return.
Observed to be the most common trigger for the silent TX stall
documented in the linked reports.
The existing tx_pending / txubr_pending handshake is left in place
unchanged -- it remains the correct recovery for any other reason
the MAC could silently fail to start TX.
We do not have direct hardware evidence that TSTART is being lost
on the RP1 path. This patch is one of a three-patch series
("candidate fixes for silent TX stall on BCM2712/RP1"); see the
cover letter for context. We have verified it compiles and
applies cleanly; runtime verification is pending.
Link: https://github.com/cilium/cilium/issues/43198
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
@@ -15,38 +15,41 @@ existing comment in the function notes:
* interrupts are re-enabled.
*/
and mitigates this by calling macb_tx_complete_pending() to look
for a completed descriptor whose TX_USED bit the hardware has
DMA'd but whose completion we processed without ever seeing an
interrupt for.
and mitigates this by calling macb_tx_complete_pending(), which
inspects driver-visible ring state (descriptor->ctrl, after rmb())
and reschedules NAPI if a completion is observable in memory.
macb_tx_complete_pending() only inspects driver-visible ring state
(descriptor->ctrl, after rmb()). On PCIe-attached parts (BCM2712 +
RP1 on Raspberry Pi 5 in particular) the descriptor DMA write that
sets TX_USED can still be in flight in the PCIe fabric when we
check. The read-memory-barrier synchronises the CPU view of earlier
CPU writes, but does not force the peripheral's in-flight DMA to
retire. In that window the check returns false, napi exits, the
IER re-enable does not re-fire (the quirk above), and the queue
stalls silently.
On PCIe-attached parts (BCM2712 + RP1 on Raspberry Pi 5 is the
setup we have in front of us), the descriptor DMA write that sets
TX_USED may not have retired to system memory at the point
macb_tx_complete_pending() runs. The rmb() synchronises the CPU
view of earlier CPU writes; it is not sufficient to retire an
in-flight peripheral DMA write. Under that ordering the in-memory
descriptor can still read TX_USED=0 when the hardware has in fact
completed the frame; the check returns false; NAPI exits; the
quirk above prevents the re-enabled IER from re-firing; the ring
goes quiescent.
Re-check the hardware's own ISR state as well. Reading a MAC
register after IER re-enable serves two purposes:
Add an explicit ISR read after the IER write. The MMIO read
serves two independent purposes:
(1) It drains any in-flight PCIe DMA writes of descriptor state,
so a subsequent macb_tx_complete_pending() sees an accurate
view of TX_USED.
(1) It is an architected PCIe read barrier for earlier
peripheral-originated DMA writes on the same path, so a
subsequent macb_tx_complete_pending() observes any TX_USED
write that was in flight at the time of the barrier.
(2) It directly observes whether the hardware currently has a
pending TCOMP signal, catching the case the existing driver
comment describes (completions raised while masked, not
re-fired).
(2) It samples the hardware ISR directly, so a TCOMP bit that
the hardware set while TCOMP was masked is visible here,
independently of whether the descriptor DMA has retired.
If either path indicates pending work, schedule NAPI again.
If either signal indicates pending work, reschedule NAPI via the
same path as the existing check.
Combined with the PCIe posted-write flush in patch 1/3, this closes
the observed silent-TX-stall path on BCM2712/RP1 reported at the
links below.
This patch addresses one of three candidate races for the silent
TX stall described in the cover letter. Whether it is sufficient
by itself, or whether it requires the PCIe posted-write flush in
patch 1/3 to cover the observed behaviour, we have not yet
verified at runtime.
Link: https://github.com/cilium/cilium/issues/43198
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
@@ -1,30 +1,39 @@
From 0000000000000000000000000000000000000003 Mon Sep 17 00:00:00 2001
From: Lukasz Raczylo <lukasz@raczylo.com>
Date: Fri, 24 Apr 2026 00:00:00 +0000
Subject: [PATCH 3/3] net: macb: add TX stall watchdog to recover from lost
TCOMP
Subject: [PATCH 3/3] net: macb: add TX stall watchdog as defence-in-depth
safety net
Patches 1/3 and 2/3 close two races by which a TCOMP interrupt can
be lost on PCIe-attached macb instances. This patch adds a
defence-in-depth safety net: a per-queue delayed_work that calls
macb_tx_restart() if queue->tx_tail has not advanced in one second
despite the ring being non-empty.
Patches 1/3 and 2/3 address two candidate races that could lead
to a TCOMP completion being missed on PCIe-attached macb
instances. This patch adds a defence-in-depth safety net, in
case a further race remains that we have not identified.
The watchdog introduces no new recovery logic. macb_tx_restart()
already exists, is correctly locked, and already checks the
hardware's TBQP against the driver's head index before writing
TSTART: on a healthy ring it is a no-op at the hardware level. All
the watchdog adds is the trigger.
The watchdog is a per-queue delayed_work that runs once per
second. It snapshots queue->tx_tail; if the ring is non-empty
(queue->tx_head != queue->tx_tail) and tx_tail has not advanced
since the previous tick, it calls macb_tx_restart().
If patches 1/3 and 2/3 completely eliminate the stall, this code
never does anything beyond a spin_lock/unlock and a branch per
second per queue. If a further race remains -- hardware or
driver-level -- this turns a multi-minute silent hang into a
one-second bump.
No new recovery logic is introduced. macb_tx_restart() already
exists in this file, is correctly locked (tx_ptr_lock, bp->lock),
and verifies that the hardware's TBQP is behind the driver's
head index before re-asserting TSTART. On a healthy ring it is
a no-op at the hardware level; the watchdog only supplies the
missing trigger.
On our 24-node Raspberry Pi 5 fleet this was empirically needed:
before the patches in this series, multiple nodes per day hit the
stall and required external watchdog intervention to recover.
On a healthy queue the per-tick cost is one spin_lock_irqsave()
/ spin_unlock_irqrestore() and one branch. The delayed_work is
only scheduled between macb_open() and macb_close(), and is
cancelled synchronously on close.
Context for submission: on our 24-node Raspberry Pi 5 fleet,
before this series, an out-of-band user-space watchdog
(monitoring tx_packets from /sys/class/net/.../statistics and
toggling the link down/up when it froze) was required to keep
nodes usable. We include this kernel-side watchdog as a cleaner
in-kernel equivalent for any residual stall that patches 1 and
2 do not cover. We are willing to drop this patch if the view
is that 1 and 2 should stand alone.
Link: https://github.com/cilium/cilium/issues/43198
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877