mirror of
https://github.com/lukaszraczylo/talos-builder.git
synced 2026-06-11 00:09:29 +00:00
patches/linux: sync prose edits from talos-kernel-patches (diff bodies unchanged)
This commit is contained in:
@@ -3,32 +3,38 @@ From: Lukasz Raczylo <lukasz@raczylo.com>
|
||||
Date: Fri, 24 Apr 2026 00:00:00 +0000
|
||||
Subject: [PATCH 1/3] net: macb: flush PCIe posted write after TSTART doorbell
|
||||
|
||||
macb_start_xmit() and macb_tx_restart() both kick transmission by
|
||||
OR-ing MACB_BIT(TSTART) into NCR. On PCIe-attached macb instances --
|
||||
notably BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 -- the
|
||||
doorbell write is a posted PCIe write that can sit in the fabric's
|
||||
write queue until something drains it. A source-level comment at
|
||||
the TSTART site already acknowledges the problem:
|
||||
macb_start_xmit() and macb_tx_restart() kick transmission by
|
||||
OR-ing MACB_BIT(TSTART) into NCR. On PCIe-attached macb instances
|
||||
(BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 is the setup we
|
||||
have in front of us), writes to NCR are posted PCIe writes: they
|
||||
are not guaranteed to reach the device before the issuing CPU
|
||||
returns. An existing source-level comment at the TSTART site
|
||||
acknowledges that such writes can be lost under some conditions:
|
||||
|
||||
/* TSTART write might get dropped, so make the IRQ retrigger
|
||||
* a buffer read */
|
||||
|
||||
and arms a recovery handshake via queue->tx_pending /
|
||||
queue->txubr_pending that is picked up on the next TCOMP interrupt.
|
||||
That recovery path only runs if a TCOMP interrupt actually fires;
|
||||
if the lost doorbell means no TX starts, there is no TCOMP, and the
|
||||
ring stalls silently.
|
||||
queue->txubr_pending that runs on the next TCOMP interrupt. That
|
||||
recovery path depends on a subsequent TCOMP actually firing. If
|
||||
the TSTART write never reaches the MAC, no TX begins, no TCOMP
|
||||
completion arrives, and the ring remains quiescent without any
|
||||
kernel-visible indication.
|
||||
|
||||
Add a read-back of NCR after the TSTART write. The read serialises
|
||||
the PCIe posted-write queue and ensures the doorbell reaches the MAC
|
||||
before macb_start_xmit() / macb_tx_restart() return. The existing
|
||||
'TSTART might get dropped' handshake is preserved as a safety net
|
||||
for cases where the fabric genuinely drops the write despite the
|
||||
read barrier, but with this barrier it should rarely if ever be
|
||||
needed on PCIe-attached parts.
|
||||
Add a read-back of NCR after each TSTART write in macb_start_xmit()
|
||||
and macb_tx_restart(). The read is an architected PCIe read
|
||||
barrier for earlier posted writes on the same path; it ensures the
|
||||
doorbell has reached the MAC before the functions return.
|
||||
|
||||
Observed to be the most common trigger for the silent TX stall
|
||||
documented in the linked reports.
|
||||
The existing tx_pending / txubr_pending handshake is left in place
|
||||
unchanged -- it remains the correct recovery for any other reason
|
||||
the MAC could silently fail to start TX.
|
||||
|
||||
We do not have direct hardware evidence that TSTART is being lost
|
||||
on the RP1 path. This patch is one of a three-patch series
|
||||
("candidate fixes for silent TX stall on BCM2712/RP1"); see the
|
||||
cover letter for context. We have verified it compiles and
|
||||
applies cleanly; runtime verification is pending.
|
||||
|
||||
Link: https://github.com/cilium/cilium/issues/43198
|
||||
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
|
||||
|
||||
@@ -15,38 +15,41 @@ existing comment in the function notes:
|
||||
* interrupts are re-enabled.
|
||||
*/
|
||||
|
||||
and mitigates this by calling macb_tx_complete_pending() to look
|
||||
for a completed descriptor whose TX_USED bit the hardware has
|
||||
DMA'd but whose completion we processed without ever seeing an
|
||||
interrupt for.
|
||||
and mitigates this by calling macb_tx_complete_pending(), which
|
||||
inspects driver-visible ring state (descriptor->ctrl, after rmb())
|
||||
and reschedules NAPI if a completion is observable in memory.
|
||||
|
||||
macb_tx_complete_pending() only inspects driver-visible ring state
|
||||
(descriptor->ctrl, after rmb()). On PCIe-attached parts (BCM2712 +
|
||||
RP1 on Raspberry Pi 5 in particular) the descriptor DMA write that
|
||||
sets TX_USED can still be in flight in the PCIe fabric when we
|
||||
check. The read-memory-barrier synchronises the CPU view of earlier
|
||||
CPU writes, but does not force the peripheral's in-flight DMA to
|
||||
retire. In that window the check returns false, napi exits, the
|
||||
IER re-enable does not re-fire (the quirk above), and the queue
|
||||
stalls silently.
|
||||
On PCIe-attached parts (BCM2712 + RP1 on Raspberry Pi 5 is the
|
||||
setup we have in front of us), the descriptor DMA write that sets
|
||||
TX_USED may not have retired to system memory at the point
|
||||
macb_tx_complete_pending() runs. The rmb() synchronises the CPU
|
||||
view of earlier CPU writes; it is not sufficient to retire an
|
||||
in-flight peripheral DMA write. Under that ordering the in-memory
|
||||
descriptor can still read TX_USED=0 when the hardware has in fact
|
||||
completed the frame; the check returns false; NAPI exits; the
|
||||
quirk above prevents the re-enabled IER from re-firing; the ring
|
||||
goes quiescent.
|
||||
|
||||
Re-check the hardware's own ISR state as well. Reading a MAC
|
||||
register after IER re-enable serves two purposes:
|
||||
Add an explicit ISR read after the IER write. The MMIO read
|
||||
serves two independent purposes:
|
||||
|
||||
(1) It drains any in-flight PCIe DMA writes of descriptor state,
|
||||
so a subsequent macb_tx_complete_pending() sees an accurate
|
||||
view of TX_USED.
|
||||
(1) It is an architected PCIe read barrier for earlier
|
||||
peripheral-originated DMA writes on the same path, so a
|
||||
subsequent macb_tx_complete_pending() observes any TX_USED
|
||||
write that was in flight at the time of the barrier.
|
||||
|
||||
(2) It directly observes whether the hardware currently has a
|
||||
pending TCOMP signal, catching the case the existing driver
|
||||
comment describes (completions raised while masked, not
|
||||
re-fired).
|
||||
(2) It samples the hardware ISR directly, so a TCOMP bit that
|
||||
the hardware set while TCOMP was masked is visible here,
|
||||
independently of whether the descriptor DMA has retired.
|
||||
|
||||
If either path indicates pending work, schedule NAPI again.
|
||||
If either signal indicates pending work, reschedule NAPI via the
|
||||
same path as the existing check.
|
||||
|
||||
Combined with the PCIe posted-write flush in patch 1/3, this closes
|
||||
the observed silent-TX-stall path on BCM2712/RP1 reported at the
|
||||
links below.
|
||||
This patch addresses one of three candidate races for the silent
|
||||
TX stall described in the cover letter. Whether it is sufficient
|
||||
by itself, or whether it requires the PCIe posted-write flush in
|
||||
patch 1/3 to cover the observed behaviour, we have not yet
|
||||
verified at runtime.
|
||||
|
||||
Link: https://github.com/cilium/cilium/issues/43198
|
||||
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
|
||||
|
||||
@@ -1,30 +1,39 @@
|
||||
From 0000000000000000000000000000000000000003 Mon Sep 17 00:00:00 2001
|
||||
From: Lukasz Raczylo <lukasz@raczylo.com>
|
||||
Date: Fri, 24 Apr 2026 00:00:00 +0000
|
||||
Subject: [PATCH 3/3] net: macb: add TX stall watchdog to recover from lost
|
||||
TCOMP
|
||||
Subject: [PATCH 3/3] net: macb: add TX stall watchdog as defence-in-depth
|
||||
safety net
|
||||
|
||||
Patches 1/3 and 2/3 close two races by which a TCOMP interrupt can
|
||||
be lost on PCIe-attached macb instances. This patch adds a
|
||||
defence-in-depth safety net: a per-queue delayed_work that calls
|
||||
macb_tx_restart() if queue->tx_tail has not advanced in one second
|
||||
despite the ring being non-empty.
|
||||
Patches 1/3 and 2/3 address two candidate races that could lead
|
||||
to a TCOMP completion being missed on PCIe-attached macb
|
||||
instances. This patch adds a defence-in-depth safety net, in
|
||||
case a further race remains that we have not identified.
|
||||
|
||||
The watchdog introduces no new recovery logic. macb_tx_restart()
|
||||
already exists, is correctly locked, and already checks the
|
||||
hardware's TBQP against the driver's head index before writing
|
||||
TSTART: on a healthy ring it is a no-op at the hardware level. All
|
||||
the watchdog adds is the trigger.
|
||||
The watchdog is a per-queue delayed_work that runs once per
|
||||
second. It snapshots queue->tx_tail; if the ring is non-empty
|
||||
(queue->tx_head != queue->tx_tail) and tx_tail has not advanced
|
||||
since the previous tick, it calls macb_tx_restart().
|
||||
|
||||
If patches 1/3 and 2/3 completely eliminate the stall, this code
|
||||
never does anything beyond a spin_lock/unlock and a branch per
|
||||
second per queue. If a further race remains -- hardware or
|
||||
driver-level -- this turns a multi-minute silent hang into a
|
||||
one-second bump.
|
||||
No new recovery logic is introduced. macb_tx_restart() already
|
||||
exists in this file, is correctly locked (tx_ptr_lock, bp->lock),
|
||||
and verifies that the hardware's TBQP is behind the driver's
|
||||
head index before re-asserting TSTART. On a healthy ring it is
|
||||
a no-op at the hardware level; the watchdog only supplies the
|
||||
missing trigger.
|
||||
|
||||
On our 24-node Raspberry Pi 5 fleet this was empirically needed:
|
||||
before the patches in this series, multiple nodes per day hit the
|
||||
stall and required external watchdog intervention to recover.
|
||||
On a healthy queue the per-tick cost is one spin_lock_irqsave()
|
||||
/ spin_unlock_irqrestore() and one branch. The delayed_work is
|
||||
only scheduled between macb_open() and macb_close(), and is
|
||||
cancelled synchronously on close.
|
||||
|
||||
Context for submission: on our 24-node Raspberry Pi 5 fleet,
|
||||
before this series, an out-of-band user-space watchdog
|
||||
(monitoring tx_packets from /sys/class/net/.../statistics and
|
||||
toggling the link down/up when it froze) was required to keep
|
||||
nodes usable. We include this kernel-side watchdog as a cleaner
|
||||
in-kernel equivalent for any residual stall that patches 1 and
|
||||
2 do not cover. We are willing to drop this patch if the view
|
||||
is that 1 and 2 should stand alone.
|
||||
|
||||
Link: https://github.com/cilium/cilium/issues/43198
|
||||
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
|
||||
|
||||
Reference in New Issue
Block a user