Three related changes addressing post-v1.0.15 code-review findings and
the user's observation that we have been "throwing maps around" — under
Yaegi every sync.Map / atomic / mutex dispatch costs ~1-5ms of
interpreter overhead, so the number of dispatches per request matters
as much as whether they are lock-free.
1. Remove cleanupTimers map + cleanupTimerMu sync.Mutex.
scheduleDelayedCleanup previously tracked every pending timer in a
map guarded by a mutex so a duplicate scheduling could cancel the
prior timer. That "shouldn't happen" path was the only consumer of
the map, but the mutex fired on every successful refresh
completion — another per-request Yaegi-dispatched lock.
performCleanup is already idempotent (LoadAndDelete on the sync.Map),
so a duplicate firing is at worst a no-op second call. Dropped the
map entirely; time.AfterFunc callback now calls performCleanup
directly.
Net: -1 sync.Mutex, -1 map field, -2 Lock/Unlock pairs per refresh
completion. Shutdown simplified — no need to enumerate-and-stop
timers since the callbacks no longer need teardown.
2. Reorder applyLeaderGates: cooldown check BEFORE recordRefreshAttempt.
Previously incremented the attempt counter and then checked cooldown.
Under burst load (many concurrent leaders with different token hashes
but the same session) every goroutine could increment past
MaxRefreshAttempts before any one of them observed the threshold,
so the gate fired too late — same thundering-herd shape that drove
v1.0.14 into the ground. Reordering makes the gate authoritative:
only attempts that pass the gate are recorded.
Semantic change: with MaxRefreshAttempts=N, exactly N attempts now
run to completion before the (N+1)th is denied. Previously the Nth
was denied as it tried to record (off-by-one stricter). Test
assertion updated to N (was N-1).
3. Fix getOrCreateOperation MaxConcurrentRefreshes overshoot.
The previous CAS-loop allowed a transient overshoot of up to N-1
leaders when several goroutines all observed `current < max` in the
same scheduling slice before any one of them succeeded their CAS —
visible to readers as currentInFlightRefreshes > MaxConcurrentRefreshes
for a brief window.
Replaced with the ticket-and-return pattern: increment optimistically,
decrement if we overshot. Strictly bounded: only the goroutine that
produces max+1 sees max+1 as committed; the rest decrement back
immediately. No CAS retry loop needed.
What was NOT done in this commit, and why:
* metadataMu.RLock cached via atomic snapshot — code-reviewer flagged
this at severity 7 (3 RLocks per request: middleware.go:213,
token_manager.go:349, token_manager.go:408). The clean fix is an
atomic.Pointer[*MetadataSnapshot], but generic atomic.Pointer[T] is
NOT exposed by yaegi v0.16.1's stdlib (only legacy unsafe.Pointer
primitives). atomic.Value would work but requires a snapshot-struct
refactor across ~15 call sites (helpers/logout/token_introspection/
token_manager/main/middleware). Deferred to a focused future PR.
* isInCooldown multi-field reset race — the cooldown-reset CAS wins
on cooldownEndNano, then separately stores attempts/consecutiveFailures/
windowStartNano. A concurrent isInCooldown can briefly see the
pre-reset attempts value and trigger a fresh cooldown. Semantic glitch
(double-cooldown), not a correctness disaster. Fix is a single atomic
pointer swap of an immutable snapshot — same atomic.Pointer constraint
as above. Deferred.
All tests pass with -race; golangci-lint clean.
The v1.0.14 fix replaced one contended sync.RWMutex (RefreshCoordinator.
refreshMutex) with sync.Map. Production showed the same death-spiral
signature recurring ~2 hours later — same shape, different mutex:
65 goroutines stuck on a sync.(*RWMutex).Lock at one address, pod
pinned at 1000m CPU, identical Yaegi runCfg/reflect.Value.Call stack
pattern. The mutex was RefreshCoordinator.attemptsMutex.
Generalising: under Yaegi (interpreted Go for traefik plugins), any
per-request global mutex acquisition is a latent serialization point.
reflect.Value.Call dispatch on a held lock turns a microsecond
critical section into a multi-millisecond one, and on a GOMAXPROCS=1
pod the queue is unbounded.
This commit removes every per-request global mutex on the hot path:
1. RefreshCoordinator.attemptsMutex (sync.RWMutex)
sessionRefreshAttempts: map -> sync.Map.
refreshAttemptTracker: all fields atomic (int32, int64 UnixNano,
cooldownEndNano == 0 as the not-in-cooldown sentinel, replacing
the inCooldown bool).
isInCooldown / recordRefreshAttempt / recordRefreshSuccess /
recordRefreshFailure all become lock-free. Cooldown entry uses
CompareAndSwapInt64 so only one goroutine logs the transition.
2. RefreshCircuitBreaker.mutex (sync.RWMutex)
lastFailureTime / lastSuccessTime -> atomic.Int64 UnixNano.
state and failures already atomic.
AllowRequest / RecordSuccess / RecordFailure now pure atomic ops.
3. TraefikOidc.firstRequestMutex (sync.Mutex)
firstRequestReceived bool -> firstRequestStarted int32.
metadataRefreshStarted bool -> metadataRefreshStartedAtomic int32.
ServeHTTP bootstrap path uses CompareAndSwapInt32 — fires once,
zero steady-state cost. Previously the mutex was acquired on
every non-health request forever.
4. TraefikOidc.metadataRetryMutex (sync.Mutex)
lastMetadataRetryTime time.Time -> lastMetadataRetryNano int64.
The 30-second retry throttle is now a CAS on lastMetadataRetryNano.
cleanupStaleEntries iterates via sync.Map.Range; eviction is a
CompareAndDelete by pointer identity so a tracker freshly re-used by
a concurrent caller is not lost.
Empirical evidence (3 specialist-agent analysis of the v1.0.14 spike,
profiles in /tmp/traefik-spike-1779511683/):
* mutex profile: 97% delay in sync.(*Mutex).Unlock via
HTTPHandlerSwitcher -> accesslog -> metrics -> backoff.RetryNotify
* 65 stuck goroutines at one RWMutex address (0x40022eb648),
identical Yaegi CFG pointer, all on rc.attemptsMutex via
recordRefreshAttempt + isInCooldown
* traffic driver: long-lived in-cluster Go-http-client doing
~5.4 req/s POST embeddings via OIDC cookie session → same
sessionID → contention all funnels to one tracker entry
Yaegi support for sync/atomic confirmed at
github.com/traefik/yaegi@v0.16.1/stdlib/go1_22_sync_atomic.go:
AddInt32/Int64, LoadInt32/Int64, StoreInt32/Int64,
CompareAndSwapInt32/Int64 all exposed via reflect.ValueOf. Yaegi
dispatches each call through reflect.Value.Call to the COMPILED
atomic.* function, which executes a single hardware CAS/LOCK-XADD
instruction. Each atomic op still pays Yaegi dispatch cost but
cannot block — no queueing, no death spiral.
Trade-off acknowledged: v1.0.15 issues ~6-8 atomic/sync.Map ops per
leader-path request vs the 4 mutex ops of v1.0.14. Under low
contention this is a modest CPU bump. Under high contention it's
an unbounded → bounded transformation. Net win.
All tests pass with -race; golangci-lint clean.
* Add sharded cache and prevention of CPU spikes / locks
* Add dynamic client registration with oidc provider
* Fix race condition introduced during the sharded cache implementation.
* Add page for traefikoidc.
* Add ability to disable replay protection. - This is useful for runs with multiple traefik replicas to avoid false positives and tokens re-creation.
* Enhance the CI/CD pipelines
* Increase test coverage.
* Update vendored dependencies.
* Update behaviour on forceHTTPS as per issue #82