mirror of
https://github.com/lukaszraczylo/traefikoidc.git
synced 2026-06-05 22:44:17 +00:00
8ffd80193aee9f1d7e4eb1d4f3e7c6faa4461eb5
11 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
f821b8829b |
fix: remove write-lock convoy in getLocal + fix mutateState CAS bug
UniversalCache.getLocal(): when a cached token expires, the RLock fast path (line 385-398) previously fell through to c.mu.Lock() (write lock). Under Yaegi, the write-lock holder takes 10-100ms for LRU manipulation, and Go's RWMutex writer-priority blocks ALL new RLock callers. A single expired-token event turned every concurrent request from read-parallel into write-serialized — the convoy that produced the 737-goroutine pileup at 0x400275a608 (pprof captured at /tmp/traefik-spike-1779663149). Fix: return (nil, false) immediately on expiry for Token/JWK/Session cache types. The periodic cleanup goroutine handles eviction. Write lock is never taken on the read path for these cache types. refreshAttemptTracker.mutateState(): the CAS loop used t.state.CompareAndSwap(t.state.Load(), next) — a second Load that can see a different value from a concurrent writer, silently overwriting their update. Fixed to CompareAndSwap(cur, next) using the snapshot we computed the mutation from. |
||
|
|
17e3f8ef62 |
fix: snapshot patterns for refresh-tracker and metadata URLs
Two related lock-free snapshot refactors addressing the remaining
post-v1.0.16 code-review findings.
1. refreshAttemptTracker: per-field atomic.Load/Store -> atomic.Value
snapshot of *attemptState (refresh_coordinator.go).
Previously each tracker held five independently-atomic fields. The
cooldown-exit reset wrote cooldownEndNano = 0 first, then separately
stored attempts = 1 and windowStartNano = now. A concurrent
isInCooldown call could observe cooldownEndNano = 0 (reset just
completed) with attempts still at MaxRefreshAttempts, immediately
triggering a fresh cooldown — a benign double-trigger race that
nonetheless meant the state machine had observable intermediate
states.
New design: state is a *attemptState (immutable) published via
atomic.Value. All transitions (record/success/failure/window-reset/
cooldown-enter/cooldown-exit) go through mutateState, which runs a
CAS loop: load current snapshot -> construct fresh snapshot ->
CompareAndSwap. Either the entire new state publishes or none of
it does — no intermediate visibility, no cross-field race.
Under Yaegi this collapses 3-5 per-field atomic dispatches into one
atomic.Value.Load on the read path. Write paths pay an extra
allocation for the new snapshot but avoid the cross-field hazard.
2. MetadataSnapshot: hot-path readers use atomic.Value instead of
metadataMu.RLock (middleware.go, types.go, main.go, utilities.go).
middleware.ServeHTTP previously took metadataMu.RLock on every
non-bypass request to read the single field issuerURL. Under Yaegi
each RLock acquisition costs 1-5ms of interpreter dispatch.
updateMetadataEndpoints now also publishes an immutable
*MetadataSnapshot via atomic.Value; the hot-path reader loads it
in one op via t.metadataSnap(). Falls back to the legacy
metadataMu.RLock pattern when the snapshot is unpublished (some
test setups initialize the struct fields directly without going
through updateMetadataEndpoints).
Less-frequent callers (helpers, logout, token_introspection) still
take metadataMu.RLock and are unchanged. The snapshot strictly
subsets the metadataMu-protected fields, so those readers see
identical data.
Note on atomic.Pointer[T]: this would have been the cleaner type but
yaegi v0.16.1's stdlib (used by traefik:v3.7.1) exposes only the
legacy unsafe.Pointer-based atomic primitives — no generic Pointer[T].
atomic.Value provides the same semantics via interface{} + type assert.
All tests pass with -race; golangci-lint clean.
|
||
|
|
827926bc3a |
fix(refresh-coordinator): trim per-request mutex/map ops
Three related changes addressing post-v1.0.15 code-review findings and the user's observation that we have been "throwing maps around" — under Yaegi every sync.Map / atomic / mutex dispatch costs ~1-5ms of interpreter overhead, so the number of dispatches per request matters as much as whether they are lock-free. 1. Remove cleanupTimers map + cleanupTimerMu sync.Mutex. scheduleDelayedCleanup previously tracked every pending timer in a map guarded by a mutex so a duplicate scheduling could cancel the prior timer. That "shouldn't happen" path was the only consumer of the map, but the mutex fired on every successful refresh completion — another per-request Yaegi-dispatched lock. performCleanup is already idempotent (LoadAndDelete on the sync.Map), so a duplicate firing is at worst a no-op second call. Dropped the map entirely; time.AfterFunc callback now calls performCleanup directly. Net: -1 sync.Mutex, -1 map field, -2 Lock/Unlock pairs per refresh completion. Shutdown simplified — no need to enumerate-and-stop timers since the callbacks no longer need teardown. 2. Reorder applyLeaderGates: cooldown check BEFORE recordRefreshAttempt. Previously incremented the attempt counter and then checked cooldown. Under burst load (many concurrent leaders with different token hashes but the same session) every goroutine could increment past MaxRefreshAttempts before any one of them observed the threshold, so the gate fired too late — same thundering-herd shape that drove v1.0.14 into the ground. Reordering makes the gate authoritative: only attempts that pass the gate are recorded. Semantic change: with MaxRefreshAttempts=N, exactly N attempts now run to completion before the (N+1)th is denied. Previously the Nth was denied as it tried to record (off-by-one stricter). Test assertion updated to N (was N-1). 3. Fix getOrCreateOperation MaxConcurrentRefreshes overshoot. The previous CAS-loop allowed a transient overshoot of up to N-1 leaders when several goroutines all observed `current < max` in the same scheduling slice before any one of them succeeded their CAS — visible to readers as currentInFlightRefreshes > MaxConcurrentRefreshes for a brief window. Replaced with the ticket-and-return pattern: increment optimistically, decrement if we overshot. Strictly bounded: only the goroutine that produces max+1 sees max+1 as committed; the rest decrement back immediately. No CAS retry loop needed. What was NOT done in this commit, and why: * metadataMu.RLock cached via atomic snapshot — code-reviewer flagged this at severity 7 (3 RLocks per request: middleware.go:213, token_manager.go:349, token_manager.go:408). The clean fix is an atomic.Pointer[*MetadataSnapshot], but generic atomic.Pointer[T] is NOT exposed by yaegi v0.16.1's stdlib (only legacy unsafe.Pointer primitives). atomic.Value would work but requires a snapshot-struct refactor across ~15 call sites (helpers/logout/token_introspection/ token_manager/main/middleware). Deferred to a focused future PR. * isInCooldown multi-field reset race — the cooldown-reset CAS wins on cooldownEndNano, then separately stores attempts/consecutiveFailures/ windowStartNano. A concurrent isInCooldown can briefly see the pre-reset attempts value and trigger a fresh cooldown. Semantic glitch (double-cooldown), not a correctness disaster. Fix is a single atomic pointer swap of an immutable snapshot — same atomic.Pointer constraint as above. Deferred. All tests pass with -race; golangci-lint clean. |
||
|
|
72e2b682bb |
fix: eliminate per-request global mutexes in Yaegi hot paths
The v1.0.14 fix replaced one contended sync.RWMutex (RefreshCoordinator.
refreshMutex) with sync.Map. Production showed the same death-spiral
signature recurring ~2 hours later — same shape, different mutex:
65 goroutines stuck on a sync.(*RWMutex).Lock at one address, pod
pinned at 1000m CPU, identical Yaegi runCfg/reflect.Value.Call stack
pattern. The mutex was RefreshCoordinator.attemptsMutex.
Generalising: under Yaegi (interpreted Go for traefik plugins), any
per-request global mutex acquisition is a latent serialization point.
reflect.Value.Call dispatch on a held lock turns a microsecond
critical section into a multi-millisecond one, and on a GOMAXPROCS=1
pod the queue is unbounded.
This commit removes every per-request global mutex on the hot path:
1. RefreshCoordinator.attemptsMutex (sync.RWMutex)
sessionRefreshAttempts: map -> sync.Map.
refreshAttemptTracker: all fields atomic (int32, int64 UnixNano,
cooldownEndNano == 0 as the not-in-cooldown sentinel, replacing
the inCooldown bool).
isInCooldown / recordRefreshAttempt / recordRefreshSuccess /
recordRefreshFailure all become lock-free. Cooldown entry uses
CompareAndSwapInt64 so only one goroutine logs the transition.
2. RefreshCircuitBreaker.mutex (sync.RWMutex)
lastFailureTime / lastSuccessTime -> atomic.Int64 UnixNano.
state and failures already atomic.
AllowRequest / RecordSuccess / RecordFailure now pure atomic ops.
3. TraefikOidc.firstRequestMutex (sync.Mutex)
firstRequestReceived bool -> firstRequestStarted int32.
metadataRefreshStarted bool -> metadataRefreshStartedAtomic int32.
ServeHTTP bootstrap path uses CompareAndSwapInt32 — fires once,
zero steady-state cost. Previously the mutex was acquired on
every non-health request forever.
4. TraefikOidc.metadataRetryMutex (sync.Mutex)
lastMetadataRetryTime time.Time -> lastMetadataRetryNano int64.
The 30-second retry throttle is now a CAS on lastMetadataRetryNano.
cleanupStaleEntries iterates via sync.Map.Range; eviction is a
CompareAndDelete by pointer identity so a tracker freshly re-used by
a concurrent caller is not lost.
Empirical evidence (3 specialist-agent analysis of the v1.0.14 spike,
profiles in /tmp/traefik-spike-1779511683/):
* mutex profile: 97% delay in sync.(*Mutex).Unlock via
HTTPHandlerSwitcher -> accesslog -> metrics -> backoff.RetryNotify
* 65 stuck goroutines at one RWMutex address (0x40022eb648),
identical Yaegi CFG pointer, all on rc.attemptsMutex via
recordRefreshAttempt + isInCooldown
* traffic driver: long-lived in-cluster Go-http-client doing
~5.4 req/s POST embeddings via OIDC cookie session → same
sessionID → contention all funnels to one tracker entry
Yaegi support for sync/atomic confirmed at
github.com/traefik/yaegi@v0.16.1/stdlib/go1_22_sync_atomic.go:
AddInt32/Int64, LoadInt32/Int64, StoreInt32/Int64,
CompareAndSwapInt32/Int64 all exposed via reflect.ValueOf. Yaegi
dispatches each call through reflect.Value.Call to the COMPILED
atomic.* function, which executes a single hardware CAS/LOCK-XADD
instruction. Each atomic op still pays Yaegi dispatch cost but
cannot block — no queueing, no death spiral.
Trade-off acknowledged: v1.0.15 issues ~6-8 atomic/sync.Map ops per
leader-path request vs the 4 mutex ops of v1.0.14. Under low
contention this is a modest CPU bump. Under high contention it's
an unbounded → bounded transformation. Net win.
All tests pass with -race; golangci-lint clean.
|
||
|
|
ae4ccaa89d |
fix(refresh-coordinator): replace global RWMutex with sync.Map
Under Yaegi, the RefreshCoordinator.refreshMutex was held for tens of
milliseconds per request because every operation inside the critical
section (map access, isInCooldown, recordRefreshAttempt,
isUnderMemoryPressure, atomic ops, struct allocation) is dispatched
through reflect.Value.Call with full arg boxing/unboxing.
Concurrent refreshes on the same coordinator serialized into a queue
that grew without bound. Live capture in production (3 Grafana
dashboards left open) showed:
* 63 goroutines stuck on rc.refreshMutex.Lock() for 1-11 minutes
* pod pinned at 1000m CPU (GOMAXPROCS=1)
* 5.15M allocs/sec, 0.45 RPS effective throughput
* yaegi.call.func9 accounting for 92.66% of cumulative allocs
* mutex profile dominated by sync.(*Mutex).Unlock via the request chain
Change inFlightRefreshes from map[string]*refreshOperation+RWMutex to
sync.Map and rewrite getOrCreateOperation to:
1. Speculatively allocate the candidate operation.
2. Atomically LoadOrStore by tokenHash. Joiners take the existing
operation; leader takes the new one. No global lock acquired.
3. Leader runs rate-limit / cooldown / memory-pressure gates AFTER
the atomic store. Joiners share the leader's outcome via op.done.
4. Reserve the concurrent-refresh slot via CompareAndSwap so the
count cannot overshoot in absence of the old serializing lock.
5. On any gate failure the leader calls failCandidate, which deletes
the entry from sync.Map, records the error on op.result and closes
op.done so any joiner that snuck in returns the same error.
performCleanup becomes a single sync.Map.LoadAndDelete, eliminating
the lock entirely on the cleanup path.
Net effect: critical section is no longer Yaegi-interpreted; it
collapses to atomic instructions on a sharded sync.Map. Refresh
contention disappears even under Yaegi.
All tests pass with -race; golangci-lint clean.
|
||
|
|
1b6c8616fd |
fix(refresh): coalesce refresh-token grants + bound goroutines + cache hot path (target v0.8.27) (#131)
* fix(refresh): wire RefreshCoordinator into the live refresh path
The RefreshCoordinator existed but was never instantiated. The actual
refresh path used only session.refreshMutex, which is per-SessionData
instance - and SessionData is pulled from a sync.Pool per request -
so concurrent requests sharing a refresh token had ZERO coordination.
Symptom: when access_token expired (e.g. 5min Zitadel default), every
in-flight request from a polling client (Grafana panels) entered the
refresh path simultaneously and POSTed the same refresh_token to the
IdP. With refresh-token rotation enabled (Zitadel/Authentik default),
only one grant succeeded; the rest got invalid_grant and each cleared
the entire session. Subsequent requests then thrashed in re-auth loops.
This commit:
- adds refreshCoordinator field on TraefikOidc
- instantiates it in NewWithContext with DefaultRefreshCoordinatorConfig
- shuts it down in Close() under shutdownOnce
- routes refreshToken() through the coordinator via coordinatedTokenRefresh,
which collapses concurrent grants to a single upstream call per
refresh_token hash
- exports refreshCoordinatorSessionID for both internal hashing and the
middleware-level wireup so dedup keys stay aligned
Behavioural notes:
- nil-coordinator fallback preserves existing tests that build TraefikOidc
literals without going through the constructor
- followers receive the same TokenResponse/error as the leader, so no
per-instance code paths change
- existing TestGetNewTokenWithRefreshToken_Concurrency still passes
because it hits GetNewTokenWithRefreshToken directly, below the
coordinator boundary
Tests:
- refresh_coordinator_wireup_test.go: 50 concurrent refreshes coalesce
to <=2 upstream calls; distinct tokens still run in parallel; nil
coordinator falls back cleanly
* perf(cache): bound L1 backfill goroutines in HybridBackend
Get() and GetMany() previously spawned a goroutine per L2 hit to write
the value through to L1. Under sustained polling traffic (e.g. a Grafana
dashboard refreshing every 30s with N panels) this minted thousands of
goroutines, each running in Yaegi - directly contributing to the
~1000% CPU spike that pairs with the refresh-token herd.
Replace the per-hit goroutines with a single l1BackfillWorker fed by
l1BackfillBuffer, mirroring the existing asyncWriteBuffer/asyncWriteWorker
pattern for L2 writes. Buffer overflow drops the backfill (counted via
l1BackfillDrops) - a dropped backfill just means the next L2 hit for
that key re-queues it, which is safe.
Tests:
- TestHybridBackend_L1BackfillBounded: 1000 distinct L2 hits keep
goroutine count within +20 of baseline (pre-fix it grew by ~1000)
- TestHybridBackend_L1BackfillFullDrops: drops are accounted for when
the buffer is saturated and the worker is stopped
* feat(refresh): implement isRefreshTokenExpired heuristic
Replace the placeholder `return false` with a real check based on the
issued_at timestamp that SetRefreshToken already stamps into the session.
Gated by a new MaxRefreshTokenAgeSeconds config field (default 21600 =
6h, matching the existing comment). 0 disables the check.
This wires the previously-dead refreshTokenExpired branch in middleware.go,
which short-circuits AJAX requests with a 401 instead of letting them
hammer the IdP for a refresh token that's almost certainly stale - the
classic Grafana-after-long-pause failure mode.
Behaviour:
- maxRefreshTokenAge=0 disables the check (preserves prior behaviour)
- legacy sessions without issued_at still attempt one refresh; the IdP
remains the source of truth on first try
- nil-receiver and nil-session guards keep test code that builds
TraefikOidc literals safe
Tests:
- TestIsRefreshTokenExpired_DisabledWhenAgeZero
- TestIsRefreshTokenExpired_LegacySessionWithoutTimestamp
- TestIsRefreshTokenExpired_WithinWindow
- TestIsRefreshTokenExpired_BeyondWindow
- TestIsRefreshTokenExpired_NilGuards
* perf(token): skip parseJWT on cache hit in VerifyToken
The token cache fast-return existed but ran AFTER parseJWT, so every
validation paid for base64 + JSON unmarshal even on a hit. Under bursty
traffic (e.g. 10+ concurrent panel requests on every Grafana dashboard
refresh, each calling validateStandardTokens which verifies BOTH the
access token and the ID token), this is two redundant parses per
request multiplied by the panel count.
Move the cache lookup ahead of parseJWT. On a hit the function returns
nil immediately. On a miss the original flow runs unchanged.
Also nil-guard t.tokenCache to keep partial-literal test instances safe
(matches the same pattern we already use for tokenBlacklist).
Tests:
- TestVerifyToken_CacheHitSkipsParse: cache pre-populated with claims
for a token whose body would fail parseJWT - returns nil iff the
fast-path bypasses the parse
- TestVerifyToken_CacheMissStillParses: a syntactically valid but
unsigned token still errors past parseJWT on cache miss
* feat(refresh): cross-replica refresh-grant dedup via shared cache
The in-process RefreshCoordinator added in
|
||
|
|
2d1b04c637 |
review fixes apr 2026 (#130)
* Multiple fixes - refresh coordinator dedup + memory pressure wire - middleware sse consolidation + timer leak + claim cache - universal cache sync backfill + isDebug gate - lazy background task race - memory monitor stw cached + refresh() api * fix(auth): suppress OIDC redirects on non-navigation requests - [x] Add isNonNavigationRequest using Sec-Fetch-Mode and Accept headers - [x] Add comprehensive TestIsNonNavigationRequest - [x] Update ServeHTTP to 401 non-navigation and AJAX requests Fixes #129 * feat(config): add custom CA and insecure skip verify for OIDC TLS - [x] Add CACertPath, CACertPEM, InsecureSkipVerify to Config - [x] Implement loadCACertPool for CA bundle loading - [x] Update HTTPClientConfig with RootCAs and InsecureSkipVerify - [x] Apply CA pool and skip verify to pooled HTTP clients - [x] Enhance configKey to distinguish TLS configs - [x] Add comprehensive ca_cert_test.go Fixes #125 * feat(oidc): add custom CA certificate support for private OIDC providers - [x] Add caCertPath, caCertPEM, insecureSkipVerify config options - [x] Update traefik.yml with new OIDC client config fields - [x] Add configuration schema descriptions for new options - [x] Update README table and add Custom CA Certificates section * Fix the documentation. * test(redis): add oversized argument rejection test - [x] Add TestRedisConn_RejectOversizedArgumentBytes - [x] Import strings package * Dependencies cleanup |
||
|
|
9d52f1b018 |
feat(core): refactor linters config and improve code quality (#119)
- [x] Reorganize golangci-lint configuration with documented disable reasons - [x] Simplify errcheck and revive linter rules with targeted exclusions - [x] Pre-compile regex patterns in input_validation.go for performance - [x] Fix type assertions in memory_shard.go and resp.go with safety checks - [x] Replace string comparison with EqualFold for case-insensitive matching - [x] Fix loop variable captures in jwk.go and logout.go - [x] Change high goroutine log level from Info to Debug in autocleanup.go - [x] Replace deprecated "cancelled" spelling with "canceled" throughout - [x] Add nolint annotations for intentional unused parameters - [x] Improve comment formatting for deprecated functions - [x] Fix comment spelling: "marshalling" → "marshaling" - [x] Refactor provider warnings formatting in internal/providers/warnings.go - [x] Simplify metrics summary building in internal/recovery/metrics.go - [x] Pre-allocate slice in error_recovery.go GetDegradedServices - [x] Refactor context cancellation checks in redis.go |
||
|
|
6efb78b7a8 |
Smarter approach to the cookies (#103)
* Smarter approach to the cookies - Single maxCookieSize = 1400 constant with clear documentation - Combined cookie storage for ~40-45% size reduction - Backward compatible migration from legacy cookies * Tuneup the code. |
||
|
|
5fcbd54955 |
Add sharded cache and prevention of CPU spikes / locks (#96)
* Add sharded cache and prevention of CPU spikes / locks * Add dynamic client registration with oidc provider * Fix race condition introduced during the sharded cache implementation. * Add page for traefikoidc. |
||
|
|
1e4142a7fb |
release 0.7.2 (#66)
* Remove trailing / from metadata provider. * Resolves issue #67 - Before: 100 concurrent requests → 300+ refresh attempts → OOM - After: 100 concurrent requests → 1 refresh attempt → Stable memory Added following changes: - Introduced a refresh coordinator to manage concurrent refresh requests - Implemented a test to simulate high concurrency and verify memory stability * Issue #67 fixed. |