traefikoidc

mirror of https://github.com/lukaszraczylo/traefikoidc.git synced 2026-06-05 22:44:17 +00:00

Author	SHA1	Message	Date
lukaszraczylo	f821b8829b	fix: remove write-lock convoy in getLocal + fix mutateState CAS bug UniversalCache.getLocal(): when a cached token expires, the RLock fast path (line 385-398) previously fell through to c.mu.Lock() (write lock). Under Yaegi, the write-lock holder takes 10-100ms for LRU manipulation, and Go's RWMutex writer-priority blocks ALL new RLock callers. A single expired-token event turned every concurrent request from read-parallel into write-serialized — the convoy that produced the 737-goroutine pileup at 0x400275a608 (pprof captured at /tmp/traefik-spike-1779663149). Fix: return (nil, false) immediately on expiry for Token/JWK/Session cache types. The periodic cleanup goroutine handles eviction. Write lock is never taken on the read path for these cache types. refreshAttemptTracker.mutateState(): the CAS loop used t.state.CompareAndSwap(t.state.Load(), next) — a second Load that can see a different value from a concurrent writer, silently overwriting their update. Fixed to CompareAndSwap(cur, next) using the snapshot we computed the mutation from.	2026-05-25 00:06:47 +01:00
lukaszraczylo	17e3f8ef62	fix: snapshot patterns for refresh-tracker and metadata URLs Two related lock-free snapshot refactors addressing the remaining post-v1.0.16 code-review findings. 1. refreshAttemptTracker: per-field atomic.Load/Store -> atomic.Value snapshot of attemptState (refresh_coordinator.go). Previously each tracker held five independently-atomic fields. The cooldown-exit reset wrote cooldownEndNano = 0 first, then separately stored attempts = 1 and windowStartNano = now. A concurrent isInCooldown call could observe cooldownEndNano = 0 (reset just completed) with attempts still at MaxRefreshAttempts, immediately triggering a fresh cooldown — a benign double-trigger race that nonetheless meant the state machine had observable intermediate states. New design: state is a attemptState (immutable) published via atomic.Value. All transitions (record/success/failure/window-reset/ cooldown-enter/cooldown-exit) go through mutateState, which runs a CAS loop: load current snapshot -> construct fresh snapshot -> CompareAndSwap. Either the entire new state publishes or none of it does — no intermediate visibility, no cross-field race. Under Yaegi this collapses 3-5 per-field atomic dispatches into one atomic.Value.Load on the read path. Write paths pay an extra allocation for the new snapshot but avoid the cross-field hazard. 2. MetadataSnapshot: hot-path readers use atomic.Value instead of metadataMu.RLock (middleware.go, types.go, main.go, utilities.go). middleware.ServeHTTP previously took metadataMu.RLock on every non-bypass request to read the single field issuerURL. Under Yaegi each RLock acquisition costs 1-5ms of interpreter dispatch. updateMetadataEndpoints now also publishes an immutable *MetadataSnapshot via atomic.Value; the hot-path reader loads it in one op via t.metadataSnap(). Falls back to the legacy metadataMu.RLock pattern when the snapshot is unpublished (some test setups initialize the struct fields directly without going through updateMetadataEndpoints). Less-frequent callers (helpers, logout, token_introspection) still take metadataMu.RLock and are unchanged. The snapshot strictly subsets the metadataMu-protected fields, so those readers see identical data. Note on atomic.Pointer[T]: this would have been the cleaner type but yaegi v0.16.1's stdlib (used by traefik:v3.7.1) exposes only the legacy unsafe.Pointer-based atomic primitives — no generic Pointer[T]. atomic.Value provides the same semantics via interface{} + type assert. All tests pass with -race; golangci-lint clean.	2026-05-23 11:31:51 +01:00
lukaszraczylo	827926bc3a	fix(refresh-coordinator): trim per-request mutex/map ops Three related changes addressing post-v1.0.15 code-review findings and the user's observation that we have been "throwing maps around" — under Yaegi every sync.Map / atomic / mutex dispatch costs ~1-5ms of interpreter overhead, so the number of dispatches per request matters as much as whether they are lock-free. 1. Remove cleanupTimers map + cleanupTimerMu sync.Mutex. scheduleDelayedCleanup previously tracked every pending timer in a map guarded by a mutex so a duplicate scheduling could cancel the prior timer. That "shouldn't happen" path was the only consumer of the map, but the mutex fired on every successful refresh completion — another per-request Yaegi-dispatched lock. performCleanup is already idempotent (LoadAndDelete on the sync.Map), so a duplicate firing is at worst a no-op second call. Dropped the map entirely; time.AfterFunc callback now calls performCleanup directly. Net: -1 sync.Mutex, -1 map field, -2 Lock/Unlock pairs per refresh completion. Shutdown simplified — no need to enumerate-and-stop timers since the callbacks no longer need teardown. 2. Reorder applyLeaderGates: cooldown check BEFORE recordRefreshAttempt. Previously incremented the attempt counter and then checked cooldown. Under burst load (many concurrent leaders with different token hashes but the same session) every goroutine could increment past MaxRefreshAttempts before any one of them observed the threshold, so the gate fired too late — same thundering-herd shape that drove v1.0.14 into the ground. Reordering makes the gate authoritative: only attempts that pass the gate are recorded. Semantic change: with MaxRefreshAttempts=N, exactly N attempts now run to completion before the (N+1)th is denied. Previously the Nth was denied as it tried to record (off-by-one stricter). Test assertion updated to N (was N-1). 3. Fix getOrCreateOperation MaxConcurrentRefreshes overshoot. The previous CAS-loop allowed a transient overshoot of up to N-1 leaders when several goroutines all observed `current < max` in the same scheduling slice before any one of them succeeded their CAS — visible to readers as currentInFlightRefreshes > MaxConcurrentRefreshes for a brief window. Replaced with the ticket-and-return pattern: increment optimistically, decrement if we overshot. Strictly bounded: only the goroutine that produces max+1 sees max+1 as committed; the rest decrement back immediately. No CAS retry loop needed. What was NOT done in this commit, and why: * metadataMu.RLock cached via atomic snapshot — code-reviewer flagged this at severity 7 (3 RLocks per request: middleware.go:213, token_manager.go:349, token_manager.go:408). The clean fix is an atomic.Pointer[MetadataSnapshot], but generic atomic.Pointer[T] is NOT exposed by yaegi v0.16.1's stdlib (only legacy unsafe.Pointer primitives). atomic.Value would work but requires a snapshot-struct refactor across ~15 call sites (helpers/logout/token_introspection/ token_manager/main/middleware). Deferred to a focused future PR. isInCooldown multi-field reset race — the cooldown-reset CAS wins on cooldownEndNano, then separately stores attempts/consecutiveFailures/ windowStartNano. A concurrent isInCooldown can briefly see the pre-reset attempts value and trigger a fresh cooldown. Semantic glitch (double-cooldown), not a correctness disaster. Fix is a single atomic pointer swap of an immutable snapshot — same atomic.Pointer constraint as above. Deferred. All tests pass with -race; golangci-lint clean.	2026-05-23 11:23:16 +01:00
lukaszraczylo	72e2b682bb	fix: eliminate per-request global mutexes in Yaegi hot paths The v1.0.14 fix replaced one contended sync.RWMutex (RefreshCoordinator. refreshMutex) with sync.Map. Production showed the same death-spiral signature recurring ~2 hours later — same shape, different mutex: 65 goroutines stuck on a sync.(RWMutex).Lock at one address, pod pinned at 1000m CPU, identical Yaegi runCfg/reflect.Value.Call stack pattern. The mutex was RefreshCoordinator.attemptsMutex. Generalising: under Yaegi (interpreted Go for traefik plugins), any per-request global mutex acquisition is a latent serialization point. reflect.Value.Call dispatch on a held lock turns a microsecond critical section into a multi-millisecond one, and on a GOMAXPROCS=1 pod the queue is unbounded. This commit removes every per-request global mutex on the hot path: 1. RefreshCoordinator.attemptsMutex (sync.RWMutex) sessionRefreshAttempts: map -> sync.Map. refreshAttemptTracker: all fields atomic (int32, int64 UnixNano, cooldownEndNano == 0 as the not-in-cooldown sentinel, replacing the inCooldown bool). isInCooldown / recordRefreshAttempt / recordRefreshSuccess / recordRefreshFailure all become lock-free. Cooldown entry uses CompareAndSwapInt64 so only one goroutine logs the transition. 2. RefreshCircuitBreaker.mutex (sync.RWMutex) lastFailureTime / lastSuccessTime -> atomic.Int64 UnixNano. state and failures already atomic. AllowRequest / RecordSuccess / RecordFailure now pure atomic ops. 3. TraefikOidc.firstRequestMutex (sync.Mutex) firstRequestReceived bool -> firstRequestStarted int32. metadataRefreshStarted bool -> metadataRefreshStartedAtomic int32. ServeHTTP bootstrap path uses CompareAndSwapInt32 — fires once, zero steady-state cost. Previously the mutex was acquired on every non-health request forever. 4. TraefikOidc.metadataRetryMutex (sync.Mutex) lastMetadataRetryTime time.Time -> lastMetadataRetryNano int64. The 30-second retry throttle is now a CAS on lastMetadataRetryNano. cleanupStaleEntries iterates via sync.Map.Range; eviction is a CompareAndDelete by pointer identity so a tracker freshly re-used by a concurrent caller is not lost. Empirical evidence (3 specialist-agent analysis of the v1.0.14 spike, profiles in /tmp/traefik-spike-1779511683/): mutex profile: 97% delay in sync.(Mutex).Unlock via HTTPHandlerSwitcher -> accesslog -> metrics -> backoff.RetryNotify 65 stuck goroutines at one RWMutex address (0x40022eb648), identical Yaegi CFG pointer, all on rc.attemptsMutex via recordRefreshAttempt + isInCooldown * traffic driver: long-lived in-cluster Go-http-client doing ~5.4 req/s POST embeddings via OIDC cookie session → same sessionID → contention all funnels to one tracker entry Yaegi support for sync/atomic confirmed at github.com/traefik/yaegi@v0.16.1/stdlib/go1_22_sync_atomic.go: AddInt32/Int64, LoadInt32/Int64, StoreInt32/Int64, CompareAndSwapInt32/Int64 all exposed via reflect.ValueOf. Yaegi dispatches each call through reflect.Value.Call to the COMPILED atomic.* function, which executes a single hardware CAS/LOCK-XADD instruction. Each atomic op still pays Yaegi dispatch cost but cannot block — no queueing, no death spiral. Trade-off acknowledged: v1.0.15 issues ~6-8 atomic/sync.Map ops per leader-path request vs the 4 mutex ops of v1.0.14. Under low contention this is a modest CPU bump. Under high contention it's an unbounded → bounded transformation. Net win. All tests pass with -race; golangci-lint clean.	2026-05-23 10:47:21 +01:00
lukaszraczylo	ae4ccaa89d	fix(refresh-coordinator): replace global RWMutex with sync.Map Under Yaegi, the RefreshCoordinator.refreshMutex was held for tens of milliseconds per request because every operation inside the critical section (map access, isInCooldown, recordRefreshAttempt, isUnderMemoryPressure, atomic ops, struct allocation) is dispatched through reflect.Value.Call with full arg boxing/unboxing. Concurrent refreshes on the same coordinator serialized into a queue that grew without bound. Live capture in production (3 Grafana dashboards left open) showed: * 63 goroutines stuck on rc.refreshMutex.Lock() for 1-11 minutes * pod pinned at 1000m CPU (GOMAXPROCS=1) * 5.15M allocs/sec, 0.45 RPS effective throughput * yaegi.call.func9 accounting for 92.66% of cumulative allocs * mutex profile dominated by sync.(Mutex).Unlock via the request chain Change inFlightRefreshes from map[string]refreshOperation+RWMutex to sync.Map and rewrite getOrCreateOperation to: 1. Speculatively allocate the candidate operation. 2. Atomically LoadOrStore by tokenHash. Joiners take the existing operation; leader takes the new one. No global lock acquired. 3. Leader runs rate-limit / cooldown / memory-pressure gates AFTER the atomic store. Joiners share the leader's outcome via op.done. 4. Reserve the concurrent-refresh slot via CompareAndSwap so the count cannot overshoot in absence of the old serializing lock. 5. On any gate failure the leader calls failCandidate, which deletes the entry from sync.Map, records the error on op.result and closes op.done so any joiner that snuck in returns the same error. performCleanup becomes a single sync.Map.LoadAndDelete, eliminating the lock entirely on the cleanup path. Net effect: critical section is no longer Yaegi-interpreted; it collapses to atomic instructions on a sharded sync.Map. Refresh contention disappears even under Yaegi. All tests pass with -race; golangci-lint clean.	2026-05-23 02:34:49 +01:00
lukaszraczylo	1b6c8616fd	fix(refresh): coalesce refresh-token grants + bound goroutines + cache hot path (target v0.8.27) (#131 ) * fix(refresh): wire RefreshCoordinator into the live refresh path The RefreshCoordinator existed but was never instantiated. The actual refresh path used only session.refreshMutex, which is per-SessionData instance - and SessionData is pulled from a sync.Pool per request - so concurrent requests sharing a refresh token had ZERO coordination. Symptom: when access_token expired (e.g. 5min Zitadel default), every in-flight request from a polling client (Grafana panels) entered the refresh path simultaneously and POSTed the same refresh_token to the IdP. With refresh-token rotation enabled (Zitadel/Authentik default), only one grant succeeded; the rest got invalid_grant and each cleared the entire session. Subsequent requests then thrashed in re-auth loops. This commit: - adds refreshCoordinator field on TraefikOidc - instantiates it in NewWithContext with DefaultRefreshCoordinatorConfig - shuts it down in Close() under shutdownOnce - routes refreshToken() through the coordinator via coordinatedTokenRefresh, which collapses concurrent grants to a single upstream call per refresh_token hash - exports refreshCoordinatorSessionID for both internal hashing and the middleware-level wireup so dedup keys stay aligned Behavioural notes: - nil-coordinator fallback preserves existing tests that build TraefikOidc literals without going through the constructor - followers receive the same TokenResponse/error as the leader, so no per-instance code paths change - existing TestGetNewTokenWithRefreshToken_Concurrency still passes because it hits GetNewTokenWithRefreshToken directly, below the coordinator boundary Tests: - refresh_coordinator_wireup_test.go: 50 concurrent refreshes coalesce to <=2 upstream calls; distinct tokens still run in parallel; nil coordinator falls back cleanly * perf(cache): bound L1 backfill goroutines in HybridBackend Get() and GetMany() previously spawned a goroutine per L2 hit to write the value through to L1. Under sustained polling traffic (e.g. a Grafana dashboard refreshing every 30s with N panels) this minted thousands of goroutines, each running in Yaegi - directly contributing to the ~1000% CPU spike that pairs with the refresh-token herd. Replace the per-hit goroutines with a single l1BackfillWorker fed by l1BackfillBuffer, mirroring the existing asyncWriteBuffer/asyncWriteWorker pattern for L2 writes. Buffer overflow drops the backfill (counted via l1BackfillDrops) - a dropped backfill just means the next L2 hit for that key re-queues it, which is safe. Tests: - TestHybridBackend_L1BackfillBounded: 1000 distinct L2 hits keep goroutine count within +20 of baseline (pre-fix it grew by ~1000) - TestHybridBackend_L1BackfillFullDrops: drops are accounted for when the buffer is saturated and the worker is stopped * feat(refresh): implement isRefreshTokenExpired heuristic Replace the placeholder `return false` with a real check based on the issued_at timestamp that SetRefreshToken already stamps into the session. Gated by a new MaxRefreshTokenAgeSeconds config field (default 21600 = 6h, matching the existing comment). 0 disables the check. This wires the previously-dead refreshTokenExpired branch in middleware.go, which short-circuits AJAX requests with a 401 instead of letting them hammer the IdP for a refresh token that's almost certainly stale - the classic Grafana-after-long-pause failure mode. Behaviour: - maxRefreshTokenAge=0 disables the check (preserves prior behaviour) - legacy sessions without issued_at still attempt one refresh; the IdP remains the source of truth on first try - nil-receiver and nil-session guards keep test code that builds TraefikOidc literals safe Tests: - TestIsRefreshTokenExpired_DisabledWhenAgeZero - TestIsRefreshTokenExpired_LegacySessionWithoutTimestamp - TestIsRefreshTokenExpired_WithinWindow - TestIsRefreshTokenExpired_BeyondWindow - TestIsRefreshTokenExpired_NilGuards * perf(token): skip parseJWT on cache hit in VerifyToken The token cache fast-return existed but ran AFTER parseJWT, so every validation paid for base64 + JSON unmarshal even on a hit. Under bursty traffic (e.g. 10+ concurrent panel requests on every Grafana dashboard refresh, each calling validateStandardTokens which verifies BOTH the access token and the ID token), this is two redundant parses per request multiplied by the panel count. Move the cache lookup ahead of parseJWT. On a hit the function returns nil immediately. On a miss the original flow runs unchanged. Also nil-guard t.tokenCache to keep partial-literal test instances safe (matches the same pattern we already use for tokenBlacklist). Tests: - TestVerifyToken_CacheHitSkipsParse: cache pre-populated with claims for a token whose body would fail parseJWT - returns nil iff the fast-path bypasses the parse - TestVerifyToken_CacheMissStillParses: a syntactically valid but unsigned token still errors past parseJWT on cache miss * feat(refresh): cross-replica refresh-grant dedup via shared cache The in-process RefreshCoordinator added in `9f96d8c` already collapses concurrent refresh-token grants on a single Traefik replica. With the plugin's existing Redis (Dragonfly) cache infrastructure available, we can extend that dedup across replicas: if pod A refreshes a token at T+0 and pod B receives a request for the same session at T+1, pod B should reuse pod A's result rather than POSTing the now-rotated refresh token to the IdP. Implementation: - Add a refreshResultCache to UniversalCacheManager (memory-only when Redis is disabled, Redis-backed in production via the existing hybrid/Redis-only mode selection) - Expose it through CacheManager.GetSharedRefreshResultCache and on the TraefikOidc struct as refreshResultCache (CacheInterface) - Inside the closure passed to RefreshCoordinator.CoordinateRefresh, consult the cache first; on hit return immediately, on miss exchange with the IdP and populate the cache for peers - 5s TTL: long enough for siblings to observe, short enough that a rotated refresh token cannot be re-supplied after the IdP has moved on - Errors are intentionally NOT cached - peers must always be able to retry on their own Pragmatic choice: optimistic cache rather than a hard distributed lock. - A hard lock (SET NX + poll) doubles Redis RTT and risks dead-locks if a Traefik pod dies mid-grant. - The user's BGP+Local externalTrafficPolicy already pins ingress for a session to one node in steady state, so cross-pod racing is rare. - This optimistic path catches the rare failover case without adding failure modes. Tests: - TestCoordinatedTokenRefresh_CrossReplicaCacheHit: pre-populated cache short-circuits the upstream call entirely (0 IdP calls) - TestCoordinatedTokenRefresh_PopulatesCrossReplicaCache: leader stores a successful result for peers to find - TestCoordinatedTokenRefresh_ErrorIsNotCached: invalid_grant must not poison the dedup cache - peers must retry independently	2026-04-30 18:52:39 +01:00
lukaszraczylo	2d1b04c637	review fixes apr 2026 (#130 ) * Multiple fixes - refresh coordinator dedup + memory pressure wire - middleware sse consolidation + timer leak + claim cache - universal cache sync backfill + isDebug gate - lazy background task race - memory monitor stw cached + refresh() api * fix(auth): suppress OIDC redirects on non-navigation requests - [x] Add isNonNavigationRequest using Sec-Fetch-Mode and Accept headers - [x] Add comprehensive TestIsNonNavigationRequest - [x] Update ServeHTTP to 401 non-navigation and AJAX requests Fixes #129 * feat(config): add custom CA and insecure skip verify for OIDC TLS - [x] Add CACertPath, CACertPEM, InsecureSkipVerify to Config - [x] Implement loadCACertPool for CA bundle loading - [x] Update HTTPClientConfig with RootCAs and InsecureSkipVerify - [x] Apply CA pool and skip verify to pooled HTTP clients - [x] Enhance configKey to distinguish TLS configs - [x] Add comprehensive ca_cert_test.go Fixes #125 * feat(oidc): add custom CA certificate support for private OIDC providers - [x] Add caCertPath, caCertPEM, insecureSkipVerify config options - [x] Update traefik.yml with new OIDC client config fields - [x] Add configuration schema descriptions for new options - [x] Update README table and add Custom CA Certificates section * Fix the documentation. * test(redis): add oversized argument rejection test - [x] Add TestRedisConn_RejectOversizedArgumentBytes - [x] Import strings package * Dependencies cleanup	2026-04-19 10:12:00 +01:00
lukaszraczylo	9d52f1b018	feat(core): refactor linters config and improve code quality (#119 ) - [x] Reorganize golangci-lint configuration with documented disable reasons - [x] Simplify errcheck and revive linter rules with targeted exclusions - [x] Pre-compile regex patterns in input_validation.go for performance - [x] Fix type assertions in memory_shard.go and resp.go with safety checks - [x] Replace string comparison with EqualFold for case-insensitive matching - [x] Fix loop variable captures in jwk.go and logout.go - [x] Change high goroutine log level from Info to Debug in autocleanup.go - [x] Replace deprecated "cancelled" spelling with "canceled" throughout - [x] Add nolint annotations for intentional unused parameters - [x] Improve comment formatting for deprecated functions - [x] Fix comment spelling: "marshalling" → "marshaling" - [x] Refactor provider warnings formatting in internal/providers/warnings.go - [x] Simplify metrics summary building in internal/recovery/metrics.go - [x] Pre-allocate slice in error_recovery.go GetDegradedServices - [x] Refactor context cancellation checks in redis.go	2026-01-15 10:40:49 +00:00
lukaszraczylo	6efb78b7a8	Smarter approach to the cookies (#103 ) * Smarter approach to the cookies - Single maxCookieSize = 1400 constant with clear documentation - Combined cookie storage for ~40-45% size reduction - Backward compatible migration from legacy cookies * Tuneup the code.	2025-12-12 18:35:06 +00:00
lukaszraczylo	5fcbd54955	Add sharded cache and prevention of CPU spikes / locks (#96 ) * Add sharded cache and prevention of CPU spikes / locks * Add dynamic client registration with oidc provider * Fix race condition introduced during the sharded cache implementation. * Add page for traefikoidc.	2025-11-30 01:41:12 +00:00
lukaszraczylo	1e4142a7fb	release 0.7.2 (#66 ) * Remove trailing / from metadata provider. * Resolves issue #67 - Before: 100 concurrent requests → 300+ refresh attempts → OOM - After: 100 concurrent requests → 1 refresh attempt → Stable memory Added following changes: - Introduced a refresh coordinator to manage concurrent refresh requests - Implemented a test to simulate high concurrency and verify memory stability * Issue #67 fixed.	2025-09-25 12:52:53 +01:00

11 Commits