Commit Graph

48 Commits

Author SHA1 Message Date
lukaszraczylo da8ec5f21d Add LRU cache support. 2025-12-03 10:22:33 +00:00
lukaszraczylo 6a69694ab3 November improvements. (#29)
* Tackling the CPU / memory spikes after some time.

* Update admin dashboard, fix the circuit breaker and request coalescing.
2025-11-29 14:21:09 +00:00
lukaszraczylo 39dc7b49cf Improve caching by adding user ids and roles to hash. 2025-11-22 17:02:16 +00:00
lukaszraczylo e3e9f7d181 improvements mid apr 2025 (#27)
* General improvements and bug fixes.

* Improve tests coverage.

* fixup! Improve tests coverage.

* Update README.md with latest changes.

* Fix the uint32

* Resolve issue with race condition for logging.

* fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* Fix the test of the rate limiter

* Add default ratelimit.json file

* Update dependencies.

* Significant refactor.

* fixup! Significant refactor.

* fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! fixup! fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! fixup! fixup! fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! fixup! fixup! fixup! fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* Enhance admin dashboard with real-time WebSocket streaming and charts

Dashboard improvements:
- Added Chart.js for time-series visualization
- Created real-time graphs for RPS and cache hit rate
- Added new statistics displays:
  * System uptime with human-readable format
  * Total requests with success/failure breakdown
  * Current and average RPS
  * Success rate with progress bars
  * Cache hit rate and memory usage with visual indicators
  * Detailed cache statistics
- Implemented WebSocket streaming endpoint (/admin/ws/stats)
- Real-time updates every 2 seconds via WebSocket
- Automatic fallback to polling if WebSocket unavailable
- Connection status indicator
- Progress bars for success rate, cache hit rate, and memory usage

Backend enhancements:
- New WebSocket handler for streaming all statistics
- gatherAllStats() method to collect comprehensive metrics
- Streams data every 2 seconds to connected clients
- Automatic reconnection handling
- Maintains up to 60 data points per chart

The dashboard now provides comprehensive real-time monitoring with:
- Live metrics streaming
- Historical trend visualization
- Responsive design with visual indicators
- Graceful degradation to polling mode

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix WebSocket message handling for real-time stats streaming

Issues fixed:
- Removed blocking default case that prevented ticker from firing
- Separated read and write operations into proper goroutines
- Added proper ping/pong handlers with read deadlines
- Implemented done channel for clean disconnection signaling
- Send initial stats immediately on connection

The WebSocket now properly:
- Streams stats every 2 seconds via ticker
- Handles client disconnections gracefully
- Maintains connection with ping/pong
- Detects connection drops via read goroutine
- Non-blocking message handling

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Add Redis-based cluster mode for distributed metrics aggregation

When using Redis for caching, proxies now automatically form a cluster
and aggregate metrics across all instances for unified monitoring.

Features:
- Metrics Aggregator: Publishes instance metrics to Redis every 5s
- Cluster Mode API: /admin/api/cluster/stats and /admin/api/cluster/instances
- Dashboard Cluster View: Toggle between single instance and cluster view
- Auto-discovery: Detects cluster mode automatically via Redis
- Instance Management: Each instance gets unique ID (hostname + UUID)
- Graceful Cleanup: Removes metrics from Redis on shutdown
- TTL-based expiration: Stale instances auto-expire after 30s

Cluster metrics include:
- Aggregated requests (total, succeeded, failed, success rate)
- Combined RPS across all instances
- Total cache hits/misses with cluster-wide hit rate
- Per-instance health status and uptime
- Active connections and WebSocket stats
- Request coalescing backend savings

Dashboard improvements:
- Cluster Status section showing total/healthy instances
- Instance Details section with per-node metrics
- Cluster View toggle in header
- Automatic detection of cluster availability
- Individual instance cards with health indicators
- "Current" badge for the instance you're viewing

Architecture:
- Uses Redis SET to track active instances
- Each instance publishes to redis key: graphql-proxy:metrics:instances:{id}
- 30s TTL ensures stale instances are removed
- Aggregator started automatically when Redis cache enabled
- Registered with shutdown manager for graceful cleanup

Environment: Automatically enabled when ENABLE_REDIS_CACHE=true

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix cluster instance details using correct JSON field names

JavaScript was using Go struct field names (PascalCase) instead of
JSON field names (snake_case), causing all instance metrics to show
as 0 or undefined.

Fixed references:
- instance.InstanceID → instance.instance_id
- instance.Hostname → instance.hostname
- instance.UptimeSeconds → instance.uptime_seconds
- instance.Stats → instance.stats
- instance.Health → instance.health

Also added fallback to check both instance.cache_summary and
stats.cache_summary for better compatibility.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Enhance Cluster View toggle visibility and styling

Improvements:
- Replaced basic checkbox with custom toggle switch
- Added prominent card-style container with backdrop blur
- Positioned toggle in header next to dashboard title
- Toggle switch with smooth animation (slides left/right)
- Green color when enabled (#10b981)
- Hover effects with slight lift
- Better typography with font weights and spacing
- Info text positioned below toggle label
- Disabled state with reduced opacity
- Responsive layout with flexbox

The toggle is now much more visible and professional-looking,
making it clear when cluster mode is available and active.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Add comprehensive debugging for cluster mode metrics issues

Backend improvements:
- Fixed metrics structure: ensure Stats always has correct inner structure
- Added defensive nil checks for instance.Stats in aggregation
- Added debug logging in publishMetrics to verify data being sent
- Added warning logging in aggregateStats when data is missing
- Log actual keys present in Stats when 'requests' is missing
- Initialize empty maps instead of leaving fields nil

Frontend improvements:
- Added console.log statements to trace cluster data flow
- Log cluster data structure on receive
- Log stats keys and structure
- Log instances array and count
- Warn when expected data is missing
- Added fallback values (|| 0) for display fields

This will help diagnose why cluster view shows zeros by logging:
1. What data is being published to Redis
2. What data is being retrieved from Redis
3. What structure the data has at each step
4. What keys are present vs expected

Check browser console and server logs to see where the data flow breaks.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Add cluster debug endpoint for troubleshooting

New endpoint: GET /admin/api/cluster/debug

Returns comprehensive debug information:
- Whether metrics aggregator is initialized
- Redis cache enabled status
- Current instance ID
- Cluster mode status
- Total/healthy instance counts
- Sample instance structure with keys
- Sample requests data structure
- Error messages if any

This helps diagnose cluster mode issues by showing:
1. If Redis is actually enabled
2. If aggregator is initialized
3. What data structure is being stored
4. What keys are present in Stats
5. Sample of actual data being aggregated

Visit http://localhost:8181/admin/api/cluster/debug to see
what's happening under the hood.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix cluster mode initialization and improve Redis error visibility

Critical fixes:
1. Move metrics aggregator initialization BEFORE cache initialization
   - Runs independently even if CacheEnable=false
   - Only requires CacheRedisEnable=true
   - This was causing aggregator to not initialize when cache was disabled

2. Promote Redis errors from Warning to Error level
   - Changed "Failed to publish" from Warning to Error with  CRITICAL prefix
   - Added detailed error context (instance_id, keys, error message)
   - Added success logging with ✓ confirmation
   - Log command count and data size on success

3. Enhanced startup logging
   - Log "Initializing metrics aggregator" with Redis URL/DB
   - Log "✓ Successfully initialized" with instance ID
   - Log "FAILED to initialize" as ERROR (was Warning)

Why this matters:
- If Redis cache is disabled but Redis is available, cluster mode should still work
- Previous code only initialized aggregator inside cache initialization block
- Redis publish errors were being silently logged as warnings
- No visibility into whether metrics were actually being stored

After this fix:
- Cluster mode works even with ENABLE_GLOBAL_CACHE=false + ENABLE_REDIS_CACHE=true
- Redis errors are immediately visible in logs
- Clear success/failure indicators
- Better troubleshooting information

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Add force-publish endpoint for cluster mode debugging

New endpoint: POST /admin/api/cluster/force-publish

Forces an immediate metrics publish to Redis and reports results.
This helps diagnose why instances aren't appearing:

Response includes:
- success: true/false
- publish_done: confirmation publish was attempted
- instances_found: count after publish
- error: if retrieval failed
- check_logs: reminder to look for log messages

Use this to test:
curl -X POST http://localhost:8181/admin/api/cluster/force-publish

Then check server logs for:
✓ Successfully published metrics to Redis
OR
 CRITICAL: Failed to publish metrics to Redis

This bypasses the 5-second timer and publishes immediately,
making it easier to test without waiting.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix cluster metrics aggregation and dashboard display issues

- Fix metric names: use correct Prometheus metric names (requests_succesful, requests_failed, requests_skipped)
- Add automatic stale instance cleanup (>1 minute inactive)
- Implement 10-second moving average smoothing for RPS, success rate, and cache hit rate
- Add trend indicators (↑ ↗ → ↘ ↓) to show metric direction
- Add compact number formatting (1.2M, 3.4K) with full-value tooltips
- Add retry budget aggregation (allowed/denied retries, denial rate)
- Add circuit breaker aggregation (state counts, per-instance breakdown)
- Add coalescing stats aggregation (backend savings percentage)
- Fix memory display to show "N/A" for Redis cache (memory tracking not available)
- Fix JavaScript error: change hitRate to smoothedHitRate in chart update call
- Change Redis operations to use context.Background() instead of parent context
- Fix staticcheck warning: omit nil check for map len()

This resolves cluster view showing zeros and prevents metrics from disappearing.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Fix cluster instance count not updating in upper right corner

The cluster-info element (showing instance count next to "Cluster View" toggle)
was only updated during initial page load. Now it updates in real-time whenever
cluster stats are received via WebSocket.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* Clean up verbose debug logging

Removed debug console.log statements from dashboard JavaScript:
- WebSocket connection/disconnection logs
- Cluster mode availability checks
- Cluster stats update debug logs

Removed verbose Info logs from Go code that ran frequently:
- "Publishing metrics to Redis" (every 5s)
- "Metrics gathered successfully" (every 5s)
- "Successfully published metrics to Redis" (every 5s)
- "Aggregating stats from instances" (frequently)
- "Successfully aggregated cluster metrics" (frequently)
- "Aggregation complete" (frequently)

Kept important logs:
- Error and warning logs
- Initialization and shutdown logs
- Conditional logs (stale instance cleanup, failures)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-10-01 00:25:32 +01:00
lukaszraczylo 0fc776228f improvements mid apr 2025 (#26)
* General improvements and bug fixes.
* Admin dashboard
2025-10-01 00:20:45 +01:00
lukaszraczylo cedee416a8 improvements mid may 2025 (#24)
* General improvements and bug fixes.

* Improve tests coverage.

* fixup! Improve tests coverage.

* Update README.md with latest changes.

* Fix the uint32

* Resolve issue with race condition for logging.

* fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* Fix the test of the rate limiter

* Add default ratelimit.json file

* Update dependencies.

* Significant refactor.

* fixup! Significant refactor.

* fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! fixup! fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! fixup! fixup! fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! fixup! fixup! fixup! fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025

* fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! fixup! Merge remote-tracking branch 'origin/main' into improvements-mid-apr-2025
2025-09-30 18:27:33 +01:00
lukaszraczylo 1b7890f322 Gofmt the codebase. 2025-02-26 00:47:41 +00:00
lukaszraczylo 66c8fef24d Improve the test coverage 2025-02-26 00:44:14 +00:00
lukaszraczylo 2ab78d35ce Configuration Management:
Optimized the getDetailsFromEnv function to reduce redundant lookups and improve type handling
Added direct environment variable access for better performance

Memory Cache Optimization:

Implemented a size-based compression threshold (1KB) to avoid compressing small payloads
Added cache size limits (10,000 entries) to prevent memory leaks
Implemented efficient eviction strategies for the oldest entries
Added atomic counter for thread-safe cache size tracking
Improved cleanup routines with GC triggering for large caches

Proxy Implementation:

Refactored the proxy code into smaller, focused functions for better maintainability
Optimized gzip handling for better performance
Improved error handling and logging
Enhanced tracing integration

GraphQL Processing:

Optimized introspection query checking with fast-path returns
Improved object pool usage
Added detailed comments for better code understanding
Split complex functions into smaller, more focused ones
Fixed test compatibility issues with introspection checking

Request Processing:

Refactored the request processing logic into smaller, focused functions
Separated user extraction, caching, and request handling for better maintainability
Improved error handling and response generation

Tracing Enhancements:

Added better span context management
Implemented custom attributes for more detailed tracing
Added sampling configuration to reduce overhead
Improved resource attribution with host and process information
Added timeout handling for tracing operations

Application Lifecycle:

Implemented graceful shutdown with proper signal handling
Added goroutine management with wait groups
Improved startup sequence with better error handling
Added timeout handling for shutdown operations
2025-02-25 23:34:39 +00:00
lukaszraczylo 6af5aefe54 Add tracing and relevant tests (#21)
* Add tracing and relevant tests.

* fixup! Add tracing and relevant tests.

* gofmt the code 🤷

* fixup! gofmt the code 🤷
2025-01-08 18:29:25 +00:00
lukaszraczylo cbe2afe539 Gather cleaner event errors and display as a group rather than separately. 2024-12-06 09:39:26 +00:00
lukaszraczylo 6b31e5c4c0 Little code cleanup. (#19) 2024-10-10 10:34:23 +01:00
lukaszraczylo 68526ddfd4 Add delay, in case the container runs in deployment with hasura.
Hasura takes its sweet time to start up, that causes the client to error out
as the graphql proxy fires up practically instantly. This is a temporary workaround.
2024-08-20 13:14:24 +01:00
lukaszraczylo 9f9e36efa9 fixup! Enhance the retry logic for the proxied queries. 2024-08-20 13:07:37 +01:00
lukaszraczylo 5b171b2317 Add initial retry for end graphql server connection. 2024-08-20 12:40:04 +01:00
lukaszraczylo ae9a44033b New release.
Includes the panic when cache is completely disabled.
2024-08-19 15:43:42 +01:00
lukaszraczylo dc9e0906fd Resolve issue when proxy could panic.
Issue occured when cache was disabled via environment variables but
graphql queries contained the cache directive.
2024-08-19 11:27:06 +01:00
lukaszraczylo dfd3b02014 Release 0.19.x 2024-06-29 09:57:52 +01:00
lukaszraczylo 16844e325e Disable caller as it's not necessary and generates slight delay. 2024-06-19 23:40:44 +01:00
lukaszraczylo 61d7a45d00 Update cache library, use miniredis for testing, add additional benchmarks. (#14)
Update cache library,
Update logging library,
use miniredis for testing, add additional benchmarks.
2024-06-19 23:10:36 +01:00
lukaszraczylo b2380c689b Add cleanup of the event and invocation logs on timer. 2024-06-12 11:47:21 +01:00
lukaszraczylo 12e0294945 Add distibuted cache with Redis 2024-06-11 11:35:50 +01:00
lukaszraczylo 371d51f96f Update dependencies. 2024-06-11 11:35:49 +01:00
lukaszraczylo a9fd6b3d0a Release: Add cache operations via API + support distributed redis cache. 2024-06-11 11:35:46 +01:00
lukaszraczylo e495cf23d9 Read only endpoint support (#10)
* This change introduces ability to set additional endpoint leading to the
instance of the graphql server connected to the read only database.
If regular query is detected and endpoint for `HOST_GRAPHQL_READONLY` value is set,
the query will be proxied to it. Mutations and non-graphql will be sent
to the `HOST_GRAPHQL` endpoint.
2024-03-12 11:16:35 +00:00
lukaszraczylo 3a18e0e935 Improve stats gathering and tests improvements. (#8) 2024-03-05 22:40:06 +00:00
lukaszraczylo d3a8da1dcf Move location of the global proxy client from the per-req to main.
There's no need to re-create it every single time.
2024-02-05 14:35:33 +00:00
lukaszraczylo 794cb1ddf4 Add the prefixed environment variables to avoid potential conflicts. 2024-02-05 14:24:17 +00:00
lukaszraczylo 1390e7cdd1 Fix blocking the introspection + add unit tests. 2023-11-18 02:11:38 +00:00
lukaszraczylo 105c624426 Add purging metrics on timer. 2023-11-17 13:47:54 +00:00
lukaszraczylo 0b642f8be1 Add ability to reset metrics between crawl to limit payload absorbed (#5)
by the prometheus/victoria metric crawlers.
2023-11-16 16:45:48 +00:00
lukaszraczylo 3d70018179 Add configurable timeout for queries. 2023-10-24 10:40:17 +01:00
lukaszraczylo 35e6069f5e Add the healtcheck checks on the end server. 2023-10-19 15:43:49 +01:00
lukaszraczylo 92359c1114 Cleanup pt 1 (#4)
* Disable startup headers.

* Add banning / unbanning of specific user.
2023-10-19 14:36:16 +01:00
lukaszraczylo 2a0302ab75 Create allow list for event when intospection is blocked but developers
really want to use certain subqueries.
2023-10-15 10:01:23 +01:00
lukaszraczylo bf9ec2c877 Reuse fasthttp client 2023-10-12 21:16:57 +01:00
lukaszraczylo 815a6841ed Add ability to set up allowed paths for proxying. 2023-10-12 14:12:03 +01:00
lukaszraczylo f41b2ae46f New: Proxy all the requests to the graphql server 2023-10-11 11:26:55 +01:00
lukaszraczylo 1a3628837f Extract helper libraries from private repo of telegram-bot.app 2023-10-10 22:16:50 +01:00
lukaszraczylo 51dfc8d9be Add ability to look for the role in header. 2023-10-10 19:48:56 +01:00
lukaszraczylo 7de1cf7cc7 Add read only mode to block all the queries with mutations. 2023-10-10 19:26:36 +01:00
lukaszraczylo ac44056a00 Add role ratelimit (#1)
* Add ratelimit configuration.
* Add rate limiting :party:
2023-10-09 17:46:50 +01:00
lukaszraczylo 743eed7f71 Add ability to enable / disable access log.
In high frequency environments it can be a little bit noisy.
2023-10-09 17:46:50 +01:00
lukaszraczylo b89053c015 Update README. 2023-10-09 17:46:50 +01:00
lukaszraczylo e7b2cc1deb Update readme and make it release ready. 2023-10-08 18:38:55 +01:00
lukaszraczylo 3ac7c115aa Blocking introspection queries. 2023-10-08 18:07:24 +01:00
lukaszraczylo 8673f1caf8 Remove println and replace it with our logging 2023-10-07 14:13:29 +01:00
lukaszraczylo 39d3afdd05 Initial commit. 2023-10-07 11:14:20 +01:00