Skip to content

[TESTING][MULTI-INSTANCE]: Leader Election, Redis Coordination, and Horizontal Scaling #2488

@crivetimihai

Description

@crivetimihai

[TESTING][MULTI-INSTANCE]: Leader Election, Redis Coordination, and Horizontal Scaling

Goal

Produce a comprehensive manual test plan for validating correct behavior with multiple gateway instances including leader election, state synchronization, cache coherence, and horizontal scaling.

Why Now?

Multi-instance deployment is critical for production:

  1. High Availability: No single point of failure
  2. Scalability: Handle increased load
  3. Zero Downtime: Rolling deployments
  4. Data Consistency: State sync across instances
  5. Leader Tasks: Background jobs run once

User Stories

US-1: Operator - High Availability

As an operator
I want multiple gateway instances
So that the service survives instance failures

Acceptance Criteria:

Feature: High Availability

  Scenario: Instance failure
    Given 3 gateway instances running
    When one instance crashes
    Then the service should continue
    And requests should be handled by remaining instances
US-2: Operator - Leader Election

As an operator
I want only one leader instance
So that background jobs don't run multiple times

Acceptance Criteria:

Feature: Leader Election

  Scenario: Leader failover
    Given instance A is the leader
    When instance A crashes
    Then instance B or C should become leader
    And there should never be two leaders

Architecture

                    MULTI-INSTANCE ARCHITECTURE
+------------------------------------------------------------------------+
|                                                                        |
|   Load Balancer                                                        |
|   -------------                                                        |
|        |                                                               |
|        +------------------+------------------+                         |
|        |                  |                  |                         |
|        v                  v                  v                         |
|   +---------+        +---------+        +---------+                    |
|   | Gateway |        | Gateway |        | Gateway |                    |
|   | Inst 1  |        | Inst 2  |        | Inst 3  |                    |
|   | (Leader)|        +---------+        +---------+                    |
|   +---------+             |                  |                         |
|        |                  |                  |                         |
|        +------------------+------------------+                         |
|                           |                                            |
|                           v                                            |
|   +----------------------------------------------------------------+  |
|   |                         Redis                                   |  |
|   |  - Leader key (SET NX pattern)                                 |  |
|   |  - Pub/Sub for cache invalidation                              |  |
|   |  - Session state sync                                          |  |
|   +----------------------------------------------------------------+  |
|                           |                                            |
|   +----------------------------------------------------------------+  |
|   |                       PostgreSQL                                |  |
|   |  - Shared persistent state                                      |  |
|   +----------------------------------------------------------------+  |
|                                                                        |
+------------------------------------------------------------------------+

Test Environment Setup

# Start Redis
docker run -d --name redis -p 6379:6379 redis:7

# Start PostgreSQL
docker run -d --name postgres -p 5432:5432 \
  -e POSTGRES_PASSWORD=postgres \
  postgres:15

# Common environment
export DATABASE_URL="postgresql://postgres:postgres@localhost/gateway"
export REDIS_URL="redis://localhost:6379"
export JWT_SECRET_KEY="multi-instance-test"

# Create database
docker exec postgres createdb -U postgres gateway

# Instance ports
export PORTS=(8001 8002 8003)

Manual Test Cases

Case Scenario Configuration Expected Result
MI-01 Single leader 3 instances Only one leader
MI-02 Leader failover Kill leader New leader elected
MI-03 Heartbeat renewal Leader running TTL refreshed
MI-04 Split-brain Network partition No dual leaders
MI-05 Round-robin Load balancer All handle requests
MI-06 Cache invalidation Update on A B sees update
MI-07 Background job Health check Only leader executes
MI-08 Scale up Add instance Joins smoothly
MI-09 Scale down Remove non-leader No impact
MI-10 Graceful shutdown Stop leader Releases key

MI-01: Single Leader Election

Steps:

# Start 3 instances
for port in ${PORTS[@]}; do
  python -m mcpgateway.main --port $port &
  sleep 3
done

# Wait for leader election
sleep 15

# Check Redis for leader
LEADER=$(redis-cli GET gateway_service_leader)
echo "Leader: $LEADER"

# Verify only one leader
LEADER_COUNT=$(redis-cli KEYS "gateway_service_leader*" | wc -l)
[ "$LEADER_COUNT" -eq 1 ] && echo "PASS: Single leader" || echo "FAIL: Multiple leaders"

Verification via API:

# Each instance should report its leader status
for port in ${PORTS[@]}; do
  echo "Instance $port:"
  curl -s "http://localhost:$port/api/admin/status" \
    -H "Authorization: Bearer $TOKEN" | jq '.is_leader'
done
# Exactly one should return true

Expected Result:

  • Exactly one instance becomes leader
  • Other instances are followers
  • Leader key exists in Redis
MI-02: Leader Failover

Steps:

# Identify current leader
LEADER=$(redis-cli GET gateway_service_leader)
LEADER_PORT=$(echo $LEADER | cut -d: -f2)
echo "Current leader on port: $LEADER_PORT"

# Kill leader
pkill -f "port $LEADER_PORT"

# Wait for failover (TTL is 15 seconds)
sleep 20

# Check new leader
NEW_LEADER=$(redis-cli GET gateway_service_leader)
echo "New leader: $NEW_LEADER"

# Verify different from old
[ "$LEADER" != "$NEW_LEADER" ] && echo "PASS: New leader elected" || echo "FAIL"

Service Continuity:

# While failover happens, check service continues
for i in {1..30}; do
  # Hit load balancer (or any live instance)
  curl -s -o /dev/null -w "%{http_code}\n" "http://localhost:${PORTS[1]}/health"
  sleep 1
done
# Should all return 200

Expected Result:

  • New leader elected within TTL + heartbeat interval
  • Service continues during failover
  • No split-brain
MI-03: Heartbeat Renewal

Steps:

# Monitor TTL over time
for i in {1..10}; do
  TTL=$(redis-cli TTL gateway_service_leader)
  echo "TTL: $TTL seconds"
  sleep 5
done
# TTL should reset every heartbeat interval (5s)

Expected Result:

  • TTL consistently between 10-15 seconds
  • Never drops to 0 while leader running
  • Heartbeat refreshes every 5 seconds
MI-04: Split-Brain Prevention

Simulate Network Partition:

# This is harder to test without container networking
# In Kubernetes, use NetworkPolicy to isolate

# Alternative: pause Redis briefly
docker pause redis
sleep 20  # Longer than TTL
docker unpause redis
sleep 10

# Check for single leader after recovery
LEADER=$(redis-cli GET gateway_service_leader)
echo "Leader after partition: $LEADER"

Expected Result:

  • During partition, old leader loses leadership
  • After recovery, single leader exists
  • No period with two leaders
MI-05: Round-Robin Request Handling

Setup nginx Load Balancer:

upstream gateway {
    server localhost:8001;
    server localhost:8002;
    server localhost:8003;
}

server {
    listen 8000;
    location / {
        proxy_pass http://gateway;
    }
}

Test Distribution:

# Make many requests and track which instance handles
for i in {1..30}; do
  curl -s "http://localhost:8000/health" | jq -r '.instance_id' >> /tmp/instances.txt
done

# Count distribution
sort /tmp/instances.txt | uniq -c
# Should be roughly even (10 each)

Expected Result:

  • All instances handle requests
  • Load distributed (approximately)
  • No instance overloaded
MI-06: Cache Invalidation Across Instances

Steps:

# Create entity via instance 1
GATEWAY_ID=$(curl -s -X POST "http://localhost:8001/gateways" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "cache-test", "url": "http://localhost:9000"}' | jq -r '.id')

# Read via instance 2 (should be cached)
curl -s "http://localhost:8002/gateways/$GATEWAY_ID" \
  -H "Authorization: Bearer $TOKEN" | jq '.name'

# Update via instance 1
curl -s -X PUT "http://localhost:8001/gateways/$GATEWAY_ID" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "cache-test-updated", "url": "http://localhost:9000"}'

# Read via instance 2 (should see update)
sleep 1  # Allow pub/sub propagation
curl -s "http://localhost:8002/gateways/$GATEWAY_ID" \
  -H "Authorization: Bearer $TOKEN" | jq '.name'
# Should return "cache-test-updated"

Expected Result:

  • Cache invalidation propagates via Redis pub/sub
  • All instances see updated data
  • No stale cache entries
MI-07: Background Job Single Execution

Steps:

# Check health refresh logs (only leader should run)
for port in ${PORTS[@]}; do
  docker logs gateway-$port 2>&1 | grep "health check refresh" | wc -l
done
# Only leader should have these log entries

# Or check via metrics
for port in ${PORTS[@]}; do
  curl -s "http://localhost:$port/metrics" | grep background_job_executions
done
# Only leader should have non-zero count

Expected Result:

  • Only leader executes background jobs
  • Followers skip background job execution
  • No duplicate job runs
MI-08: Scale Up

Steps:

# Start with 3 instances
# Add 4th instance
python -m mcpgateway.main --port 8004 &
sleep 10

# Verify it joins without disrupting leader
LEADER=$(redis-cli GET gateway_service_leader)
echo "Leader unchanged: $LEADER"

# New instance handles requests
curl -s "http://localhost:8004/health" | jq .

Expected Result:

  • New instance starts without errors
  • Existing leader not disrupted
  • New instance handles requests
MI-09: Scale Down Non-Leader

Steps:

# Identify non-leader
LEADER=$(redis-cli GET gateway_service_leader)
LEADER_PORT=$(echo $LEADER | cut -d: -f2)

# Find non-leader port
for port in ${PORTS[@]}; do
  [ "$port" != "$LEADER_PORT" ] && NON_LEADER_PORT=$port && break
done

# Stop non-leader
pkill -f "port $NON_LEADER_PORT"

# Verify no impact
LEADER_AFTER=$(redis-cli GET gateway_service_leader)
[ "$LEADER" = "$LEADER_AFTER" ] && echo "PASS: Leader unchanged" || echo "FAIL"

Expected Result:

  • Non-leader removed cleanly
  • Leader not affected
  • Service continues
MI-10: Graceful Leader Shutdown

Steps:

# Get leader port
LEADER=$(redis-cli GET gateway_service_leader)
LEADER_PORT=$(echo $LEADER | cut -d: -f2)

# Send SIGTERM for graceful shutdown
kill $(pgrep -f "port $LEADER_PORT")

# Leader should release key
sleep 2
LEADER_AFTER=$(redis-cli GET gateway_service_leader)
echo "Leader after graceful shutdown: $LEADER_AFTER"
# May be empty (if shutdown before new election) or new leader

Expected Result:

  • Leader key released on shutdown
  • New leader elected quickly
  • No orphan leader key

Test Matrix

Scenario Instances Redis Pass Criteria
Leader election 3 Online Single leader
Leader failover 3 Online New leader < 20s
Heartbeat 1 leader Online TTL refreshed
Split-brain 3 Partitioned No dual leaders
Cache sync 3 Online Updates propagate
Background jobs 3 Online Leader only
Scale up 3 -> 4 Online No disruption
Scale down 3 -> 2 Online No disruption

Success Criteria

  • Exactly one leader at any time
  • Leader failover within 20 seconds
  • Heartbeat keeps leader key alive
  • No split-brain after network partition
  • Cache invalidation propagates across instances
  • Background jobs run on leader only
  • Scale up/down doesn't disrupt service
  • Graceful shutdown releases leader key

Related Files

  • mcpgateway/services/gateway_service.py - Leader election
  • mcpgateway/cache/ - Cache with pub/sub
  • mcpgateway/config.py - Leader configuration

Related Issues

Metadata

Metadata

Labels

MUSTP1: Non-negotiable, critical requirements without which the product is non-functional or unsafechoreLinting, formatting, dependency hygiene, or project maintenance choresmanual-testingManual testing / test planning issuesreadyValidated, ready-to-work-on itemstestingTesting (unit, e2e, manual, automated, etc)

Type

No fields configured for Task.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions