[TESTING][MULTI-INSTANCE]: Leader Election, Redis Coordination, and Horizontal Scaling
Goal
Produce a comprehensive manual test plan for validating correct behavior with multiple gateway instances including leader election, state synchronization, cache coherence, and horizontal scaling.
Why Now?
Multi-instance deployment is critical for production:
- High Availability: No single point of failure
- Scalability: Handle increased load
- Zero Downtime: Rolling deployments
- Data Consistency: State sync across instances
- Leader Tasks: Background jobs run once
User Stories
US-1: Operator - High Availability
As an operator
I want multiple gateway instances
So that the service survives instance failures
Acceptance Criteria:
Feature: High Availability
Scenario: Instance failure
Given 3 gateway instances running
When one instance crashes
Then the service should continue
And requests should be handled by remaining instances
US-2: Operator - Leader Election
As an operator
I want only one leader instance
So that background jobs don't run multiple times
Acceptance Criteria:
Feature: Leader Election
Scenario: Leader failover
Given instance A is the leader
When instance A crashes
Then instance B or C should become leader
And there should never be two leaders
Architecture
MULTI-INSTANCE ARCHITECTURE
+------------------------------------------------------------------------+
| |
| Load Balancer |
| ------------- |
| | |
| +------------------+------------------+ |
| | | | |
| v v v |
| +---------+ +---------+ +---------+ |
| | Gateway | | Gateway | | Gateway | |
| | Inst 1 | | Inst 2 | | Inst 3 | |
| | (Leader)| +---------+ +---------+ |
| +---------+ | | |
| | | | |
| +------------------+------------------+ |
| | |
| v |
| +----------------------------------------------------------------+ |
| | Redis | |
| | - Leader key (SET NX pattern) | |
| | - Pub/Sub for cache invalidation | |
| | - Session state sync | |
| +----------------------------------------------------------------+ |
| | |
| +----------------------------------------------------------------+ |
| | PostgreSQL | |
| | - Shared persistent state | |
| +----------------------------------------------------------------+ |
| |
+------------------------------------------------------------------------+
Test Environment Setup
# Start Redis
docker run -d --name redis -p 6379:6379 redis:7
# Start PostgreSQL
docker run -d --name postgres -p 5432:5432 \
-e POSTGRES_PASSWORD=postgres \
postgres:15
# Common environment
export DATABASE_URL="postgresql://postgres:postgres@localhost/gateway"
export REDIS_URL="redis://localhost:6379"
export JWT_SECRET_KEY="multi-instance-test"
# Create database
docker exec postgres createdb -U postgres gateway
# Instance ports
export PORTS=(8001 8002 8003)
Manual Test Cases
| Case |
Scenario |
Configuration |
Expected Result |
| MI-01 |
Single leader |
3 instances |
Only one leader |
| MI-02 |
Leader failover |
Kill leader |
New leader elected |
| MI-03 |
Heartbeat renewal |
Leader running |
TTL refreshed |
| MI-04 |
Split-brain |
Network partition |
No dual leaders |
| MI-05 |
Round-robin |
Load balancer |
All handle requests |
| MI-06 |
Cache invalidation |
Update on A |
B sees update |
| MI-07 |
Background job |
Health check |
Only leader executes |
| MI-08 |
Scale up |
Add instance |
Joins smoothly |
| MI-09 |
Scale down |
Remove non-leader |
No impact |
| MI-10 |
Graceful shutdown |
Stop leader |
Releases key |
MI-01: Single Leader Election
Steps:
# Start 3 instances
for port in ${PORTS[@]}; do
python -m mcpgateway.main --port $port &
sleep 3
done
# Wait for leader election
sleep 15
# Check Redis for leader
LEADER=$(redis-cli GET gateway_service_leader)
echo "Leader: $LEADER"
# Verify only one leader
LEADER_COUNT=$(redis-cli KEYS "gateway_service_leader*" | wc -l)
[ "$LEADER_COUNT" -eq 1 ] && echo "PASS: Single leader" || echo "FAIL: Multiple leaders"
Verification via API:
# Each instance should report its leader status
for port in ${PORTS[@]}; do
echo "Instance $port:"
curl -s "http://localhost:$port/api/admin/status" \
-H "Authorization: Bearer $TOKEN" | jq '.is_leader'
done
# Exactly one should return true
Expected Result:
- Exactly one instance becomes leader
- Other instances are followers
- Leader key exists in Redis
MI-02: Leader Failover
Steps:
# Identify current leader
LEADER=$(redis-cli GET gateway_service_leader)
LEADER_PORT=$(echo $LEADER | cut -d: -f2)
echo "Current leader on port: $LEADER_PORT"
# Kill leader
pkill -f "port $LEADER_PORT"
# Wait for failover (TTL is 15 seconds)
sleep 20
# Check new leader
NEW_LEADER=$(redis-cli GET gateway_service_leader)
echo "New leader: $NEW_LEADER"
# Verify different from old
[ "$LEADER" != "$NEW_LEADER" ] && echo "PASS: New leader elected" || echo "FAIL"
Service Continuity:
# While failover happens, check service continues
for i in {1..30}; do
# Hit load balancer (or any live instance)
curl -s -o /dev/null -w "%{http_code}\n" "http://localhost:${PORTS[1]}/health"
sleep 1
done
# Should all return 200
Expected Result:
- New leader elected within TTL + heartbeat interval
- Service continues during failover
- No split-brain
MI-03: Heartbeat Renewal
Steps:
# Monitor TTL over time
for i in {1..10}; do
TTL=$(redis-cli TTL gateway_service_leader)
echo "TTL: $TTL seconds"
sleep 5
done
# TTL should reset every heartbeat interval (5s)
Expected Result:
- TTL consistently between 10-15 seconds
- Never drops to 0 while leader running
- Heartbeat refreshes every 5 seconds
MI-04: Split-Brain Prevention
Simulate Network Partition:
# This is harder to test without container networking
# In Kubernetes, use NetworkPolicy to isolate
# Alternative: pause Redis briefly
docker pause redis
sleep 20 # Longer than TTL
docker unpause redis
sleep 10
# Check for single leader after recovery
LEADER=$(redis-cli GET gateway_service_leader)
echo "Leader after partition: $LEADER"
Expected Result:
- During partition, old leader loses leadership
- After recovery, single leader exists
- No period with two leaders
MI-05: Round-Robin Request Handling
Setup nginx Load Balancer:
upstream gateway {
server localhost:8001;
server localhost:8002;
server localhost:8003;
}
server {
listen 8000;
location / {
proxy_pass http://gateway;
}
}
Test Distribution:
# Make many requests and track which instance handles
for i in {1..30}; do
curl -s "http://localhost:8000/health" | jq -r '.instance_id' >> /tmp/instances.txt
done
# Count distribution
sort /tmp/instances.txt | uniq -c
# Should be roughly even (10 each)
Expected Result:
- All instances handle requests
- Load distributed (approximately)
- No instance overloaded
MI-06: Cache Invalidation Across Instances
Steps:
# Create entity via instance 1
GATEWAY_ID=$(curl -s -X POST "http://localhost:8001/gateways" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "cache-test", "url": "http://localhost:9000"}' | jq -r '.id')
# Read via instance 2 (should be cached)
curl -s "http://localhost:8002/gateways/$GATEWAY_ID" \
-H "Authorization: Bearer $TOKEN" | jq '.name'
# Update via instance 1
curl -s -X PUT "http://localhost:8001/gateways/$GATEWAY_ID" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "cache-test-updated", "url": "http://localhost:9000"}'
# Read via instance 2 (should see update)
sleep 1 # Allow pub/sub propagation
curl -s "http://localhost:8002/gateways/$GATEWAY_ID" \
-H "Authorization: Bearer $TOKEN" | jq '.name'
# Should return "cache-test-updated"
Expected Result:
- Cache invalidation propagates via Redis pub/sub
- All instances see updated data
- No stale cache entries
MI-07: Background Job Single Execution
Steps:
# Check health refresh logs (only leader should run)
for port in ${PORTS[@]}; do
docker logs gateway-$port 2>&1 | grep "health check refresh" | wc -l
done
# Only leader should have these log entries
# Or check via metrics
for port in ${PORTS[@]}; do
curl -s "http://localhost:$port/metrics" | grep background_job_executions
done
# Only leader should have non-zero count
Expected Result:
- Only leader executes background jobs
- Followers skip background job execution
- No duplicate job runs
MI-08: Scale Up
Steps:
# Start with 3 instances
# Add 4th instance
python -m mcpgateway.main --port 8004 &
sleep 10
# Verify it joins without disrupting leader
LEADER=$(redis-cli GET gateway_service_leader)
echo "Leader unchanged: $LEADER"
# New instance handles requests
curl -s "http://localhost:8004/health" | jq .
Expected Result:
- New instance starts without errors
- Existing leader not disrupted
- New instance handles requests
MI-09: Scale Down Non-Leader
Steps:
# Identify non-leader
LEADER=$(redis-cli GET gateway_service_leader)
LEADER_PORT=$(echo $LEADER | cut -d: -f2)
# Find non-leader port
for port in ${PORTS[@]}; do
[ "$port" != "$LEADER_PORT" ] && NON_LEADER_PORT=$port && break
done
# Stop non-leader
pkill -f "port $NON_LEADER_PORT"
# Verify no impact
LEADER_AFTER=$(redis-cli GET gateway_service_leader)
[ "$LEADER" = "$LEADER_AFTER" ] && echo "PASS: Leader unchanged" || echo "FAIL"
Expected Result:
- Non-leader removed cleanly
- Leader not affected
- Service continues
MI-10: Graceful Leader Shutdown
Steps:
# Get leader port
LEADER=$(redis-cli GET gateway_service_leader)
LEADER_PORT=$(echo $LEADER | cut -d: -f2)
# Send SIGTERM for graceful shutdown
kill $(pgrep -f "port $LEADER_PORT")
# Leader should release key
sleep 2
LEADER_AFTER=$(redis-cli GET gateway_service_leader)
echo "Leader after graceful shutdown: $LEADER_AFTER"
# May be empty (if shutdown before new election) or new leader
Expected Result:
- Leader key released on shutdown
- New leader elected quickly
- No orphan leader key
Test Matrix
| Scenario |
Instances |
Redis |
Pass Criteria |
| Leader election |
3 |
Online |
Single leader |
| Leader failover |
3 |
Online |
New leader < 20s |
| Heartbeat |
1 leader |
Online |
TTL refreshed |
| Split-brain |
3 |
Partitioned |
No dual leaders |
| Cache sync |
3 |
Online |
Updates propagate |
| Background jobs |
3 |
Online |
Leader only |
| Scale up |
3 -> 4 |
Online |
No disruption |
| Scale down |
3 -> 2 |
Online |
No disruption |
Success Criteria
Related Files
mcpgateway/services/gateway_service.py - Leader election
mcpgateway/cache/ - Cache with pub/sub
mcpgateway/config.py - Leader configuration
Related Issues
[TESTING][MULTI-INSTANCE]: Leader Election, Redis Coordination, and Horizontal Scaling
Goal
Produce a comprehensive manual test plan for validating correct behavior with multiple gateway instances including leader election, state synchronization, cache coherence, and horizontal scaling.
Why Now?
Multi-instance deployment is critical for production:
User Stories
US-1: Operator - High Availability
As an operator
I want multiple gateway instances
So that the service survives instance failures
Acceptance Criteria:
US-2: Operator - Leader Election
As an operator
I want only one leader instance
So that background jobs don't run multiple times
Acceptance Criteria:
Architecture
Test Environment Setup
Manual Test Cases
MI-01: Single Leader Election
Steps:
Verification via API:
Expected Result:
MI-02: Leader Failover
Steps:
Service Continuity:
Expected Result:
MI-03: Heartbeat Renewal
Steps:
Expected Result:
MI-04: Split-Brain Prevention
Simulate Network Partition:
Expected Result:
MI-05: Round-Robin Request Handling
Setup nginx Load Balancer:
Test Distribution:
Expected Result:
MI-06: Cache Invalidation Across Instances
Steps:
Expected Result:
MI-07: Background Job Single Execution
Steps:
Expected Result:
MI-08: Scale Up
Steps:
Expected Result:
MI-09: Scale Down Non-Leader
Steps:
Expected Result:
MI-10: Graceful Leader Shutdown
Steps:
Expected Result:
Test Matrix
Success Criteria
Related Files
mcpgateway/services/gateway_service.py- Leader electionmcpgateway/cache/- Cache with pub/submcpgateway/config.py- Leader configurationRelated Issues