[TESTING][MULTI-INSTANCE]: Leader Election, Redis Coordination, and Horizontal Scaling

# [TESTING][MULTI-INSTANCE]: Leader Election, Redis Coordination, and Horizontal Scaling

## Goal

Produce a **comprehensive manual test plan** for validating correct behavior with multiple gateway instances including leader election, state synchronization, cache coherence, and horizontal scaling.

## Why Now?

Multi-instance deployment is critical for production:

1. **High Availability**: No single point of failure
2. **Scalability**: Handle increased load
3. **Zero Downtime**: Rolling deployments
4. **Data Consistency**: State sync across instances
5. **Leader Tasks**: Background jobs run once

---

## User Stories

<details>
<summary>US-1: Operator - High Availability</summary>

**As an** operator
**I want** multiple gateway instances
**So that** the service survives instance failures

**Acceptance Criteria:**

```gherkin
Feature: High Availability

 Scenario: Instance failure
 Given 3 gateway instances running
 When one instance crashes
 Then the service should continue
 And requests should be handled by remaining instances
```

</details>

<details>
<summary>US-2: Operator - Leader Election</summary>

**As an** operator
**I want** only one leader instance
**So that** background jobs don't run multiple times

**Acceptance Criteria:**

```gherkin
Feature: Leader Election

 Scenario: Leader failover
 Given instance A is the leader
 When instance A crashes
 Then instance B or C should become leader
 And there should never be two leaders
```

</details>

---

## Architecture

```
 MULTI-INSTANCE ARCHITECTURE
+------------------------------------------------------------------------+
| |
| Load Balancer |
| ------------- |
| | |
| +------------------+------------------+ |
| | | | |
| v v v |
| +---------+ +---------+ +---------+ |
| | Gateway | | Gateway | | Gateway | |
| | Inst 1 | | Inst 2 | | Inst 3 | |
| | (Leader)| +---------+ +---------+ |
| +---------+ | | |
| | | | |
| +------------------+------------------+ |
| | |
| v |
| +----------------------------------------------------------------+ |
| | Redis | |
| | - Leader key (SET NX pattern) | |
| | - Pub/Sub for cache invalidation | |
| | - Session state sync | |
| +----------------------------------------------------------------+ |
| | |
| +----------------------------------------------------------------+ |
| | PostgreSQL | |
| | - Shared persistent state | |
| +----------------------------------------------------------------+ |
| |
+------------------------------------------------------------------------+
```

---

## Test Environment Setup

```bash
# Start Redis
docker run -d --name redis -p 6379:6379 redis:7

# Start PostgreSQL
docker run -d --name postgres -p 5432:5432 \
 -e POSTGRES_PASSWORD=postgres \
 postgres:15

# Common environment
export DATABASE_URL="postgresql://postgres:postgres@localhost/gateway"
export REDIS_URL="redis://localhost:6379"
export JWT_SECRET_KEY="multi-instance-test"

# Create database
docker exec postgres createdb -U postgres gateway

# Instance ports
export PORTS=(8001 8002 8003)
```

---

## Manual Test Cases

| Case | Scenario | Configuration | Expected Result |
|------|----------|---------------|-----------------|
| MI-01 | Single leader | 3 instances | Only one leader |
| MI-02 | Leader failover | Kill leader | New leader elected |
| MI-03 | Heartbeat renewal | Leader running | TTL refreshed |
| MI-04 | Split-brain | Network partition | No dual leaders |
| MI-05 | Round-robin | Load balancer | All handle requests |
| MI-06 | Cache invalidation | Update on A | B sees update |
| MI-07 | Background job | Health check | Only leader executes |
| MI-08 | Scale up | Add instance | Joins smoothly |
| MI-09 | Scale down | Remove non-leader | No impact |
| MI-10 | Graceful shutdown | Stop leader | Releases key |

---

<details>
<summary>MI-01: Single Leader Election</summary>

**Steps:**
```bash
# Start 3 instances
for port in ${PORTS[@]}; do
 python -m mcpgateway.main --port $port &
 sleep 3
done

# Wait for leader election
sleep 15

# Check Redis for leader
LEADER=$(redis-cli GET gateway_service_leader)
echo "Leader: $LEADER"

# Verify only one leader
LEADER_COUNT=$(redis-cli KEYS "gateway_service_leader*" | wc -l)
[ "$LEADER_COUNT" -eq 1 ] && echo "PASS: Single leader" || echo "FAIL: Multiple leaders"
```

**Verification via API:**
```bash
# Each instance should report its leader status
for port in ${PORTS[@]}; do
 echo "Instance $port:"
 curl -s "http://localhost:$port/api/admin/status" \
 -H "Authorization: Bearer $TOKEN" | jq '.is_leader'
done
# Exactly one should return true
```

**Expected Result:**
- Exactly one instance becomes leader
- Other instances are followers
- Leader key exists in Redis

</details>

<details>
<summary>MI-02: Leader Failover</summary>

**Steps:**
```bash
# Identify current leader
LEADER=$(redis-cli GET gateway_service_leader)
LEADER_PORT=$(echo $LEADER | cut -d: -f2)
echo "Current leader on port: $LEADER_PORT"

# Kill leader
pkill -f "port $LEADER_PORT"

# Wait for failover (TTL is 15 seconds)
sleep 20

# Check new leader
NEW_LEADER=$(redis-cli GET gateway_service_leader)
echo "New leader: $NEW_LEADER"

# Verify different from old
[ "$LEADER" != "$NEW_LEADER" ] && echo "PASS: New leader elected" || echo "FAIL"
```

**Service Continuity:**
```bash
# While failover happens, check service continues
for i in {1..30}; do
 # Hit load balancer (or any live instance)
 curl -s -o /dev/null -w "%{http_code}\n" "http://localhost:${PORTS[1]}/health"
 sleep 1
done
# Should all return 200
```

**Expected Result:**
- New leader elected within TTL + heartbeat interval
- Service continues during failover
- No split-brain

</details>

<details>
<summary>MI-03: Heartbeat Renewal</summary>

**Steps:**
```bash
# Monitor TTL over time
for i in {1..10}; do
 TTL=$(redis-cli TTL gateway_service_leader)
 echo "TTL: $TTL seconds"
 sleep 5
done
# TTL should reset every heartbeat interval (5s)
```

**Expected Result:**
- TTL consistently between 10-15 seconds
- Never drops to 0 while leader running
- Heartbeat refreshes every 5 seconds

</details>

<details>
<summary>MI-04: Split-Brain Prevention</summary>

**Simulate Network Partition:**
```bash
# This is harder to test without container networking
# In Kubernetes, use NetworkPolicy to isolate

# Alternative: pause Redis briefly
docker pause redis
sleep 20 # Longer than TTL
docker unpause redis
sleep 10

# Check for single leader after recovery
LEADER=$(redis-cli GET gateway_service_leader)
echo "Leader after partition: $LEADER"
```

**Expected Result:**
- During partition, old leader loses leadership
- After recovery, single leader exists
- No period with two leaders

</details>

<details>
<summary>MI-05: Round-Robin Request Handling</summary>

**Setup nginx Load Balancer:**
```nginx
upstream gateway {
 server localhost:8001;
 server localhost:8002;
 server localhost:8003;
}

server {
 listen 8000;
 location / {
 proxy_pass http://gateway;
 }
}
```

**Test Distribution:**
```bash
# Make many requests and track which instance handles
for i in {1..30}; do
 curl -s "http://localhost:8000/health" | jq -r '.instance_id' >> /tmp/instances.txt
done

# Count distribution
sort /tmp/instances.txt | uniq -c
# Should be roughly even (10 each)
```

**Expected Result:**
- All instances handle requests
- Load distributed (approximately)
- No instance overloaded

</details>

<details>
<summary>MI-06: Cache Invalidation Across Instances</summary>

**Steps:**
```bash
# Create entity via instance 1
GATEWAY_ID=$(curl -s -X POST "http://localhost:8001/gateways" \
 -H "Authorization: Bearer $TOKEN" \
 -H "Content-Type: application/json" \
 -d '{"name": "cache-test", "url": "http://localhost:9000"}' | jq -r '.id')

# Read via instance 2 (should be cached)
curl -s "http://localhost:8002/gateways/$GATEWAY_ID" \
 -H "Authorization: Bearer $TOKEN" | jq '.name'

# Update via instance 1
curl -s -X PUT "http://localhost:8001/gateways/$GATEWAY_ID" \
 -H "Authorization: Bearer $TOKEN" \
 -H "Content-Type: application/json" \
 -d '{"name": "cache-test-updated", "url": "http://localhost:9000"}'

# Read via instance 2 (should see update)
sleep 1 # Allow pub/sub propagation
curl -s "http://localhost:8002/gateways/$GATEWAY_ID" \
 -H "Authorization: Bearer $TOKEN" | jq '.name'
# Should return "cache-test-updated"
```

**Expected Result:**
- Cache invalidation propagates via Redis pub/sub
- All instances see updated data
- No stale cache entries

</details>

<details>
<summary>MI-07: Background Job Single Execution</summary>

**Steps:**
```bash
# Check health refresh logs (only leader should run)
for port in ${PORTS[@]}; do
 docker logs gateway-$port 2>&1 | grep "health check refresh" | wc -l
done
# Only leader should have these log entries

# Or check via metrics
for port in ${PORTS[@]}; do
 curl -s "http://localhost:$port/metrics" | grep background_job_executions
done
# Only leader should have non-zero count
```

**Expected Result:**
- Only leader executes background jobs
- Followers skip background job execution
- No duplicate job runs

</details>

<details>
<summary>MI-08: Scale Up</summary>

**Steps:**
```bash
# Start with 3 instances
# Add 4th instance
python -m mcpgateway.main --port 8004 &
sleep 10

# Verify it joins without disrupting leader
LEADER=$(redis-cli GET gateway_service_leader)
echo "Leader unchanged: $LEADER"

# New instance handles requests
curl -s "http://localhost:8004/health" | jq .
```

**Expected Result:**
- New instance starts without errors
- Existing leader not disrupted
- New instance handles requests

</details>

<details>
<summary>MI-09: Scale Down Non-Leader</summary>

**Steps:**
```bash
# Identify non-leader
LEADER=$(redis-cli GET gateway_service_leader)
LEADER_PORT=$(echo $LEADER | cut -d: -f2)

# Find non-leader port
for port in ${PORTS[@]}; do
 [ "$port" != "$LEADER_PORT" ] && NON_LEADER_PORT=$port && break
done

# Stop non-leader
pkill -f "port $NON_LEADER_PORT"

# Verify no impact
LEADER_AFTER=$(redis-cli GET gateway_service_leader)
[ "$LEADER" = "$LEADER_AFTER" ] && echo "PASS: Leader unchanged" || echo "FAIL"
```

**Expected Result:**
- Non-leader removed cleanly
- Leader not affected
- Service continues

</details>

<details>
<summary>MI-10: Graceful Leader Shutdown</summary>

**Steps:**
```bash
# Get leader port
LEADER=$(redis-cli GET gateway_service_leader)
LEADER_PORT=$(echo $LEADER | cut -d: -f2)

# Send SIGTERM for graceful shutdown
kill $(pgrep -f "port $LEADER_PORT")

# Leader should release key
sleep 2
LEADER_AFTER=$(redis-cli GET gateway_service_leader)
echo "Leader after graceful shutdown: $LEADER_AFTER"
# May be empty (if shutdown before new election) or new leader
```

**Expected Result:**
- Leader key released on shutdown
- New leader elected quickly
- No orphan leader key

</details>

---

## Test Matrix

| Scenario | Instances | Redis | Pass Criteria |
|----------|-----------|-------|---------------|
| Leader election | 3 | Online | Single leader |
| Leader failover | 3 | Online | New leader < 20s |
| Heartbeat | 1 leader | Online | TTL refreshed |
| Split-brain | 3 | Partitioned | No dual leaders |
| Cache sync | 3 | Online | Updates propagate |
| Background jobs | 3 | Online | Leader only |
| Scale up | 3 -> 4 | Online | No disruption |
| Scale down | 3 -> 2 | Online | No disruption |

---

## Success Criteria

- [ ] Exactly one leader at any time
- [ ] Leader failover within 20 seconds
- [ ] Heartbeat keeps leader key alive
- [ ] No split-brain after network partition
- [ ] Cache invalidation propagates across instances
- [ ] Background jobs run on leader only
- [ ] Scale up/down doesn't disrupt service
- [ ] Graceful shutdown releases leader key

---

## Related Files

- `mcpgateway/services/gateway_service.py` - Leader election
- `mcpgateway/cache/` - Cache with pub/sub
- `mcpgateway/config.py` - Leader configuration

---

## Related Issues

- #2486 - Concurrency testing
- #2467 - Redis resilience
- #2468 - Kubernetes resilience

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TESTING][MULTI-INSTANCE]: Leader Election, Redis Coordination, and Horizontal Scaling #2488

[TESTING][MULTI-INSTANCE]: Leader Election, Redis Coordination, and Horizontal Scaling

Goal

Why Now?

User Stories

Architecture

Test Environment Setup

Manual Test Cases

Test Matrix

Success Criteria

Related Files

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Case	Scenario	Configuration	Expected Result
MI-01	Single leader	3 instances	Only one leader
MI-02	Leader failover	Kill leader	New leader elected
MI-03	Heartbeat renewal	Leader running	TTL refreshed
MI-04	Split-brain	Network partition	No dual leaders
MI-05	Round-robin	Load balancer	All handle requests
MI-06	Cache invalidation	Update on A	B sees update
MI-07	Background job	Health check	Only leader executes
MI-08	Scale up	Add instance	Joins smoothly
MI-09	Scale down	Remove non-leader	No impact
MI-10	Graceful shutdown	Stop leader	Releases key

Scenario	Instances	Redis	Pass Criteria
Leader election	3	Online	Single leader
Leader failover	3	Online	New leader < 20s
Heartbeat	1 leader	Online	TTL refreshed
Split-brain	3	Partitioned	No dual leaders
Cache sync	3	Online	Updates propagate
Background jobs	3	Online	Leader only
Scale up	3 -> 4	Online	No disruption
Scale down	3 -> 2	Online	No disruption

[TESTING][MULTI-INSTANCE]: Leader Election, Redis Coordination, and Horizontal Scaling #2488

Description

[TESTING][MULTI-INSTANCE]: Leader Election, Redis Coordination, and Horizontal Scaling

Goal

Why Now?

User Stories

Architecture

Test Environment Setup

Manual Test Cases

Test Matrix

Success Criteria

Related Files

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions