|
| 1 | +# Load Balancer Architecture - Quick Reference Card |
| 2 | + |
| 3 | +## 🎯 One-Sentence Summary |
| 4 | +Split guilds across multiple pods, each with its own SQLite database, coordinated by a config service. |
| 5 | + |
| 6 | +## 📊 Current vs Proposed |
| 7 | + |
| 8 | +| Aspect | Current | Proposed | |
| 9 | +|--------|---------|----------| |
| 10 | +| **Pods** | 1 | 7-20 (3 gateway, 2-10 HTTP, 2 config, 1 PostgreSQL) | |
| 11 | +| **Scaling** | ❌ None | ✅ Horizontal | |
| 12 | +| **Cost** | $10/mo | $45-50/mo | |
| 13 | +| **HA** | ❌ No | ✅ Yes | |
| 14 | +| **SQLite** | 1 database | 3-10 databases (1 per gateway pod) | |
| 15 | +| **Load Balancer** | ❌ Not supported | ✅ Supported | |
| 16 | + |
| 17 | +## 🏗️ Architecture at a Glance |
| 18 | + |
| 19 | +``` |
| 20 | +Users → LB → HTTP Pods → Config Service → Gateway Pods → Discord |
| 21 | + ↓ ↓ |
| 22 | + PostgreSQL SQLite + Litestream |
| 23 | + (guild→pod) (guild data) |
| 24 | +``` |
| 25 | + |
| 26 | +## 📦 Components |
| 27 | + |
| 28 | +### HTTP Service |
| 29 | +- **Purpose**: Web portal + webhook routing |
| 30 | +- **Type**: Deployment (stateless) |
| 31 | +- **Replicas**: 2-10 (HPA) |
| 32 | +- **Scales**: Automatically on CPU/memory |
| 33 | + |
| 34 | +### Config Service |
| 35 | +- **Purpose**: Guild assignment management |
| 36 | +- **Type**: Deployment (stateless) |
| 37 | +- **Replicas**: 2 |
| 38 | +- **Database**: PostgreSQL |
| 39 | + |
| 40 | +### Gateway Service |
| 41 | +- **Purpose**: Discord gateway connection |
| 42 | +- **Type**: StatefulSet (stateful) |
| 43 | +- **Replicas**: 3-10 |
| 44 | +- **Database**: SQLite (1 per pod) |
| 45 | +- **Backup**: Litestream → S3 |
| 46 | + |
| 47 | +## 🔑 Key Decisions |
| 48 | + |
| 49 | +| Decision | Rationale | |
| 50 | +|----------|-----------| |
| 51 | +| Guild-based sharding | Natural fit with Discord architecture | |
| 52 | +| Keep SQLite | No migration, proven, fast | |
| 53 | +| Litestream backup | Low overhead, battle-tested | |
| 54 | +| PostgreSQL for config | Multi-writer, small dataset | |
| 55 | +| Separate HTTP/Gateway | Independent scaling | |
| 56 | + |
| 57 | +## 🚫 What We're NOT Doing |
| 58 | + |
| 59 | +❌ Migrating to PostgreSQL (too much work) |
| 60 | +❌ Using rqlite (different API) |
| 61 | +❌ Using LiteFS (still single writer) |
| 62 | +❌ Using Turso (vendor lock-in) |
| 63 | +❌ Sharing SQLite across pods (impossible) |
| 64 | + |
| 65 | +## ⚡ How It Works |
| 66 | + |
| 67 | +### Discord Event |
| 68 | +``` |
| 69 | +Discord → Gateway Pod 0 → SQLite 0 → Litestream → S3 |
| 70 | + (guild assigned to pod 0) |
| 71 | +``` |
| 72 | + |
| 73 | +### HTTP Request |
| 74 | +``` |
| 75 | +User → LB → HTTP Pod → Config: "Which pod has guild 42?" |
| 76 | + → Gateway Pod 0 → SQLite 0 → Response |
| 77 | +``` |
| 78 | + |
| 79 | +### Guild Assignment |
| 80 | +``` |
| 81 | +New Guild → Config Service → Least loaded pod |
| 82 | + → Update PostgreSQL |
| 83 | + → Gateway pod starts handling |
| 84 | +``` |
| 85 | + |
| 86 | +## 📈 Scaling Path |
| 87 | + |
| 88 | +``` |
| 89 | +Phase 1: 3 gateway pods (0-99 guilds each) |
| 90 | +Phase 2: 5 gateway pods (rebalance to ~60 each) |
| 91 | +Phase 3: 10 gateway pods (100+ guilds each) |
| 92 | +``` |
| 93 | + |
| 94 | +## 💵 Cost Breakdown |
| 95 | + |
| 96 | +``` |
| 97 | +Gateway pods (3x): $15/mo |
| 98 | +HTTP pods (2-10x): $10/mo |
| 99 | +Config pods (2x): $5/mo |
| 100 | +PostgreSQL: $8/mo |
| 101 | +Volumes (3x): $3/mo |
| 102 | +S3 backup: $5/mo |
| 103 | +───────────────────────────── |
| 104 | +Total: $46/mo |
| 105 | +``` |
| 106 | + |
| 107 | +## ⏱️ Timeline |
| 108 | + |
| 109 | +``` |
| 110 | +Week 1-2: Config service |
| 111 | +Week 3-4: Gateway changes |
| 112 | +Week 5-6: Production deploy |
| 113 | +Week 7+: Optimization |
| 114 | +``` |
| 115 | + |
| 116 | +## 🎯 Success Criteria |
| 117 | + |
| 118 | +- [ ] P95 latency < 100ms |
| 119 | +- [ ] 99.9% uptime |
| 120 | +- [ ] Zero-downtime deploys |
| 121 | +- [ ] < 30s pod recovery |
| 122 | +- [ ] 1000+ guilds/pod |
| 123 | + |
| 124 | +## 🔥 Quick Start |
| 125 | + |
| 126 | +```bash |
| 127 | +# 1. Deploy config service |
| 128 | +kubectl apply -f cluster/proposed/config-service.yaml |
| 129 | + |
| 130 | +# 2. Deploy gateway pods |
| 131 | +kubectl apply -f cluster/proposed/gateway-service.yaml |
| 132 | + |
| 133 | +# 3. Deploy HTTP service |
| 134 | +kubectl apply -f cluster/proposed/http-service.yaml |
| 135 | + |
| 136 | +# 4. Update ingress |
| 137 | +kubectl apply -f cluster/proposed/ingress.yaml |
| 138 | + |
| 139 | +# 5. Verify |
| 140 | +kubectl get pods -l app=mod-bot |
| 141 | +``` |
| 142 | + |
| 143 | +## 📚 Documentation Map |
| 144 | + |
| 145 | +| Need | Read | |
| 146 | +|------|------| |
| 147 | +| Exec summary | 2026-01-01_5_executive-summary.md | |
| 148 | +| Visual diagrams | 2026-01-01_6_ascii-diagrams.md | |
| 149 | +| Full analysis | 2026-01-01_1_load-balancer-architecture.md | |
| 150 | +| Implementation | 2026-01-01_4_implementation-guide.md | |
| 151 | +| Tool comparison | 2026-01-01_3_sqlite-sync-comparison.md | |
| 152 | +| Navigation | LOAD_BALANCER_INDEX.md | |
| 153 | + |
| 154 | +## ⚠️ Common Questions |
| 155 | + |
| 156 | +**Q: Why not just use PostgreSQL?** |
| 157 | +A: SQLite is simpler, faster for our use case, and already works. Migration would take months. |
| 158 | + |
| 159 | +**Q: Why not use [SQLite replication tool]?** |
| 160 | +A: They all have major limitations (see comparison doc). Guild sharding is simpler and proven. |
| 161 | + |
| 162 | +**Q: What if a pod fails?** |
| 163 | +A: Kubernetes restarts it, Litestream restores from S3, guilds back online in < 30s. |
| 164 | + |
| 165 | +**Q: How do we rebalance guilds?** |
| 166 | +A: Config service can reassign guilds. Stop → Export → Import → Start. Takes ~2 minutes. |
| 167 | + |
| 168 | +**Q: Can we scale down?** |
| 169 | +A: Yes, but requires guild reassignment. Not instant, but possible. |
| 170 | + |
| 171 | +**Q: What about cross-guild queries?** |
| 172 | +A: HTTP service can query multiple gateway pods and aggregate results. |
| 173 | + |
| 174 | +## 🎓 Key Insights |
| 175 | + |
| 176 | +1. **SQLite isn't the problem** - Single-writer is fine if you partition data |
| 177 | +2. **Discord's architecture helps** - Guilds are natural boundaries |
| 178 | +3. **Simple is better** - Standard tools beat fancy solutions |
| 179 | +4. **Cost is worth it** - 5x cost for production-grade scaling is reasonable |
| 180 | +5. **No silver bullet** - All SQLite replication tools have tradeoffs |
| 181 | + |
| 182 | +## 🚀 Bottom Line |
| 183 | + |
| 184 | +**Status**: ✅ Ready to implement |
| 185 | +**Confidence**: High (proven patterns) |
| 186 | +**Risk**: Medium (new architecture) |
| 187 | +**Effort**: 6-8 weeks |
| 188 | +**Impact**: Enables horizontal scaling + HA |
| 189 | + |
| 190 | +**Recommendation**: ✅ Proceed with implementation |
| 191 | + |
| 192 | +--- |
| 193 | + |
| 194 | +**Version**: 1.0 |
| 195 | +**Updated**: 2026-01-01 |
| 196 | +**Next Step**: Team review & approval |
0 commit comments