Skip to content

Commit b16f603

Browse files
Copilotvcarl
andcommitted
Add quick reference card for load balancer architecture
Co-authored-by: vcarl <[email protected]>
1 parent 4c8f864 commit b16f603

File tree

1 file changed

+196
-0
lines changed

1 file changed

+196
-0
lines changed

notes/LOAD_BALANCER_QUICK_REF.md

Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
# Load Balancer Architecture - Quick Reference Card
2+
3+
## 🎯 One-Sentence Summary
4+
Split guilds across multiple pods, each with its own SQLite database, coordinated by a config service.
5+
6+
## 📊 Current vs Proposed
7+
8+
| Aspect | Current | Proposed |
9+
|--------|---------|----------|
10+
| **Pods** | 1 | 7-20 (3 gateway, 2-10 HTTP, 2 config, 1 PostgreSQL) |
11+
| **Scaling** | ❌ None | ✅ Horizontal |
12+
| **Cost** | $10/mo | $45-50/mo |
13+
| **HA** | ❌ No | ✅ Yes |
14+
| **SQLite** | 1 database | 3-10 databases (1 per gateway pod) |
15+
| **Load Balancer** | ❌ Not supported | ✅ Supported |
16+
17+
## 🏗️ Architecture at a Glance
18+
19+
```
20+
Users → LB → HTTP Pods → Config Service → Gateway Pods → Discord
21+
↓ ↓
22+
PostgreSQL SQLite + Litestream
23+
(guild→pod) (guild data)
24+
```
25+
26+
## 📦 Components
27+
28+
### HTTP Service
29+
- **Purpose**: Web portal + webhook routing
30+
- **Type**: Deployment (stateless)
31+
- **Replicas**: 2-10 (HPA)
32+
- **Scales**: Automatically on CPU/memory
33+
34+
### Config Service
35+
- **Purpose**: Guild assignment management
36+
- **Type**: Deployment (stateless)
37+
- **Replicas**: 2
38+
- **Database**: PostgreSQL
39+
40+
### Gateway Service
41+
- **Purpose**: Discord gateway connection
42+
- **Type**: StatefulSet (stateful)
43+
- **Replicas**: 3-10
44+
- **Database**: SQLite (1 per pod)
45+
- **Backup**: Litestream → S3
46+
47+
## 🔑 Key Decisions
48+
49+
| Decision | Rationale |
50+
|----------|-----------|
51+
| Guild-based sharding | Natural fit with Discord architecture |
52+
| Keep SQLite | No migration, proven, fast |
53+
| Litestream backup | Low overhead, battle-tested |
54+
| PostgreSQL for config | Multi-writer, small dataset |
55+
| Separate HTTP/Gateway | Independent scaling |
56+
57+
## 🚫 What We're NOT Doing
58+
59+
❌ Migrating to PostgreSQL (too much work)
60+
❌ Using rqlite (different API)
61+
❌ Using LiteFS (still single writer)
62+
❌ Using Turso (vendor lock-in)
63+
❌ Sharing SQLite across pods (impossible)
64+
65+
## ⚡ How It Works
66+
67+
### Discord Event
68+
```
69+
Discord → Gateway Pod 0 → SQLite 0 → Litestream → S3
70+
(guild assigned to pod 0)
71+
```
72+
73+
### HTTP Request
74+
```
75+
User → LB → HTTP Pod → Config: "Which pod has guild 42?"
76+
→ Gateway Pod 0 → SQLite 0 → Response
77+
```
78+
79+
### Guild Assignment
80+
```
81+
New Guild → Config Service → Least loaded pod
82+
→ Update PostgreSQL
83+
→ Gateway pod starts handling
84+
```
85+
86+
## 📈 Scaling Path
87+
88+
```
89+
Phase 1: 3 gateway pods (0-99 guilds each)
90+
Phase 2: 5 gateway pods (rebalance to ~60 each)
91+
Phase 3: 10 gateway pods (100+ guilds each)
92+
```
93+
94+
## 💵 Cost Breakdown
95+
96+
```
97+
Gateway pods (3x): $15/mo
98+
HTTP pods (2-10x): $10/mo
99+
Config pods (2x): $5/mo
100+
PostgreSQL: $8/mo
101+
Volumes (3x): $3/mo
102+
S3 backup: $5/mo
103+
─────────────────────────────
104+
Total: $46/mo
105+
```
106+
107+
## ⏱️ Timeline
108+
109+
```
110+
Week 1-2: Config service
111+
Week 3-4: Gateway changes
112+
Week 5-6: Production deploy
113+
Week 7+: Optimization
114+
```
115+
116+
## 🎯 Success Criteria
117+
118+
- [ ] P95 latency < 100ms
119+
- [ ] 99.9% uptime
120+
- [ ] Zero-downtime deploys
121+
- [ ] < 30s pod recovery
122+
- [ ] 1000+ guilds/pod
123+
124+
## 🔥 Quick Start
125+
126+
```bash
127+
# 1. Deploy config service
128+
kubectl apply -f cluster/proposed/config-service.yaml
129+
130+
# 2. Deploy gateway pods
131+
kubectl apply -f cluster/proposed/gateway-service.yaml
132+
133+
# 3. Deploy HTTP service
134+
kubectl apply -f cluster/proposed/http-service.yaml
135+
136+
# 4. Update ingress
137+
kubectl apply -f cluster/proposed/ingress.yaml
138+
139+
# 5. Verify
140+
kubectl get pods -l app=mod-bot
141+
```
142+
143+
## 📚 Documentation Map
144+
145+
| Need | Read |
146+
|------|------|
147+
| Exec summary | 2026-01-01_5_executive-summary.md |
148+
| Visual diagrams | 2026-01-01_6_ascii-diagrams.md |
149+
| Full analysis | 2026-01-01_1_load-balancer-architecture.md |
150+
| Implementation | 2026-01-01_4_implementation-guide.md |
151+
| Tool comparison | 2026-01-01_3_sqlite-sync-comparison.md |
152+
| Navigation | LOAD_BALANCER_INDEX.md |
153+
154+
## ⚠️ Common Questions
155+
156+
**Q: Why not just use PostgreSQL?**
157+
A: SQLite is simpler, faster for our use case, and already works. Migration would take months.
158+
159+
**Q: Why not use [SQLite replication tool]?**
160+
A: They all have major limitations (see comparison doc). Guild sharding is simpler and proven.
161+
162+
**Q: What if a pod fails?**
163+
A: Kubernetes restarts it, Litestream restores from S3, guilds back online in < 30s.
164+
165+
**Q: How do we rebalance guilds?**
166+
A: Config service can reassign guilds. Stop → Export → Import → Start. Takes ~2 minutes.
167+
168+
**Q: Can we scale down?**
169+
A: Yes, but requires guild reassignment. Not instant, but possible.
170+
171+
**Q: What about cross-guild queries?**
172+
A: HTTP service can query multiple gateway pods and aggregate results.
173+
174+
## 🎓 Key Insights
175+
176+
1. **SQLite isn't the problem** - Single-writer is fine if you partition data
177+
2. **Discord's architecture helps** - Guilds are natural boundaries
178+
3. **Simple is better** - Standard tools beat fancy solutions
179+
4. **Cost is worth it** - 5x cost for production-grade scaling is reasonable
180+
5. **No silver bullet** - All SQLite replication tools have tradeoffs
181+
182+
## 🚀 Bottom Line
183+
184+
**Status**: ✅ Ready to implement
185+
**Confidence**: High (proven patterns)
186+
**Risk**: Medium (new architecture)
187+
**Effort**: 6-8 weeks
188+
**Impact**: Enables horizontal scaling + HA
189+
190+
**Recommendation**: ✅ Proceed with implementation
191+
192+
---
193+
194+
**Version**: 1.0
195+
**Updated**: 2026-01-01
196+
**Next Step**: Team review & approval

0 commit comments

Comments
 (0)