An internal eval platform that continuously monitors the quality of the RivalReview AI pipeline. Built with FastAPI and SQLite, it runs 19 automated evals across 4 layers — from raw output schema checks to cost and latency tracking.
RivalReview uses multiple AI agents to process app store reviews and generate competitive analysis. This eval platform answers the question: did a pipeline change make things better or worse?
Every eval run produces a score per eval, a pass/fail against a threshold, and a delta vs the baseline version. Results are stored, versioned, and comparable over time through a web dashboard.
RivalReview-Evals/
├── main.py # FastAPI app — all routes
├── config.py # Thresholds, DB paths, env config
├── requirements.txt
├── .env.example
│
├── evals/
│ ├── base.py # EvalResult dataclass — contract for all evals
│ ├── runner.py # Orchestrates runs, computes deltas, saves results
│ ├── layer1/ # Monthly Batch Agent evals (1.1–1.5)
│ ├── layer2/ # App Synthesis Agent evals (2.1–2.6)
│ ├── layer3/ # Cross-App Synthesis Agent evals (3.1–3.5)
│ └── layer4/ # Cost & Latency evals (4.1–4.4)
│
├── services/
│ ├── db.py # SQLAlchemy models for both DBs
│ └── grok.py # Grok API client (used by judge evals)
│
└── templates/ # Jinja2 HTML templates
├── dashboard.html
├── history.html
├── versions.html
├── run_detail.html
└── compare.html
- Python 3.11+
- A running RivalReview instance with a populated
rivalreview.db - A Grok API key (used by LLM judge evals — 1.3, 2.4, 3.5)
1. Clone the repo
git clone https://github.com/your-username/rivalreview-evals.git
cd rivalreview-evals2. Create and activate virtual environment
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activate3. Install dependencies
pip install -r requirements.txt4. Configure environment
cp .env.example .envEdit .env with your values:
RIVALREVIEW_DB_PATH=C:\path\to\rivalreview.db
EVALSTORE_DB_PATH=C:\path\to\evalstore.db
GROK_API_KEY=your_grok_api_key_here
GROK_MODEL=grok-3-mini5. Run the server
uvicorn main:app --reload --port 8001Open http://localhost:8001 in your browser.
- Go to the dashboard at
http://localhost:8001 - Click a project card to select it
- Choose a version from the dropdown
- Click Run All Evals
- Every meaningful pipeline change should be a new version
- Go to Versions → create a new version with a description of what changed
- New versions automatically become the current version
- The baseline version (
v1.0-baseline) is protected and cannot be deleted — it is the anchor all other versions compare against
- Go to Compare and select two versions
- Scores are shown side by side per eval with delta
- Use this after making a pipeline change to confirm improvement
- Go to History to see all past runs
- Click any run to see full eval results, criteria, scores, and deltas
| Layer | Agent | Evals | What it checks |
|---|---|---|---|
| 1 | Monthly Batch Agent | 1.1–1.5 | Schema validity, theme specificity, excerpt relevance, sentiment accuracy, volume plausibility |
| 2 | App Synthesis Agent | 2.1–2.6 | Theme deduplication, volume consistency, ranking correctness, summary actionability, coverage, excerpt traceability |
| 3 | Cross-App Synthesis Agent | 3.1–3.5 | Sentiment trend math, sentiment completeness, differentiator accuracy, summary depth |
| 4 | All Agents | 4.1–4.4 | Token usage, cost per run, latency, retry rate |
- Create a new file in the correct layer folder, e.g.
evals/layer2/eval_2_7_my_eval.py - Follow this structure:
from evals.base import EvalResult
from config import THRESHOLDS
EVAL_ID = "2.7"
EVAL_NAME = "My New Eval"
LAYER = 2
def run(app_analyses: list[dict], analyses: list[dict]) -> dict:
result = EvalResult(
eval_id = EVAL_ID,
name = EVAL_NAME,
layer = LAYER,
threshold = THRESHOLDS[EVAL_ID],
)
for item in app_analyses:
passed = # your check here
if passed:
result.passed += 1
else:
result.failed += 1
result.details.append({
"item_id": item["app_id"],
"passed": passed,
"note": "...",
})
result.finalise()
return result.to_dict()- Add the threshold to
config.pyunderTHRESHOLDS - Import and register it in
evals/runner.py:- Import at the top
- Add to
ALL_EVALSlist - Add to
NEEDS_REVIEWSorNEEDS_METRICSsets if needed - Add a criteria string to
EVAL_CRITERIAdict
That's it — the runner, dashboard, history, and compare pages all handle it automatically.
| Component | Technology |
|---|---|
| Web framework | FastAPI |
| Templates | Jinja2 |
| Database | SQLite via SQLAlchemy |
| LLM judge | Grok API (grok-3-mini) |
| Fuzzy matching | thefuzz |
| Frontend | Vanilla HTML + Tailwind CDN |
| Variable | Description | Default |
|---|---|---|
RIVALREVIEW_DB_PATH |
Path to RivalReview's SQLite DB | — |
EVALSTORE_DB_PATH |
Path to eval platform's SQLite DB | ./evalstore.db |
GROK_API_KEY |
Grok API key for LLM judge evals | — |
GROK_MODEL |
Grok model to use | grok-3-mini |
GROK_TEMPERATURE |
Temperature for judge calls | 0.1 |