| title | SQL Review Env |
|---|---|
| emoji | π |
| colorFrom | blue |
| colorTo | indigo |
| sdk | docker |
| app_port | 7860 |
An OpenEnv-compliant, enterprise-grade reinforcement learning environment for training and evaluating AI agents on real-world database engineering workflows.
This is not a toy. SQL query review is one of the most critical, high-volume engineering tasks performed daily by Senior Database Administrators (DBAs) at scale.
In enterprise production systems, millions of SQL queries are submitted to databases every day. Before any query reaches production, a DBA must manually verify that it:
- Parses correctly β syntax errors silently break entire application pipelines
- Returns the right data β an off-by-one
WHEREclause can expose confidential records - Performs efficiently β a single full-table scan on a billion-row table can bring down a production server
- Mutates data safely β an
UPDATEwithout aWHEREclause corrupts entire datasets permanently
This environment simulates that high-stakes engineering workflow as an RL task. An AI agent acts as a virtual DBA β receiving a broken or inefficient query, reading the live database schema, executing iterative fixes, and receiving dense reward signals that mirror exactly how a human engineer would evaluate quality.
Important
This environment is designed for agents that must reason over structured data, write syntactically precise code, and optimize for both correctness and efficiency β simultaneously. It is directly applicable to training production-ready AI coding assistants.
See the Universal Grader in action in under 60 seconds β no API key required:
python demo.pyThis runs a fully automated, color-formatted terminal showcase proving the grader works live across all three SQL evaluation domains:
| Scenario | Task | What It Demonstrates |
|---|---|---|
| 1 β Self-Correcting Agent | syntax-fix |
Error bubbling, score clamping at 0.01, then full recovery to 0.99 |
| 2 β Query Optimizer | performance-tune |
Partial credit for correctness, then EXPLAIN QUERY PLAN bonus awarded |
| 3 β Schema Architect | schema-design |
DDL grading via sqlite_master inspection β zero string matching |
No mocks. No hardcoded strings. Every reward is computed live against a real in-memory SQLite database.
The SQL Review Environment places an agent in an interactive loop against a live, isolated, in-memory SQLite database. The agent must write raw SQL, observe results or errors, then iterate until the query is both correct and efficient.
Key Architectural Features:
| Feature | Implementation |
|---|---|
| Universal Intent-Based Grader | Detects SQL intent (DQL/DML/DDL) and applies state-comparison evaluation dynamically |
| Master Template Cache | sqlite3.Connection.backup() clones the master DB in < 2ms per reset() |
| Realistic Fixture Data | 10,000+ rows across 5 tables, generated via Faker(seed=42) for reproducibility |
| RAM Cap | PRAGMA max_page_count = 10000 prevents memory exhaustion attacks |
| Query Timeout | set_progress_handler aborts runaway queries after 1.5 seconds |
| Safety Penalties | Unguarded DROP/DELETE commands are penalized without crashing the server |
| Full OpenEnv Spec | Typed Pydantic models, all required HTTP endpoints, openenv validate passes |
The environment exposes 6 tasks spanning the full spectrum of database engineering, progressing from syntax debugging to complex structural design.
| Task ID | Difficulty | Category | Objective | Why This Is Challenging |
|---|---|---|---|---|
syntax-fix |
Easy | Debugging | Fix deliberate parse errors (SELCET, WHRE) |
Requires recognizing SQLite-specific syntax vs. other SQL dialects |
performance-tune |
Medium | Optimization | Rewrite a slow subquery to use indexed JOINs | Agent must understand EXPLAIN QUERY PLAN semantics, not just produce correct data |
schema-design |
Hard | DDL | Design a normalized relational schema from a plain-text requirement | No reference data exists to compare β grader must inspect sqlite_master state |
aggregation-mastery |
Medium | Analytics | Write GROUP BY + HAVING aggregate query |
HAVING clause filtering is frequently confused with WHERE by LLMs |
data-mutation |
Medium | DML | Targeted UPDATE/DELETE with a precise WHERE clause |
Unguarded mutations are penalized; agent must scope changes correctly |
advanced-joins |
Hard | Joins | LEFT JOIN preserving NULL rows for products with no reviews |
NULL preservation in outer joins is a common failure point for even frontier LLMs |
The core innovation is a single deterministic grading engine (sql_env/graders.py) that handles all 6 task types without any per-task custom logic:
graph TD
Sub([Agent SQL Submission]) --> S1["<b>1. SYNTAX CHECK</b><br/>EXPLAIN <sql> → parse error?"]
S1 --> S2{"<b>2. INTENT DETECTION</b><br/>First keyword → DQL/DML/DDL"}
S2 -->|"DQL<br/>(SELECT)"| N1["Run both queries,<br/>sort+hash results"]
S2 -->|"DML<br/>(UPDATE/DELETE)"| N2["Clone DB twice,<br/>compare table state"]
S2 -->|"DDL<br/>(CREATE/ALTER)"| N3["Inspect sqlite_master<br/>for table presence"]
N1 --> S4["<b>4. PERFORMANCE CHECK (DQL only)</b><br/>EXPLAIN QUERY PLAN → index?"]
S4 --> S5["<b>5. REWARD VECTOR COMPOSITION</b><br/>syntax + execution + correctness + performance<br/>− penalties → clamp [0,1]"]
N2 --> S5
N3 --> S5
Note
All types below are implemented as Pydantic models (sql_env/models.py) per the OpenEnv typed spec. This ensures schema validation on every API call.
The full state packet returned to the agent after every reset() or step():
| Property | Type | Description |
|---|---|---|
task_id |
str |
Active task (e.g., "performance-tune") |
db_schema |
str |
Human-readable CREATE TABLE schema of the live database |
query |
str |
Starting SQL query β often containing deliberate bugs or inefficiencies |
expected_hint |
str |
Natural language description of the agent's goal |
error_message |
str | None |
Last SQLite exception string from the engine, or null if clean |
step |
int |
Current episode step count (max_steps = 8) |
The single action type the agent submits to the environment:
| Property | Type | Description |
|---|---|---|
sql |
str |
A raw SQL statement to execute against the live SQLite database |
The structured reward object returned after every step():
| Property | Type | Description |
|---|---|---|
value |
float |
Scalar score clamped to [0.01, 0.99] |
breakdown |
dict |
Per-component score vector (see reward table below) |
done |
bool |
Whether the episode has terminated |
info |
dict |
Debug info β includes raw error string, EXPLAIN QUERY PLAN output |
The reward function provides dense partial-progress signals across every step β not just binary success/failure. This allows RL agents to learn from incremental improvements.
Important
Scores are strictly clamped to [0.01, 0.99]. This avoids dead-zone gradient issues in policy gradient methods β the agent always receives a non-zero learning signal.
| Component | Max Weight | Trigger |
|---|---|---|
| Syntax | +0.30 |
EXPLAIN <sql> passes the SQLite parser without error |
| Execution | +0.25 |
Query executes against the live DB without a runtime exception |
| Correctness | +0.35 |
Sorted, hashed result set (or final table state for DML) matches ground truth |
| Performance | +0.10 |
EXPLAIN QUERY PLAN confirms an index scan (DQL tasks only) |
| Penalty | Weight | Condition |
|---|---|---|
| Destructive Command | β0.05 |
Unguarded DROP or DELETE (no WHERE clause) |
| Step Exhaustion | β0.10 |
Agent exceeds max_steps = 8 without solving the task |
- Docker (for containerized deployment)
- Python 3.11+ (for local development)
# Build the container image
docker build -t sql-review-env .
# Run the environment on port 7860
docker run -p 7860:7860 sql-review-envThe Gradio UI and all OpenEnv API endpoints will be available at http://localhost:7860.
# Create and activate virtual environment
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # Linux / macOS
# Install dependencies
pip install -r requirements.txt
# Start the Gradio UI (demo interface)
python app.py
# OR: Start the raw OpenEnv API server directly
python -m uvicorn server.app:app --host 0.0.0.0 --port 7860# No API key needed β runs purely against local SQLite
python demo.py# Set credentials (at least one is required)
export HF_TOKEN="hf_your_token_here"
# OR: export OPENAI_API_KEY="sk-your_key_here"
# Optional overrides
export API_BASE_URL="https://router.huggingface.co/v1"
export MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
# Run the evaluation loop across all tasks
python inference.pyStructured stdout output format (required by OpenEnv spec):
[START] task=syntax-fix env=sql-review-env model=Qwen/Qwen2.5-72B-Instruct
[STEP] step=1 action=SELECT * FROM users reward=0.55 done=false error=null
[STEP] step=2 action=SELECT id, name FROM users WHERE ... reward=0.99 done=true error=null
[END] success=true steps=2 score=0.990 rewards=0.55,0.99
Results achieved using the zero-shot baseline agent (Qwen/Qwen2.5-72B-Instruct, max 8 steps):
| Task | Score | Observation |
|---|---|---|
syntax-fix |
0.99 | Solved in 1 step consistently (grader clamps max reward to 0.99) |
performance-tune |
0.90 | Correctness achieved; occasional index miss costs the +0.10 bonus |
schema-design |
0.90 | Frontier models sometimes hallucinate MySQL syntax (AUTO_INCREMENT instead of SQLite's AUTOINCREMENT) |
aggregation-mastery |
0.85 | GROUP BY solid; HAVING clause occasionally omitted |
data-mutation |
0.85 | UPDATE correctness solid; scoping precision varies |
advanced-joins |
0.75 | NULL preservation in LEFT JOIN is the primary failure mode |
Note
These scores are fully reproducible. Run python inference.py with your API credentials against the included environment. The dataset is pinned via Faker(seed=42) and random.seed(42) β every run produces an identical database state.
sql-review-env/
βββ app.py # Gradio UI entry point (interactive demo)
βββ demo.py # Automated end-to-end showcase script (run this first!)
βββ inference.py # OpenEnv-compliant LLM evaluation agent
βββ requirements.txt # Python dependencies
βββ pyproject.toml # Project metadata + entry_point: server.app:main
βββ uv.lock # Deterministic dependency lockfile (uv)
βββ openenv.yaml # OpenEnv environment metadata declaration
βββ Dockerfile # Container: python:3.11-slim + uvicorn + proxy headers
βββ validate-submission.sh # Local pre-submission validation runner
βββ README.md # This document
βββ server/
β βββ app.py # FastAPI server: /reset /step /state /health /metadata /schema /mcp
βββ sql_env/
βββ models.py # Pydantic typed models: SQLObservation, SQLAction, SQLReward
βββ env.py # Core environment: reset(), step(), state() + safety constraints
βββ graders.py # Universal Intent-Based Grader (DQL / DML / DDL routing)
βββ tasks.py # 6 task definitions + Master DB template cache
| Requirement | Status | Detail |
|---|---|---|
/reset endpoint |
β | POST β accepts task_id, returns SQLObservation |
/step endpoint |
β | POST β accepts SQLAction, returns SQLReward |
/state endpoint |
β | GET β returns current episode state dict |
/health endpoint |
β | GET β returns {"status": "healthy"} |
/metadata endpoint |
β | GET β returns name, description, version, task list |
/schema endpoint |
β | GET β returns full action/observation/state JSON schemas |
/mcp endpoint |
β | POST β returns JSON-RPC 2.0 compliant payload |
| Typed Pydantic models | β | SQLObservation, SQLAction, SQLReward |
openenv.yaml present |
β | Declares entry_point: server.app:main with 6 tasks |
openenv validate passes |
β | All automated checks pass |
docker build succeeds |
β | python:3.11-slim, clean build, port 7860 |
| Minimum 3 tasks | β | 6 tasks implemented (Easy β Hard) |
Score range [0.0, 1.0] |
β | Clamped to [0.01, 0.99] |
| Inference script present | β | inference.py in root, uses OpenAI client |
| Structured stdout logs | β | [START] / [STEP] / [END] format strictly followed |
| Reproducible scores | β | Faker(seed=42) + random.seed(42) pinned |