Skip to content

Add mcpg --demo: a curated demo dataset with a captured walkthrough#220

Merged
devopam merged 1 commit into
mainfrom
claude/demo-dataset
Jul 3, 2026
Merged

Add mcpg --demo: a curated demo dataset with a captured walkthrough#220
devopam merged 1 commit into
mainfrom
claude/demo-dataset

Conversation

@devopam

@devopam devopam commented Jul 2, 2026

Copy link
Copy Markdown
Owner

Summary

New-user onboarding (roadmap 17.1): previously, a new user's first five minutes with MCPg ran against whatever data they happened to have — often an empty scratch database that shows off none of the 250+ tool surface. This PR gives every user (and every future screenshot/recording) the same rich first experience:

MCPG_DATABASE_URL=postgresql://... mcpg --demo        # seed the mcpg_demo schema
MCPG_DATABASE_URL=postgresql://... mcpg --demo-drop   # remove it again

The dataset is curated, not random

A small, fully deterministic e-commerce schema (400 customers / 120 products / 3,000 orders / ~7,400 order items / 900 reviews) engineered so the pivotal tools all have something real to find:

  • orders.customer_id is an FK with no indexanalyze_query_plan shows the sequential scan, and recommend_indexes genuinely flags the table (the sibling tables do carry FK indexes, so it reads as a finding, not a theme).
  • customers.email / customers.phone bait find_sensitive_columns; a camelCase reviews."reviewSource" column trips lint_naming_conventions.
  • Review prose is per-product-type plausible — feature phrases are keyed to the product type, so a robot vacuum gets complaints about battery life and a yoga mat about grip, never the reverse. "battery life" recurs across battery-having products on purpose: it's the walkthrough's canonical full_text_search query.
  • products.embedding (pgvector, 8-dim, deterministic) is added only when the vector extension is already installed — the seeder never creates extensions; everything else works without it.
  • Order dates skew recent (growth curve) and customer activity is heavy-tailed, so time-window and top-N questions return dashboard-shaped answers.

Safety

  • The whole seed is one transaction — a mid-seed failure leaves nothing behind.
  • Re-seeding refuses rather than clobbers ("run mcpg --demo-drop first").
  • --demo-drop checks the ownership marker (schema comment) and refuses to drop a schema MCPg didn't create; dropping a non-existent schema is a no-op, not an error.
  • CLI-only surface — no new MCP tools; the tool-surface snapshot and outputSchema contract manifests are untouched.

The captured walkthrough (docs/demo.md)

Captured, not written: every output block is a real tool run against the seeded dataset, rendered by tools/generate_demo_walkthrough.py (7 sections: table summary → SQL analytics → slow-query diagnosis → index advisor → FTS → PII/naming audit → graph projection). Because the dataset is deterministic, the numbers in the doc are the numbers users get. tests/integration/test_demo_integration.py pins every planted finding, so the walkthrough can't silently rot when a helper changes.

One non-obvious bit worth flagging for review: the index-advisor section resets pg_stat before replaying the workload — seeding itself generates ~7,400 FK-check index scans on orders' PK, which drowns the advisor's seq_scan > idx_scan signal; and since a backend's pending stats only flush at transaction end, the wait is a poll (which itself drives flushes), not a sleep. Both the generator and the integration test do this identically.

Docs

README quick-start section, docs/index.md link, CHANGELOG [Unreleased], and roadmap section 17 (shipped).

Test plan

  • 8 new unit tests (tests/unit/test_demo.py): determinism, row counts, referential integrity, order totals = sum of items, unique emails, planted-flaw pinning (the missing index and camelCase column are asserted present in the DDL so nobody "fixes" them), per-type feature plausibility.
  • 3 new CLI tests (tests/unit/test_main.py): --demo seeds and prints the summary + suggested prompts, --demo-drop reports, DemoError → exit 1.
  • 2 integration tests: full lifecycle (seed → verify counts/marker/vector-column parity with pg_extension → re-seed refusal → all planted findings via the real tools → drop → double-drop no-op) and foreign-schema drop refusal. Skipped on the WarehousePG lane (demo targets stock PostgreSQL).
  • Verified end-to-end against a real PostgreSQL 16 in this environment: seeded via the actual mcpg --demo CLI, generated docs/demo.md from live runs, and ran the full suite: 2735 passed, coverage 90.08% (gate: 90%). ruff format --check / ruff check / mypy src/mcpg (strict) / bandit all clean.

Generated by Claude Code

New-user onboarding: `mcpg --demo` seeds a small, deterministic,
deliberately curated e-commerce dataset (400 customers, 120 products,
3,000 orders, 900 reviews) into an mcpg_demo schema in the configured
database, so the first five minutes with MCPg run against data the
tools can actually show off. `mcpg --demo-drop` removes it.

The dataset plants specific teaching moments:
- orders.customer_id is an FK with no index — analyze_query_plan shows
  the seq scan, recommend_indexes catches it
- customers.email/phone bait find_sensitive_columns; a camelCase
  reviews."reviewSource" column trips lint_naming_conventions
- review prose is per-product-type plausible (a yoga mat is never
  praised for its battery life) and FTS-searchable
- products.embedding (pgvector, 8-dim) is added only when the vector
  extension is already installed — never created by the seeder

Safety: single-transaction seed, refuses to touch an existing schema,
and --demo-drop only drops a schema carrying the MCPg ownership marker
comment. CLI-only surface — no new MCP tools, snapshot unchanged.

docs/demo.md is captured, not written: every output block is a real
tool run against the seeded dataset (tools/generate_demo_walkthrough.py
regenerates it), and tests/integration/test_demo_integration.py pins
every planted finding so the walkthrough can't silently rot. Roadmap
row 17.1.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @devopam, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@gemini-code-assist-2 gemini-code-assist-2 Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new onboarding and demo feature for MCPg, allowing users to seed and drop a curated, deterministic e-commerce dataset using the mcpg --demo and mcpg --demo-drop CLI commands. The dataset is specifically engineered with planted flaws (such as an un-indexed foreign key, PII-shaped columns, and naming violations) to showcase the capabilities of MCPg's analysis, indexing, search, and auditing tools. The PR includes comprehensive unit and integration tests, updated documentation, and a script to automatically generate a walkthrough of the demo. No review comments were provided, so there is no additional feedback to address.

@devopam devopam merged commit 5ff68fe into main Jul 3, 2026
19 checks passed
@devopam devopam deleted the claude/demo-dataset branch July 3, 2026 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants