feat: add gallery examples registry mapping examples to datasets by dsmedia · Pull Request #724 · vega/vega-datasets

dsmedia · 2025-10-26T02:12:29Z

This proposal adds a gallery examples registry (gallery_examples.json) to vega-datasets that maps ~470 visualization examples from the Vega, Vega-Lite, and Altair galleries to their underlying datasets and techniques. Combined with datapackage.json, this creates a queryable knowledge base covering the entire Vega visualization ecosystsme for:

Learners discovering visualization patterns
AI coding assistants grounding responses in real examples
Tool builders creating dataset-aware recommendations

Note: Visualization taxonomy and schema is in draft form will benefit from expert input. Please see open questions at the end.

How It Works

The registry is auto-generated by scraping all three official galleries:

# Generate the example registry (fetches ~470 specs)
uv run scripts/generate_gallery_examples.py

# Rebuild datapackage.json to include the new resource
uv run scripts/build_datapackage.py

Current State: `datapackage.json`

The existing datapackage.json follows the Data Package Standard and describes 73 datasets in the /data/ directory:

{
  "name": "vega-datasets",
  "version": "3.2.1",
  "description": "Common repository for example datasets used by Vega related projects...",
  "resources": [
    {
      "name": "airports",
      "type": "table",
      "path": "airports.csv",
      "description": "Airports in the United States...",
      "schema": {
        "fields": [
          {"name": "iata", "type": "string"},
          {"name": "name", "type": "string"},
          {"name": "city", "type": "string"},
          {"name": "state", "type": "string"},
          {"name": "latitude", "type": "number"},
          {"name": "longitude", "type": "number"}
        ]
      },
      "sources": [{"title": "Federal Aviation Administration", "path": "..."}],
      "licenses": [{"name": "other-open", "title": "..."}]
    },
    // ... 72 more resources
  ]
}

What it provides: Schema, sources, licenses, descriptions for each dataset.

What's missing: How are these datasets actually used in visualizations?

Proposed Addition: `gallery_examples.json`

A new meta-resource in the repository root containing an array of 470 gallery examples:

[
  {
    "id": 1,
    "gallery_name": "altair",
    "example_name": "Atmospheric CO2 Concentration",
    "example_url": "https://altair-viz.github.io/gallery/co2_concentration.html",
    "spec_url": "https://raw.githubusercontent.com/.../co2_concentration.py",
    "categories": ["Case Studies"],
    "description": "A fully developed line chart using window transformation...",
    "datasets": ["co2_concentration"],
    "techniques": ["transform:window", "composition:layer"]
  },
  // ... 469 more examples
]

Field Reference

Field	Type	Description
`id`	integer	Unique sequential identifier
`gallery_name`	string	`"vega"`, `"vega-lite"`, or `"altair"`
`example_name`	string	Human-readable title
`example_url`	string	Link to rendered example
`spec_url`	string	Link to source spec/code
`categories`	array	Gallery categories (e.g., "Bar Charts")
`description`	string	What the example demonstrates
`datasets`	array	Dataset names (references `resource.name` in datapackage)
`techniques`	array	Detected techniques (e.g., `transform:filter`)

Technique Taxonomy

Category	Examples	Maps to Vega-Lite
`transform:*`	`filter`, `aggregate`, `window`, `calculate`, `fold`	Transforms
`composition:*`	`layer`, `facet`, `concat`, `repeat`	View Composition
`interaction:*`	`param`, `selection`, `binding`, `conditional`	Parameters
`geo:*`	`projection`, `graticule`	Geographic

Integration with `datapackage.json`

The gallery examples registry is added as a resource in datapackage.json:

{
  "name": "gallery_examples",
  "type": "json",
  "path": "gallery_examples.json",
  "description": "Cross-reference catalog mapping gallery examples to vega-datasets resources...",
  "schema": {
    "fields": [
      {"name": "id", "type": "integer", "description": "Unique sequential identifier"},
      {"name": "gallery_name", "type": "string", "constraints": {"enum": ["vega", "vega-lite", "altair"]}},
      {"name": "example_name", "type": "string"},
      {"name": "datasets", "type": "array", "description": "References resource.name in this package"},
      {"name": "techniques", "type": "array"}
      // ... additional fields
    ]
  },
  "sources": [
    {"title": "Vega Gallery", "path": "https://vega.github.io/vega/examples/"},
    {"title": "Vega-Lite Gallery", "path": "https://vega.github.io/vega-lite/examples/"},
    {"title": "Altair Gallery", "path": "https://altair-viz.github.io/gallery/"}
  ]
}

Result: 73 datasets → 74 resources (datasets + gallery registry)

Use Cases: AI-Assisted Discovery

The following examples show how an AI coding assistant (Claude, Copilot, Cursor) can query these files to provide grounded, accurate responses.

Use Case 1: Learning Paths (Technique-First Discovery)

User Prompt:

"I want to learn how to use window transforms in Vega-Lite. Can you show me some examples?"

Agent Query:

jq '[.[] | select(.techniques | contains(["transform:window"]))] |
    map({name: .example_name, gallery: .gallery_name, datasets: .datasets}) |
    .[0:5]' gallery_examples.json

Result:

[
  {"name": "Atmospheric CO2 Concentration", "gallery": "altair", "datasets": ["co2_concentration"]},
  {"name": "Cumulative Wikipedia Donations", "gallery": "altair", "datasets": ["co2_concentration"]},
  {"name": "Layer Line Chart with Dual Axis", "gallery": "altair", "datasets": ["seattle_weather"]},
  {"name": "Layered Plot with Dual-Axis", "gallery": "altair", "datasets": ["seattle_weather"]},
  {"name": "Normalized Stacked Area Chart", "gallery": "altair", "datasets": ["iowa_electricity"]}
]

Takeaway: 44 examples use window transforms. The agent can link directly to working examples and explain that window transforms work best with temporal/sequential data (CO2 readings, weather records).

Use Case 2: Dataset Onramp (What Can I Build?)

User Prompt:

"I found the gapminder dataset. What visualizations can I make with it?"

Agent Query:

# Query 1: Find examples
jq '[.[] | select(.datasets | contains(["gapminder"]))] |
    map({name: .example_name, gallery: .gallery_name, techniques: .techniques})' \
    gallery_examples.json

# Query 2: Get schema context
jq '.resources[] | select(.name == "gapminder") |
    {fields: [.schema.fields[].name], description: .description[0:150]}' \
    datapackage.json

Result:

[
  {"name": "Gapminder Bubble Plot", "gallery": "altair", "techniques": ["interaction:param"]},
  {"name": "Scatter plot with point paths on hover", "gallery": "altair", "techniques": ["interaction:param", "interaction:conditional"]},
  {"name": "Global Development", "gallery": "vega", "techniques": ["interaction:param", "interaction:binding"]},
  {"name": "Bubble Plot (Gapminder)", "gallery": "vega-lite", "techniques": []},
  {"name": "Interactive scatter plot of global health statistics", "gallery": "vega-lite", "techniques": ["interaction:param", "interaction:binding"]}
]

Takeaway: The agent can explain: "Gapminder has temporal (year), categorical (country, cluster), and multiple quantitative fields (pop, life_expect, fertility)—perfect for the famous Hans Rosling animated bubble chart. Here are 5 working examples across all three libraries."

Use Case 3: Rosetta Stone (Same Visualization, Three Libraries)

User Prompt:

"I know Altair but need to write Vega-Lite. Show me the same visualization in both."

Agent Query:

jq '[.[] | select(.datasets | contains(["cars"]))] |
    group_by(.gallery_name) |
    map({gallery: .[0].gallery_name, count: length,
         examples: [.[0:3] | .[].example_name]})' \
    gallery_examples.json

Result:

[
  {"gallery": "altair", "count": 19, "examples": ["2D Histogram Heatmap", "Binned Scatterplot", "Boxplot with Min/Max Whiskers"]},
  {"gallery": "vega", "count": 6, "examples": ["Car Horsepower", "Connected Scatter Plot", "Interactive Legend"]},
  {"gallery": "vega-lite", "count": 28, "examples": ["2D Histogram Heatmap", "Aggregate Bar Chart (Sorted)", "Bar Chart Highlighting Values"]}
]

Takeaway: 53 examples use the cars dataset across all three abstraction levels. The agent can show the same scatter plot in Altair (~10 lines of Python), Vega-Lite (~20 lines of JSON), and Vega (~100 lines of JSON)—demonstrating the Grammar of Graphics abstraction ladder concretely.

Use Case 4: Contextual Learning (Just-In-Time Help)

User Prompt:

"I'm making a scatter plot with cars data. How do I add brushing/selection?"

Agent Query:

jq '[.[] |
    select(.datasets | contains(["cars"])) |
    select(.techniques | contains(["interaction:selection"]))] |
    map({name: .example_name, url: .example_url, gallery: .gallery_name})' \
    gallery_examples.json

Result:

[
  {"name": "Brushing Scatter Plot to Show Data on a Table", "url": "https://altair-viz.github.io/gallery/scatter_with_table.html", "gallery": "altair"},
  {"name": "Multi-panel Scatter Plot with Linked Brushing", "url": "https://altair-viz.github.io/gallery/scatter_linked_table.html", "gallery": "altair"},
  {"name": "Interactive Rectangular Brush", "url": "https://altair-viz.github.io/gallery/interactive_brush.html", "gallery": "altair"},
  {"name": "Rectangular Brush", "url": "https://vega.github.io/vega-lite/examples/selection_brush.html", "gallery": "vega-lite"},
  {"name": "Scatterplot Pan & Zoom", "url": "https://vega.github.io/vega-lite/examples/selection_pan_zoom.html", "gallery": "vega-lite"}
]

Takeaway: The agent provides working examples using the exact dataset the user is already working with. No hallucinated field names, no guessing—just grounded responses with direct links.

Use Case 5: Technique Popularity (Learning Prioritization)

User Prompt:

"I have limited time. Which Vega-Lite techniques should I learn first?"

Agent Query:

jq '[.[].techniques[]] | group_by(.) |
    map({technique: .[0], count: length}) |
    sort_by(-.count) | .[0:10]' \
    gallery_examples.json

Result:

[
  {"technique": "interaction:param", "count": 142},
  {"technique": "transform:filter", "count": 128},
  {"technique": "transform:aggregate", "count": 112},
  {"technique": "transform:calculate", "count": 108},
  {"technique": "composition:layer", "count": 82},
  {"technique": "interaction:binding", "count": 65},
  {"technique": "composition:facet", "count": 52},
  {"technique": "interaction:conditional", "count": 49},
  {"technique": "transform:window", "count": 44},
  {"technique": "interaction:selection", "count": 38}
]

Takeaway: Data-driven curriculum: Master these 6 techniques and you'll understand 70%+ of gallery examples:

Parameters (interaction:param) - foundation of interactivity
Filtering (transform:filter) - most common transform
Aggregation (transform:aggregate) - grouping/summarizing
Calculations (transform:calculate) - computed fields
Layering (composition:layer) - multi-mark overlays
Bindings (interaction:binding) - input widgets

Use Case 6: Coverage Gaps (Contribution Guide)

User Prompt:

"I want to contribute gallery examples. What datasets need more coverage?"

Agent Query:

jq '[.[].datasets[]] | group_by(.) |
    map({dataset: .[0], count: length}) |
    sort_by(.count) | .[0:10]' \
    gallery_examples.json

Result:

[
  {"dataset": "annual_precip", "count": 1},
  {"dataset": "earthquakes", "count": 1},
  {"dataset": "flights_3m", "count": 1},
  {"dataset": "football", "count": 1},
  {"dataset": "income", "count": 1},
  {"dataset": "miserables", "count": 1},
  {"dataset": "normal_2d", "count": 1},
  {"dataset": "ohlc", "count": 1},
  {"dataset": "political_contributions", "count": 1},
  {"dataset": "volcano", "count": 1}
]

Takeaway: 10 datasets have only 1 example each. Contributors can focus here to improve coverage. Compare to cars (53 examples) or stocks (15 examples) to see what good coverage looks like.

Why This Matters

For Learners

Discovery by technique: "Show me examples using window transforms"
Discovery by dataset: "What can I build with gapminder?"
Complexity gradients: Start simple, level up progressively

For AI Coding Assistants

Grounded responses: Real field names, working URLs
No hallucination: Recommendations based on actual tested examples
Context-aware: Suggest techniques proven to work with specific data shapes

For Tool Builders

IDE integrations: Dataset-aware autocomplete
Learning platforms: Smart recommendations based on data shape
Documentation generators: Auto-link examples to dataset docs

Feedback Requested: Technique Taxonomy and Schema

This PR auto-generates a registry of ~470 gallery examples with detected techniques. Before finalizing:

Technique taxonomy — Does transform:*, composition:*, interaction:*, geo:* capture the right conceptual categories? Missing any (e.g., encoding:*, data:*)?
Technique granularity — Is transform:window the right level, or should we go deeper (e.g., transform:window:cumulative) or stay flatter?
Field schema — Are datasets, techniques, categories the right cross-reference points, or would additional fields (e.g., marks, encodings) be more useful for discovery?

References

Adds Pyright type checking to the project with initial coverage of select scripts. Configuration uses 'basic' mode for gradual typing adoption. Scripts included in type checking: - scripts/generate_gallery_examples.py (new) - scripts/build_datapackage.py - scripts/species.py - scripts/flights.py - scripts/income.py - scripts/us-state-capitals.py Type safety improvements to scripts/species.py (required to pass checks): - Add TypedDict definitions for configuration structures (FilterItem, GeographicFilter, ProcessingConfig, Config) - Add semantic type aliases (ItemId, SpeciesCode, CountyId, FileExtension, ExactExtractOp) for domain clarity - Add type guard function is_file_extension() for FileExtension validation - Improve function signatures with complete type annotations - Add TYPE_CHECKING block for type-only imports These changes ensure the build passes with Pyright enabled while improving code maintainability and IDE support.

Adds cross-ecosystem registry cataloging ~470 examples from Vega, Vega-Lite, and Altair galleries, tracking which datasets each example uses. New files: - _data/gallery_examples.toml: Configuration (URLs, Altair name mappings) - scripts/generate_gallery_examples.py: Generator (2,289 lines, fully typed) - gallery_examples.json: Generated output (~470 examples) When joined with datapackage.json, enables: - Dataset-first learning (find all examples using specific dataset) - Curation analytics (dataset coverage matrices, gap analysis) - High-quality training data for visualization AI/ML systems Examples are curated by the Vega community to demonstrate essential visualization techniques and design patterns. Implementation details: - Handles different spec formats per framework (Vega, Vega-Lite, Altair) - Normalizes all references to canonical datapackage.json names - Altair deduplication: Uses method-based syntax (preferred as of Altair 5) when examples exist in both syntax directories (116 cases) - Temporary name mappings for Altair API (3 mappings, will be removed after Altair PR #3859 lands) - Comprehensive type safety with TypedDict, Protocols, semantic type aliases - Protocol-based validation infrastructure for extensibility Runtime: ~15 seconds to collect all examples Quality: All checks pass (taplo, ruff, pyright, npm build)

Altair PR #3859 (merged 2025-10-26) migrated from vega_datasets package to altair.datasets module with canonical vega-datasets naming. This updates the gallery examples collection to track Altair v6+ main branch. Changes: - Empty [altair.name_mapping] section (was: londonBoroughs → london_boroughs) - Comments now document legacy v5.x support instead of temporary workaround - Add pattern for fully qualified altair.datasets.data.X.url syntax - Refactor extract_altair_api_datasets() with explicit name_mapping parameter - Regenerate gallery_examples.json (470 examples, all with canonical names) Type safety improvements: - extract_altair_api_datasets() now accepts name_mapping as parameter instead of accessing global _config directly - Explicit None default for Altair v6+ (no mapping needed) - Better testability and separation of concerns Backward compatibility: - Mapping section preserved (empty) with documentation for v5.x users - Historical camelCase examples commented out for reference - Function signature supports both v5 (with mapping) and v6 (without) Configuration notes: - Currently tracks Altair main branch (v6+ development) - Git ref hardcoded in Python script (line 1135) - documented in TOML - Stability note added: consider pinning to release tag when v6.0.0 available - Testing procedure documented for v5.x regression testing All three galleries (Vega, Vega-Lite, Altair) now use consistent canonical dataset naming from datapackage.json. Related: vega/altair#3859 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

@manzt

Fixes broken URL found during vega-datasets link checking (see vega/vega-datasets#724) The original Observable notebook by @manzt no longer exists at the previous URL. Update to link to his Vega/Vega-Lite examples collection instead, which provides proper attribution without broken links. Co-authored-by: Claude <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- gallery_examples.json now contains data-only array (no metadata wrapper) - Metadata (description, schema, sources, licenses) moved to datapackage.json - build_datapackage.py conditionally includes gallery_examples resource - Added 9-field schema in datapackage_additions.toml Per Data Package Standard v2: data files contain only data; metadata belongs in datapackage.json as a resource entry. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…nsforms Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…geopath, geojson) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…nest, treelinks) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…pattern, cross) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…n headers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ction Regenerated with 54 unique technique tags (up from 33), including new layout:*, geo:*, and transform:* patterns. Vega zero-technique examples reduced from 6 to 2 (Connected Scatter Plot and Timelines). Also adds ruff linting relaxation for test files (pyproject.toml). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add 5 safe, unambiguous patterns for Altair's methods syntax API: - .stack() method → transform:stack (+18 detections) - mark_arc / "arc" → mark:arc (new tag, +23 detections) - alt.Row() / alt.Column() → composition:facet (+44 detections) - bin=True / .bin() → transform:bin (+27 detections) - count() shorthand → transform:aggregate (+54 detections) Altair zero-technique examples reduced from 64 to 40. Total unique techniques: 55 (up from 54). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dsmedia mentioned this pull request Oct 26, 2025

Best practice for dependency version pinning: ruff false positive gap between versions #725

Closed

dsmedia added enhancement documentation labels Oct 26, 2025

dsmedia requested review from domoritz and mattijn October 26, 2025 02:47

dsmedia mentioned this pull request Oct 26, 2025

chore(deps-dev): bump ruff from 0.8.3 to 0.14.2 in the dev group #726

Merged

dsmedia added 2 commits October 26, 2025 13:56

dsmedia force-pushed the feat/generate-gallery-examples branch from 823c2b9 to 0815882 Compare October 26, 2025 14:00

dsmedia and others added 5 commits February 3, 2026 18:58

Merge branch 'main' into feat/generate-gallery-examples

81b80e8

chore: add .worktrees to gitignore

1d17119

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: formatting

1013862

fix: format and lint

fb837a4

dsmedia requested review from hydrosquall and joelostblom February 5, 2026 02:24

dsmedia and others added 8 commits February 5, 2026 13:11

Remove accidental Mypy type ignores

3ca2d07

feat(techniques): add stack and timeUnit transform detection

ceb7a9b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(techniques): add layout:* category for Vega-only algorithmic tra…

f96c91a

…nsforms Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(techniques): add Vega-only geo transforms (graticule, geopoint, …

43401a2

…geopath, geojson) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(techniques): add Vega-only hierarchy data transforms (stratify, …

873005e

…nest, treelinks) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(techniques): add Vega-only data transforms (kde2d, dotbin, count…

4c0e327

…pattern, cross) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor(techniques): reorganize TECHNIQUE_PATTERNS with clear sectio…

a967ee2

…n headers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dsmedia and others added 4 commits February 5, 2026 22:18

chore: rebuild datapackage with expanded technique vocabulary

a53ac40

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: rebuild datapackage after Altair pattern improvements

854f66a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: format test file with ruff

eb207ab

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add gallery examples registry mapping examples to datasets#724

feat: add gallery examples registry mapping examples to datasets#724
dsmedia wants to merge 20 commits intovega:mainfrom
dsmedia:feat/generate-gallery-examples

dsmedia commented Oct 26, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dsmedia commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How It Works

Current State: datapackage.json

Proposed Addition: gallery_examples.json

Field Reference

Technique Taxonomy

Integration with datapackage.json

Use Cases: AI-Assisted Discovery

Why This Matters

For Learners

For AI Coding Assistants

For Tool Builders

Feedback Requested: Technique Taxonomy and Schema

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dsmedia commented Oct 26, 2025 •

edited

Loading

Current State: `datapackage.json`

Proposed Addition: `gallery_examples.json`

Integration with `datapackage.json`