feat: add gallery examples registry mapping examples to datasets#724
Draft
feat: add gallery examples registry mapping examples to datasets#724
Conversation
Adds Pyright type checking to the project with initial coverage of select scripts. Configuration uses 'basic' mode for gradual typing adoption. Scripts included in type checking: - scripts/generate_gallery_examples.py (new) - scripts/build_datapackage.py - scripts/species.py - scripts/flights.py - scripts/income.py - scripts/us-state-capitals.py Type safety improvements to scripts/species.py (required to pass checks): - Add TypedDict definitions for configuration structures (FilterItem, GeographicFilter, ProcessingConfig, Config) - Add semantic type aliases (ItemId, SpeciesCode, CountyId, FileExtension, ExactExtractOp) for domain clarity - Add type guard function is_file_extension() for FileExtension validation - Improve function signatures with complete type annotations - Add TYPE_CHECKING block for type-only imports These changes ensure the build passes with Pyright enabled while improving code maintainability and IDE support.
Adds cross-ecosystem registry cataloging ~470 examples from Vega, Vega-Lite, and Altair galleries, tracking which datasets each example uses. New files: - _data/gallery_examples.toml: Configuration (URLs, Altair name mappings) - scripts/generate_gallery_examples.py: Generator (2,289 lines, fully typed) - gallery_examples.json: Generated output (~470 examples) When joined with datapackage.json, enables: - Dataset-first learning (find all examples using specific dataset) - Curation analytics (dataset coverage matrices, gap analysis) - High-quality training data for visualization AI/ML systems Examples are curated by the Vega community to demonstrate essential visualization techniques and design patterns. Implementation details: - Handles different spec formats per framework (Vega, Vega-Lite, Altair) - Normalizes all references to canonical datapackage.json names - Altair deduplication: Uses method-based syntax (preferred as of Altair 5) when examples exist in both syntax directories (116 cases) - Temporary name mappings for Altair API (3 mappings, will be removed after Altair PR #3859 lands) - Comprehensive type safety with TypedDict, Protocols, semantic type aliases - Protocol-based validation infrastructure for extensibility Runtime: ~15 seconds to collect all examples Quality: All checks pass (taplo, ruff, pyright, npm build)
823c2b9 to
0815882
Compare
Altair PR #3859 (merged 2025-10-26) migrated from vega_datasets package to altair.datasets module with canonical vega-datasets naming. This updates the gallery examples collection to track Altair v6+ main branch. Changes: - Empty [altair.name_mapping] section (was: londonBoroughs → london_boroughs) - Comments now document legacy v5.x support instead of temporary workaround - Add pattern for fully qualified altair.datasets.data.X.url syntax - Refactor extract_altair_api_datasets() with explicit name_mapping parameter - Regenerate gallery_examples.json (470 examples, all with canonical names) Type safety improvements: - extract_altair_api_datasets() now accepts name_mapping as parameter instead of accessing global _config directly - Explicit None default for Altair v6+ (no mapping needed) - Better testability and separation of concerns Backward compatibility: - Mapping section preserved (empty) with documentation for v5.x users - Historical camelCase examples commented out for reference - Function signature supports both v5 (with mapping) and v6 (without) Configuration notes: - Currently tracks Altair main branch (v6+ development) - Git ref hardcoded in Python script (line 1135) - documented in TOML - Stability note added: consider pinning to release tag when v6.0.0 available - Testing procedure documented for v5.x regression testing All three galleries (Vega, Vega-Lite, Altair) now use consistent canonical dataset naming from datapackage.json. Related: vega/altair#3859 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This was referenced Oct 27, 2025
domoritz
pushed a commit
to vega/vega-lite
that referenced
this pull request
Nov 19, 2025
Fixes broken URL found during vega-datasets link checking (see vega/vega-datasets#724) The original Observable notebook by @manzt no longer exists at the previous URL. Update to link to his Vega/Vega-Lite examples collection instead, which provides proper attribution without broken links. Co-authored-by: Claude <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- gallery_examples.json now contains data-only array (no metadata wrapper) - Metadata (description, schema, sources, licenses) moved to datapackage.json - build_datapackage.py conditionally includes gallery_examples resource - Added 9-field schema in datapackage_additions.toml Per Data Package Standard v2: data files contain only data; metadata belongs in datapackage.json as a resource entry. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nsforms Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…geopath, geojson) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nest, treelinks) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pattern, cross) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n headers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ction Regenerated with 54 unique technique tags (up from 33), including new layout:*, geo:*, and transform:* patterns. Vega zero-technique examples reduced from 6 to 2 (Connected Scatter Plot and Timelines). Also adds ruff linting relaxation for test files (pyproject.toml). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 5 safe, unambiguous patterns for Altair's methods syntax API: - .stack() method → transform:stack (+18 detections) - mark_arc / "arc" → mark:arc (new tag, +23 detections) - alt.Row() / alt.Column() → composition:facet (+44 detections) - bin=True / .bin() → transform:bin (+27 detections) - count() shorthand → transform:aggregate (+54 detections) Altair zero-technique examples reduced from 64 to 40. Total unique techniques: 55 (up from 54). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This proposal adds a gallery examples registry (
gallery_examples.json) to vega-datasets that maps ~470 visualization examples from the Vega, Vega-Lite, and Altair galleries to their underlying datasets and techniques. Combined withdatapackage.json, this creates a queryable knowledge base covering the entire Vega visualization ecosystsme for:Note: Visualization taxonomy and schema is in draft form will benefit from expert input. Please see open questions at the end.
How It Works
The registry is auto-generated by scraping all three official galleries:
Current State:
datapackage.jsonThe existing
datapackage.jsonfollows the Data Package Standard and describes 73 datasets in the/data/directory:{ "name": "vega-datasets", "version": "3.2.1", "description": "Common repository for example datasets used by Vega related projects...", "resources": [ { "name": "airports", "type": "table", "path": "airports.csv", "description": "Airports in the United States...", "schema": { "fields": [ {"name": "iata", "type": "string"}, {"name": "name", "type": "string"}, {"name": "city", "type": "string"}, {"name": "state", "type": "string"}, {"name": "latitude", "type": "number"}, {"name": "longitude", "type": "number"} ] }, "sources": [{"title": "Federal Aviation Administration", "path": "..."}], "licenses": [{"name": "other-open", "title": "..."}] }, // ... 72 more resources ] }What it provides: Schema, sources, licenses, descriptions for each dataset.
What's missing: How are these datasets actually used in visualizations?
Proposed Addition:
gallery_examples.jsonA new meta-resource in the repository root containing an array of 470 gallery examples:
[ { "id": 1, "gallery_name": "altair", "example_name": "Atmospheric CO2 Concentration", "example_url": "https://altair-viz.github.io/gallery/co2_concentration.html", "spec_url": "https://raw.githubusercontent.com/.../co2_concentration.py", "categories": ["Case Studies"], "description": "A fully developed line chart using window transformation...", "datasets": ["co2_concentration"], "techniques": ["transform:window", "composition:layer"] }, // ... 469 more examples ]Field Reference
idgallery_name"vega","vega-lite", or"altair"example_nameexample_urlspec_urlcategoriesdescriptiondatasetsresource.namein datapackage)techniquestransform:filter)Technique Taxonomy
transform:*filter,aggregate,window,calculate,foldcomposition:*layer,facet,concat,repeatinteraction:*param,selection,binding,conditionalgeo:*projection,graticuleIntegration with
datapackage.jsonThe gallery examples registry is added as a resource in
datapackage.json:{ "name": "gallery_examples", "type": "json", "path": "gallery_examples.json", "description": "Cross-reference catalog mapping gallery examples to vega-datasets resources...", "schema": { "fields": [ {"name": "id", "type": "integer", "description": "Unique sequential identifier"}, {"name": "gallery_name", "type": "string", "constraints": {"enum": ["vega", "vega-lite", "altair"]}}, {"name": "example_name", "type": "string"}, {"name": "datasets", "type": "array", "description": "References resource.name in this package"}, {"name": "techniques", "type": "array"} // ... additional fields ] }, "sources": [ {"title": "Vega Gallery", "path": "https://vega.github.io/vega/examples/"}, {"title": "Vega-Lite Gallery", "path": "https://vega.github.io/vega-lite/examples/"}, {"title": "Altair Gallery", "path": "https://altair-viz.github.io/gallery/"} ] }Result: 73 datasets → 74 resources (datasets + gallery registry)
Use Cases: AI-Assisted Discovery
The following examples show how an AI coding assistant (Claude, Copilot, Cursor) can query these files to provide grounded, accurate responses.
Use Case 1: Learning Paths (Technique-First Discovery)
User Prompt:
Agent Query:
Result:
[ {"name": "Atmospheric CO2 Concentration", "gallery": "altair", "datasets": ["co2_concentration"]}, {"name": "Cumulative Wikipedia Donations", "gallery": "altair", "datasets": ["co2_concentration"]}, {"name": "Layer Line Chart with Dual Axis", "gallery": "altair", "datasets": ["seattle_weather"]}, {"name": "Layered Plot with Dual-Axis", "gallery": "altair", "datasets": ["seattle_weather"]}, {"name": "Normalized Stacked Area Chart", "gallery": "altair", "datasets": ["iowa_electricity"]} ]Takeaway: 44 examples use window transforms. The agent can link directly to working examples and explain that window transforms work best with temporal/sequential data (CO2 readings, weather records).
Use Case 2: Dataset Onramp (What Can I Build?)
User Prompt:
Agent Query:
Result:
[ {"name": "Gapminder Bubble Plot", "gallery": "altair", "techniques": ["interaction:param"]}, {"name": "Scatter plot with point paths on hover", "gallery": "altair", "techniques": ["interaction:param", "interaction:conditional"]}, {"name": "Global Development", "gallery": "vega", "techniques": ["interaction:param", "interaction:binding"]}, {"name": "Bubble Plot (Gapminder)", "gallery": "vega-lite", "techniques": []}, {"name": "Interactive scatter plot of global health statistics", "gallery": "vega-lite", "techniques": ["interaction:param", "interaction:binding"]} ]Takeaway: The agent can explain: "Gapminder has temporal (year), categorical (country, cluster), and multiple quantitative fields (pop, life_expect, fertility)—perfect for the famous Hans Rosling animated bubble chart. Here are 5 working examples across all three libraries."
Use Case 3: Rosetta Stone (Same Visualization, Three Libraries)
User Prompt:
Agent Query:
Result:
[ {"gallery": "altair", "count": 19, "examples": ["2D Histogram Heatmap", "Binned Scatterplot", "Boxplot with Min/Max Whiskers"]}, {"gallery": "vega", "count": 6, "examples": ["Car Horsepower", "Connected Scatter Plot", "Interactive Legend"]}, {"gallery": "vega-lite", "count": 28, "examples": ["2D Histogram Heatmap", "Aggregate Bar Chart (Sorted)", "Bar Chart Highlighting Values"]} ]Takeaway: 53 examples use the
carsdataset across all three abstraction levels. The agent can show the same scatter plot in Altair (~10 lines of Python), Vega-Lite (~20 lines of JSON), and Vega (~100 lines of JSON)—demonstrating the Grammar of Graphics abstraction ladder concretely.Use Case 4: Contextual Learning (Just-In-Time Help)
User Prompt:
Agent Query:
Result:
[ {"name": "Brushing Scatter Plot to Show Data on a Table", "url": "https://altair-viz.github.io/gallery/scatter_with_table.html", "gallery": "altair"}, {"name": "Multi-panel Scatter Plot with Linked Brushing", "url": "https://altair-viz.github.io/gallery/scatter_linked_table.html", "gallery": "altair"}, {"name": "Interactive Rectangular Brush", "url": "https://altair-viz.github.io/gallery/interactive_brush.html", "gallery": "altair"}, {"name": "Rectangular Brush", "url": "https://vega.github.io/vega-lite/examples/selection_brush.html", "gallery": "vega-lite"}, {"name": "Scatterplot Pan & Zoom", "url": "https://vega.github.io/vega-lite/examples/selection_pan_zoom.html", "gallery": "vega-lite"} ]Takeaway: The agent provides working examples using the exact dataset the user is already working with. No hallucinated field names, no guessing—just grounded responses with direct links.
Use Case 5: Technique Popularity (Learning Prioritization)
User Prompt:
Agent Query:
Result:
[ {"technique": "interaction:param", "count": 142}, {"technique": "transform:filter", "count": 128}, {"technique": "transform:aggregate", "count": 112}, {"technique": "transform:calculate", "count": 108}, {"technique": "composition:layer", "count": 82}, {"technique": "interaction:binding", "count": 65}, {"technique": "composition:facet", "count": 52}, {"technique": "interaction:conditional", "count": 49}, {"technique": "transform:window", "count": 44}, {"technique": "interaction:selection", "count": 38} ]Takeaway: Data-driven curriculum: Master these 6 techniques and you'll understand 70%+ of gallery examples:
interaction:param) - foundation of interactivitytransform:filter) - most common transformtransform:aggregate) - grouping/summarizingtransform:calculate) - computed fieldscomposition:layer) - multi-mark overlaysinteraction:binding) - input widgetsUse Case 6: Coverage Gaps (Contribution Guide)
User Prompt:
Agent Query:
Result:
[ {"dataset": "annual_precip", "count": 1}, {"dataset": "earthquakes", "count": 1}, {"dataset": "flights_3m", "count": 1}, {"dataset": "football", "count": 1}, {"dataset": "income", "count": 1}, {"dataset": "miserables", "count": 1}, {"dataset": "normal_2d", "count": 1}, {"dataset": "ohlc", "count": 1}, {"dataset": "political_contributions", "count": 1}, {"dataset": "volcano", "count": 1} ]Takeaway: 10 datasets have only 1 example each. Contributors can focus here to improve coverage. Compare to
cars(53 examples) orstocks(15 examples) to see what good coverage looks like.Why This Matters
For Learners
For AI Coding Assistants
For Tool Builders
Feedback Requested: Technique Taxonomy and Schema
This PR auto-generates a registry of ~470 gallery examples with detected techniques. Before finalizing:
Technique taxonomy — Does
transform:*,composition:*,interaction:*,geo:*capture the right conceptual categories? Missing any (e.g.,encoding:*,data:*)?Technique granularity — Is
transform:windowthe right level, or should we go deeper (e.g.,transform:window:cumulative) or stay flatter?Field schema — Are
datasets,techniques,categoriesthe right cross-reference points, or would additional fields (e.g.,marks,encodings) be more useful for discovery?References