Skip to content

feat: add gallery examples registry mapping examples to datasets#724

Draft
dsmedia wants to merge 20 commits intovega:mainfrom
dsmedia:feat/generate-gallery-examples
Draft

feat: add gallery examples registry mapping examples to datasets#724
dsmedia wants to merge 20 commits intovega:mainfrom
dsmedia:feat/generate-gallery-examples

Conversation

@dsmedia
Copy link
Collaborator

@dsmedia dsmedia commented Oct 26, 2025

This proposal adds a gallery examples registry (gallery_examples.json) to vega-datasets that maps ~470 visualization examples from the Vega, Vega-Lite, and Altair galleries to their underlying datasets and techniques. Combined with datapackage.json, this creates a queryable knowledge base covering the entire Vega visualization ecosystsme for:

  • Learners discovering visualization patterns
  • AI coding assistants grounding responses in real examples
  • Tool builders creating dataset-aware recommendations

Note: Visualization taxonomy and schema is in draft form will benefit from expert input. Please see open questions at the end.


How It Works

The registry is auto-generated by scraping all three official galleries:

# Generate the example registry (fetches ~470 specs)
uv run scripts/generate_gallery_examples.py

# Rebuild datapackage.json to include the new resource
uv run scripts/build_datapackage.py

Current State: datapackage.json

The existing datapackage.json follows the Data Package Standard and describes 73 datasets in the /data/ directory:

{
  "name": "vega-datasets",
  "version": "3.2.1",
  "description": "Common repository for example datasets used by Vega related projects...",
  "resources": [
    {
      "name": "airports",
      "type": "table",
      "path": "airports.csv",
      "description": "Airports in the United States...",
      "schema": {
        "fields": [
          {"name": "iata", "type": "string"},
          {"name": "name", "type": "string"},
          {"name": "city", "type": "string"},
          {"name": "state", "type": "string"},
          {"name": "latitude", "type": "number"},
          {"name": "longitude", "type": "number"}
        ]
      },
      "sources": [{"title": "Federal Aviation Administration", "path": "..."}],
      "licenses": [{"name": "other-open", "title": "..."}]
    },
    // ... 72 more resources
  ]
}

What it provides: Schema, sources, licenses, descriptions for each dataset.

What's missing: How are these datasets actually used in visualizations?


Proposed Addition: gallery_examples.json

A new meta-resource in the repository root containing an array of 470 gallery examples:

[
  {
    "id": 1,
    "gallery_name": "altair",
    "example_name": "Atmospheric CO2 Concentration",
    "example_url": "https://altair-viz.github.io/gallery/co2_concentration.html",
    "spec_url": "https://raw.githubusercontent.com/.../co2_concentration.py",
    "categories": ["Case Studies"],
    "description": "A fully developed line chart using window transformation...",
    "datasets": ["co2_concentration"],
    "techniques": ["transform:window", "composition:layer"]
  },
  // ... 469 more examples
]

Field Reference

Field Type Description
id integer Unique sequential identifier
gallery_name string "vega", "vega-lite", or "altair"
example_name string Human-readable title
example_url string Link to rendered example
spec_url string Link to source spec/code
categories array Gallery categories (e.g., "Bar Charts")
description string What the example demonstrates
datasets array Dataset names (references resource.name in datapackage)
techniques array Detected techniques (e.g., transform:filter)

Technique Taxonomy

Category Examples Maps to Vega-Lite
transform:* filter, aggregate, window, calculate, fold Transforms
composition:* layer, facet, concat, repeat View Composition
interaction:* param, selection, binding, conditional Parameters
geo:* projection, graticule Geographic

Integration with datapackage.json

The gallery examples registry is added as a resource in datapackage.json:

{
  "name": "gallery_examples",
  "type": "json",
  "path": "gallery_examples.json",
  "description": "Cross-reference catalog mapping gallery examples to vega-datasets resources...",
  "schema": {
    "fields": [
      {"name": "id", "type": "integer", "description": "Unique sequential identifier"},
      {"name": "gallery_name", "type": "string", "constraints": {"enum": ["vega", "vega-lite", "altair"]}},
      {"name": "example_name", "type": "string"},
      {"name": "datasets", "type": "array", "description": "References resource.name in this package"},
      {"name": "techniques", "type": "array"}
      // ... additional fields
    ]
  },
  "sources": [
    {"title": "Vega Gallery", "path": "https://vega.github.io/vega/examples/"},
    {"title": "Vega-Lite Gallery", "path": "https://vega.github.io/vega-lite/examples/"},
    {"title": "Altair Gallery", "path": "https://altair-viz.github.io/gallery/"}
  ]
}

Result: 73 datasets → 74 resources (datasets + gallery registry)


Use Cases: AI-Assisted Discovery

The following examples show how an AI coding assistant (Claude, Copilot, Cursor) can query these files to provide grounded, accurate responses.


Use Case 1: Learning Paths (Technique-First Discovery)

User Prompt:

"I want to learn how to use window transforms in Vega-Lite. Can you show me some examples?"

Agent Query:

jq '[.[] | select(.techniques | contains(["transform:window"]))] |
    map({name: .example_name, gallery: .gallery_name, datasets: .datasets}) |
    .[0:5]' gallery_examples.json

Result:

[
  {"name": "Atmospheric CO2 Concentration", "gallery": "altair", "datasets": ["co2_concentration"]},
  {"name": "Cumulative Wikipedia Donations", "gallery": "altair", "datasets": ["co2_concentration"]},
  {"name": "Layer Line Chart with Dual Axis", "gallery": "altair", "datasets": ["seattle_weather"]},
  {"name": "Layered Plot with Dual-Axis", "gallery": "altair", "datasets": ["seattle_weather"]},
  {"name": "Normalized Stacked Area Chart", "gallery": "altair", "datasets": ["iowa_electricity"]}
]

Takeaway: 44 examples use window transforms. The agent can link directly to working examples and explain that window transforms work best with temporal/sequential data (CO2 readings, weather records).


Use Case 2: Dataset Onramp (What Can I Build?)

User Prompt:

"I found the gapminder dataset. What visualizations can I make with it?"

Agent Query:

# Query 1: Find examples
jq '[.[] | select(.datasets | contains(["gapminder"]))] |
    map({name: .example_name, gallery: .gallery_name, techniques: .techniques})' \
    gallery_examples.json

# Query 2: Get schema context
jq '.resources[] | select(.name == "gapminder") |
    {fields: [.schema.fields[].name], description: .description[0:150]}' \
    datapackage.json

Result:

[
  {"name": "Gapminder Bubble Plot", "gallery": "altair", "techniques": ["interaction:param"]},
  {"name": "Scatter plot with point paths on hover", "gallery": "altair", "techniques": ["interaction:param", "interaction:conditional"]},
  {"name": "Global Development", "gallery": "vega", "techniques": ["interaction:param", "interaction:binding"]},
  {"name": "Bubble Plot (Gapminder)", "gallery": "vega-lite", "techniques": []},
  {"name": "Interactive scatter plot of global health statistics", "gallery": "vega-lite", "techniques": ["interaction:param", "interaction:binding"]}
]

Takeaway: The agent can explain: "Gapminder has temporal (year), categorical (country, cluster), and multiple quantitative fields (pop, life_expect, fertility)—perfect for the famous Hans Rosling animated bubble chart. Here are 5 working examples across all three libraries."


Use Case 3: Rosetta Stone (Same Visualization, Three Libraries)

User Prompt:

"I know Altair but need to write Vega-Lite. Show me the same visualization in both."

Agent Query:

jq '[.[] | select(.datasets | contains(["cars"]))] |
    group_by(.gallery_name) |
    map({gallery: .[0].gallery_name, count: length,
         examples: [.[0:3] | .[].example_name]})' \
    gallery_examples.json

Result:

[
  {"gallery": "altair", "count": 19, "examples": ["2D Histogram Heatmap", "Binned Scatterplot", "Boxplot with Min/Max Whiskers"]},
  {"gallery": "vega", "count": 6, "examples": ["Car Horsepower", "Connected Scatter Plot", "Interactive Legend"]},
  {"gallery": "vega-lite", "count": 28, "examples": ["2D Histogram Heatmap", "Aggregate Bar Chart (Sorted)", "Bar Chart Highlighting Values"]}
]

Takeaway: 53 examples use the cars dataset across all three abstraction levels. The agent can show the same scatter plot in Altair (~10 lines of Python), Vega-Lite (~20 lines of JSON), and Vega (~100 lines of JSON)—demonstrating the Grammar of Graphics abstraction ladder concretely.


Use Case 4: Contextual Learning (Just-In-Time Help)

User Prompt:

"I'm making a scatter plot with cars data. How do I add brushing/selection?"

Agent Query:

jq '[.[] |
    select(.datasets | contains(["cars"])) |
    select(.techniques | contains(["interaction:selection"]))] |
    map({name: .example_name, url: .example_url, gallery: .gallery_name})' \
    gallery_examples.json

Result:

[
  {"name": "Brushing Scatter Plot to Show Data on a Table", "url": "https://altair-viz.github.io/gallery/scatter_with_table.html", "gallery": "altair"},
  {"name": "Multi-panel Scatter Plot with Linked Brushing", "url": "https://altair-viz.github.io/gallery/scatter_linked_table.html", "gallery": "altair"},
  {"name": "Interactive Rectangular Brush", "url": "https://altair-viz.github.io/gallery/interactive_brush.html", "gallery": "altair"},
  {"name": "Rectangular Brush", "url": "https://vega.github.io/vega-lite/examples/selection_brush.html", "gallery": "vega-lite"},
  {"name": "Scatterplot Pan & Zoom", "url": "https://vega.github.io/vega-lite/examples/selection_pan_zoom.html", "gallery": "vega-lite"}
]

Takeaway: The agent provides working examples using the exact dataset the user is already working with. No hallucinated field names, no guessing—just grounded responses with direct links.


Use Case 5: Technique Popularity (Learning Prioritization)

User Prompt:

"I have limited time. Which Vega-Lite techniques should I learn first?"

Agent Query:

jq '[.[].techniques[]] | group_by(.) |
    map({technique: .[0], count: length}) |
    sort_by(-.count) | .[0:10]' \
    gallery_examples.json

Result:

[
  {"technique": "interaction:param", "count": 142},
  {"technique": "transform:filter", "count": 128},
  {"technique": "transform:aggregate", "count": 112},
  {"technique": "transform:calculate", "count": 108},
  {"technique": "composition:layer", "count": 82},
  {"technique": "interaction:binding", "count": 65},
  {"technique": "composition:facet", "count": 52},
  {"technique": "interaction:conditional", "count": 49},
  {"technique": "transform:window", "count": 44},
  {"technique": "interaction:selection", "count": 38}
]

Takeaway: Data-driven curriculum: Master these 6 techniques and you'll understand 70%+ of gallery examples:

  1. Parameters (interaction:param) - foundation of interactivity
  2. Filtering (transform:filter) - most common transform
  3. Aggregation (transform:aggregate) - grouping/summarizing
  4. Calculations (transform:calculate) - computed fields
  5. Layering (composition:layer) - multi-mark overlays
  6. Bindings (interaction:binding) - input widgets

Use Case 6: Coverage Gaps (Contribution Guide)

User Prompt:

"I want to contribute gallery examples. What datasets need more coverage?"

Agent Query:

jq '[.[].datasets[]] | group_by(.) |
    map({dataset: .[0], count: length}) |
    sort_by(.count) | .[0:10]' \
    gallery_examples.json

Result:

[
  {"dataset": "annual_precip", "count": 1},
  {"dataset": "earthquakes", "count": 1},
  {"dataset": "flights_3m", "count": 1},
  {"dataset": "football", "count": 1},
  {"dataset": "income", "count": 1},
  {"dataset": "miserables", "count": 1},
  {"dataset": "normal_2d", "count": 1},
  {"dataset": "ohlc", "count": 1},
  {"dataset": "political_contributions", "count": 1},
  {"dataset": "volcano", "count": 1}
]

Takeaway: 10 datasets have only 1 example each. Contributors can focus here to improve coverage. Compare to cars (53 examples) or stocks (15 examples) to see what good coverage looks like.


Why This Matters

For Learners

  • Discovery by technique: "Show me examples using window transforms"
  • Discovery by dataset: "What can I build with gapminder?"
  • Complexity gradients: Start simple, level up progressively

For AI Coding Assistants

  • Grounded responses: Real field names, working URLs
  • No hallucination: Recommendations based on actual tested examples
  • Context-aware: Suggest techniques proven to work with specific data shapes

For Tool Builders

  • IDE integrations: Dataset-aware autocomplete
  • Learning platforms: Smart recommendations based on data shape
  • Documentation generators: Auto-link examples to dataset docs

Feedback Requested: Technique Taxonomy and Schema

This PR auto-generates a registry of ~470 gallery examples with detected techniques. Before finalizing:

  1. Technique taxonomy — Does transform:*, composition:*, interaction:*, geo:* capture the right conceptual categories? Missing any (e.g., encoding:*, data:*)?

  2. Technique granularity — Is transform:window the right level, or should we go deeper (e.g., transform:window:cumulative) or stay flatter?

  3. Field schema — Are datasets, techniques, categories the right cross-reference points, or would additional fields (e.g., marks, encodings) be more useful for discovery?


References

Adds Pyright type checking to the project with initial coverage of select
scripts. Configuration uses 'basic' mode for gradual typing adoption.

Scripts included in type checking:
- scripts/generate_gallery_examples.py (new)
- scripts/build_datapackage.py
- scripts/species.py
- scripts/flights.py
- scripts/income.py
- scripts/us-state-capitals.py

Type safety improvements to scripts/species.py (required to pass checks):
- Add TypedDict definitions for configuration structures (FilterItem,
  GeographicFilter, ProcessingConfig, Config)
- Add semantic type aliases (ItemId, SpeciesCode, CountyId, FileExtension,
  ExactExtractOp) for domain clarity
- Add type guard function is_file_extension() for FileExtension validation
- Improve function signatures with complete type annotations
- Add TYPE_CHECKING block for type-only imports

These changes ensure the build passes with Pyright enabled while improving
code maintainability and IDE support.
Adds cross-ecosystem registry cataloging ~470 examples from Vega, Vega-Lite,
and Altair galleries, tracking which datasets each example uses.

New files:
- _data/gallery_examples.toml: Configuration (URLs, Altair name mappings)
- scripts/generate_gallery_examples.py: Generator (2,289 lines, fully typed)
- gallery_examples.json: Generated output (~470 examples)

When joined with datapackage.json, enables:
- Dataset-first learning (find all examples using specific dataset)
- Curation analytics (dataset coverage matrices, gap analysis)
- High-quality training data for visualization AI/ML systems

Examples are curated by the Vega community to demonstrate essential
visualization techniques and design patterns.

Implementation details:
- Handles different spec formats per framework (Vega, Vega-Lite, Altair)
- Normalizes all references to canonical datapackage.json names
- Altair deduplication: Uses method-based syntax (preferred as of Altair 5)
  when examples exist in both syntax directories (116 cases)
- Temporary name mappings for Altair API (3 mappings, will be removed after
  Altair PR #3859 lands)
- Comprehensive type safety with TypedDict, Protocols, semantic type aliases
- Protocol-based validation infrastructure for extensibility

Runtime: ~15 seconds to collect all examples
Quality: All checks pass (taplo, ruff, pyright, npm build)
@dsmedia dsmedia force-pushed the feat/generate-gallery-examples branch from 823c2b9 to 0815882 Compare October 26, 2025 14:00
Altair PR #3859 (merged 2025-10-26) migrated from vega_datasets package
to altair.datasets module with canonical vega-datasets naming. This
updates the gallery examples collection to track Altair v6+ main branch.

Changes:
- Empty [altair.name_mapping] section (was: londonBoroughs → london_boroughs)
- Comments now document legacy v5.x support instead of temporary workaround
- Add pattern for fully qualified altair.datasets.data.X.url syntax
- Refactor extract_altair_api_datasets() with explicit name_mapping parameter
- Regenerate gallery_examples.json (470 examples, all with canonical names)

Type safety improvements:
- extract_altair_api_datasets() now accepts name_mapping as parameter
  instead of accessing global _config directly
- Explicit None default for Altair v6+ (no mapping needed)
- Better testability and separation of concerns

Backward compatibility:
- Mapping section preserved (empty) with documentation for v5.x users
- Historical camelCase examples commented out for reference
- Function signature supports both v5 (with mapping) and v6 (without)

Configuration notes:
- Currently tracks Altair main branch (v6+ development)
- Git ref hardcoded in Python script (line 1135) - documented in TOML
- Stability note added: consider pinning to release tag when v6.0.0 available
- Testing procedure documented for v5.x regression testing

All three galleries (Vega, Vega-Lite, Altair) now use consistent
canonical dataset naming from datapackage.json.

Related: vega/altair#3859

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
domoritz pushed a commit to vega/vega-lite that referenced this pull request Nov 19, 2025
Fixes broken URL found during vega-datasets link checking (see
vega/vega-datasets#724)

The original Observable notebook by @manzt no longer exists at the
previous URL. Update to link to his Vega/Vega-Lite examples collection
instead, which provides proper attribution without broken links.

Co-authored-by: Claude <noreply@anthropic.com>
dsmedia and others added 5 commits February 3, 2026 18:58
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- gallery_examples.json now contains data-only array (no metadata wrapper)
- Metadata (description, schema, sources, licenses) moved to datapackage.json
- build_datapackage.py conditionally includes gallery_examples resource
- Added 9-field schema in datapackage_additions.toml

Per Data Package Standard v2: data files contain only data;
metadata belongs in datapackage.json as a resource entry.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
dsmedia and others added 8 commits February 5, 2026 13:11
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nsforms

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…geopath, geojson)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nest, treelinks)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pattern, cross)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n headers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ction

Regenerated with 54 unique technique tags (up from 33), including
new layout:*, geo:*, and transform:* patterns. Vega zero-technique
examples reduced from 6 to 2 (Connected Scatter Plot and Timelines).

Also adds ruff linting relaxation for test files (pyproject.toml).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dsmedia and others added 4 commits February 5, 2026 22:18
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 5 safe, unambiguous patterns for Altair's methods syntax API:
- .stack() method → transform:stack (+18 detections)
- mark_arc / "arc" → mark:arc (new tag, +23 detections)
- alt.Row() / alt.Column() → composition:facet (+44 detections)
- bin=True / .bin() → transform:bin (+27 detections)
- count() shorthand → transform:aggregate (+54 detections)

Altair zero-technique examples reduced from 64 to 40.
Total unique techniques: 55 (up from 54).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant