Skip to content

feat(py): DataSourceReader bridge for ggsql database pushdown#221

Draft
cpsievert wants to merge 22 commits intomainfrom
feat/datasource-reader-bridge
Draft

feat(py): DataSourceReader bridge for ggsql database pushdown#221
cpsievert wants to merge 22 commits intomainfrom
feat/datasource-reader-bridge

Conversation

@cpsievert
Copy link
Copy Markdown
Contributor

Motivation

querychat's current ggsql integration splits execution into two phases: run the SQL on the real database, then replay the VISUALISE portion locally in an in-memory DuckDB. This has two problems:

  1. Scaling — the full SQL result must be pulled into Python memory, even when ggsql's stat transforms (histogram, density, boxplot) would reduce it to a small summary. A histogram of 10M rows pulls all 10M rows into memory only to bin them into ~30 buckets.

  2. Multi-source layers — ggsql supports per-layer data sources (e.g., a CTE fed to a different DRAW clause). The two-phase approach loses intermediate tables at the DataSource boundary, so querychat rejects these queries entirely.

Both problems stem from the same root cause: querychat splits the query at the SQL/VISUALISE boundary rather than letting ggsql run the full pipeline against the real database.

Approach

For SQLAlchemySource data sources, this PR implements a DataSourceReader — a Python object that satisfies ggsql's reader protocol (execute_sql(), register(), unregister()) by routing SQL to the real database via SQLAlchemy. ggsql runs its entire pipeline (parsing, CTEs, stat transforms, layer queries) against the real DB.

sqlglot transpiles ggsql's ANSI-generated SQL to the target database dialect. The dialect mapping covers 22 SQLAlchemy backends, verified by installing the actual driver packages and checking engine.dialect.name.

Falls back to the current two-phase approach when the bridge fails (e.g., temp table permission denied, unsupported dialect, transpilation error) or for non-SQLAlchemy data sources.

Why a separate PR

This is split off from feat/ggsql-integration because the DataSourceReader bridge is Python-specific — it depends on SQLAlchemy and sqlglot, neither of which has an R equivalent. The parent branch (feat/ggsql-integration) contains ggsql prompt/syntax updates and other changes that apply to both R and Python. Keeping this separate makes it clear that the R package doesn't need a corresponding change.

Changes

  • New: _datasource_reader.pyDataSourceReader class, SQLGLOT_DIALECTS mapping (22 dialects), transpile_sql(), register_sqlglot_dialect() for custom additions
  • Modified: _viz_ggsql.pyexecute_ggsql() tries bridge path first, falls back to execute_two_phase() (renamed current logic); logs a warning for unknown dialects
  • Modified: _viz_tools.py, _shiny_module.py — updated callers to pass original query string
  • New dep: sqlglot>=26.0 added to the viz extra (pure Python, zero transitive deps, 6.6 MB)
  • Tests: 19 new tests covering dialect mapping, transpilation, DataSourceReader lifecycle, and end-to-end ggsql integration against real SQLite

Test plan

  • All existing ggsql tests pass with updated 3-arg execute_ggsql() signature
  • New DataSourceReader unit tests pass (SQLite in-memory)
  • End-to-end: ggsql.execute(query, reader) works for scatter, filter, Form B, aggregation
  • Manual test with a real Snowflake/Postgres connection (TODO)

🤖 Generated with Claude Code

cpsievert and others added 22 commits April 8, 2026 17:36
Add visualization support to querychat via ggsql, a SQL extension for
declarative data visualization. Includes Altair/Vega-Lite rendering,
Shiny widget integration, ggsql syntax reference prompt, Playwright
tests, and documentation.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Fixes and improvements accumulated after the initial ggsql integration:
- Bookmark restore, Shiny tool wiring, and pre-commit fixes
- Remove legacy shiny client path
- Send PNG thumbnail to LLM in viz tool results
- Port visualization best practices from ggsqlbot (axis readability,
  labels with units, data-ink ratio, overplotting, chart type guide)
- Move behavioral guidance from tool description to system prompt
- Add interpretation skepticism to guidelines
- Add collapsed parameter to query tool for preparatory queries
- Add uv required-environments for CI

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…inction, trim redundant notes

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

fix(prompts): polish tool descriptions — collapsed default, error-mode reinforcement, tighter ggsql param

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

fix(prompts): tool name consistency, conditional fallback, standalone routing section

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

style: fix docstring formatting in new test class

Co-Authored-By: Claude Opus 4.6 <[email protected]>

fix(ggsql): guard unsupported layer sources
…deps lazy

Move pathlib, copy, importlib.util, and chevron to top-level imports.
Move internal module imports (_viz_altair_widget, _viz_ggsql) to top
level in _viz_tools.py. Keep ggsql, altair, and shinywidgets as runtime
imports since they are optional (viz extra).

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…pabilities

- Move visualization into "Your Capabilities" as a peer of filtering/querying
- Add "Choosing Between Query and Visualization" decision preamble
- Clean up conditional branching (drop viz-only fallback, simplify nesting)
- Integrate viz suggestions into existing suggestion examples
- Replace "(if available)" hedging with Mustache conditionals
- Trim tool-query.md and tool-update-dashboard.md: remove routing guidance
  and cross-tool references now covered by the main prompt
- Add structural verification tests for prompt conditionals

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ueries

Guide the LLM to prefer visualization for comparisons/distributions/trends
and always collapse query results when a chart covers the same data. Also
remove stale _extract_column_names monkeypatch that was failing CI.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Include vl-convert-python in viz dependency error message
- Use querychat_tool_starts_open() for viz tool open state
- Fix docs: viz tool is opt-in, not default
- Fix docs: each viz call produces a new message, not a replacement

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Integrate main's deferred client initialization (#207) and narwhals
version constraint (#218) with the ggsql visualization feature branch.
Extended _create_session_client with visualize_query parameter and
updated stub session test to match eager client construction.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
chatlas 0.16.0 added stream_content as a new abstract method on Provider,
causing DummyProvider instantiation to fail in CI.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
shallow copy + nested attribute mutation was modifying the original
chart's sub-specs through shared references, violating the function's
documented contract.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
narwhals.stable.v1.from_native wraps ibis Tables as DataFrame (not
LazyFrame), so the isinstance(nw_df, nw.LazyFrame) check was False and
collect() was never called, causing 'IbisLazyFrame has no attribute
to_polars'. Delegate to as_narwhals() which already handles ibis
correctly.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…identifiers

Snowflake uppercases unquoted identifiers, causing case mismatches when
ggsql validates VISUALISE column references against DuckDB results.
Replace the generic casing note with explicit wrong/correct examples
instructing the LLM to alias uppercase columns to lowercase in SELECT.

Borrowed from ggsqlbot's proven prompt pattern.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Snowflake uppercases all unquoted identifiers, so columns come back as
AVG_INTEREST_RATE even when the LLM writes AS avg_interest_rate. Since
ggsql validates VISUALISE column references case-sensitively and DuckDB
is case-insensitive, lowercasing the DataFrame columns before registering
ensures the LLM's lowercase references always match.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Adds truncate_error() in _utils to cap error strings sent to the LLM,
stripping schema dumps and applying a hard character limit. Wired into
all tool error catch blocks in tools.py and _viz_tools.py.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Updates syntax guide for ggsql v0.2.4 breaking changes: rect→tile,
linear merged into rule, array syntax to parentheses, updated text
aesthetics. Fixes pre-existing errors (invalid position fill, nonexistent
density stacking). Bumps ggsql dependency to >=0.2.4.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Implements a DataSourceReader that routes ggsql's full pipeline through
the real database via SQLAlchemy, using sqlglot for dialect transpilation.
For SQLAlchemySource with a known dialect, ggsql runs CTEs, stat
transforms, and layer queries directly on the real DB — avoiding the
need to pull large result sets into Python memory. Falls back to the
existing two-phase approach (now `execute_two_phase`) for other
DataSource types or on bridge failure.

Includes verified dialect mappings for 22 SQLAlchemy backends and
register_sqlglot_dialect() for custom additions.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Base automatically changed from feat/ggsql-integration to main April 28, 2026 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant