feat(py): DataSourceReader bridge for ggsql database pushdown#221
Draft
feat(py): DataSourceReader bridge for ggsql database pushdown#221
Conversation
Add visualization support to querychat via ggsql, a SQL extension for declarative data visualization. Includes Altair/Vega-Lite rendering, Shiny widget integration, ggsql syntax reference prompt, Playwright tests, and documentation. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Fixes and improvements accumulated after the initial ggsql integration: - Bookmark restore, Shiny tool wiring, and pre-commit fixes - Remove legacy shiny client path - Send PNG thumbnail to LLM in viz tool results - Port visualization best practices from ggsqlbot (axis readability, labels with units, data-ink ratio, overplotting, chart type guide) - Move behavioral guidance from tool description to system prompt - Add interpretation skepticism to guidelines - Add collapsed parameter to query tool for preparatory queries - Add uv required-environments for CI Co-Authored-By: Claude Opus 4.6 <[email protected]>
…inction, trim redundant notes Co-Authored-By: Claude Sonnet 4.6 <[email protected]> fix(prompts): polish tool descriptions — collapsed default, error-mode reinforcement, tighter ggsql param Co-Authored-By: Claude Sonnet 4.6 <[email protected]> fix(prompts): tool name consistency, conditional fallback, standalone routing section Co-Authored-By: Claude Sonnet 4.6 <[email protected]> style: fix docstring formatting in new test class Co-Authored-By: Claude Opus 4.6 <[email protected]> fix(ggsql): guard unsupported layer sources
…deps lazy Move pathlib, copy, importlib.util, and chevron to top-level imports. Move internal module imports (_viz_altair_widget, _viz_ggsql) to top level in _viz_tools.py. Keep ggsql, altair, and shinywidgets as runtime imports since they are optional (viz extra). Co-Authored-By: Claude Opus 4.6 <[email protected]>
…pabilities - Move visualization into "Your Capabilities" as a peer of filtering/querying - Add "Choosing Between Query and Visualization" decision preamble - Clean up conditional branching (drop viz-only fallback, simplify nesting) - Integrate viz suggestions into existing suggestion examples - Replace "(if available)" hedging with Mustache conditionals - Trim tool-query.md and tool-update-dashboard.md: remove routing guidance and cross-tool references now covered by the main prompt - Add structural verification tests for prompt conditionals Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ueries Guide the LLM to prefer visualization for comparisons/distributions/trends and always collapse query results when a chart covers the same data. Also remove stale _extract_column_names monkeypatch that was failing CI. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Include vl-convert-python in viz dependency error message - Use querychat_tool_starts_open() for viz tool open state - Fix docs: viz tool is opt-in, not default - Fix docs: each viz call produces a new message, not a replacement Co-Authored-By: Claude Opus 4.6 <[email protected]>
Integrate main's deferred client initialization (#207) and narwhals version constraint (#218) with the ggsql visualization feature branch. Extended _create_session_client with visualize_query parameter and updated stub session test to match eager client construction. Co-Authored-By: Claude Opus 4.6 <[email protected]>
chatlas 0.16.0 added stream_content as a new abstract method on Provider, causing DummyProvider instantiation to fail in CI. Co-Authored-By: Claude Opus 4.6 <[email protected]>
shallow copy + nested attribute mutation was modifying the original chart's sub-specs through shared references, violating the function's documented contract. Co-Authored-By: Claude Opus 4.6 <[email protected]>
narwhals.stable.v1.from_native wraps ibis Tables as DataFrame (not LazyFrame), so the isinstance(nw_df, nw.LazyFrame) check was False and collect() was never called, causing 'IbisLazyFrame has no attribute to_polars'. Delegate to as_narwhals() which already handles ibis correctly. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…identifiers Snowflake uppercases unquoted identifiers, causing case mismatches when ggsql validates VISUALISE column references against DuckDB results. Replace the generic casing note with explicit wrong/correct examples instructing the LLM to alias uppercase columns to lowercase in SELECT. Borrowed from ggsqlbot's proven prompt pattern. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Snowflake uppercases all unquoted identifiers, so columns come back as AVG_INTEREST_RATE even when the LLM writes AS avg_interest_rate. Since ggsql validates VISUALISE column references case-sensitively and DuckDB is case-insensitive, lowercasing the DataFrame columns before registering ensures the LLM's lowercase references always match. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Adds truncate_error() in _utils to cap error strings sent to the LLM, stripping schema dumps and applying a hard character limit. Wired into all tool error catch blocks in tools.py and _viz_tools.py. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Updates syntax guide for ggsql v0.2.4 breaking changes: rect→tile, linear merged into rule, array syntax to parentheses, updated text aesthetics. Fixes pre-existing errors (invalid position fill, nonexistent density stacking). Bumps ggsql dependency to >=0.2.4. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Implements a DataSourceReader that routes ggsql's full pipeline through the real database via SQLAlchemy, using sqlglot for dialect transpilation. For SQLAlchemySource with a known dialect, ggsql runs CTEs, stat transforms, and layer queries directly on the real DB — avoiding the need to pull large result sets into Python memory. Falls back to the existing two-phase approach (now `execute_two_phase`) for other DataSource types or on bridge failure. Includes verified dialect mappings for 22 SQLAlchemy backends and register_sqlglot_dialect() for custom additions. Co-Authored-By: Claude Opus 4.6 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
querychat's current ggsql integration splits execution into two phases: run the SQL on the real database, then replay the VISUALISE portion locally in an in-memory DuckDB. This has two problems:
Scaling — the full SQL result must be pulled into Python memory, even when ggsql's stat transforms (histogram, density, boxplot) would reduce it to a small summary. A histogram of 10M rows pulls all 10M rows into memory only to bin them into ~30 buckets.
Multi-source layers — ggsql supports per-layer data sources (e.g., a CTE fed to a different DRAW clause). The two-phase approach loses intermediate tables at the DataSource boundary, so querychat rejects these queries entirely.
Both problems stem from the same root cause: querychat splits the query at the SQL/VISUALISE boundary rather than letting ggsql run the full pipeline against the real database.
Approach
For
SQLAlchemySourcedata sources, this PR implements aDataSourceReader— a Python object that satisfies ggsql's reader protocol (execute_sql(),register(),unregister()) by routing SQL to the real database via SQLAlchemy. ggsql runs its entire pipeline (parsing, CTEs, stat transforms, layer queries) against the real DB.sqlglot transpiles ggsql's ANSI-generated SQL to the target database dialect. The dialect mapping covers 22 SQLAlchemy backends, verified by installing the actual driver packages and checking
engine.dialect.name.Falls back to the current two-phase approach when the bridge fails (e.g., temp table permission denied, unsupported dialect, transpilation error) or for non-SQLAlchemy data sources.
Why a separate PR
This is split off from
feat/ggsql-integrationbecause the DataSourceReader bridge is Python-specific — it depends on SQLAlchemy and sqlglot, neither of which has an R equivalent. The parent branch (feat/ggsql-integration) contains ggsql prompt/syntax updates and other changes that apply to both R and Python. Keeping this separate makes it clear that the R package doesn't need a corresponding change.Changes
_datasource_reader.py—DataSourceReaderclass,SQLGLOT_DIALECTSmapping (22 dialects),transpile_sql(),register_sqlglot_dialect()for custom additions_viz_ggsql.py—execute_ggsql()tries bridge path first, falls back toexecute_two_phase()(renamed current logic); logs a warning for unknown dialects_viz_tools.py,_shiny_module.py— updated callers to pass original query stringsqlglot>=26.0added to thevizextra (pure Python, zero transitive deps, 6.6 MB)Test plan
execute_ggsql()signatureggsql.execute(query, reader)works for scatter, filter, Form B, aggregation🤖 Generated with Claude Code