Conversation
…s and examples This PR implements all items from the checklist in issue apache#1398: ## Implementation Checklist - [x] Add example .ipynb notebooks to python/examples/ - getting_started.ipynb - Basic connection and queries - dataframe_api.ipynb - DataFrame transformations - distributed_queries.ipynb - Multi-stage query examples - [x] Document notebook support in Python README - Added comprehensive Jupyter section with examples - [x] Create ballista.jupyter module with magic commands - Full implementation with BallistaMagics class - [x] Add %ballista connect/status/tables/schema line magics - connect: Connect to Ballista cluster - status: Show connection status - tables: List registered tables - schema: Show table schema - disconnect: Disconnect from cluster - history: Show query history - [x] Add %%sql cell magic - Line magic for single-line queries - Cell magic for multi-line queries - Variable assignment support - --no-display and --limit options - [x] Add explain_visual() method for query plan rendering - Generates DOT/SVG visualization - Supports Jupyter _repr_html_ - Fallback when graphviz not installed - [x] Add progress indicator support for long-running queries - collect_with_progress() method - Callback support for custom progress handling - Jupyter-aware display - [x] Consider JupySQL integration - Documented as alternative in README ## Additional Features - ExecutionPlanVisualization class for plan rendering - tables() method on BallistaSessionContext - Optional jupyter dependency in pyproject.toml - Comprehensive test coverage (45 tests passing) Closes apache#1398
python/python/ballista/jupyter.py
Outdated
| if not tables: | ||
| return "No tables registered.\n\nUse ctx.register_parquet() or ctx.register_csv() to register tables." | ||
| schema_count = len(tables.keys()) | ||
| table_count = len(tables.values()) |
There was a problem hiding this comment.
| table_count = len(tables.values()) | |
| table_count = sum(len(v) for v in tables.values()) |
| for row in lines: | ||
| print(row) | ||
|
|
||
| def _show_help(self) -> Optional[str]: |
There was a problem hiding this comment.
Is this function supposed to return something ?
Currently it just prints and implicitly returns None but https://github.com/apache/datafusion-ballista/pull/1513/changes#diff-db55df3b39944095b30c0a940790394d25c009baf49be871acd19d9eab2f30b8R90-R93 seems to expect Some
There was a problem hiding this comment.
Instead of return of str in functions changed the return type to Optional[str]. Remade it to display() if the ipython is available and defaults to print otherwise.
Also changed the testcases to account for that.
| if IPYTHON_AVAILABLE: | ||
| display(HTML(html)) | ||
| else: | ||
| print("\n".join(status_lines)) |
There was a problem hiding this comment.
Is this function supposed to return Some ?
https://github.com/apache/datafusion-ballista/pull/1513/changes#diff-db55df3b39944095b30c0a940790394d25c009baf49be871acd19d9eab2f30b8R68 tries to call .lower() on the implicit None
|
|
||
| try: | ||
| # Query the table with LIMIT 0 to get schema without data | ||
| df = self.ctx.sql(f"SELECT * FROM {table_name} LIMIT 0") |
There was a problem hiding this comment.
Should table_name be sanitized ? Currently it could lead to SQL injection
There was a problem hiding this comment.
We are considering a local development, I guess?
It is the same as:
%sql some SQL code
so it might be potentially any SQL code, executed by the data-scientist/analyst.
| else: | ||
| def decorator(func): | ||
| return func | ||
| return decorator |
There was a problem hiding this comment.
No fallback for line_cell_magic. @line_cell_magic will fail if IPython is not available
| "# ctx = BallistaSessionContext(\"df://your-scheduler:50050\")\n", | ||
| "\n", | ||
| "# host, port = setup_test_cluster()\n", | ||
| "host, port = \"localhost\", \"39431\"\n", |
There was a problem hiding this comment.
Is this a debug leftover ? IMO line 79 should be used
python/python/ballista/extension.py
Outdated
|
|
||
| def collect_with_progress( | ||
| self, | ||
| callback: Optional[callable] = None, |
There was a problem hiding this comment.
| callback: Optional[callable] = None, | |
| callback: Optional[Callable] = None, |
python/python/ballista/jupyter.py
Outdated
| LIMIT 10 | ||
| """ | ||
| for row in help_info.split("\n"): | ||
| print(row) |
There was a problem hiding this comment.
| print(row) | |
| print(row) | |
| return help_info |
| ## Jupyter Notebook Support | ||
|
|
||
| PyBallista provides first-class Jupyter notebook support with SQL magic commands and rich HTML rendering. | ||
|
|
There was a problem hiding this comment.
| Install Jupyter extras first: | |
| ```bash | |
| pip install "ballista[jupyter]" |
python/python/ballista/extension.py
Outdated
| try: | ||
| catalog = self.catalog() | ||
| schema_names = list(catalog.schema_names()) | ||
| if schema_names is not None: |
There was a problem hiding this comment.
A list is never None.
| if schema_names is not None: | |
| if schema_names: |
|
My main question was regarding the SQL-sanitizing. It seems to me that it is not needed (adding another layer of logic) in the current implementation. But I'm opened to suggestions and thank you for your review. |
hey @sandugood thanks for the contribution, i would agree with you, I don't think this is security issue as user can do whatever it wants anyway. will have a look at the pr later in more details |
Which issue does this PR close?
Refactoring of PR #1430
Rationale for this change
Latest commit to the mentioned PR was in January. Refactoring felt like a nice thing.
What changes are included in this PR?
Are there any user-facing changes?
No.