Skip to content

Feature/jupyter notebook support#1513

Open
sandugood wants to merge 6 commits intoapache:mainfrom
sandugood:feature/jupyter-notebook-support
Open

Feature/jupyter notebook support#1513
sandugood wants to merge 6 commits intoapache:mainfrom
sandugood:feature/jupyter-notebook-support

Conversation

@sandugood
Copy link

Which issue does this PR close?

Refactoring of PR #1430

Rationale for this change

Latest commit to the mentioned PR was in January. Refactoring felt like a nice thing.

What changes are included in this PR?

  1. Fix of both @line_magic and @cell_magic decorators. Combined them under the @line_cell_magic decorator
  2. Nicely HTML-formatted output of cells. Using display() func and also printing in a loop where applicable

Are there any user-facing changes?

No.

littleKitchen and others added 4 commits January 31, 2026 00:45
…s and examples

This PR implements all items from the checklist in issue apache#1398:

## Implementation Checklist

- [x] Add example .ipynb notebooks to python/examples/
  - getting_started.ipynb - Basic connection and queries
  - dataframe_api.ipynb - DataFrame transformations
  - distributed_queries.ipynb - Multi-stage query examples

- [x] Document notebook support in Python README
  - Added comprehensive Jupyter section with examples

- [x] Create ballista.jupyter module with magic commands
  - Full implementation with BallistaMagics class

- [x] Add %ballista connect/status/tables/schema line magics
  - connect: Connect to Ballista cluster
  - status: Show connection status
  - tables: List registered tables
  - schema: Show table schema
  - disconnect: Disconnect from cluster
  - history: Show query history

- [x] Add %%sql cell magic
  - Line magic for single-line queries
  - Cell magic for multi-line queries
  - Variable assignment support
  - --no-display and --limit options

- [x] Add explain_visual() method for query plan rendering
  - Generates DOT/SVG visualization
  - Supports Jupyter _repr_html_
  - Fallback when graphviz not installed

- [x] Add progress indicator support for long-running queries
  - collect_with_progress() method
  - Callback support for custom progress handling
  - Jupyter-aware display

- [x] Consider JupySQL integration
  - Documented as alternative in README

## Additional Features

- ExecutionPlanVisualization class for plan rendering
- tables() method on BallistaSessionContext
- Optional jupyter dependency in pyproject.toml
- Comprehensive test coverage (45 tests passing)

Closes apache#1398
if not tables:
return "No tables registered.\n\nUse ctx.register_parquet() or ctx.register_csv() to register tables."
schema_count = len(tables.keys())
table_count = len(tables.values())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
table_count = len(tables.values())
table_count = sum(len(v) for v in tables.values())

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

for row in lines:
print(row)

def _show_help(self) -> Optional[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function supposed to return something ?
Currently it just prints and implicitly returns None but https://github.com/apache/datafusion-ballista/pull/1513/changes#diff-db55df3b39944095b30c0a940790394d25c009baf49be871acd19d9eab2f30b8R90-R93 seems to expect Some

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of return of str in functions changed the return type to Optional[str]. Remade it to display() if the ipython is available and defaults to print otherwise.
Also changed the testcases to account for that.

if IPYTHON_AVAILABLE:
display(HTML(html))
else:
print("\n".join(status_lines))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as previous comment


try:
# Query the table with LIMIT 0 to get schema without data
df = self.ctx.sql(f"SELECT * FROM {table_name} LIMIT 0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should table_name be sanitized ? Currently it could lead to SQL injection

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are considering a local development, I guess?
It is the same as:
%sql some SQL code

so it might be potentially any SQL code, executed by the data-scientist/analyst.

else:
def decorator(func):
return func
return decorator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No fallback for line_cell_magic. @line_cell_magic will fail if IPython is not available

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

"# ctx = BallistaSessionContext(\"df://your-scheduler:50050\")\n",
"\n",
"# host, port = setup_test_cluster()\n",
"host, port = \"localhost\", \"39431\"\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a debug leftover ? IMO line 79 should be used

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, debug leftover. Fixed


def collect_with_progress(
self,
callback: Optional[callable] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
callback: Optional[callable] = None,
callback: Optional[Callable] = None,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thank you

LIMIT 10
"""
for row in help_info.split("\n"):
print(row)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(row)
print(row)
return help_info

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

## Jupyter Notebook Support

PyBallista provides first-class Jupyter notebook support with SQL magic commands and rich HTML rendering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Install Jupyter extras first:
```bash
pip install "ballista[jupyter]"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

try:
catalog = self.catalog()
schema_names = list(catalog.schema_names())
if schema_names is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A list is never None.

Suggested change
if schema_names is not None:
if schema_names:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thank you

@sandugood
Copy link
Author

My main question was regarding the SQL-sanitizing. It seems to me that it is not needed (adding another layer of logic) in the current implementation. But I'm opened to suggestions and thank you for your review.

@milenkovicm
Copy link
Contributor

My main question was regarding the SQL-sanitizing. It seems to me that it is not needed (adding another layer of logic) in the current implementation. But I'm opened to suggestions and thank you for your review.

hey @sandugood thanks for the contribution, i would agree with you, I don't think this is security issue as user can do whatever it wants anyway.

will have a look at the pr later in more details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants