-
Notifications
You must be signed in to change notification settings - Fork 261
Description
Summary
Improve the Jupyter notebook experience for Ballista by adding SQL magic commands, example notebooks, and notebook-specific features. While basic notebook support already works via _repr_html_, there's an opportunity to provide a richer, more integrated experience.
Current State
PyBallista already supports basic Jupyter usage:
from ballista import BallistaSessionContext
ctx = BallistaSessionContext("df://localhost:50050")
df = ctx.sql("SELECT * FROM my_table LIMIT 10")
df # Renders as HTML table via _repr_html_()What works today:
_repr_html_()- DataFrames render as styled HTML tablesto_pandas()/to_arrow_table()/to_polars()- Data conversionshow()- Terminal-style output- Example
.pyfiles with# %%cell markers
Proposed Improvements
Phase 1: Documentation & Examples (Low Effort)
-
Add example Jupyter notebooks to
python/examples/:getting_started.ipynb- Basic connection and queriesdataframe_api.ipynb- DataFrame transformationsdistributed_queries.ipynb- Multi-stage query examples
-
Document notebook support in
python/README.md
Phase 2: SQL Magic Commands (Medium Effort)
Add IPython magic commands for a more interactive SQL experience:
%load_ext ballista.jupyter
# Connect to cluster
%ballista connect df://localhost:50050
# Line magic for simple queries
%sql SELECT COUNT(*) FROM orders
# Cell magic for complex queries
%%sql
SELECT
customer_id,
SUM(amount) as total
FROM orders
GROUP BY customer_id
ORDER BY total DESC
LIMIT 10Implementation sketch:
# ballista/jupyter.py
from IPython.core.magic import Magics, magics_class, line_magic, cell_magic
@magics_class
class BallistaMagics(Magics):
def __init__(self, shell):
super().__init__(shell)
self.ctx = None
@line_magic
def ballista(self, line):
"""Ballista commands: connect, status, disconnect"""
cmd, *args = line.split()
if cmd == "connect":
self.ctx = BallistaSessionContext(args[0])
return f"Connected to {args[0]}"
elif cmd == "status":
# Show cluster status
pass
@cell_magic
def sql(self, line, cell):
"""Execute SQL query"""
if self.ctx is None:
raise ValueError("Not connected. Use: %ballista connect df://host:port")
return self.ctx.sql(cell)
def load_ipython_extension(ipython):
ipython.register_magics(BallistaMagics)Alternative: Integrate with JupySQL which provides a mature %%sql magic with features like:
- Query composition
- Result caching
- Plotting integration
- Multiple connection management
Phase 3: Enhanced Notebook Features (Medium Effort)
-
Query plan visualization
df.explain_visual() # Render SVG of execution plan in notebook
Leverage existing
/api/job/{job_id}/dot_svgendpoint. -
Progress indicators for long queries
# Show progress bar during distributed query execution from ipywidgets import FloatProgress
-
Result size warnings
# Warn before collecting large results df.collect() # Warning: Query will return ~1M rows. Use .limit() or proceed? [y/N]
-
Schema exploration
%ballista tables # List registered tables %ballista schema orders # Show schema for table
Benefits
- Lower barrier to entry - SQL magic is familiar to data scientists
- Interactive exploration - Faster iteration in notebooks
- Discoverability - Example notebooks show what's possible
- Ecosystem alignment - Follows patterns from ipython-sql, JupySQL, DuckDB
Prior Art
- JupySQL - Modern SQL magic for Jupyter
- ipython-sql - Original
%%sqlmagic - DuckDB Jupyter - DuckDB's notebook integration
- Spark magic - SQL magic for Spark
Implementation Checklist
- Add example
.ipynbnotebooks topython/examples/ - Document notebook support in Python README
- Create
ballista.jupytermodule with magic commands - Add
%ballista connect/status/tables/schemaline magics - Add
%%sqlcell magic - Add
explain_visual()method for query plan rendering - Consider JupySQL integration as alternative/complement
- Add progress indicator support for long-running queries