Skip to content

Improve Jupyter notebook support with SQL magic commands and examples #1398

@andygrove

Description

@andygrove

Summary

Improve the Jupyter notebook experience for Ballista by adding SQL magic commands, example notebooks, and notebook-specific features. While basic notebook support already works via _repr_html_, there's an opportunity to provide a richer, more integrated experience.

Current State

PyBallista already supports basic Jupyter usage:

from ballista import BallistaSessionContext

ctx = BallistaSessionContext("df://localhost:50050")
df = ctx.sql("SELECT * FROM my_table LIMIT 10")
df  # Renders as HTML table via _repr_html_()

What works today:

  • _repr_html_() - DataFrames render as styled HTML tables
  • to_pandas() / to_arrow_table() / to_polars() - Data conversion
  • show() - Terminal-style output
  • Example .py files with # %% cell markers

Proposed Improvements

Phase 1: Documentation & Examples (Low Effort)

  1. Add example Jupyter notebooks to python/examples/:

    • getting_started.ipynb - Basic connection and queries
    • dataframe_api.ipynb - DataFrame transformations
    • distributed_queries.ipynb - Multi-stage query examples
  2. Document notebook support in python/README.md

Phase 2: SQL Magic Commands (Medium Effort)

Add IPython magic commands for a more interactive SQL experience:

%load_ext ballista.jupyter

# Connect to cluster
%ballista connect df://localhost:50050

# Line magic for simple queries
%sql SELECT COUNT(*) FROM orders

# Cell magic for complex queries
%%sql
SELECT 
    customer_id,
    SUM(amount) as total
FROM orders
GROUP BY customer_id
ORDER BY total DESC
LIMIT 10

Implementation sketch:

# ballista/jupyter.py
from IPython.core.magic import Magics, magics_class, line_magic, cell_magic

@magics_class
class BallistaMagics(Magics):
    def __init__(self, shell):
        super().__init__(shell)
        self.ctx = None
    
    @line_magic
    def ballista(self, line):
        """Ballista commands: connect, status, disconnect"""
        cmd, *args = line.split()
        if cmd == "connect":
            self.ctx = BallistaSessionContext(args[0])
            return f"Connected to {args[0]}"
        elif cmd == "status":
            # Show cluster status
            pass
    
    @cell_magic
    def sql(self, line, cell):
        """Execute SQL query"""
        if self.ctx is None:
            raise ValueError("Not connected. Use: %ballista connect df://host:port")
        return self.ctx.sql(cell)

def load_ipython_extension(ipython):
    ipython.register_magics(BallistaMagics)

Alternative: Integrate with JupySQL which provides a mature %%sql magic with features like:

  • Query composition
  • Result caching
  • Plotting integration
  • Multiple connection management

Phase 3: Enhanced Notebook Features (Medium Effort)

  1. Query plan visualization

    df.explain_visual()  # Render SVG of execution plan in notebook

    Leverage existing /api/job/{job_id}/dot_svg endpoint.

  2. Progress indicators for long queries

    # Show progress bar during distributed query execution
    from ipywidgets import FloatProgress
  3. Result size warnings

    # Warn before collecting large results
    df.collect()  # Warning: Query will return ~1M rows. Use .limit() or proceed? [y/N]
  4. Schema exploration

    %ballista tables           # List registered tables
    %ballista schema orders    # Show schema for table

Benefits

  1. Lower barrier to entry - SQL magic is familiar to data scientists
  2. Interactive exploration - Faster iteration in notebooks
  3. Discoverability - Example notebooks show what's possible
  4. Ecosystem alignment - Follows patterns from ipython-sql, JupySQL, DuckDB

Prior Art

Implementation Checklist

  • Add example .ipynb notebooks to python/examples/
  • Document notebook support in Python README
  • Create ballista.jupyter module with magic commands
  • Add %ballista connect/status/tables/schema line magics
  • Add %%sql cell magic
  • Add explain_visual() method for query plan rendering
  • Consider JupySQL integration as alternative/complement
  • Add progress indicator support for long-running queries

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions