Skip to content

perf: Consider using async I/O (tokio::fs) in shuffle writer #1387

@andygrove

Description

@andygrove

Description

The shuffle writer currently uses synchronous file I/O operations (std::fs) inside async functions, which blocks the tokio runtime. This could impact performance, especially when multiple tasks are executing concurrently on the same executor.

Current Implementation

In ballista/core/src/execution_plans/shuffle_writer.rs and ballista/core/src/utils.rs, the following synchronous operations are used within async contexts:

  • std::fs::create_dir_all() - creating output directories
  • std::fs::File::create() - creating shuffle output files
  • std::fs::metadata() - reading file size after write
  • StreamWriter::write() / StreamWriter::finish() - writing data (synchronous write to file)

Proposed Optimization

Consider using tokio::fs equivalents or wrapping synchronous I/O in spawn_blocking:

Option 1: Use tokio::fs directly

use tokio::fs::{self, File};
use tokio::io::BufWriter;

// Directory creation
tokio::fs::create_dir_all(&path).await?;

// File creation
let file = BufWriter::new(File::create(path).await?);

// File metadata
let num_bytes = tokio::fs::metadata(&w.path).await?.len();

Option 2: Use spawn_blocking for StreamWriter

Since Arrow's StreamWriter is synchronous, the actual writes would need spawn_blocking:

let batch_clone = batch.clone();
tokio::task::spawn_blocking(move || {
    writer.write(&batch_clone)
}).await??;

Trade-offs to Consider

  1. Complexity: Async file I/O adds complexity, especially for the StreamWriter which is inherently synchronous
  2. Performance: The benefit depends on workload - may be more significant with many concurrent tasks and small batches
  3. Memory: spawn_blocking requires cloning data or careful lifetime management
  4. Compatibility: Arrow's IPC writer is synchronous; full async would require buffering or a different approach

Benchmarking Needed

Before implementing, it would be valuable to benchmark:

  • Current performance with synchronous I/O
  • Impact of blocking on concurrent task execution
  • Whether spawn_blocking overhead is worth the async benefit

Related

This was identified during a code review of the shuffle writer. See also PR #1386 which adds BufWriter for buffered I/O.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions