-
Notifications
You must be signed in to change notification settings - Fork 261
Open
Description
Description
The shuffle writer currently uses synchronous file I/O operations (std::fs) inside async functions, which blocks the tokio runtime. This could impact performance, especially when multiple tasks are executing concurrently on the same executor.
Current Implementation
In ballista/core/src/execution_plans/shuffle_writer.rs and ballista/core/src/utils.rs, the following synchronous operations are used within async contexts:
std::fs::create_dir_all()- creating output directoriesstd::fs::File::create()- creating shuffle output filesstd::fs::metadata()- reading file size after writeStreamWriter::write()/StreamWriter::finish()- writing data (synchronous write to file)
Proposed Optimization
Consider using tokio::fs equivalents or wrapping synchronous I/O in spawn_blocking:
Option 1: Use tokio::fs directly
use tokio::fs::{self, File};
use tokio::io::BufWriter;
// Directory creation
tokio::fs::create_dir_all(&path).await?;
// File creation
let file = BufWriter::new(File::create(path).await?);
// File metadata
let num_bytes = tokio::fs::metadata(&w.path).await?.len();Option 2: Use spawn_blocking for StreamWriter
Since Arrow's StreamWriter is synchronous, the actual writes would need spawn_blocking:
let batch_clone = batch.clone();
tokio::task::spawn_blocking(move || {
writer.write(&batch_clone)
}).await??;Trade-offs to Consider
- Complexity: Async file I/O adds complexity, especially for the
StreamWriterwhich is inherently synchronous - Performance: The benefit depends on workload - may be more significant with many concurrent tasks and small batches
- Memory:
spawn_blockingrequires cloning data or careful lifetime management - Compatibility: Arrow's IPC writer is synchronous; full async would require buffering or a different approach
Benchmarking Needed
Before implementing, it would be valuable to benchmark:
- Current performance with synchronous I/O
- Impact of blocking on concurrent task execution
- Whether
spawn_blockingoverhead is worth the async benefit
Related
This was identified during a code review of the shuffle writer. See also PR #1386 which adds BufWriter for buffered I/O.
Metadata
Metadata
Assignees
Labels
No labels