-
Notifications
You must be signed in to change notification settings - Fork 278
perf: Improve benchmarks for native row-to-columnar used by JVM shuffle #3290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Add jvm_shuffle.rs benchmark that covers the full range of data types processed by `process_sorted_row_partition()` in JVM shuffle: - Primitive columns (100 Int64 columns) - Struct (flat with 5/10/20 fields) - Nested struct (2 levels deep) - Deeply nested struct (3 levels deep) - List<Int64> - Map<Int64, Int64> This replaces the old row_columnar.rs which only tested primitive columns. These benchmarks help measure the performance of the row-to-columnar conversion used by CometColumnarShuffle when writing shuffle data. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3290 +/- ##
============================================
+ Coverage 56.12% 60.13% +4.00%
- Complexity 976 1468 +492
============================================
Files 119 175 +56
Lines 11743 16085 +4342
Branches 2251 2665 +414
============================================
+ Hits 6591 9672 +3081
- Misses 4012 5066 +1054
- Partials 1140 1347 +207 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Use div_ceil() instead of manual ceiling division and replace needless range loop with iterator pattern. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
| fn get_row_size(num_struct_fields: usize) -> usize { | ||
| // Top-level row has 1 column (the struct) | ||
| let top_level_bitset_width = SparkUnsafeRow::get_row_bitset_width(1); | ||
| // Struct pointer (offset + size) is 8 bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Struct pointer (offset + size) is 8 bytes | |
| // Struct pointer (offset + size) is 8 bytes on 64bit architectures |
| let top_level_bitset_width = SparkUnsafeRow::get_row_bitset_width(1); | ||
|
|
||
| // Nested struct starts after top-level row header + pointer | ||
| let nested_offset = top_level_bitset_width + 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a thought, its too many eights, prob it would be easy to name them? where is the pointer size or int64 size, etc?
| // Fill nested struct with some data | ||
| for i in 0..num_struct_fields { | ||
| let value_offset = nested_offset + nested_bitset_width + i * 8; | ||
| let value = (i as i64) * 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is 100 here? is aligning?
| false, | ||
| 0, | ||
| None, | ||
| &CompressionCodec::Zstd(1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also check other codecs? 🤔
I might be wrong but Spark uses LZ4 for IO_COMPRESSION_CODEC which is used for shuffle?
private[spark] val IO_COMPRESSION_CODEC =
ConfigBuilder("spark.io.compression.codec")
.doc("The codec used to compress internal data such as RDD partitions, event log, " +
"broadcast variables and shuffle outputs. By default, Spark provides four codecs: " +
"lz4, lzf, snappy, and zstd. You can also use fully qualified class names to specify " +
"the codec")
.version("0.8.0")
.stringConf
.createWithDefaultString("lz4")
| } | ||
|
|
||
| /// Create a schema with nested structs: Struct<Struct<int64 fields>> | ||
| fn make_nested_struct_schema(num_fields: usize) -> DataType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some feeling make_nested_struct_schema and make_deeply_nested_struct_schema can be generalized?
| fn get_nested_row_size(num_inner_fields: usize) -> usize { | ||
| // Top-level row has 1 column (the outer struct) | ||
| let top_level_bitset_width = SparkUnsafeRow::get_row_bitset_width(1); | ||
| let struct_pointer_size = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this value should prob be const on crate level?
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove WDYT it would be nice to get a followup PR to see nested type of list/map/struct combinations?
Improve
row_columnar.rsbenchmark to cover the full range of data types processed byprocess_sorted_row_partition()in JVM shuffle:These benchmarks help measure the performance of the row-to-columnar conversion used by CometColumnarShuffle when writing shuffle data.