[AURON #1891] Implement randn() function #1938

robreeves · 2026-01-21T05:35:05Z

Which issue does this PR close?

Closes #1891

Rationale for this change

This improves Spark function coverage in Auron.

What changes are included in this PR?

Adds support for the Spark randn function

Are there any user-facing changes?

Yes, it adds the randn function.

How was this patch tested?

Unit tests and [TODO] manual testing in spark-shell.

Copilot

Pull request overview

This PR implements the randn() function to improve Spark function coverage in Auron. The function generates random values from a standard normal distribution with optional seed support.

Changes:

Added Rust implementation of spark_randn function with seed handling
Registered the new function in the Scala converter and Rust function registry
Added rand_distr dependency for normal distribution sampling

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
spark-extension/src/main/scala/org/apache/spark/sql/auron/NativeConverters.scala	Added case handler for Randn expression to route to native implementation
native-engine/datafusion-ext-functions/src/spark_randn.rs	New implementation of randn function with seed handling and unit tests
native-engine/datafusion-ext-functions/src/lib.rs	Registered Spark_Randn function in the extension function factory
native-engine/datafusion-ext-functions/Cargo.toml	Added rand and rand_distr dependencies
Cargo.toml	Added rand_distr workspace dependency
Cargo.lock	Updated lock file with rand_distr package metadata

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-21T05:52:27Z

native-engine/datafusion-ext-functions/src/spark_randn.rs

+#[cfg(test)]
+mod test {
+    use std::error::Error;
+
+    use datafusion::{common::ScalarValue, logical_expr::ColumnarValue};
+
+    use crate::spark_randn::spark_randn;
+
+    #[test]
+    fn test_randn_with_seed_reproducibility() -> Result<(), Box<dyn Error>> {
+        // Same seed should produce same result
+        let seed = ColumnarValue::Scalar(ScalarValue::Int64(Some(42)));
+
+        let result1 = spark_randn(&vec![seed.clone()])?;
+        let result2 = spark_randn(&vec![seed])?;
+
+        match (result1, result2) {
+            (
+                ColumnarValue::Scalar(ScalarValue::Float64(Some(v1))),
+                ColumnarValue::Scalar(ScalarValue::Float64(Some(v2))),
+            ) => {
+                assert_eq!(v1, v2, "Same seed should produce same result");
+            }
+            _ => panic!("Expected Float64 scalar results"),
+        }
+        Ok(())
+    }
+
+    #[test]
+    fn test_randn_different_seeds() -> Result<(), Box<dyn Error>> {
+        // Different seeds should produce different results (with very high probability)
+        let seed1 = ColumnarValue::Scalar(ScalarValue::Int64(Some(42)));
+        let seed2 = ColumnarValue::Scalar(ScalarValue::Int64(Some(123)));
+
+        let result1 = spark_randn(&vec![seed1])?;
+        let result2 = spark_randn(&vec![seed2])?;
+
+        match (result1, result2) {
+            (
+                ColumnarValue::Scalar(ScalarValue::Float64(Some(v1))),
+                ColumnarValue::Scalar(ScalarValue::Float64(Some(v2))),
+            ) => {
+                assert_ne!(v1, v2, "Different seeds should produce different results");
+            }
+            _ => panic!("Expected Float64 scalar results"),
+        }
+        Ok(())
+    }
+
+    #[test]
+    fn test_randn_no_seed() -> Result<(), Box<dyn Error>> {
+        // Without seed, should still produce a valid float
+        let result = spark_randn(&vec![])?;
+
+        match result {
+            ColumnarValue::Scalar(ScalarValue::Float64(Some(v))) => {
+                assert!(v.is_finite(), "Result should be a finite number");
+            }
+            _ => panic!("Expected Float64 scalar result"),
+        }
+        Ok(())
+    }
+
+    #[test]
+    fn test_randn_with_int32_seed() -> Result<(), Box<dyn Error>> {
+        // Int32 seed should work
+        let seed = ColumnarValue::Scalar(ScalarValue::Int32(Some(42)));
+
+        let result = spark_randn(&vec![seed])?;
+
+        match result {
+            ColumnarValue::Scalar(ScalarValue::Float64(Some(v))) => {
+                assert!(v.is_finite(), "Result should be a finite number");
+            }
+            _ => panic!("Expected Float64 scalar result"),
+        }
+        Ok(())
+    }
+
+    #[test]
+    fn test_randn_with_null_seed() -> Result<(), Box<dyn Error>> {
+        // Null seed should be treated as no seed (random)
+        let seed = ColumnarValue::Scalar(ScalarValue::Null);
+
+        let result = spark_randn(&vec![seed])?;
+
+        match result {
+            ColumnarValue::Scalar(ScalarValue::Float64(Some(v))) => {
+                assert!(v.is_finite(), "Result should be a finite number");
+            }
+            _ => panic!("Expected Float64 scalar result"),
+        }
+        Ok(())
+    }
+
+}


The test coverage is incomplete as it only tests scalar seed values. Tests should be added for columnar (array) inputs to verify that the function correctly generates different random values for each row in a batch. This is the primary use case when randn() is used in DataFrame queries.

Copilot · 2026-01-21T05:52:28Z

native-engine/datafusion-ext-functions/src/spark_randn.rs

+pub fn spark_randn(args: &[ColumnarValue]) -> Result<ColumnarValue> {
+    // Parse seed argument, or generate random seed if not provided
+    let seed: u64 = if args.is_empty() {
+        rand::random()
+    } else {
+        match &args[0] {
+            ColumnarValue::Scalar(ScalarValue::Int64(Some(s))) => *s as u64,
+            ColumnarValue::Scalar(ScalarValue::Int32(Some(s))) => *s as u64,
+            _ => rand::random(),
+        }
+    };
+
+    let mut rng = StdRng::seed_from_u64(seed);
+    let value: f64 = StandardNormal.sample(&mut rng);
+
+    Ok(ColumnarValue::Scalar(ScalarValue::Float64(Some(value))))
+}


This implementation only handles scalar seed inputs but does not handle columnar (array) data properly. When randn() is used in a DataFrame context (e.g., df.select(randn(seed))), it should generate a different random value for each row in the dataset, not just return a single value.

The function needs to be updated to handle ColumnarValue::Array inputs where the seed could be an array, and generate an array of random values - one per row. This is critical for correct Spark compatibility when randn is used in SELECT queries over datasets with multiple rows.

Copilot · 2026-01-21T05:52:28Z

native-engine/datafusion-ext-functions/src/spark_randn.rs

+///from the standard normal distribution
+///
+/// - Takes an optional seed (i64) for reproducibility
+/// - If no seed is provided, uses a random seed


The documentation states "If no seed is provided, uses a random seed" which is misleading. In Spark's randn() function, when no seed is provided, each invocation should generate different random values (using a randomized seed per partition and row). The current implementation with a single random seed would produce the same value for all rows in a batch when the seed argument is absent, which does not match Spark's behavior.

Suggested change

/// - If no seed is provided, uses a random seed

/// - If no seed is provided, a random seed is chosen once per invocation, so all rows in the

/// batch share the same value (unlike Spark's randn, which yields different values per row)

robreeves added 4 commits January 18, 2026 10:37

randn

175fa00

reduce arg parsing complexity

5bb5ca5

remove extra method

3d0fa7d

use rand_dist

0f793c4

github-actions bot added spark native build labels Jan 21, 2026

robreeves added 2 commits January 20, 2026 21:40

revert auto format change

0d041f3

revert auto format changes

0b93080

cxzl25 requested a review from Copilot January 21, 2026 05:48

Copilot started reviewing on behalf of cxzl25 January 21, 2026 05:49 View session

Copilot AI reviewed Jan 21, 2026

View reviewed changes

autoformat

3b78e73

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AURON #1891] Implement randn() function #1938

[AURON #1891] Implement randn() function #1938

Uh oh!

robreeves commented Jan 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Copilot AI Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	/// - If no seed is provided, uses a random seed
	/// - If no seed is provided, a random seed is chosen once per invocation, so all rows in the
	/// batch share the same value (unlike Spark's randn, which yields different values per row)

[AURON #1891] Implement randn() function #1938

Are you sure you want to change the base?

[AURON #1891] Implement randn() function #1938

Uh oh!

Conversation

robreeves commented Jan 21, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant