perf: optimize array_remove for scalar needle by lyne7-sc · Pull Request #22390 · apache/datafusion

lyne7-sc · 2026-05-20T09:29:00Z

Which issue does this PR close?

Closes #.

Rationale for this change

Similar to #22387 (array_replace scalar optimization)

array_remove / array_remove_n / array_remove_all perform element-wise comparison by invoking compare_element_to_list against each row's sub-array individually. When the needle is a scalar, this can be optimized by performing a single vectorized distinct comparison over the entire flattened values buffer.

What changes are included in this PR?

Add a specialized removal kernel (general_remove_with_scalar) that uses arrow_ord::cmp::distinct with Scalar wrapper for a single bulk comparison pass over the flat values buffer.
Extend SLT tests with multi-row scalar-argument coverage, NULL-containing arrays, empty-array edge cases, boundary n values, and LargeList type coverage.

Benchmarks

group                                                                    baseline                               optimized
-----                                                                    --------                               ---------
array_remove_all_int64/remove/list size: 10, num_rows: 4000              4.10   853.8±25.65µs        ? ?/sec    1.00    208.4±2.94µs        ? ?/sec
array_remove_all_int64/remove/list size: 100, num_rows: 10000            1.41      5.6±0.26ms        ? ?/sec    1.00      4.0±0.64ms        ? ?/sec
array_remove_all_int64/remove/list size: 500, num_rows: 10000            1.06     18.9±0.43ms        ? ?/sec    1.00     17.8±0.53ms        ? ?/sec
array_remove_all_int64_nested/remove/list size: 10, num_rows: 4000       1.00      7.0±0.26ms        ? ?/sec    1.03      7.2±0.59ms        ? ?/sec
array_remove_all_int64_nested/remove/list size: 100, num_rows: 3000      1.00     36.4±0.54ms        ? ?/sec    1.01     36.6±0.29ms        ? ?/sec
array_remove_all_int64_nested/remove/list size: 300, num_rows: 1500      1.00     52.6±0.95ms        ? ?/sec    1.01     53.0±1.17ms        ? ?/sec
array_remove_boolean/remove/list size: 10, num_rows: 4000                3.76  846.9±108.47µs        ? ?/sec    1.00    225.3±2.84µs        ? ?/sec
array_remove_boolean/remove/list size: 100, num_rows: 10000              2.06      4.1±0.79ms        ? ?/sec    1.00  1983.4±48.23µs        ? ?/sec
array_remove_boolean/remove/list size: 500, num_rows: 10000              1.62     11.0±1.50ms        ? ?/sec    1.00      6.8±0.08ms        ? ?/sec
array_remove_fixed_size_binary/remove/list size: 10, num_rows: 4000      3.12   933.9±76.49µs        ? ?/sec    1.00    299.2±4.95µs        ? ?/sec
array_remove_fixed_size_binary/remove/list size: 100, num_rows: 10000    1.51      7.5±0.41ms        ? ?/sec    1.00      5.0±0.10ms        ? ?/sec
array_remove_fixed_size_binary/remove/list size: 500, num_rows: 10000    1.19     30.1±3.18ms        ? ?/sec    1.00     25.4±0.89ms        ? ?/sec
array_remove_int64/remove/list size: 10, num_rows: 4000                  4.35   837.8±42.38µs        ? ?/sec    1.00    192.6±3.69µs        ? ?/sec
array_remove_int64/remove/list size: 100, num_rows: 10000                2.09      4.1±0.63ms        ? ?/sec    1.00  1947.7±341.88µs        ? ?/sec
array_remove_int64/remove/list size: 500, num_rows: 10000                1.15     10.9±0.83ms        ? ?/sec    1.00      9.5±3.14ms        ? ?/sec
array_remove_int64_nested/remove/list size: 10, num_rows: 4000           1.00      7.0±0.20ms        ? ?/sec    1.01      7.1±0.18ms        ? ?/sec
array_remove_int64_nested/remove/list size: 100, num_rows: 3000          1.00     36.0±0.92ms        ? ?/sec    1.00     36.0±0.38ms        ? ?/sec
array_remove_int64_nested/remove/list size: 300, num_rows: 1500          1.01     52.3±1.28ms        ? ?/sec    1.00     51.9±0.73ms        ? ?/sec
array_remove_n_int64/remove/list size: 10, num_rows: 4000                4.11   854.5±26.44µs        ? ?/sec    1.00    207.7±4.18µs        ? ?/sec
array_remove_n_int64/remove/list size: 100, num_rows: 10000              1.73      5.2±0.82ms        ? ?/sec    1.00      3.0±0.60ms        ? ?/sec
array_remove_n_int64/remove/list size: 500, num_rows: 10000              1.11     15.7±2.01ms        ? ?/sec    1.00     14.2±1.97ms        ? ?/sec
array_remove_n_int64_nested/remove/list size: 10, num_rows: 4000         1.03      7.2±0.56ms        ? ?/sec    1.00      7.0±0.08ms        ? ?/sec
array_remove_n_int64_nested/remove/list size: 100, num_rows: 3000        1.00     36.3±1.28ms        ? ?/sec    1.01     36.5±0.39ms        ? ?/sec
array_remove_n_int64_nested/remove/list size: 300, num_rows: 1500        1.00     51.4±0.59ms        ? ?/sec    1.01     51.8±0.54ms        ? ?/sec
array_remove_strings/remove/list size: 10, num_rows: 4000                2.48  1137.7±23.31µs        ? ?/sec    1.00   458.2±12.41µs        ? ?/sec
array_remove_strings/remove/list size: 100, num_rows: 10000              1.31     10.3±0.61ms        ? ?/sec    1.00      7.9±0.17ms        ? ?/sec
array_remove_strings/remove/list size: 500, num_rows: 10000              1.14     40.1±4.03ms        ? ?/sec    1.00     35.2±0.98ms        ? ?/sec

Are these changes tested?

Yes, existing and new SLT edge-case tests in array_remove.slt.

Are there any user-facing changes?

No.

neilconway

Nice performance win!

neilconway · 2026-05-20T16:09:11Z

+            );
+        }
+    };
+    let original_data = list_array.values().to_data();


This will be inefficient for sliced arrays.

I now slice the values to the range actually referenced by the offsets.

That said, I wanted to understand your concern better: when a GenericListArray is sliced, values() returns the full underlying array, and to_data() on it wraps the existing buffer references into ArrayData without copying. So the main downside I could identify is that Capacities::Array(original_data.len()) over-estimates the pre-allocation for sliced inputs. Were you thinking of a different inefficiency, or is the over-allocation what you had in mind?

The over-allocation was one part, but the bigger concern is calling the distinct kernel on the entire values buffer (see other comment).

neilconway · 2026-05-20T16:12:41Z

+            let list_array = array.as_list::<i64>();
+            general_remove_with_scalar::<i64>(list_array, needle, arr_n)
+        }
+        array_type => exec_err!("array_remove does not support type '{array_type}'."),


This is called by more than just array_remove; can we improve the error message?

sure, updated.

neilconway · 2026-05-20T16:23:46Z

+        for (i, keep) in eq_array.iter().enumerate() {
+            if keep == Some(false) && removed < max_removals {
+                if let Some(bs) = pending_batch_to_retain {
+                    mutable.extend(0, start + bs, start + i);
+                    copied += i - bs;
+                    pending_batch_to_retain = None;
+                }
+                removed += 1;
+            } else if pending_batch_to_retain.is_none() {
+                pending_batch_to_retain = Some(i);
+            }
+        }


I wonder if it would be possible to iterate only over the "false" bits, e.g., by negating the buffer and looking at BooleanBuffer::set_indices.

Great suggestion. Benchmarks show a ~20–40% improvement with this optimization.

neilconway · 2026-05-20T16:24:25Z

+        let mut copied = 0usize;
+        let mut pending_batch_to_retain: Option<usize> = None;
+        for (i, keep) in eq_array.iter().enumerate() {
+            if keep == Some(false) && removed < max_removals {


Can we break from the loop once we hit max_removals?

Good point. now break early once max_removals is reached.

neilconway · 2026-05-21T15:46:00Z

+        // Iterate only over the positions that need removal using set_indices,
+        // which is more efficient than scanning every bit.


Might be worth elaborating that the win here is mostly because we expect the # of values-to-remove is a lot smaller than the total array size, which it usually (but not always) will be.

neilconway · 2026-05-21T15:47:43Z

+    let keep_mask =
+        arrow_ord::cmp::distinct(list_array.values(), &Scalar::new(Arc::clone(needle)))?;


This will call the distinct kernel on all the elements in the value buffer, not just the ones that are visible in a sliced array.

neilconway · 2026-05-21T15:50:24Z

+            );
+        }
+    };
+    let original_data = list_array.values().to_data();


The over-allocation was one part, but the bigger concern is calling the distinct kernel on the entire values buffer (see other comment).

lyne7-sc and others added 4 commits May 20, 2026 11:16

Refactor array remove invocation

9b3f32f

Merge branch 'apache:main' into perf/remove

0c787f4

enhance array_remove slt

9697c3a

Merge branch 'main' into perf/remove

a6e6ad1

github-actions Bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels May 20, 2026

neilconway reviewed May 20, 2026

View reviewed changes

lyne7-sc added 2 commits May 21, 2026 22:41

apply suggestions

b6023a0

apply suggestions

2e7cd40

neilconway reviewed May 21, 2026

View reviewed changes

		// Iterate only over the positions that need removal using set_indices,
		// which is more efficient than scanning every bit.

		let keep_mask =
		arrow_ord::cmp::distinct(list_array.values(), &Scalar::new(Arc::clone(needle)))?;

Conversation

lyne7-sc commented May 20, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Benchmarks

Are these changes tested?

Are there any user-facing changes?

Uh oh!

neilconway left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants