Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: binary_mut should work if only one input array has null buffer #6396

Merged
merged 4 commits into from
Sep 18, 2024

Conversation

viirya
Copy link
Member

@viirya viirya commented Sep 15, 2024

Which issue does this PR close?

Closes #6374.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 15, 2024
@@ -313,7 +313,7 @@ where
))));
}

let nulls = NullBuffer::union(a.logical_nulls().as_ref(), b.logical_nulls().as_ref());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only a has null buffer, NullBuffer.union will clone it which shares the underlying buffer that fails into_mutable call on the null buffer later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have a by value, perhaps we could add a PrimitiveArray::into_parts that would deconstruct the PrimitiveArray into its underlying NullBuffer and values 🤔

Following the model of https://docs.rs/arrow/latest/arrow/array/struct.GenericByteArray.html#method.into_parts

That way we could directly use the NullBuffer and avoid the copy in create_union_null_buffer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, let me see if it works.

Copy link
Member Author

@viirya viirya Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, for binary_mut, it is okay to defer the computation of union of two null buffers. It is because that we don't need to own the null buffers before invoking op on the two arrays.

Since the null buffer is not changed on the builder, we can compute the union after finishing the builder. So for binary_mut, we can avoid copying null buffer.

But for try_binary_mut, because we need to get the union of two null buffers before invoking op on the two arrays, we still need to copy it as this PR does.

PrimitiveArray::into_parts (it already exists) also cannot help on it because the builder needs to own the null buffer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I only avoid copying null buffer in binary_mut.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still feels to me that we could remove the copy on try_binary_mut, but this seems like an improvement to me 🚀

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viirya -- great find.

I have a suggestion that might save a copy

cc @MaxenceMaire

@@ -313,7 +313,7 @@ where
))));
}

let nulls = NullBuffer::union(a.logical_nulls().as_ref(), b.logical_nulls().as_ref());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have a by value, perhaps we could add a PrimitiveArray::into_parts that would deconstruct the PrimitiveArray into its underlying NullBuffer and values 🤔

Following the model of https://docs.rs/arrow/latest/arrow/array/struct.GenericByteArray.html#method.into_parts

That way we could directly use the NullBuffer and avoid the copy in create_union_null_buffer

) -> Option<NullBuffer> {
match (lhs, rhs) {
(Some(lhs), Some(rhs)) => Some(NullBuffer::new(lhs.inner() & rhs.inner())),
(Some(n), None) | (None, Some(n)) => Some(NullBuffer::new(n.inner() & n.inner())),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, it forces a copy of the null buffer's contents?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Previously it did a clone that shares the underlying buffer.

@MaxenceMaire
Copy link

Nice work, thanks @viirya and @alamb for looking into it!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @viirya -- this looks good to me

@@ -313,7 +313,7 @@ where
))));
}

let nulls = NullBuffer::union(a.logical_nulls().as_ref(), b.logical_nulls().as_ref());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still feels to me that we could remove the copy on try_binary_mut, but this seems like an improvement to me 🚀

arrow-arith/src/arity.rs Show resolved Hide resolved
arrow-arith/src/arity.rs Show resolved Hide resolved
viirya and others added 2 commits September 18, 2024 10:26
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@viirya
Copy link
Member Author

viirya commented Sep 18, 2024

Thanks @alamb @MaxenceMaire

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A copy of the NullBuffer in some cases seems a lot better than having to copy the entire array as was needed previously

Let's go with this and iterate as a follow on

@alamb alamb merged commit f5a6382 into apache:master Sep 18, 2024
13 checks passed
@viirya
Copy link
Member Author

viirya commented Sep 18, 2024

Thank you @alamb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

compute::binary_mut returns Err(PrimitiveArray<T>) only with certain arrays
3 participants