Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support zero column RecordBatches in pyarrow integration (use RecordBatchOptions when converting a pyarrow RecordBatch) #6320

Merged
merged 6 commits into from
Aug 31, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions arrow-pyarrow-integration-testing/tests/test_sql.py
Original file line number Diff line number Diff line change
Expand Up @@ -476,6 +476,35 @@ def test_tensor_array():

del b


def test_empty_recordbatch_with_row_count():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose CI is likely always testing with the most recent version of pyarrow, and thus we only really test with the PyCapsule Interface, not with the pyarrow-specific FFI. If you wanted to ensure you're testing the PyCapsule Interface, you can create a wrapper class around a pa.RecordBatch that only exposes the PyCapsule dunder method:

https://github.com/pola-rs/polars/blob/b2550a092e34aa40f8786f45ff67cab96c93695d/py-polars/tests/unit/constructors/test_constructors.py#L1661-L1676

Then you can be assured that

rust.round_trip_record_batch(PyCapsuleArrayHolder(batch))

is testing the PyCapsule Interface

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like CI runs with at least both pyarrow 13 (last release before capsules) and 14

https://github.com/apache/arrow-rs/actions/runs/10603372118?pr=6320

"""
The result of a `count` on a dataset is a RecordBatch with no columns but with `num_rows` set
"""

# If you know how to create an empty RecordBatch with a specific number of rows, please share
# Create an empty schema with no fields
schema = pa.schema([])

# Create an empty RecordBatch with 0 columns
record_batch = pa.RecordBatch.from_arrays([], schema=schema)

# Set the desired number of rows by creating a table and slicing
num_rows = 5 # Replace with your desired number of rows
empty_table = pa.Table.from_batches([record_batch]).slice(0, num_rows)

# Get the first batch from the table which will have the desired number of rows
batch = empty_table.to_batches()[0]

b = rust.round_trip_record_batch(batch)
assert b == batch
assert b.schema == batch.schema
assert b.schema.metadata == batch.schema.metadata

assert b.num_rows == num_rows

del b

def test_record_batch_reader():
"""
Python -> Rust -> Python
Expand Down
16 changes: 13 additions & 3 deletions arrow/src/pyarrow.rs
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ use std::convert::{From, TryFrom};
use std::ptr::{addr_of, addr_of_mut};
use std::sync::Arc;

use arrow_array::{RecordBatchIterator, RecordBatchReader, StructArray};
use arrow_array::{RecordBatchIterator, RecordBatchOptions, RecordBatchReader, StructArray};
use pyo3::exceptions::{PyTypeError, PyValueError};
use pyo3::ffi::Py_uintptr_t;
use pyo3::import_exception;
Expand Down Expand Up @@ -333,6 +333,15 @@ impl<T: ToPyArrow> ToPyArrow for Vec<T> {

impl FromPyArrow for RecordBatch {
fn from_pyarrow_bound(value: &Bound<PyAny>) -> PyResult<Self> {
// Technically `num_rows` is an attribute on `pyarrow.RecordBatch`
// If other python classes can use the PyCapsule interface and do not have this attribute,
// then this will have no effect.
let row_count = value
.getattr("num_rows")
.ok()
.and_then(|x| x.extract().ok());
let options = RecordBatchOptions::default().with_row_count(row_count);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial thought is that the PyCapsule interface should handle this, and so this should not be before checking for the pycapsule dunder. If this breaks via the C data interface, I'd like to look for a fix to that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd strongly prefer a non-pyarrow-specific solution to this, or else we'll get the same failure from other Arrow producers.

In kylebarron/arro3#177 I added some tests to arro3 to make sure my (arrow-rs derived) FFI can handle this. It's a bit annoying: the ArrayData will have positive length but then once you import that with makeData, you'll have a StructArray with length 0. I think your most recent commit fixes this.

// Newer versions of PyArrow as well as other libraries with Arrow data implement this
// method, so prefer it over _export_to_c.
// See https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html
Expand Down Expand Up @@ -371,7 +380,7 @@ impl FromPyArrow for RecordBatch {
0,
"Cannot convert nullable StructArray to RecordBatch, see StructArray documentation"
);
return RecordBatch::try_new(schema, columns).map_err(to_py_err);
return RecordBatch::try_new_with_options(schema, columns, &options).map_err(to_py_err);
}

validate_class("RecordBatch", value)?;
Expand All @@ -386,7 +395,8 @@ impl FromPyArrow for RecordBatch {
.map(|a| Ok(make_array(ArrayData::from_pyarrow_bound(&a)?)))
.collect::<PyResult<_>>()?;

let batch = RecordBatch::try_new(schema, arrays).map_err(to_py_err)?;
let batch =
RecordBatch::try_new_with_options(schema, arrays, &options).map_err(to_py_err)?;
Ok(batch)
}
}
Expand Down
Loading