-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support zero column RecordBatch
es in pyarrow integration (use RecordBatchOptions when converting a pyarrow RecordBatch)
#6320
Changes from 3 commits
f8d417f
5969fc5
25832e3
7f430b5
83aa49e
5829e7e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -59,7 +59,7 @@ use std::convert::{From, TryFrom}; | |
use std::ptr::{addr_of, addr_of_mut}; | ||
use std::sync::Arc; | ||
|
||
use arrow_array::{RecordBatchIterator, RecordBatchReader, StructArray}; | ||
use arrow_array::{RecordBatchIterator, RecordBatchOptions, RecordBatchReader, StructArray}; | ||
use pyo3::exceptions::{PyTypeError, PyValueError}; | ||
use pyo3::ffi::Py_uintptr_t; | ||
use pyo3::import_exception; | ||
|
@@ -333,6 +333,15 @@ impl<T: ToPyArrow> ToPyArrow for Vec<T> { | |
|
||
impl FromPyArrow for RecordBatch { | ||
fn from_pyarrow_bound(value: &Bound<PyAny>) -> PyResult<Self> { | ||
// Technically `num_rows` is an attribute on `pyarrow.RecordBatch` | ||
// If other python classes can use the PyCapsule interface and do not have this attribute, | ||
// then this will have no effect. | ||
let row_count = value | ||
.getattr("num_rows") | ||
.ok() | ||
.and_then(|x| x.extract().ok()); | ||
let options = RecordBatchOptions::default().with_row_count(row_count); | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My initial thought is that the PyCapsule interface should handle this, and so this should not be before checking for the pycapsule dunder. If this breaks via the C data interface, I'd like to look for a fix to that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd strongly prefer a non-pyarrow-specific solution to this, or else we'll get the same failure from other Arrow producers. In kylebarron/arro3#177 I added some tests to arro3 to make sure my (arrow-rs derived) FFI can handle this. It's a bit annoying: the |
||
// Newer versions of PyArrow as well as other libraries with Arrow data implement this | ||
// method, so prefer it over _export_to_c. | ||
// See https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html | ||
|
@@ -371,7 +380,7 @@ impl FromPyArrow for RecordBatch { | |
0, | ||
"Cannot convert nullable StructArray to RecordBatch, see StructArray documentation" | ||
); | ||
return RecordBatch::try_new(schema, columns).map_err(to_py_err); | ||
return RecordBatch::try_new_with_options(schema, columns, &options).map_err(to_py_err); | ||
} | ||
|
||
validate_class("RecordBatch", value)?; | ||
|
@@ -386,7 +395,8 @@ impl FromPyArrow for RecordBatch { | |
.map(|a| Ok(make_array(ArrayData::from_pyarrow_bound(&a)?))) | ||
.collect::<PyResult<_>>()?; | ||
|
||
let batch = RecordBatch::try_new(schema, arrays).map_err(to_py_err)?; | ||
let batch = | ||
RecordBatch::try_new_with_options(schema, arrays, &options).map_err(to_py_err)?; | ||
Ok(batch) | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose CI is likely always testing with the most recent version of pyarrow, and thus we only really test with the PyCapsule Interface, not with the pyarrow-specific FFI. If you wanted to ensure you're testing the PyCapsule Interface, you can create a wrapper class around a
pa.RecordBatch
that only exposes the PyCapsule dunder method:https://github.com/pola-rs/polars/blob/b2550a092e34aa40f8786f45ff67cab96c93695d/py-polars/tests/unit/constructors/test_constructors.py#L1661-L1676
Then you can be assured that
is testing the PyCapsule Interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like CI runs with at least both pyarrow 13 (last release before capsules) and 14
https://github.com/apache/arrow-rs/actions/runs/10603372118?pr=6320