Skip to content

Commit

Permalink
ARROW-82: Initial IPC support for ListArray
Browse files Browse the repository at this point in the history
This is a work in progress because I can't get clang-tidy to shut-up on parameterized test files (see last commit which I need to revert before merge).   I'd like to make sure this is a clean build and make sure people are ok with these change.  This PR also has a lot of collateral damage for small/large things I cleaned up my way to make this work.  I tried to split the commits up logically but if people would prefer separate pull requests I can try to do that as well.

Open questions:
1.  For supporting strings, binary, etc.  I was thinking of changing thei4 type definitions to inherit from ListType, and to hard-code the child type.  This would allow for simpler IPC code (all of the instantiation of types would happen in construct.h/.cc?) vs handling each of there types separately for IPC.
2.  There are some TODOs I left sprinkled in the code and would like peoples thoughts on if they are urgent/worthwhile for following up on.

Open issues:
1.  Supporting the rest of the List backed logical types
2.  More unit tests for added functionality.

As part of this commit I also refactored the Builder interfaces a little bit for the following reasons:
1.  It seems that if ArrayBuilder owns the bitmap it should be responsible for having methods to manipulate it.
2.  This allows ListBuilder to use the parent class + a BufferBuilder instead of inheriting Int32Builder, which means it doesn't need to do strange length/capacity hacks.

Other misc things here:
1.  Native popcount in test-util.h
2.  Ability to build a new list on top of an existing by incrementally add offsets/sizes
3.  Added missing types primitive types in construct.h for primitive.

Author: Micah Kornfield <emkornfield@gmail.com>

Closes #59 from emkornfield/emk_list_ipc_PR and squashes the following commits:

0c5162d [Micah Kornfield] another format fix
0af558b [Micah Kornfield] remove a now unnecessary NOLINT, but mostly to trigger another travis-ci job that failed due to apt get issue
7789205 [Micah Kornfield] make clang-format-3.7 happy
6e57728 [Micah Kornfield] make format fixes
5e15815 [Micah Kornfield] fix make lint
8982723 [Micah Kornfield] remaining style cleanup
be04b3e [Micah Kornfield] add unit tests for zero length row batches and non-null batches.  fix bugs
10e6651 [Micah Kornfield] add in maximum recursion depth, surfaced possible recursion issue with flatbuffers
3b219a1 [Micah Kornfield] Make append is_null parameter is_valid for api consistency
2e6c477 [Micah Kornfield] add missing RETURN_NOT_OK
e71810b [Micah Kornfield] make Resize and Init virtual on builder
8ab5315 [Micah Kornfield] make clang tidy ignore a little bit less hacky
53d37bc [Micah Kornfield] filter out ipc-adapter-test from tidy
8e464b5 [Micah Kornfield] Fixes per tidy and lint
aa0602c [Micah Kornfield] add potentially useful pool factories to test utils
39c57ed [Micah Kornfield] add potentially useful methods for generative arrays to ipc test-common
a2e1e52 [Micah Kornfield] native popcount
61b0481 [Micah Kornfield] small fixes to naming/style for c++ and potential bugs
5f87aef [Micah Kornfield] Refactor ipc-adapter-test to make it paramaterizable.  add unit test for lists.  make unit test pass and and construction method for list arrays
45e41c0 [Micah Kornfield] Make BufferBuilder more useable for appending primitives
1374485 [Micah Kornfield] augment python unittest to have null element in list
20f984b [Micah Kornfield] refactor primitive builders to use parent builders bitmap
3895d34 [Micah Kornfield] Refactor list builder to use ArrayBuilders bitmap methods and a separate buffer builder
01c50be [Micah Kornfield] Add utility methods for managing null bitmap directly to ArrayBuilder
cc7f851 [Micah Kornfield] add Validate method to array and implementation for ListArray
  • Loading branch information
emkornfield authored and wesm committed Apr 18, 2016
1 parent 5843e68 commit 0b472d8
Show file tree
Hide file tree
Showing 32 changed files with 839 additions and 250 deletions.
2 changes: 1 addition & 1 deletion cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -565,7 +565,7 @@ if (${CLANG_TIDY_FOUND})
`find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc | sed -e '/_generated/g'`)
# runs clang-tidy and exits with a non-zero exit code if any errors are found.
add_custom_target(check-clang-tidy ${BUILD_SUPPORT_DIR}/run-clang-tidy.sh ${CLANG_TIDY_BIN} ${CMAKE_BINARY_DIR}/compile_commands.json
0 `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc | sed -e '/_generated/g'`)
0 `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc |grep -v -F -f ${CMAKE_CURRENT_SOURCE_DIR}/src/.clang-tidy-ignore | sed -e '/_generated/g'`)

endif()

Expand Down
9 changes: 8 additions & 1 deletion cpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,4 +76,11 @@ build failures by running the following checks before submitting your pull reque

Note that the clang-tidy target may take a while to run. You might consider
running clang-tidy separately on the files you have added/changed before
invoking the make target to reduce iteration time.
invoking the make target to reduce iteration time. Also, it might generate warnings
that aren't valid. To avoid these you can use add a line comment `// NOLINT`. If
NOLINT doesn't suppress the warnings, you add the file in question to
the .clang-tidy-ignore file. This will allow `make check-clang-tidy` to pass in
travis-CI (but still surface the potential warnings in `make clang-tidy`). Ideally,
both of these options would be used rarely. Current known uses-cases whent hey are required:

* Parameterized tests in google test.
1 change: 1 addition & 0 deletions cpp/src/.clang-tidy-ignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ipc-adapter-test.cc
5 changes: 5 additions & 0 deletions cpp/src/arrow/array.cc
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
#include <cstdint>

#include "arrow/util/buffer.h"
#include "arrow/util/status.h"

namespace arrow {

Expand Down Expand Up @@ -47,6 +48,10 @@ bool Array::EqualsExact(const Array& other) const {
return true;
}

Status Array::Validate() const {
return Status::OK();
}

bool NullArray::Equals(const std::shared_ptr<Array>& arr) const {
if (this == arr.get()) { return true; }
if (Type::NA != arr->type_enum()) { return false; }
Expand Down
6 changes: 5 additions & 1 deletion cpp/src/arrow/array.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
namespace arrow {

class Buffer;
class Status;

// Immutable data array with some logical type and some length. Any memory is
// owned by the respective Buffer instance (or its parents).
Expand All @@ -39,7 +40,7 @@ class Array {
Array(const std::shared_ptr<DataType>& type, int32_t length, int32_t null_count = 0,
const std::shared_ptr<Buffer>& null_bitmap = nullptr);

virtual ~Array() {}
virtual ~Array() = default;

// Determine if a slot is null. For inner loops. Does *not* boundscheck
bool IsNull(int i) const {
Expand All @@ -58,6 +59,9 @@ class Array {

bool EqualsExact(const Array& arr) const;
virtual bool Equals(const std::shared_ptr<Array>& arr) const = 0;
// Determines if the array is internally consistent. Defaults to always
// returning Status::OK. This can be an expensive check.
virtual Status Validate() const;

protected:
std::shared_ptr<DataType> type_;
Expand Down
56 changes: 56 additions & 0 deletions cpp/src/arrow/builder.cc
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,25 @@

namespace arrow {

Status ArrayBuilder::AppendToBitmap(bool is_valid) {
if (length_ == capacity_) {
// If the capacity was not already a multiple of 2, do so here
// TODO(emkornfield) doubling isn't great default allocation practice
// see https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md
// fo discussion
RETURN_NOT_OK(Resize(util::next_power2(capacity_ + 1)));
}
UnsafeAppendToBitmap(is_valid);
return Status::OK();
}

Status ArrayBuilder::AppendToBitmap(const uint8_t* valid_bytes, int32_t length) {
RETURN_NOT_OK(Reserve(length));

UnsafeAppendToBitmap(valid_bytes, length);
return Status::OK();
}

Status ArrayBuilder::Init(int32_t capacity) {
capacity_ = capacity;
int32_t to_alloc = util::ceil_byte(capacity) / 8;
Expand All @@ -36,6 +55,7 @@ Status ArrayBuilder::Init(int32_t capacity) {
}

Status ArrayBuilder::Resize(int32_t new_bits) {
if (!null_bitmap_) { return Init(new_bits); }
int32_t new_bytes = util::ceil_byte(new_bits) / 8;
int32_t old_bytes = null_bitmap_->size();
RETURN_NOT_OK(null_bitmap_->Resize(new_bytes));
Expand All @@ -56,10 +76,46 @@ Status ArrayBuilder::Advance(int32_t elements) {

Status ArrayBuilder::Reserve(int32_t elements) {
if (length_ + elements > capacity_) {
// TODO(emkornfield) power of 2 growth is potentially suboptimal
int32_t new_capacity = util::next_power2(length_ + elements);
return Resize(new_capacity);
}
return Status::OK();
}

Status ArrayBuilder::SetNotNull(int32_t length) {
RETURN_NOT_OK(Reserve(length));
UnsafeSetNotNull(length);
return Status::OK();
}

void ArrayBuilder::UnsafeAppendToBitmap(bool is_valid) {
if (is_valid) {
util::set_bit(null_bitmap_data_, length_);
} else {
++null_count_;
}
++length_;
}

void ArrayBuilder::UnsafeAppendToBitmap(const uint8_t* valid_bytes, int32_t length) {
if (valid_bytes == nullptr) {
UnsafeSetNotNull(length);
return;
}
for (int32_t i = 0; i < length; ++i) {
// TODO(emkornfield) Optimize for large values of length?
UnsafeAppendToBitmap(valid_bytes[i] > 0);
}
}

void ArrayBuilder::UnsafeSetNotNull(int32_t length) {
const int32_t new_length = length + length_;
// TODO(emkornfield) Optimize for large values of length?
for (int32_t i = length_; i < new_length; ++i) {
util::set_bit(null_bitmap_data_, i);
}
length_ = new_length;
}

} // namespace arrow
46 changes: 37 additions & 9 deletions cpp/src/arrow/builder.h
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,10 @@ class PoolBuffer;

static constexpr int32_t MIN_BUILDER_CAPACITY = 1 << 5;

// Base class for all data array builders
// Base class for all data array builders.
// This class provides a facilities for incrementally building the null bitmap
// (see Append methods) and as a side effect the current number of slots and
// the null count.
class ArrayBuilder {
public:
explicit ArrayBuilder(MemoryPool* pool, const TypePtr& type)
Expand All @@ -46,7 +49,7 @@ class ArrayBuilder {
length_(0),
capacity_(0) {}

virtual ~ArrayBuilder() {}
virtual ~ArrayBuilder() = default;

// For nested types. Since the objects are owned by this class instance, we
// skip shared pointers and just return a raw pointer
Expand All @@ -58,14 +61,27 @@ class ArrayBuilder {
int32_t null_count() const { return null_count_; }
int32_t capacity() const { return capacity_; }

// Allocates requires memory at this level, but children need to be
// initialized independently
Status Init(int32_t capacity);
// Append to null bitmap
Status AppendToBitmap(bool is_valid);
// Vector append. Treat each zero byte as a null. If valid_bytes is null
// assume all of length bits are valid.
Status AppendToBitmap(const uint8_t* valid_bytes, int32_t length);
// Set the next length bits to not null (i.e. valid).
Status SetNotNull(int32_t length);

// Resizes the null_bitmap array
Status Resize(int32_t new_bits);
// Allocates initial capacity requirements for the builder. In most
// cases subclasses should override and call there parent classes
// method as well.
virtual Status Init(int32_t capacity);

Status Reserve(int32_t extra_bits);
// Resizes the null_bitmap array. In most
// cases subclasses should override and call there parent classes
// method as well.
virtual Status Resize(int32_t new_bits);

// Ensures there is enough space for adding the number of elements by checking
// capacity and calling Resize if necessary.
Status Reserve(int32_t elements);

// For cases where raw data was memcpy'd into the internal buffers, allows us
// to advance the length of the builder. It is your responsibility to use
Expand All @@ -75,7 +91,7 @@ class ArrayBuilder {
const std::shared_ptr<PoolBuffer>& null_bitmap() const { return null_bitmap_; }

// Creates new array object to hold the contents of the builder and transfers
// ownership of the data
// ownership of the data. This resets all variables on the builder.
virtual std::shared_ptr<Array> Finish() = 0;

const std::shared_ptr<DataType>& type() const { return type_; }
Expand All @@ -97,6 +113,18 @@ class ArrayBuilder {
// Child value array builders. These are owned by this class
std::vector<std::unique_ptr<ArrayBuilder>> children_;

//
// Unsafe operations (don't check capacity/don't resize)
//

// Append to null bitmap.
void UnsafeAppendToBitmap(bool is_valid);
// Vector append. Treat each zero byte as a nullzero. If valid_bytes is null
// assume all of length bits are valid.
void UnsafeAppendToBitmap(const uint8_t* valid_bytes, int32_t length);
// Set the next length bits to not null (i.e. valid).
void UnsafeSetNotNull(int32_t length);

private:
DISALLOW_COPY_AND_ASSIGN(ArrayBuilder);
};
Expand Down
Loading

0 comments on commit 0b472d8

Please sign in to comment.