-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-100: [C++] Computing RowBatch size #61
Conversation
Implement RowBatchWriter::DataHeaderSize and arrow::ipc::GetRowBatchSize. To achieve this, the Flatbuffer metadata is written to a temporary buffer and its size is determined. This commit also adds MockMemorySource, a new MemorySource that tracks the amount of memory written. Author: Philipp Moritz <pcmoritz@gmail.com>
@@ -121,6 +121,26 @@ class MemoryMappedSource : public MemorySource { | |||
std::unique_ptr<Impl> impl_; | |||
}; | |||
|
|||
// A MemorySource that tracks the size of allocations from a memory source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probably belongs in test-common.h (along with the implementation, I'm not sure if it is worth creating a new .cc file or just inlining)
…tRowBatchSize, unify DataHeaderSize and TotalBytes into GetTotalSize
I'm sorry, it looks like my change did have some conflicts with yours (and it got merged first). Do you mind rebasing? |
Sorry about that. I'll review/merge once this is rebased. |
Thanks, please hold off a little longer on that, I'd like to properly test it with all the other new IPC code that was added. I expect to finish this tonight. |
The PR should be ready now! |
} | ||
|
||
Status MockMemorySource::Write(int64_t position, const uint8_t* data, int64_t nbytes) { | ||
pos_ = std::max(pos_, position + nbytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only keep the max here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal here is to determine how many bytes there are between the beginning of the buffer and the location where the last byte is being written; the function GetRowBatchSize will most of the time be used to determine how much shared memory should be allocated for IPC and then this is the quantity we care about; if memory is noncontiguous, it is not clear what the desired behaviour is.
See this comment at the beginning of GetRowBatchSize:
// Compute the precise number of bytes needed in a contiguous memory segment to
// write the row batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a variable naming and documentation problem. Can you change the variable name to extent_bytes_written_
or something similar and add a comment to Position
(or rename Position) to indicate that it returns the smallest number of bytes containing the modified region of the MockMemorySource? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks it makes sense now.
+1, thank you |
@pcmoritz I've made you a Contributor on JIRA so you'll be able to assign yourself JIRAs going forward |
- added java bindings for varlen types/literals - minor cleanups in llvm generator and engine (reported by clang-tidy)
- added java bindings for varlen types/literals - minor cleanups in llvm generator and engine (reported by clang-tidy)
- added java bindings for varlen types/literals - minor cleanups in llvm generator and engine (reported by clang-tidy)
- added java bindings for varlen types/literals - minor cleanups in llvm generator and engine (reported by clang-tidy)
- added java bindings for varlen types/literals - minor cleanups in llvm generator and engine (reported by clang-tidy)
- added java bindings for varlen types/literals - minor cleanups in llvm generator and engine (reported by clang-tidy)
- added java bindings for varlen types/literals - minor cleanups in llvm generator and engine (reported by clang-tidy)
Implements close on completion
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
* Support casting boolean to bigint (apache#60) * remove log4j as it's not used (apache#61) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> * Add stripe iteration support for batch_size reading in the ORC Scanner (apache#63) * Install re2 headers (apache#66) Co-authored-by: PHILO-HE <feilong.he@intel.com> Co-authored-by: zhixingheyi-tian <xiangxiang.shen@intel.com>
Implements close on completion
Implements close on completion
…o UnionVector (apache#61) When a DecimalVector is promoted to a UnionVector via a PromotableWriter, the UnionVector will have the decimal vector in it's internal struct vector, but the decimalVector field will not be set. If UnionReader.read is then used to read from the UnionVector, it will fail when it tries to read one of the promoted decimal values, due to decimalVector being null, and the exact decimal type not being provided. This failure is unnecessary though as we have a pre-existing decimal vector, the caller just does not know the exact type - and it shouldn't be required to. The change here is to check for a pre-existing decimal vector in the internal struct when getDecimalVector() is called. If one exists, set the decimalVector field and return. Otherwise, if none exists, throw the exception.
…o UnionVector (apache#61) When a DecimalVector is promoted to a UnionVector via a PromotableWriter, the UnionVector will have the decimal vector in it's internal struct vector, but the decimalVector field will not be set. If UnionReader.read is then used to read from the UnionVector, it will fail when it tries to read one of the promoted decimal values, due to decimalVector being null, and the exact decimal type not being provided. This failure is unnecessary though as we have a pre-existing decimal vector, the caller just does not know the exact type - and it shouldn't be required to. The change here is to check for a pre-existing decimal vector in the internal struct when getDecimalVector() is called. If one exists, set the decimalVector field and return. Otherwise, if none exists, throw the exception.
Implement RowBatchWriter::DataHeaderSize and arrow::ipc::GetRowBatchSize. To achieve this, the Flatbuffer metadata is written to a temporary buffer and its size is determined. This commit also adds MockMemorySource, a new MemorySource that tracks the amount of memory written.
Author: Philipp Moritz pcmoritz@gmail.com