Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-100: [C++] Computing RowBatch size #61

Closed
wants to merge 7 commits into from

Conversation

pcmoritz
Copy link
Contributor

Implement RowBatchWriter::DataHeaderSize and arrow::ipc::GetRowBatchSize. To achieve this, the Flatbuffer metadata is written to a temporary buffer and its size is determined. This commit also adds MockMemorySource, a new MemorySource that tracks the amount of memory written.

Author: Philipp Moritz pcmoritz@gmail.com

Implement RowBatchWriter::DataHeaderSize and arrow::ipc::GetRowBatchSize. To achieve this, the Flatbuffer metadata is written to a temporary buffer and its size is determined. This commit also adds MockMemorySource, a new MemorySource that tracks the amount of memory written.

Author: Philipp Moritz <pcmoritz@gmail.com>
@@ -121,6 +121,26 @@ class MemoryMappedSource : public MemorySource {
std::unique_ptr<Impl> impl_;
};

// A MemorySource that tracks the size of allocations from a memory source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably belongs in test-common.h (along with the implementation, I'm not sure if it is worth creating a new .cc file or just inlining)

…tRowBatchSize, unify DataHeaderSize and TotalBytes into GetTotalSize
@emkornfield
Copy link
Contributor

I'm sorry, it looks like my change did have some conflicts with yours (and it got merged first). Do you mind rebasing?

@wesm
Copy link
Member

wesm commented Apr 19, 2016

Sorry about that. I'll review/merge once this is rebased.

@pcmoritz
Copy link
Contributor Author

Thanks, please hold off a little longer on that, I'd like to properly test it with all the other new IPC code that was added. I expect to finish this tonight.

@pcmoritz
Copy link
Contributor Author

The PR should be ready now!

}

Status MockMemorySource::Write(int64_t position, const uint8_t* data, int64_t nbytes) {
pos_ = std::max(pos_, position + nbytes);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only keep the max here?

Copy link
Contributor Author

@pcmoritz pcmoritz Apr 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is to determine how many bytes there are between the beginning of the buffer and the location where the last byte is being written; the function GetRowBatchSize will most of the time be used to determine how much shared memory should be allocated for IPC and then this is the quantity we care about; if memory is noncontiguous, it is not clear what the desired behaviour is.

See this comment at the beginning of GetRowBatchSize:
// Compute the precise number of bytes needed in a contiguous memory segment to
// write the row batch.

Copy link
Member

@wesm wesm Apr 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a variable naming and documentation problem. Can you change the variable name to extent_bytes_written_ or something similar and add a comment to Position (or rename Position) to indicate that it returns the smallest number of bytes containing the modified region of the MockMemorySource? Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks it makes sense now.

@wesm
Copy link
Member

wesm commented Apr 23, 2016

+1, thank you

@asfgit asfgit closed this in a541644 Apr 23, 2016
@wesm
Copy link
Member

wesm commented Apr 23, 2016

@pcmoritz I've made you a Contributor on JIRA so you'll be able to assign yourself JIRAs going forward

praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Aug 30, 2018
- added java bindings for varlen types/literals
- minor cleanups in llvm generator and engine
  (reported by clang-tidy)
praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Aug 30, 2018
- added java bindings for varlen types/literals
- minor cleanups in llvm generator and engine
  (reported by clang-tidy)
praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Aug 30, 2018
- added java bindings for varlen types/literals
- minor cleanups in llvm generator and engine
  (reported by clang-tidy)
praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Aug 30, 2018
- added java bindings for varlen types/literals
- minor cleanups in llvm generator and engine
  (reported by clang-tidy)
praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Sep 4, 2018
- added java bindings for varlen types/literals
- minor cleanups in llvm generator and engine
  (reported by clang-tidy)
praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Sep 10, 2018
- added java bindings for varlen types/literals
- minor cleanups in llvm generator and engine
  (reported by clang-tidy)
praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Sep 10, 2018
- added java bindings for varlen types/literals
- minor cleanups in llvm generator and engine
  (reported by clang-tidy)
vfraga pushed a commit to rafael-telles/arrow that referenced this pull request Dec 14, 2021
zhouyuan added a commit to zhouyuan/arrow that referenced this pull request Dec 23, 2021
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
zhouyuan added a commit to zhouyuan/arrow that referenced this pull request Jan 6, 2022
* Support casting boolean to bigint (apache#60)

* remove log4j as it's not used (apache#61)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* Add stripe iteration support for batch_size reading in the ORC Scanner (apache#63)

* Install re2 headers (apache#66)

Co-authored-by: PHILO-HE <feilong.he@intel.com>
Co-authored-by: zhixingheyi-tian <xiangxiang.shen@intel.com>
rafael-telles pushed a commit to rafael-telles/arrow that referenced this pull request Jan 20, 2022
vfraga pushed a commit to rafael-telles/arrow that referenced this pull request Mar 29, 2022
GerHobbelt pushed a commit to GerHobbelt/arrow that referenced this pull request Oct 18, 2024
…o UnionVector (apache#61)

When a DecimalVector is promoted to a UnionVector via a
PromotableWriter, the UnionVector will have the decimal vector in it's
internal struct vector, but the decimalVector field will not be set.
If UnionReader.read is then used to read from the UnionVector, it will
fail when it tries to read one of the promoted decimal values, due
to decimalVector being null, and the exact decimal type not being
provided.  This failure is unnecessary though as we have a pre-existing
decimal vector, the caller just does not know the exact type - and
it shouldn't be required to.

The change here is to check for a pre-existing decimal vector in the
internal struct when getDecimalVector() is called.  If one exists,
set the decimalVector field and return.  Otherwise, if none exists,
throw the exception.
GerHobbelt pushed a commit to GerHobbelt/arrow that referenced this pull request Dec 14, 2024
…o UnionVector (apache#61)

When a DecimalVector is promoted to a UnionVector via a
PromotableWriter, the UnionVector will have the decimal vector in it's
internal struct vector, but the decimalVector field will not be set.
If UnionReader.read is then used to read from the UnionVector, it will
fail when it tries to read one of the promoted decimal values, due
to decimalVector being null, and the exact decimal type not being
provided.  This failure is unnecessary though as we have a pre-existing
decimal vector, the caller just does not know the exact type - and
it shouldn't be required to.

The change here is to check for a pre-existing decimal vector in the
internal struct when getDecimalVector() is called.  If one exists,
set the decimalVector field and return.  Otherwise, if none exists,
throw the exception.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants