-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Portability issues with int128 type #2388
Comments
@mbasmanova @Yuhta @xiaoxmeng Since you are seeing most of these issues in your internal tests, can you share your thoughts? |
So far I don't see these 2 limitations as showstopper. |
My preference would be to switch to 2 64-bit integers, but I'm fine continuing with int128 for some more time to see if things get better. |
FYI I probably found the root cause of 1. It's the Meta internal system that does not link clang |
Bad news, due to some conflicts with Rust code, we probably still need to keep the annotations. |
Just an update on this, I get some internal point of contact who will be fixing the Rust build system so that we can link compiler builtins properly. I will put some placeholder/slow implementation this week in Once everything is sorted out, no annotations will be needed and we will use |
Issue 1 is solved. There is no longer need to exclude the sanitizer. For 2 we should experiment annotate |
We are still seeing segfaults on CentOS Stream 8 release build which seem to be due to unaligned access.
I will investigate this further. |
@majetideepak Try to reproduce it within a loop. Usually the alignment problem only shows up when vectorized |
Will you consider using two int64 to express UnscaledLongDecimal in the future? |
That's a good question. @majetideepak Deepak, looks like we continue seeing issue with int128_t. Should we switch to a struct with 2 64-bit integers? |
@mbasmanova My recommendation would be to continue to use the int128_t type to get the best possible platform-specific implementations. int128_t types have been supported for a while now by GCC and Clang. So it should be available in most user settings given Velox also requires C++17 support. @liujiayi771 is there any reason to consider using two int64 values for UnscaledLongDecimal? |
PR #3755 suggests that we are not able to reinterpret_cast char* to in128_t* even when memory is properly aligned. I guess we need to dig in more to understand the root cause. |
|
I test the average aggregation of decimal, and get core dump in DecimalAggregate::initializeNewGroups. I think the reason for the core dump is to construct LongDecimalWithOverflowState. I refer to this PR and modify the code. There will be no error in constructing LongDecimalWithOverflowState. |
The assembly instruction here is |
The fix would be one of the following
(2) can be achieved by either of the following approaches wherever int128_t values are being loaded.
I feel (1) is hard to achieve. (2) seems to be simpler and specific to int128_t types. |
@majetideepak We are likely to take performance hit for (2). This not only disables |
@Yuhta The trade-off here would be space wasted (due to alignment) (1) vs. less vectorization (2). So (1) would not necessarily be performant. |
When |
Unaligned data access is only a problem if the compiler assumes |
I have seen it when we use something like |
@majetideepak Deepak, thank you for looking into this. Just for my own understanding, are you thinking of switching physical representation of long decimal from int128_t to a struct of two int64_t? |
@majetideepak The fix for |
@mbasmanova , in my understanding, GCC is assuming that the int128_t addresses are always aligned by 16 bytes and is generating instructions that require this alignment, eg. The other option is to make GCC not generate instructions that require this alignment and we have to look at some GCC attributes to do this with minimum disruption. We have to tell GCC that the int128_t addresses are not aligned by 16 bytes. I The last and least preferred option in my opinion is to use |
@majetideepak If you can make the compiler to generate |
Also even if we have to give up |
@majetideepak Deepak, thank you for clarifying. |
In a hypothetical scenario of int128_t serialisation not working, I would recommend going with int64_t[2] - we already have all of the code and infra ready to work with it. I think if int64_t alignment does not work on a particular system, then one would have much bigger problems than just decimal support, probably half of the code would not work. |
@isadikov 8 bytes alignment is probably working in the current places where we do in-memory serialization, so probably not a concern. There is no essential difference at runtime though, it's a memory region of 16 bytes. We need to load it into one single register for any further processing. What we don't want is the temptation to load it into 2 registers and treat them as 2 |
@isadikov If you are referring to the approach in #4129, then we will incur the buildInt128(), UPPER(), and LOWER() overheads. |
We still hit issue now when we convert Arrow Decimal to Velox. The buffer may not be 16Byte aligned. Arrow uses 2 int64_t. Is there any reason we can't use it?
We need to fix the issue, it's really hard to debug when a core dump is caused by this. |
@FelixYBW The load is not a problem (we maybe need explicit intrinsic calls though). It's when you do arithmetics on it, you lose the SIMD (or some shortcut in hardware that's not really SIMD). We can fix the alignment in arrow conversion. Do you have some sample data? |
Do you mean if we use 2xint64_t gcc can't generate SIMD instruction for arithmetics, so we may have to use explicit intrinsic. If so let's keep current 128bit. Yes, we hit the issue in Gluten. Two ways to fix the alignment during arrow conversion, one is to copy value to an aligned buffer, another one is to make sure the buffer from arrow is 16B aligned always. We are trying the second solution now. |
@FelixYBW The second way would be optimal. All arrow buffers are aligned at 64 bytes so this happens only if you are storing other types of data in the same buffer. In this case you need some padding. We can also add some check in conversion to make sure the alignment is enforced. |
… netty For small size ColumnarBatch, this batch will not be compressed, the buffer which origins from netty is aligned, but the actual buffer used in RecordBatch is SliceBuffer(buffer, offset, size), this buffer cannot guarantee align. SIMD instruction movdqa required the address 16B aligned, so it will core dump at velox function copyValuesAndNulls and left potential coredump. This copy is expensive but essential. BufferReader::DoReadAt(int64_t position, int64_t nbytes) { return SliceBuffer(buffer_, position, nbytes); // buffer_ is netty buffer } For most batch which is not tiny batch and compress codec use default lz4, shuffle read will decompress the buffer to an aligned address which meets SIMD instructions requirement. Relevant issue: facebookincubator/velox#2388
Support months_between function
Summary: When integrating Spark query runner with Spark expression fuzzer test, we found it cored dumps at below point when copying a decimal vector, whose memory is allocated by 'arrow::ipc::RecordBatchReader'. https://github.com/facebookincubator/velox/blob/7b2bb7f672b435d38d5a83f9cd8441bf17b564e6/velox/vector/FlatVector-inl.h#L198 The reason is Arrow uses two uint64_t values to represent a 128-bit decimal value, and the allocated memory might not be 16-byte aligned.This PR adds a copy process for long decimal in 'importFromArrowImpl' to ensure the alignment. #2388 Pull Request resolved: #11404 Reviewed By: pedroerp Differential Revision: D66307435 Pulled By: kagamiori fbshipit-source-id: 86081c041169cfc196e68f36629a246f8626d3d9
…11404) Summary: When integrating Spark query runner with Spark expression fuzzer test, we found it cored dumps at below point when copying a decimal vector, whose memory is allocated by 'arrow::ipc::RecordBatchReader'. https://github.com/facebookincubator/velox/blob/7b2bb7f672b435d38d5a83f9cd8441bf17b564e6/velox/vector/FlatVector-inl.h#L198 The reason is Arrow uses two uint64_t values to represent a 128-bit decimal value, and the allocated memory might not be 16-byte aligned.This PR adds a copy process for long decimal in 'importFromArrowImpl' to ensure the alignment. facebookincubator#2388 Pull Request resolved: facebookincubator#11404 Reviewed By: pedroerp Differential Revision: D66307435 Pulled By: kagamiori fbshipit-source-id: 86081c041169cfc196e68f36629a246f8626d3d9
The current Velox Long Decimal type uses
int128_t
type. However, we are seeing a couple of portability issues around int128 types. Some of them are:I want to discuss these issues further and would like to conclude if it is meaningful to continue to use this type for development or use 2 int64_t type values.
The text was updated successfully, but these errors were encountered: