Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Portability issues with int128 type #2388

Open
majetideepak opened this issue Aug 25, 2022 · 36 comments
Open

Portability issues with int128 type #2388

majetideepak opened this issue Aug 25, 2022 · 36 comments
Assignees
Labels

Comments

@majetideepak
Copy link
Collaborator

majetideepak commented Aug 25, 2022

The current Velox Long Decimal type uses int128_t type. However, we are seeing a couple of portability issues around int128 types. Some of them are:

  1. Address sanitizer does not support int128_t well.
  2. int128_t requires alignment on x86_64.

I want to discuss these issues further and would like to conclude if it is meaningful to continue to use this type for development or use 2 int64_t type values.

@majetideepak
Copy link
Collaborator Author

@mbasmanova @Yuhta @xiaoxmeng Since you are seeing most of these issues in your internal tests, can you share your thoughts?

@Yuhta
Copy link
Contributor

Yuhta commented Aug 25, 2022

So far I don't see these 2 limitations as showstopper.
For 1 we just need to use the new mul defined in DecimalUtil.h.
For 2 we need to be careful and annotate the alignment (__attribute__ ((aligned (16))) or alignas(16)) wherever needed.

@mbasmanova
Copy link
Contributor

My preference would be to switch to 2 64-bit integers, but I'm fine continuing with int128 for some more time to see if things get better.

@Yuhta
Copy link
Contributor

Yuhta commented Aug 26, 2022

FYI I probably found the root cause of 1. It's the Meta internal system that does not link clang compiler_rt by default. Once I add this, no more annotation will be needed for this purpose.

@Yuhta
Copy link
Contributor

Yuhta commented Aug 26, 2022

Bad news, due to some conflicts with Rust code, we probably still need to keep the annotations.
I am going to give it a final try to fix our clang next week, if failed we still need the annotation.

@Yuhta
Copy link
Contributor

Yuhta commented Aug 29, 2022

Just an update on this, I get some internal point of contact who will be fixing the Rust build system so that we can link compiler builtins properly. I will put some placeholder/slow implementation this week in CheckedArithematic.h to unblock the PRs for now (annotation is still needed for now).

Once everything is sorted out, no annotations will be needed and we will use __builtin_mul_overflow with __int128 in CheckedArithematic.h.

@Yuhta
Copy link
Contributor

Yuhta commented Sep 1, 2022

Issue 1 is solved. There is no longer need to exclude the sanitizer. For 2 we should experiment annotate UnscaledLongDecimal and see if the alignment can be transitively specified.

@majetideepak
Copy link
Collaborator Author

majetideepak commented Mar 3, 2023

We are still seeing segfaults on CentOS Stream 8 release build which seem to be due to unaligned access.
See PRs #3755 and #4129
However, a simple reproducer with unaligned access seems to work fine on this machine.

 char* buf1 = reinterpret_cast<char*>(malloc(160));
  buf1[1] = 1;
  buf1[19] = 2;
  __int128_t* val1 = reinterpret_cast<__int128_t*>(buf1 + 1);
  __int128_t* val2 = reinterpret_cast<__int128_t*>(buf1 + 19);
  __int128_t result = val1[0] + val2[0];
  if (result == 3) {
     std::cout << "Success" << std::endl;
 }

I will investigate this further.

@majetideepak majetideepak self-assigned this Mar 3, 2023
@Yuhta
Copy link
Contributor

Yuhta commented Mar 3, 2023

@majetideepak Try to reproduce it within a loop. Usually the alignment problem only shows up when vectorized

@liujiayi771
Copy link
Contributor

Will you consider using two int64 to express UnscaledLongDecimal in the future?

@mbasmanova
Copy link
Contributor

Will you consider using two int64 to express UnscaledLongDecimal in the future?

That's a good question. @majetideepak Deepak, looks like we continue seeing issue with int128_t. Should we switch to a struct with 2 64-bit integers?

@majetideepak
Copy link
Collaborator Author

majetideepak commented Mar 6, 2023

@mbasmanova My recommendation would be to continue to use the int128_t type to get the best possible platform-specific implementations. int128_t types have been supported for a while now by GCC and Clang. So it should be available in most user settings given Velox also requires C++17 support.
The only issue we see so far with 128_t is the requirement for alignment for the corresponding SIMD instructions using GCC. If we can figure out a way for the compiler to generate unaligned instructions, we should be good. I am looking into this further now.

@liujiayi771 is there any reason to consider using two int64 values for UnscaledLongDecimal?

@mbasmanova
Copy link
Contributor

The only issue we see so far with 128_t is the requirement for alignment for the corresponding SIMD instructions.

PR #3755 suggests that we are not able to reinterpret_cast char* to in128_t* even when memory is properly aligned. I guess we need to dig in more to understand the root cause.

@majetideepak
Copy link
Collaborator Author

reinterpret_cast is only a compiler directive, so there is no instruction associated. We also see these crashes only on GCC.
At level O3, the code generated is highly optimized and only way is to look at the generated code. I plan to do this this week and will post updates here. A simple reproducer would help a lot too, but I don't have one yet.

@liujiayi771
Copy link
Contributor

@mbasmanova My recommendation would be to continue to use the int128_t type to get the best possible platform-specific implementations. int128_t types have been supported for a while now by GCC and Clang. So it should be available in most user settings given Velox also requires C++17 support. The only issue we see so far with 128_t is the requirement for alignment for the corresponding SIMD instructions using GCC. If we can figure out a way for the compiler to generate unaligned SIMD instructions, we should be good. I am looking into this further now.

@liujiayi771 is there any reason to consider using two int64 values for UnscaledLongDecimal?

I test the average aggregation of decimal, and get core dump in DecimalAggregate::initializeNewGroups. I think the reason for the core dump is to construct LongDecimalWithOverflowState. I refer to this PR and modify the code. There will be no error in constructing LongDecimalWithOverflowState.

@majetideepak
Copy link
Collaborator Author

majetideepak commented Mar 10, 2023

The assembly instruction here is vmovaps and the SO answer describes the need for it to be aligned to 16 bytes.
https://stackoverflow.com/questions/67243284/why-movaps-causes-segmentation-fault
0x21f9d02 <_ZN8facebook5velox9aggregate16DecimalAggregateINS0_20UnscaledShortDecimalES3_E19initializeNewGroupsEPPcN5folly5RangeIPKiEE+98> vmovaps %xmm0,(%r11)

@majetideepak
Copy link
Collaborator Author

majetideepak commented Mar 10, 2023

The fix would be one of the following

  1. Ensure we can write code such that all loads are aligned to Aggregate::accumulatorAlignmentSize()
  2. Enable the GCC compiler to disable generating vectorized loads that need alignment for int128_t types.

(2) can be achieved by either of the following approaches wherever int128_t values are being loaded.

#if defined (__GNUC__) && (defined (__x86_64__) || defined (__i386__))
__attribute__ ((target ("no-sse")))
#endif
#pragma GCC push_options
#pragma GCC optimize ("O0")

code 

#pragma GCC pop_options
use volatile

I feel (1) is hard to achieve. (2) seems to be simpler and specific to int128_t types.
I will work towards adding support for (2) if there are no objections. I will look for better compiler options to avoid vmovaps

@Yuhta
Copy link
Contributor

Yuhta commented Mar 10, 2023

@majetideepak We are likely to take performance hit for (2). This not only disables vmovaps but essentially all SIMD instructions (the O0 one is even worse than that). I would suggest to try (1) first for performance reason. If (1) turns out not possible, I think @isadikov's approach in #4129 is still better than (2).

@majetideepak
Copy link
Collaborator Author

@Yuhta The trade-off here would be space wasted (due to alignment) (1) vs. less vectorization (2). So (1) would not necessarily be performant. O0 is definitely not desirable. I quickly checked to see if that helps to avoid vmovaps and it does.
The approach in #4129 solves this issue by using two int64_t instead of int128_t inside LongDecimalWithOverflowState. This approach has the buildInt128(), UPPER(), and LOWER overheads.
It will also not help other uses of int128_t (UnscaledLongDecimal).
Since we cannot predict where the compiler can introduce 16-byte aligned instructions for int128_t types,
from an API standpoint, I want the implementation to pick either int128_t or two int64_t for all long decimal types.
As mentioned above, I will check for other options to avoid 16-byte aligned load instructions.
Maybe you can check with Orri or others internally at Meta if there is a less intrusive way.

@Yuhta
Copy link
Contributor

Yuhta commented Mar 12, 2023

When int128_t is allocated by compiler, it is guaranteed to be aligned properly. Alignment is a problem only when we do deserialization, and it's not specific to int128_t; any unaligned data access potentially can have this problem. That's why I prefer to fix accumulatorAlignmentSize instead of finding workaround just for int128_t.

@majetideepak
Copy link
Collaborator Author

majetideepak commented Mar 12, 2023

Unaligned data access is only a problem if the compiler assumes unaligned memory to be aligned and then ends up generating instructions that require aligned memory.
So far, we do not see these segfaults for other data types. So I believe this is specific to int128_t type.

@Yuhta
Copy link
Contributor

Yuhta commented Mar 12, 2023

I have seen it when we use something like sfmt19937; actually I have some idea how to fix the alignment problem, let me try something quickly and see if it works.

@mbasmanova
Copy link
Contributor

@majetideepak Deepak, thank you for looking into this. Just for my own understanding, are you thinking of switching physical representation of long decimal from int128_t to a struct of two int64_t?

@Yuhta
Copy link
Contributor

Yuhta commented Mar 13, 2023

@majetideepak The fix for Aggregate::accumulatorAlignmentSize() is at #4268 , can you check if this fixes the crash with decimal accumulator?

@majetideepak
Copy link
Collaborator Author

are you thinking of switching physical representation of long decimal from int128_t to a struct of two int64_t?

@mbasmanova , in my understanding, GCC is assuming that the int128_t addresses are always aligned by 16 bytes and is generating instructions that require this alignment, eg. vmovaps as described above.
Because of this, any pointer arithmetic in the Velox code on int128_t types must align the addresses to 16 bytes. @Yuhta has a fix for the accumulator code in #4268
But we have to ensure future features also honor this.

The other option is to make GCC not generate instructions that require this alignment and we have to look at some GCC attributes to do this with minimum disruption. We have to tell GCC that the int128_t addresses are not aligned by 16 bytes. I
looked into some options, but could not find a good one yet.

The last and least preferred option in my opinion is to use two int64_t and create essentially a int128_t library.

@majetideepak
Copy link
Collaborator Author

The fix for Aggregate::accumulatorAlignmentSize() is at #4268, can you check if this fixes the crash with decimal accumulator?

@Yuhta, I commented in that PR.

@Yuhta
Copy link
Contributor

Yuhta commented Mar 13, 2023

@majetideepak If you can make the compiler to generate vmovups instead of vmovaps that would be also nice.

@Yuhta
Copy link
Contributor

Yuhta commented Mar 13, 2023

Also even if we have to give up int128_t in serialization, I would suggest using char[16] instead of int64_t[2] to replace them. This will work even if it is not aligned at 8 bytes, and makes it more clear that it is a serialization, and we can load it into register using something like vmovups.

@mbasmanova
Copy link
Contributor

@majetideepak Deepak, thank you for clarifying.

@isadikov
Copy link
Contributor

isadikov commented Mar 14, 2023

In a hypothetical scenario of int128_t serialisation not working, I would recommend going with int64_t[2] - we already have all of the code and infra ready to work with it. I think if int64_t alignment does not work on a particular system, then one would have much bigger problems than just decimal support, probably half of the code would not work.

@Yuhta
Copy link
Contributor

Yuhta commented Mar 14, 2023

@isadikov 8 bytes alignment is probably working in the current places where we do in-memory serialization, so probably not a concern. There is no essential difference at runtime though, it's a memory region of 16 bytes. We need to load it into one single register for any further processing. What we don't want is the temptation to load it into 2 registers and treat them as 2 int64_t, that's the main rationale behind my suggestion.

@majetideepak
Copy link
Collaborator Author

majetideepak commented Mar 14, 2023

we already have all of the code and infra ready to work with it.

@isadikov If you are referring to the approach in #4129, then we will incur the buildInt128(), UPPER(), and LOWER() overheads. memcpy into an int128_t variable and memcpy back is more efficient I guess. This is what @Yuhta is referring to probably.

@FelixYBW
Copy link

We still hit issue now when we convert Arrow Decimal to Velox. The buffer may not be 16Byte aligned. Arrow uses 2 int64_t. Is there any reason we can't use it?
If we define

struct Decimal{ uint64_t a; uint64_t b; }; Decimal A,B; A=B;
GCC will still use SIMD instruction but switch to movdqu. In current processor, movedqu and movdqa have the same performance if address is 16B aligned.

We need to fix the issue, it's really hard to debug when a core dump is caused by this.

@Yuhta
Copy link
Contributor

Yuhta commented Jun 22, 2023

@FelixYBW The load is not a problem (we maybe need explicit intrinsic calls though). It's when you do arithmetics on it, you lose the SIMD (or some shortcut in hardware that's not really SIMD). We can fix the alignment in arrow conversion. Do you have some sample data?

@FelixYBW
Copy link

Do you mean if we use 2xint64_t gcc can't generate SIMD instruction for arithmetics, so we may have to use explicit intrinsic. If so let's keep current 128bit.

Yes, we hit the issue in Gluten. Two ways to fix the alignment during arrow conversion, one is to copy value to an aligned buffer, another one is to make sure the buffer from arrow is 16B aligned always. We are trying the second solution now.

@Yuhta
Copy link
Contributor

Yuhta commented Jun 22, 2023

@FelixYBW The second way would be optimal. All arrow buffers are aligned at 64 bytes so this happens only if you are storing other types of data in the same buffer. In this case you need some padding. We can also add some check in conversion to make sure the alignment is enforced.

zhouyuan pushed a commit to apache/incubator-gluten that referenced this issue Jul 3, 2023
… netty

For small size ColumnarBatch, this batch will not be compressed, the buffer which origins from netty is aligned, but the actual buffer used in RecordBatch is SliceBuffer(buffer, offset, size), this buffer cannot guarantee align. SIMD instruction movdqa required the address 16B aligned, so it will core dump at velox function copyValuesAndNulls and left potential coredump. This copy is expensive but essential.

BufferReader::DoReadAt(int64_t position, int64_t nbytes) {
return SliceBuffer(buffer_, position, nbytes); // buffer_ is netty buffer
}

For most batch which is not tiny batch and compress codec use default lz4, shuffle read will decompress the buffer to an aligned address which meets SIMD instructions requirement.

Relevant issue: facebookincubator/velox#2388
marin-ma pushed a commit to marin-ma/velox-oap that referenced this issue Dec 15, 2023
facebook-github-bot pushed a commit that referenced this issue Nov 21, 2024
Summary:
When integrating Spark query runner with Spark expression fuzzer test, we found
it cored dumps at below point when copying a decimal vector, whose memory is
allocated by 'arrow::ipc::RecordBatchReader'.

https://github.com/facebookincubator/velox/blob/7b2bb7f672b435d38d5a83f9cd8441bf17b564e6/velox/vector/FlatVector-inl.h#L198

The reason is Arrow uses two uint64_t values to represent a 128-bit decimal
value, and the allocated memory might not be 16-byte aligned.This PR adds a
copy process for long decimal in 'importFromArrowImpl' to ensure the alignment.

#2388

Pull Request resolved: #11404

Reviewed By: pedroerp

Differential Revision: D66307435

Pulled By: kagamiori

fbshipit-source-id: 86081c041169cfc196e68f36629a246f8626d3d9
athmaja-n pushed a commit to athmaja-n/velox that referenced this issue Jan 10, 2025
…11404)

Summary:
When integrating Spark query runner with Spark expression fuzzer test, we found
it cored dumps at below point when copying a decimal vector, whose memory is
allocated by 'arrow::ipc::RecordBatchReader'.

https://github.com/facebookincubator/velox/blob/7b2bb7f672b435d38d5a83f9cd8441bf17b564e6/velox/vector/FlatVector-inl.h#L198

The reason is Arrow uses two uint64_t values to represent a 128-bit decimal
value, and the allocated memory might not be 16-byte aligned.This PR adds a
copy process for long decimal in 'importFromArrowImpl' to ensure the alignment.

facebookincubator#2388

Pull Request resolved: facebookincubator#11404

Reviewed By: pedroerp

Differential Revision: D66307435

Pulled By: kagamiori

fbshipit-source-id: 86081c041169cfc196e68f36629a246f8626d3d9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants