Portability issues with int128 type #2388

majetideepak · 2022-08-25T22:08:06Z

The current Velox Long Decimal type uses int128_t type. However, we are seeing a couple of portability issues around int128 types. Some of them are:

Address sanitizer does not support int128_t well.
int128_t requires alignment on x86_64.

I want to discuss these issues further and would like to conclude if it is meaningful to continue to use this type for development or use 2 int64_t type values.

The text was updated successfully, but these errors were encountered:

majetideepak · 2022-08-25T22:08:53Z

@mbasmanova @Yuhta @xiaoxmeng Since you are seeing most of these issues in your internal tests, can you share your thoughts?

Yuhta · 2022-08-25T22:16:30Z

So far I don't see these 2 limitations as showstopper.
For 1 we just need to use the new mul defined in DecimalUtil.h.
For 2 we need to be careful and annotate the alignment (__attribute__ ((aligned (16))) or alignas(16)) wherever needed.

mbasmanova · 2022-08-25T22:52:50Z

My preference would be to switch to 2 64-bit integers, but I'm fine continuing with int128 for some more time to see if things get better.

Yuhta · 2022-08-26T20:15:21Z

FYI I probably found the root cause of 1. It's the Meta internal system that does not link clang compiler_rt by default. Once I add this, no more annotation will be needed for this purpose.

Yuhta · 2022-08-26T21:55:25Z

Bad news, due to some conflicts with Rust code, we probably still need to keep the annotations.
I am going to give it a final try to fix our clang next week, if failed we still need the annotation.

Yuhta · 2022-08-29T20:49:56Z

Just an update on this, I get some internal point of contact who will be fixing the Rust build system so that we can link compiler builtins properly. I will put some placeholder/slow implementation this week in CheckedArithematic.h to unblock the PRs for now (annotation is still needed for now).

Once everything is sorted out, no annotations will be needed and we will use __builtin_mul_overflow with __int128 in CheckedArithematic.h.

Yuhta · 2022-09-01T18:13:32Z

Issue 1 is solved. There is no longer need to exclude the sanitizer. For 2 we should experiment annotate UnscaledLongDecimal and see if the alignment can be transitively specified.

majetideepak · 2023-03-03T21:36:53Z

We are still seeing segfaults on CentOS Stream 8 release build which seem to be due to unaligned access.
See PRs #3755 and #4129
However, a simple reproducer with unaligned access seems to work fine on this machine.

 char* buf1 = reinterpret_cast<char*>(malloc(160));
  buf1[1] = 1;
  buf1[19] = 2;
  __int128_t* val1 = reinterpret_cast<__int128_t*>(buf1 + 1);
  __int128_t* val2 = reinterpret_cast<__int128_t*>(buf1 + 19);
  __int128_t result = val1[0] + val2[0];
  if (result == 3) {
     std::cout << "Success" << std::endl;
 }

I will investigate this further.

Yuhta · 2023-03-03T21:45:06Z

@majetideepak Try to reproduce it within a loop. Usually the alignment problem only shows up when vectorized

liujiayi771 · 2023-03-06T01:06:19Z

Will you consider using two int64 to express UnscaledLongDecimal in the future?

mbasmanova · 2023-03-06T14:42:34Z

Will you consider using two int64 to express UnscaledLongDecimal in the future?

That's a good question. @majetideepak Deepak, looks like we continue seeing issue with int128_t. Should we switch to a struct with 2 64-bit integers?

majetideepak · 2023-03-06T18:18:35Z

@mbasmanova My recommendation would be to continue to use the int128_t type to get the best possible platform-specific implementations. int128_t types have been supported for a while now by GCC and Clang. So it should be available in most user settings given Velox also requires C++17 support.
The only issue we see so far with 128_t is the requirement for alignment for the corresponding SIMD instructions using GCC. If we can figure out a way for the compiler to generate unaligned instructions, we should be good. I am looking into this further now.

@liujiayi771 is there any reason to consider using two int64 values for UnscaledLongDecimal?

mbasmanova · 2023-03-06T18:24:29Z

The only issue we see so far with 128_t is the requirement for alignment for the corresponding SIMD instructions.

PR #3755 suggests that we are not able to reinterpret_cast char* to in128_t* even when memory is properly aligned. I guess we need to dig in more to understand the root cause.

majetideepak · 2023-03-06T19:18:43Z

reinterpret_cast is only a compiler directive, so there is no instruction associated. We also see these crashes only on GCC.
At level O3, the code generated is highly optimized and only way is to look at the generated code. I plan to do this this week and will post updates here. A simple reproducer would help a lot too, but I don't have one yet.

liujiayi771 · 2023-03-07T02:07:24Z

@mbasmanova My recommendation would be to continue to use the int128_t type to get the best possible platform-specific implementations. int128_t types have been supported for a while now by GCC and Clang. So it should be available in most user settings given Velox also requires C++17 support. The only issue we see so far with 128_t is the requirement for alignment for the corresponding SIMD instructions using GCC. If we can figure out a way for the compiler to generate unaligned SIMD instructions, we should be good. I am looking into this further now.

@liujiayi771 is there any reason to consider using two int64 values for UnscaledLongDecimal?

I test the average aggregation of decimal, and get core dump in DecimalAggregate::initializeNewGroups. I think the reason for the core dump is to construct LongDecimalWithOverflowState. I refer to this PR and modify the code. There will be no error in constructing LongDecimalWithOverflowState.

majetideepak · 2023-03-10T14:22:03Z

The assembly instruction here is vmovaps and the SO answer describes the need for it to be aligned to 16 bytes.
https://stackoverflow.com/questions/67243284/why-movaps-causes-segmentation-fault
0x21f9d02 <_ZN8facebook5velox9aggregate16DecimalAggregateINS0_20UnscaledShortDecimalES3_E19initializeNewGroupsEPPcN5folly5RangeIPKiEE+98> vmovaps %xmm0,(%r11)

majetideepak · 2023-03-10T14:52:54Z

The fix would be one of the following

Ensure we can write code such that all loads are aligned to Aggregate::accumulatorAlignmentSize()
Enable the GCC compiler to disable generating vectorized loads that need alignment for int128_t types.

(2) can be achieved by either of the following approaches wherever int128_t values are being loaded.

#if defined (__GNUC__) && (defined (__x86_64__) || defined (__i386__))
__attribute__ ((target ("no-sse")))
#endif

#pragma GCC push_options
#pragma GCC optimize ("O0")

code 

#pragma GCC pop_options

use volatile

I feel (1) is hard to achieve. (2) seems to be simpler and specific to int128_t types.
I will work towards adding support for (2) if there are no objections. I will look for better compiler options to avoid vmovaps

Yuhta · 2023-03-10T15:51:40Z

@majetideepak We are likely to take performance hit for (2). This not only disables vmovaps but essentially all SIMD instructions (the O0 one is even worse than that). I would suggest to try (1) first for performance reason. If (1) turns out not possible, I think @isadikov's approach in #4129 is still better than (2).

majetideepak · 2023-03-10T23:57:41Z

@Yuhta The trade-off here would be space wasted (due to alignment) (1) vs. less vectorization (2). So (1) would not necessarily be performant. O0 is definitely not desirable. I quickly checked to see if that helps to avoid vmovaps and it does.
The approach in #4129 solves this issue by using two int64_t instead of int128_t inside LongDecimalWithOverflowState. This approach has the buildInt128(), UPPER(), and LOWER overheads.
It will also not help other uses of int128_t (UnscaledLongDecimal).
Since we cannot predict where the compiler can introduce 16-byte aligned instructions for int128_t types,
from an API standpoint, I want the implementation to pick either int128_t or two int64_t for all long decimal types.
As mentioned above, I will check for other options to avoid 16-byte aligned load instructions.
Maybe you can check with Orri or others internally at Meta if there is a less intrusive way.

Yuhta · 2023-03-12T01:40:28Z

When int128_t is allocated by compiler, it is guaranteed to be aligned properly. Alignment is a problem only when we do deserialization, and it's not specific to int128_t; any unaligned data access potentially can have this problem. That's why I prefer to fix accumulatorAlignmentSize instead of finding workaround just for int128_t.

majetideepak · 2023-03-12T02:56:02Z

Unaligned data access is only a problem if the compiler assumes unaligned memory to be aligned and then ends up generating instructions that require aligned memory.
So far, we do not see these segfaults for other data types. So I believe this is specific to int128_t type.

Yuhta · 2023-03-12T03:45:04Z

I have seen it when we use something like sfmt19937; actually I have some idea how to fix the alignment problem, let me try something quickly and see if it works.

mbasmanova · 2023-03-13T14:04:15Z

@majetideepak Deepak, thank you for looking into this. Just for my own understanding, are you thinking of switching physical representation of long decimal from int128_t to a struct of two int64_t?

Yuhta · 2023-03-13T14:42:31Z

@majetideepak The fix for Aggregate::accumulatorAlignmentSize() is at #4268 , can you check if this fixes the crash with decimal accumulator?

majetideepak · 2023-03-13T15:55:38Z

are you thinking of switching physical representation of long decimal from int128_t to a struct of two int64_t?

@mbasmanova , in my understanding, GCC is assuming that the int128_t addresses are always aligned by 16 bytes and is generating instructions that require this alignment, eg. vmovaps as described above.
Because of this, any pointer arithmetic in the Velox code on int128_t types must align the addresses to 16 bytes. @Yuhta has a fix for the accumulator code in #4268
But we have to ensure future features also honor this.

The other option is to make GCC not generate instructions that require this alignment and we have to look at some GCC attributes to do this with minimum disruption. We have to tell GCC that the int128_t addresses are not aligned by 16 bytes. I
looked into some options, but could not find a good one yet.

The last and least preferred option in my opinion is to use two int64_t and create essentially a int128_t library.

majetideepak · 2023-03-13T15:56:43Z

The fix for Aggregate::accumulatorAlignmentSize() is at #4268, can you check if this fixes the crash with decimal accumulator?

@Yuhta, I commented in that PR.

Yuhta · 2023-03-13T18:03:25Z

@majetideepak If you can make the compiler to generate vmovups instead of vmovaps that would be also nice.

Yuhta · 2023-03-13T18:09:00Z

Also even if we have to give up int128_t in serialization, I would suggest using char[16] instead of int64_t[2] to replace them. This will work even if it is not aligned at 8 bytes, and makes it more clear that it is a serialization, and we can load it into register using something like vmovups.

mbasmanova · 2023-03-13T18:17:01Z

@majetideepak Deepak, thank you for clarifying.

isadikov · 2023-03-14T05:29:51Z

In a hypothetical scenario of int128_t serialisation not working, I would recommend going with int64_t[2] - we already have all of the code and infra ready to work with it. I think if int64_t alignment does not work on a particular system, then one would have much bigger problems than just decimal support, probably half of the code would not work.

Yuhta · 2023-03-14T14:35:06Z

@isadikov 8 bytes alignment is probably working in the current places where we do in-memory serialization, so probably not a concern. There is no essential difference at runtime though, it's a memory region of 16 bytes. We need to load it into one single register for any further processing. What we don't want is the temptation to load it into 2 registers and treat them as 2 int64_t, that's the main rationale behind my suggestion.

majetideepak · 2023-03-14T20:50:23Z

we already have all of the code and infra ready to work with it.

@isadikov If you are referring to the approach in #4129, then we will incur the buildInt128(), UPPER(), and LOWER() overheads. memcpy into an int128_t variable and memcpy back is more efficient I guess. This is what @Yuhta is referring to probably.

FelixYBW · 2023-06-22T01:29:56Z

We still hit issue now when we convert Arrow Decimal to Velox. The buffer may not be 16Byte aligned. Arrow uses 2 int64_t. Is there any reason we can't use it?
If we define

struct Decimal{ uint64_t a; uint64_t b; }; Decimal A,B; A=B;
GCC will still use SIMD instruction but switch to movdqu. In current processor, movedqu and movdqa have the same performance if address is 16B aligned.

We need to fix the issue, it's really hard to debug when a core dump is caused by this.

Yuhta · 2023-06-22T13:50:06Z

@FelixYBW The load is not a problem (we maybe need explicit intrinsic calls though). It's when you do arithmetics on it, you lose the SIMD (or some shortcut in hardware that's not really SIMD). We can fix the alignment in arrow conversion. Do you have some sample data?

FelixYBW · 2023-06-22T19:50:34Z

Do you mean if we use 2xint64_t gcc can't generate SIMD instruction for arithmetics, so we may have to use explicit intrinsic. If so let's keep current 128bit.

Yes, we hit the issue in Gluten. Two ways to fix the alignment during arrow conversion, one is to copy value to an aligned buffer, another one is to make sure the buffer from arrow is 16B aligned always. We are trying the second solution now.

Yuhta · 2023-06-22T19:54:05Z

@FelixYBW The second way would be optimal. All arrow buffers are aligned at 64 bytes so this happens only if you are storing other types of data in the same buffer. In this case you need some padding. We can also add some check in conversion to make sure the alignment is enforced.

… netty For small size ColumnarBatch, this batch will not be compressed, the buffer which origins from netty is aligned, but the actual buffer used in RecordBatch is SliceBuffer(buffer, offset, size), this buffer cannot guarantee align. SIMD instruction movdqa required the address 16B aligned, so it will core dump at velox function copyValuesAndNulls and left potential coredump. This copy is expensive but essential. BufferReader::DoReadAt(int64_t position, int64_t nbytes) { return SliceBuffer(buffer_, position, nbytes); // buffer_ is netty buffer } For most batch which is not tiny batch and compress codec use default lz4, shuffle read will decompress the buffer to an aligned address which meets SIMD instructions requirement. Relevant issue: facebookincubator/velox#2388

Support months_between function

Summary: When integrating Spark query runner with Spark expression fuzzer test, we found it cored dumps at below point when copying a decimal vector, whose memory is allocated by 'arrow::ipc::RecordBatchReader'. https://github.com/facebookincubator/velox/blob/7b2bb7f672b435d38d5a83f9cd8441bf17b564e6/velox/vector/FlatVector-inl.h#L198 The reason is Arrow uses two uint64_t values to represent a 128-bit decimal value, and the allocated memory might not be 16-byte aligned.This PR adds a copy process for long decimal in 'importFromArrowImpl' to ensure the alignment. #2388 Pull Request resolved: #11404 Reviewed By: pedroerp Differential Revision: D66307435 Pulled By: kagamiori fbshipit-source-id: 86081c041169cfc196e68f36629a246f8626d3d9

…11404) Summary: When integrating Spark query runner with Spark expression fuzzer test, we found it cored dumps at below point when copying a decimal vector, whose memory is allocated by 'arrow::ipc::RecordBatchReader'. https://github.com/facebookincubator/velox/blob/7b2bb7f672b435d38d5a83f9cd8441bf17b564e6/velox/vector/FlatVector-inl.h#L198 The reason is Arrow uses two uint64_t values to represent a 128-bit decimal value, and the allocated memory might not be 16-byte aligned.This PR adds a copy process for long decimal in 'importFromArrowImpl' to ensure the alignment. facebookincubator#2388 Pull Request resolved: facebookincubator#11404 Reviewed By: pedroerp Differential Revision: D66307435 Pulled By: kagamiori fbshipit-source-id: 86081c041169cfc196e68f36629a246f8626d3d9

majetideepak mentioned this issue Aug 25, 2022

Add support for Decimal Type Sum Aggregation #2163

Closed

majetideepak added the design label Aug 25, 2022

majetideepak self-assigned this Mar 3, 2023

majetideepak mentioned this issue Mar 10, 2023

Fix decimal sum and average aggregation issues due to int128_t handling in the accumulator #4129

Closed

jinchengchenghh mentioned this issue Jul 3, 2023

[GLUTEN-1434] [VL] Fix shuffle read unalign buffer which origins from netty apache/incubator-gluten#2159

Merged

marin-ma pushed a commit to marin-ma/velox-oap that referenced this issue Dec 15, 2023

[CH] Support months_between function (facebookincubator#2388)

b8fcde6

Support months_between function

rui-mo mentioned this issue Nov 1, 2024

fix: Fix importing long decimal vector from Arrow #11404

Closed

Portability issues with int128 type #2388

Portability issues with int128 type #2388

Comments

majetideepak commented Aug 25, 2022 • edited Loading

majetideepak commented Aug 25, 2022

Yuhta commented Aug 25, 2022

mbasmanova commented Aug 25, 2022

Yuhta commented Aug 26, 2022 • edited Loading

Yuhta commented Aug 26, 2022 • edited Loading

Yuhta commented Aug 29, 2022 • edited Loading

Yuhta commented Sep 1, 2022

majetideepak commented Mar 3, 2023 • edited Loading

Yuhta commented Mar 3, 2023

liujiayi771 commented Mar 6, 2023

mbasmanova commented Mar 6, 2023

majetideepak commented Mar 6, 2023 • edited Loading

mbasmanova commented Mar 6, 2023

majetideepak commented Mar 6, 2023

liujiayi771 commented Mar 7, 2023

majetideepak commented Mar 10, 2023 • edited Loading

majetideepak commented Mar 10, 2023 • edited Loading

Yuhta commented Mar 10, 2023 • edited Loading

majetideepak commented Mar 10, 2023

Yuhta commented Mar 12, 2023

majetideepak commented Mar 12, 2023 • edited Loading

Yuhta commented Mar 12, 2023

mbasmanova commented Mar 13, 2023

Yuhta commented Mar 13, 2023 • edited Loading

majetideepak commented Mar 13, 2023

majetideepak commented Mar 13, 2023

Yuhta commented Mar 13, 2023

Yuhta commented Mar 13, 2023 • edited Loading

mbasmanova commented Mar 13, 2023

isadikov commented Mar 14, 2023 • edited Loading

Yuhta commented Mar 14, 2023

majetideepak commented Mar 14, 2023 • edited Loading

FelixYBW commented Jun 22, 2023

Yuhta commented Jun 22, 2023 • edited Loading

FelixYBW commented Jun 22, 2023

Yuhta commented Jun 22, 2023

majetideepak commented Aug 25, 2022 •

edited

Loading

Yuhta commented Aug 26, 2022 •

edited

Loading

Yuhta commented Aug 26, 2022 •

edited

Loading

Yuhta commented Aug 29, 2022 •

edited

Loading

majetideepak commented Mar 3, 2023 •

edited

Loading

majetideepak commented Mar 6, 2023 •

edited

Loading

majetideepak commented Mar 10, 2023 •

edited

Loading

majetideepak commented Mar 10, 2023 •

edited

Loading

Yuhta commented Mar 10, 2023 •

edited

Loading

majetideepak commented Mar 12, 2023 •

edited

Loading

Yuhta commented Mar 13, 2023 •

edited

Loading

Yuhta commented Mar 13, 2023 •

edited

Loading

isadikov commented Mar 14, 2023 •

edited

Loading

majetideepak commented Mar 14, 2023 •

edited

Loading

Yuhta commented Jun 22, 2023 •

edited

Loading