Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-13: Set up JNI build (dataset, etc.) #449

Merged
merged 7 commits into from
Jan 2, 2025
Merged

Conversation

lidavidm
Copy link
Member

@lidavidm lidavidm commented Dec 5, 2024

Fixes #13.

@lidavidm lidavidm force-pushed the gh-13 branch 2 times, most recently from ac6bbb9 to d98309e Compare December 5, 2024 05:12
@lidavidm
Copy link
Member Author

lidavidm commented Dec 5, 2024

The docker image takes so long to build that we really need to cache it.

@lidavidm lidavidm force-pushed the gh-13 branch 6 times, most recently from 0598a43 to c8eb06b Compare December 5, 2024 08:29
@lidavidm
Copy link
Member Author

lidavidm commented Dec 9, 2024

It's going to be a while before I can get back to this.

@lidavidm lidavidm force-pushed the gh-13 branch 8 times, most recently from fc17d5e to dcef949 Compare December 27, 2024 01:22
@lidavidm
Copy link
Member Author

Hmm, this problem doesn't happen on a local build if I change to a debug build...

@lidavidm
Copy link
Member Author

Ah, interesting, LD_PRELOADING ASan (even if the binaries are built without it) "fixes" the issue. That's rather unfortunate, since ASan was how I was trying to figure out the corruption issue...

@lidavidm
Copy link
Member Author

Valgrind also seems to "fix" the issue :/

@lidavidm
Copy link
Member Author

I may disable the ORC test for now to get CI passing...we'll have to revisit a lot of the JNI work.

@wgtmac
Copy link
Member

wgtmac commented Dec 30, 2024

What is the error message for ORC test?

@lidavidm
Copy link
Member Author

It is crashing in CI. When I debug locally it appears that a malloc assertion fails. However it no longer reproduces for me after fiddling around.

@lidavidm
Copy link
Member Author

I tried to use ASan and Valgrind (separately) to identify possible memory corruption but it turns out under these tools, the crash no longer reproduces. Also after using the tools, even with them turned off now, I can't reproduce the crash locally anymore.

@wgtmac
Copy link
Member

wgtmac commented Dec 30, 2024

I suspect ORC crash is due to missing the timezone database: apache/arrow#36026.

@lidavidm
Copy link
Member Author

It crashes in malloc, though. I do have the core dump:

#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007f6268d37f1f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007f6268ce8fb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007f6268cd3472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x00007f6268d2c430 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f6268e48b80 "Fatal glibc error: malloc assertion failure in %s: %s\n")
    at ../sysdeps/posix/libc_fatal.c:155
#5  0x00007f6268d442bc in __malloc_assert (function=0x7f6268e49ba0 <__PRETTY_FUNCTION__.8> "sysmalloc", line=2611, file=<synthetic pointer>, 
    assertion=0x7f6268e49330 "(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)") at ./malloc/malloc.c:299
#6  sysmalloc (nb=nb@entry=2624, av=av@entry=0x7f6260000030) at ./malloc/malloc.c:2611
#7  0x00007f6268d4508e in _int_malloc (av=av@entry=0x7f6260000030, bytes=bytes@entry=2614) at ./malloc/malloc.c:4403
#8  0x00007f6268d45989 in __GI___libc_malloc (bytes=2614) at ./malloc/malloc.c:3323
#9  0x00007f61321523bf in orc::DataBuffer<char>::reserve(unsigned long) () from /tmp/target7371369939985176724arrow_orc_jni
#10 0x00007f6132152426 in orc::DataBuffer<char>::DataBuffer(orc::MemoryPool&, unsigned long) () from /tmp/target7371369939985176724arrow_orc_jni
#11 0x00007f613214c9dc in orc::createReader(std::unique_ptr<orc::InputStream, std::default_delete<orc::InputStream> >, orc::ReaderOptions const&) ()
   from /tmp/target7371369939985176724arrow_orc_jni
#12 0x00007f613160f6cd in arrow::adapters::orc::ORCFileReader::Impl::Open(std::shared_ptr<arrow::io::RandomAccessFile> const&, arrow::MemoryPool*) ()
   from /tmp/target7371369939985176724arrow_orc_jni
#13 0x00007f6131609bd2 in arrow::adapters::orc::ORCFileReader::Open(std::shared_ptr<arrow::io::RandomAccessFile> const&, arrow::MemoryPool*) ()
   from /tmp/target7371369939985176724arrow_orc_jni
#14 0x00007f6131170270 in Java_org_apache_arrow_adapter_orc_OrcReaderJniWrapper_open () from /tmp/target7371369939985176724arrow_orc_jni
#15 0x00007f624fc6b9c0 in ?? ()
#16 0x0000000000000000 in ?? ()

@lidavidm
Copy link
Member Author

And interestingly now ORC passes in CI.

@lidavidm
Copy link
Member Author

Anyways, for now I've disabled the test so that the JNI build can pass. It turns out one of the fixes here is needed to fix the Arrow CI (apache/arrow#45128).

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove java_ prefix because this is the apache/arrow-java repository?

Comment on lines 55 to 56
ARCH_ALIAS: ${{ matrix.platform.archery_arch_alias }}
ARCH_SHORT: ${{ matrix.platform.archery_arch_short }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove them.

password: ${{ secrets.GITHUB_TOKEN }}
- name: Build C++ libraries
env:
VCPKG_BINARY_SOURCES: "clear;nuget,GitHub,readwrite"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not work because we need more codes for this.
For example, https://github.com/apache/arrow/pull/44644/files#diff-e45e45baeda1c1e73482975a664062aa56f20c03dd9d64a827aba57775bed0d3R2135 and so on are needed.

But we can work on this as a follow-up task.

@wgtmac
Copy link
Member

wgtmac commented Dec 31, 2024

I found that the JNI libraries built on ubuntu has linked with both jemalloc and mimalloc. The coredump indicates an invalid initial state in the sysmalloc. I'm not sure if it is an undefined behavior if we have enabled both jemalloc and mimalloc. Should we consider disabling jemalloc by default?

: "${ARROW_GANDIVA:=ON}"
export ARROW_GANDIVA
: "${ARROW_GCS:=ON}"
: "${ARROW_JEMALLOC:=ON}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
: "${ARROW_JEMALLOC:=ON}"
: "${ARROW_JEMALLOC:=OFF}"
: "${ARROW_MIMALLOC:=ON}"

@lidavidm
Copy link
Member Author

I can try that. But I thought we've shipped multiple allocators in one binary before. (Arrow doesn't use jemalloc or mimalloc to replace system malloc.)

@lidavidm
Copy link
Member Author

Huh, that passed @wgtmac. Let's hope it stays that way 😅

I addressed Kou's feedback too.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
@kou
Copy link
Member

kou commented Jan 2, 2025

Hmm. The ORC crash is reproduced...?
https://github.com/apache/arrow-java/actions/runs/12557315120/job/35010522318?pr=449#step:7:19745

Fatal glibc error: malloc.c:2599 (sysmalloc): assertion failed: (old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)
Aborted (core dumped)

@lidavidm
Copy link
Member Author

lidavidm commented Jan 2, 2025

I think it's flaky then. Are we ok with disabling it for now?

@wgtmac
Copy link
Member

wgtmac commented Jan 2, 2025

Yes, we need to disable ORC test in this PR. Is it helpful to enable ASAN in a separate PR to build JNI libraries and use it for the ORC test?

@lidavidm
Copy link
Member Author

lidavidm commented Jan 2, 2025

I can try again in another PR. I think I found that ASAN hid the problem (Valgrind too). (Also both tools are finicky with the JVM.)

@lidavidm lidavidm merged commit ad9bad9 into apache:main Jan 2, 2025
16 checks passed
@lidavidm lidavidm deleted the gh-13 branch January 2, 2025 04:31
@lidavidm
Copy link
Member Author

lidavidm commented Jan 2, 2025

Filed #473

@lidavidm
Copy link
Member Author

lidavidm commented Jan 2, 2025

Now that we have some CI I'm going to start merging Dependabot updates again

@wgtmac
Copy link
Member

wgtmac commented Jan 2, 2025

I tried to enable ASAN on Apache ORC and no issue has been found: https://github.com/apache/orc/actions/runs/12577822275/job/35055736586?pr=2097

@lidavidm
Copy link
Member Author

lidavidm commented Jan 2, 2025

ASan replaces the malloc implementation, so the error may get masked. (Though if it is indeed memory corruption presumably ASan would find that instead.)

@lidavidm
Copy link
Member Author

lidavidm commented Jan 2, 2025

One other thing we could do is try various combinations of MALLOC_CHECK and MALLOC_PERTURB

@wgtmac
Copy link
Member

wgtmac commented Jan 2, 2025

I still don't understand why we cannot see symbols of mimalloc from the coredump backtrace if we have linked mimalloc.

@lidavidm
Copy link
Member Author

lidavidm commented Jan 2, 2025

We don't use mimalloc to replace malloc. It's only used by the Arrow memory pool. So we are still using glibc malloc for regular allocations.

kou added a commit to apache/arrow that referenced this pull request Jan 5, 2025
…ipts (#45165)

### Rationale for this change

apache/arrow-java removed `java_` prefix from scripts by apache/arrow-java#449 .

### What changes are included in this PR?

Follow the script name change.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: #45164

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@lidavidm lidavidm added this to the 18.2.0 milestone Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add test CI: JNI
3 participants