Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assertion failure on Android AVD on Apple M1 #30

Closed
Znerole opened this issue May 13, 2021 · 20 comments · Fixed by #31 or #47
Closed

assertion failure on Android AVD on Apple M1 #30

Znerole opened this issue May 13, 2021 · 20 comments · Fixed by #31 or #47
Labels
bug Something isn't working

Comments

@Znerole
Copy link

Znerole commented May 13, 2021

When using Djinni in the Android Emulator for Apple M1 (arm64), the assertion in djinni_support.hpp:323 fails.

This is a false positive (no pun intended). In this environment handle can be negative. I'm not quite sure where the 4096 comes from, but I would suggest to either remove the assertion, cast the (signed) jlong value to unsigned for this assertion or allow the value to be either negative or greater than 4096.

I can confirm that my application runs perfectly fine when uncommenting this assertion.

@jothepro jothepro added the bug Something isn't working label May 13, 2021
@a4z
Copy link
Contributor

a4z commented May 13, 2021

Hi Znerole , thanks for the report!

Can you please update me about the current state of Android Studio / SDK / NDK and emulator on Apple M1 ?

@artwyman , do you know or remember what the magic number stands for, in the mentioned code line?

@Znerole
Copy link
Author

Znerole commented May 13, 2021

Sure! M1 is supported only with a canary build of Android Studio for now: https://androidstudio.googleblog.com/2021/04/android-studio-arctic-fox-canary-15.html

The accompanying emulator supports running arm64 system images, though it seems like only the Android 11 and 12 preview images are supported. The emulator is not very stable, the little time I've spent with it so far, I had frequent black screens and stuff like that, but it is possible to run and debug an application from within Android Studio. Except for the djinni assertion, my application (which is mostly C++) seemed to run fine, though I didn't test it heavily.

I'm not sure how the emulator works. It is qemu based and it appears to be quite fast.

The assertion left me a bit concerned if there are reserved values for the handle which might resolve to real memory addresses.

@artwyman
Copy link

I have no recollection of where the 4096 came from. It's unlike me to put magic numbers in code without explanation, but blame suggests this originated in a commit I authored 6 years ago: dropbox/djinni@d155107

My best guess is it was meant to catch uses of the wrong jlong value, such as a reference count, instead of the pointer intended here. Pointers are generally not close to zero, but the test of course totally omits the possibility that they could be interpreted as negative.

I think it's probably fine to simply remove the assert at this point.

Also, an Android emulator for an Apple chip is something I wouldn't have expected to see. :)

a4z added a commit to a4z/djinni-support-lib that referenced this issue May 18, 2021
This should close cross-language-cpp#30 but keep the exisiting check in place.
@a4z a4z closed this as completed in #31 May 20, 2021
a4z added a commit that referenced this issue May 20, 2021
This should close #30 but keep the exisiting check in place.
@jothepro
Copy link
Contributor

@Znerole can you approve that this solved the issue for you on M1?

@Znerole
Copy link
Author

Znerole commented May 25, 2021

Yes, works for me! Thank you very much!

@A-Mendi
Copy link

A-Mendi commented Jul 19, 2021

FWIW, having this fix was crashing the app on Galaxy A11 with Android 10 device. Emulating this device's memory/heap on Android Studio did not cause the crash, it was however, sporadically (but often) crashing on physical device. Reverting this change resolved the crash. We ended up removing this assert from our fork altogether.

@a4z
Copy link
Contributor

a4z commented Jul 19, 2021

Thanks for your report @A-Mendi !

I think we also should remove that assert.
It is not clear why this magic number was added in a first place anyway, and it causes problems,
The case it protect against never ever has hit anyway, and it is also clear that since it was introduced 6 or more years ago, something has changed on devices so the assumption made for that assert is not valid anymore.

@a4z a4z reopened this Jul 19, 2021
@a4z
Copy link
Contributor

a4z commented Jul 19, 2021

One question, @A-Mendi , since you mention your own fork,
did you apply the change we have made in #31 ?

@A-Mendi
Copy link

A-Mendi commented Jul 20, 2021

@a4z - Yes we cherry-picked that change in our fork and that caused app to crash in a different codepath. I didn't investigate further into the cause of the crash as just removing this line seemed to have resolved the issue

@Znerole
Copy link
Author

Znerole commented Jul 20, 2021

@A-Mendi, that seems to point a more profound problem than just this assertion.

I also can not see how there could be a valid pointer in the range of [0..4095], assuming 4k is the memory page size for the given device. On iOS the page size is between 4k and 16k (https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/AboutMemory.html), for Android I think it can be assumed that it's 4k or bigger.

@a4z
Copy link
Contributor

a4z commented Jul 20, 2021

That's a somehow a multifaceted topic.

It would of course be interesting to debug that problem of A-Mendi, but that is obviously not possible.
And I can not trigger the assert self.

Since this is an assert, it can be tuned of by defining NDEBUG in your build, something a release build should do anyway.
So this, and all other asserts, are not in the binary, and this check is just a developer check/ debug build one.
And they should not be on a shipped app. (except you ship a non release build)

On the other hand, we have the situation that a pointer is casted to a number, passed via the API, and then back casted.
That is problematic, and there is no good way to guarantee the integral holds a valid pointer address and type, never mind if it's below 4k or not.
The only thing is, it might protect agains something that is similar and takes an index, and it tells you that you are in the right function.
But, since this code is only reached via generated code, chances something is wrong are little to zero.

I also think, if there would be a problem in A-Mendi's code, and the integral would not be a valid pointer, I guess an unpleasant exit would be the consequence anyway. So maybe there is an other problem (very likely) and this crash is just some indicator.

However, we have here some kind of code smell, and the only solution I see is having a smarter type erasure that holds the type info internally to the integral and makes use of that.
I think this is possible to implement, if anyone is interested, someone created a whole playlist to the topic (and one talk in this list is even from me ;-) )

That will take some time until I find enough focus time to implement that, so if anyone wants doing that, please do. All the talks should give enough know how on the topic.

But we might define that task in an extra ticket, will see, and think a bit more about that. So I leave this ticket a little bit more open as a reminder for that.

@artwyman
Copy link

The original hand-written bridging code I wrote before Djinni used a type-tagging approach like you describe, though it was in the first word of a struct not embedded in the jlong. It caught a lot of mistakes in my initial experiments. It wasn't included in the first generation of generated code since the risk of mistakes was so much lower.

Personally I'd keep the assert(handle); (which was removed in #31) but get rid of the magic number entirely, since I can't think of a valid reason for it. Any handle which goes through this code is about to be dereferenced, so if there were a mistake, it's likely to crash later anyway. The magic number check only gives a nicer debug message in that case which should never happen, and it would certainly cause unnecessary crashes if some architecture did allocate real data within the first 4096 bytes of the address space. That's not impossible, though certainly less likely than the issue before #31 which would be with any "negative" pointer (with high bit set).

@a4z
Copy link
Contributor

a4z commented Jul 21, 2021

Thanks Andrew!
Changing to an assertion that the pointer is not a null pointer sounds like a very good alternative for that place.

a4z added a commit to a4z/djinni-support-lib that referenced this issue Jul 29, 2021
As discussed in cross-language-cpp#30, the magic number brought since some time some
problems for some people.
Please find the details in the discussion of cross-language-cpp#30, and why we switch to
this assert now.

Fixes cross-language-cpp#30, second version.
@a4z a4z closed this as completed in #47 Aug 1, 2021
a4z added a commit that referenced this issue Aug 1, 2021
As discussed in #30, the magic number brought since some time some
problems for some people.
Please find the details in the discussion of #30, and why we switch to
this assert now.

Fixes #30, second version.
@davidschreiber
Copy link

As it was raised in cross-language-cpp/djinni-generator#110, this issue might be the M1-based manifestation of missing support for ARM64 tagged pointers, which was introduced with Android 11: https://source.android.com/devices/tech/debug/tagged-pointers

@a4z
Copy link
Contributor

a4z commented Oct 13, 2021

what do you mean with missing support for ARM64 tagged pointer,, @davidschreiber ?
to me it seams we support them since we fixed that issue, or do you mean we miss something else?

@jothepro
Copy link
Contributor

jothepro commented Oct 13, 2021

@a4z I think what @davidschreiber is trying to say is that pointer tagging could be the reason why pointer values on M1 exceeded the magic number limit in the first place. If this is indeed causing crashes on Android, I wonder if we are risking crashes on M1 as well? This is just me guessing, but maybe Apple is not (yet) enforcing top-byte ignore which is why the underlying problem was not yet discovered?

@a4z
Copy link
Contributor

a4z commented Oct 13, 2021

The magic number does not exist anymore, so we should be save. We also do not have any crash on Android since we fixed this issue, afaik, Or do I still miss something?

@jothepro
Copy link
Contributor

Hmm... @davidschreiber are the crashes that you reported in cross-language-cpp/djinni-generator#110 caused by the assert discussed in this issue or because Android detected a top byte modification and therefore terminated the process?

@davidschreiber
Copy link

davidschreiber commented Oct 14, 2021

Yes, the crashes are exactly the same, as we saw the ARM64 failures on the same assert. This failure is most likely due to the top bit being marked and interpreted as negative number, so performing a null check instead in this place seems like a good decision going forward.

However, I don't know about the extend of pointer tagging, and whether there are other issues that might come from it. To reproduce this on Android, you need an Arm64 device running API 30, and and set the targetSdkVersion to 30 as well, to enable tagged pointer support in the OS.

I'm going to see how fast we can make the transition from our current Djinni fork to this project, and we'll see if any other issues arise after doing so (we're developing https://pspdfkit.com for Android, and make extensive use of Djinni).

@a4z
Copy link
Contributor

a4z commented Oct 14, 2021

If you need any help with the transition, let us know, we are happy to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
6 participants