Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New caching logic may inadvertently release objects that are still needed #257

Closed
samschott opened this issue Jan 23, 2023 · 5 comments · Fixed by #258
Closed

New caching logic may inadvertently release objects that are still needed #257

samschott opened this issue Jan 23, 2023 · 5 comments · Fixed by #258
Labels
bug A crash or error in behavior.

Comments

@samschott
Copy link
Member

samschott commented Jan 23, 2023

Describe the bug

The current logic to synchronise object cache access, introduced in #246, may lead to premature releases, because the return values from alloc() and init() calls will point to the same memory address for the allocated and the initialed object but the class name may have changed between those calls.

The code below will then result in the ObjCInstance returned by alloc() to be a different Python object than the one returned from the following init().

if cached_class_name != current_class_name:
# There has been a cache hit, but the object is a different class.
# Purge the cache for this address, and treat it as a cache miss.
del cls._cached_objects[object_ptr.value]
raise KeyError(object_ptr.value)

Because the object returned by alloc() is typically not stored in any variable in a chained alloc().init() call, it will be garbage collected, leading to a release() call from our own __del__ override.

def __del__(self):
"""Release the corresponding objc instance if we own it, i.e., if it
was returned by by a method starting with 'alloc', 'new', 'copy', or
'mutableCopy' and it wasn't already explicitly released by calling
:meth:`release` or :meth:`autorelease`."""
if self._needs_release:
send_message(self, "release", restype=objc_id, argtypes=[])

Steps to reproduce

Run the following script:

from toga_cocoa.libs import NSWindow
from rubicon.objc.runtime import libobjc

w_pre_init = NSWindow.alloc()

ptr_pre_init = w_pre_init.ptr.value
name_pre_init = libobjc.class_getName(libobjc.object_getClass(w_pre_init.ptr))

w_post_init = w_pre_init.init()

ptr_post_init = w_post_init.ptr.value
name_post_init = libobjc.class_getName(libobjc.object_getClass(w_post_init.ptr))

print(f"name_pre_init = {name_pre_init}")
print(f"name_post_init = {name_post_init}")
print(f"pointers equal: {ptr_pre_init == ptr_post_init}")

This can in practice lead to segfaults when creating a window early but showing it only later, e.g., on button press.

Expected behavior

We should not send release() immediately after an init() call, because it will be sent to the same memory address of an object that we still want.

Screenshots

No response

Environment

  • Operating System: macOS 13.1
  • Python version: 3.11.1
  • Software versions:
    • rubicon-objc: 0.4.4

Logs

No response

Additional context

I've observed such class-name changes in particular with NSWindow, NSView and NSApplication but no other classes so far.

@samschott samschott added the bug A crash or error in behavior. label Jan 23, 2023
@samschott
Copy link
Member Author

samschott commented Jan 23, 2023

@freakboy3742, I see that you even have left a comment about class-changes happening during initialisation:

# 1. Objective-C runtime provides a function object_setClass that
# can change an object's class after creation, and some code
# manipulates objects' isa pointers directly (although the latter
# is no longer officially supported by Apple). This is not
# commonly done in practice, and even then it is usually only
# done during object creation/initialization, so it's basically
# safe to assume that an object's class will never change after
# it's been wrapped in an ObjCInstance.

I guess you did not heed your own warning ;)

@freakboy3742
Copy link
Member

Thanks for the report - this one's a doozy.

I can reproduce the results you've described with the test code; I haven't been able to reproduce a segfault, but that's going to be unpredictable by definition. Interestingly, in the process of writing #246, I saw cache evictions for NSWindow/NSKVONotifying_NSWindow, but didn't catch the implications on memory allocation (and I didn't see a segfault either).

Three possible fixes I can think of:

Widen the 'stale class name' criteria

Instead of using a strict classname match, we broaden the criteria to identify common renames that don't constitute a an actual stale class. For example, we could use current_class_name.endswith(f'_{cached_class_name}') as a class name match, rather than ==. That would be sufficient for the NSWindow case; I haven't seen what NSView or NSApplication name changes can occur, so I'm not sure if this approach would be sufficient in the general case.

Preserve _needs_release

When evicting a "stale" object from the instance cache, check the _needs_release value on the stale instance. If it's True, set it to False on the stale instance, but ensure that the replacement cache value has _needs_release=True.

In the NSWindow.alloc().init() case from your example, this means the instance created around the alloc() will be set _needs_release=True; the updated instance generated for the init() call will inherit the _needs_release=True value, but the alloc() wrapper will be set back to _needs_release=False, so when it is disposed of, it won't trigger a release on the still-valid object.

The problem I can see here is that I'm not 100% sure what will happen for the stale object case at the core of #246 - if the stale object was set "_needs_release = True", that means the stale object won't be released, but the new object will think it needs to be released, even though that may not be warranted.

Something akin to #256

Instead of Rubicon trying to directly mirror the ownership of ObjC objects, we move to a model where we try to control memory allocation on the ObjC side by waiting for Python to garbage collect. Essentially every Python-side ObjCInstance instance gets a retain on creation, and a release on __del__. At the very least, we'd know that if an ObjCInstance exists, the memory address is valid; but obviously there's a big risk of a memory leak here unless we make sure Python is disposing of objects.


@samschott @dgelessus Any thoughts on any of these?

@mhsmith
Copy link
Member

mhsmith commented Jan 24, 2023

Essentially every Python-side ObjCInstance instance gets a retain on creation, and a release on __del__. At the very least, we'd know that if an ObjCInstance exists, the memory address is valid; but obviously there's a big risk of a memory leak here unless we make sure Python is disposing of objects.

This is the approach that Chaquopy has always taken, with the intention of making it simply impossible for Python code using the library to cause an invalid pointer access. In practice it's always worked very well, and I'm not aware of it ever causing a memory leak.

As I understand it, CPython always calls __del__ immediately when an object's reference count drops to zero, and augments this with a periodic garbage collection of unreachable objects contained in reference cycles. In most cases, there will be no cycle, and the object will be released deterministically.

As mentioned in #256, the main concern would be "cyclic references between Python objects and Objective C objects". I'll have to leave it to those who understand the code better than me to judge how common that would be.

@samschott
Copy link
Member Author

samschott commented Jan 24, 2023

I can reproduce the results you've described with the test code; I haven't been able to reproduce a segfault, but that's going to be unpredictable by definition.

I've also struggled to reproduce the segfault in a minimal example, but I'm seeing it consistently in full apps. This includes my own, but also toga's window example app: click "Create Window" and three windows pop up. Try to close the top one and you consistently get a segfault.

Regarding the options which you presented:

Widen the 'stale class name' criteria

This approach would certainly work for the class-name changes that I have observed, but it is hacky because it relies on deducing a rule from a pattern of three. IMO, it makes an already hacky workaround even worse. Gripes that I already have with the current approach:

  1. Just because the class name and memory address are the same, it doesn't mean that we are still looking at the same instance. As you have noted in the in-line comment, this is a fair compromise.
  2. The lock guards against race conditions on the Python side, but we still might have something on the Obj-C side deallocating the object and replacing with another after our checks pass but before we return the cache-hit.

Preserve _needs_release

That seems to set us up for more chaos and segfaults. As you write:

if the stale object was set "_needs_release = True", that means the stale object won't be released, but the new object will think it needs to be released, even though that may not be warranted.

Indeed, there is no guarantee that the new object needs to be released. In the alloc().init() case, this is certainly true, but in the more general case which prompted those changes, it won't be.

Something akin to #256

The more I think about it, the more I like this option. We initially introduced the _needs_release logic to ensure that objects returned by alloc() and related calls, which are automatically owned by us, do get released eventually without requiring explicit release() calls by the user. The latter would be very challenging to get right in Python.

This was the minimally-invasive solution. The alternative would be to manually issue retain() calls whenever an ObjCInstance is created, except for alloc() and related calls, and always release objects on __del__(). This should work just as well.

Cyclic references between Python objects and Obj-C objects, or on the Obj-C, are already a problem which the user must work around by using objc_property(weak=True) whenever those may be introduced.

That being said, we might run to a whole number of corner cases and memory leaks which we did not anticipate. We will need to carefully think through different code paths. The problem with memory leaks: they are harder to notice and debug than segfaults.


One path could be to implement option 1 for the next release, because the current thread-safety issues do need to be fixed, while preparing option 3 in a release candidate and carefully testing it with a number of toga apps.

@freakboy3742
Copy link
Member

Completely agreed that option (1) is a hack stacked on a hack; however, it's also the fastest path forward right now. I've pushed #258 as an implementation of that approach, and a solution to this immediate issue (it seems to fix the window example); a deeper investigation of #256 is likely the "real" fix here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A crash or error in behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants