Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Independent SubInterpreters are still not concurrent with python 3.12+ #593

Open
novos40 opened this issue Jan 23, 2025 · 7 comments
Open

Comments

@novos40
Copy link

novos40 commented Jan 23, 2025

Describe the bug
I'm running latest python 3.13.1 on 8 core VM. I'm starting 8 java threads. Each thread creates it's own SubInterpreter and runs CPU-only function like

def cpu_bound(number):
    # print(f">>> cpu_bound({number = })")                   # debug print
    # res = sum(i * i for i in range(number))                # CPU load
    print(f">>> cpu_bound({number = }): while loop")
    res = 0
    i = 0
    while(i < number):
        res += i * i
        i += 1
    print(f"<<< cpu_bound({number = }): while loop")
    return res

I've removed any functions call even functions like sum(). It seems like SubInterpreter creations and even function starts are concurrent, but overall it's still synchronized on something (despite the fact that python 3.12+ should support per-interpreter GIL). The output looks like this (threads started with consequent numbers so we can match start and stop)

>>> cpu_bound(number = 100000001): while loop
>>> cpu_bound(number = 100000003): while loop
>>> cpu_bound(number = 100000004): while loop
>>> cpu_bound(number = 100000005): while loop
>>> cpu_bound(number = 100000007): while loop
>>> cpu_bound(number = 100000002): while loop
>>> cpu_bound(number = 100000006): while loop
>>> cpu_bound(number = 100000000): while loop
<<< cpu_bound(number = 100000003): while loop
<<< cpu_bound(number = 100000001): while loop
<<< cpu_bound(number = 100000007): while loop
<<< cpu_bound(number = 100000005): while loop
<<< cpu_bound(number = 100000004): while loop
<<< cpu_bound(number = 100000006): while loop
<<< cpu_bound(number = 100000002): while loop
<<< cpu_bound(number = 100000000): while loop

which tells me that all functions do start concurrently, but then they wait for each other in some random order. In multiple runs the order of starts and stops can vary, but first thread finish is always only after last thread start. The CPU utilization is never exceeding 12% (except very short start period for code compilation, I guess) i.e. exactly like single thread execution. For that matter you can start python threads and get exactly the same performance or rather lack thereof.

What am doing wrong here? How do I make sub-interpreters execute concurrently?

To Reproduce

  1. Create and start java platform thread
  2. Create independent SubInterpreter in each thread
  3. Call python function

Expected behavior
Fully independent SubInterpreters should run concurrently allowing for 100% CPU utilization.

Environment (please complete the following information):

  • Window10, Linux
  • Python v3.12.1, v3.13.1
  • java 21
  • Jep v4.2.2
  • Python packages used (e.g. numpy, pandas, tensorflow): pure python CPU-only code

Additional context
Add any other context about the problem here.

@ndjensen
Copy link
Member

Just to be clear, what SubInterpreterOptions did you set on your JepConfig when creating the SubInterpreters? Jep is just passing the options to the CPython interpreter so if there really is a problem I don't know if there's much we can do about it.

@novos40
Copy link
Author

novos40 commented Jan 23, 2025

Oh, sorry, totally miss sub interpreter options.
However using SubInterpreterOptions.isolated() just kills the JVM process without any messages or memory dump. It looks like somebody just called System.exit()
Using legacy options and manually setting

    final SubInterpreterOptions sio = SubInterpreterOptions.legacy();
    sio.setCheckMultiInterpExtensions(true);
    sio.setUseMainObmalloc(false);
    sio.setOwnGIL(true);
    jepConfig.setSubInterpreterOptions(sio);

produces the same result: JVM just dies on first Interpreter.set(name, value) call which sets a non-primitive java value. That is, in the following sequence (cntx is an instance of Interpreter)

    cntx.set("pythonEngine", "JEP");     // set String value
    cntx.set("undefined", noValue);      // set java object instance value
    cntx.set("appContext", appContext);  // set java lambda value
    cntx.set("ML", ML.class);            // set java class value

first two sets are working fine and JVM dies on third (java lambda value). It will also die on setting ML.class value if I flip last two lines.
It looks like it dies in java.lang.Class.getDeclaredFields() method while reflecting the value or at least it's the last place I can see in java debugger.
FYI:
appContext is just a reference to a static method of some java class. Nothing from me, all pure java.
ML class is actually quite large with 1500+ methods (it's our universal data structure, most of the methods are for JIT optimizations), but it should not matter unless there is some limited space allocated somewhere.

Everything works fine with legacy options so I doubt that it's a buffer overflow issue.

Any ideas how to trace the reason?

@bsteffensmeier
Copy link
Member

Thank you so much for the code examples you included. I initially tried to replicate the problem by adding some calls to set() in the isolated interpreter test but I could not get it to crash. When I took exactly what you have and pasted it in an java main it crashed immediately. It took some digging to find the problem but eventually I found that setting PYTHONMALLOC=pymalloc_debug would allow java to create an hs_err_pid file which had a native stack trace with some helpful clues in it.

The problem is that PyJObject(this is the python class that represents java.lang.Object) is a statically allocated type which is shared between sub-interpreters. I had thought sharing would be safe because immutable objects are allowed to be shared but I failed to take into account that creating a subclass mutates the super class. The failure occurs when the a subclass of PyJObject is created in an isolated sub-interpreter. We create a subclass every time a new Java class is used in Python. Not every subclass causes a crash, it seems to only happen when the dict holding subclasses reaches a capacity threshold where it needs to grow. I assume the reason I couldn't crash it in our test case is because the tests do a lot of testing in SharedInterpreters and the subclass map has already grown large enough that further growth is infrequent. With an isolated main program it crashes after only a handful of subclasses. PEP-684 specifically discusses how subclasses can cause problems but until you found this crash I didn't realize it was going to impact us.

Unfortunately I don't think we can fix this in jep 4.2. Mostly because it is going to be a pretty big change but also because we don't want to drop support for older Python versions in a point release. We use the python buffer protocols in subclasses of PyJObject and Before python 3.8 classes using the buffer protocol had to be statically defined. On the dev_4.3 branch I've already made changes so the buffer type is allocated on the heap but I was not planing on changing the way PyJObject and PyJClass are allocated. Your discovery definitely puts changing the way those are allocated on the agenda for the 4.3 release.

If you want to experiment with isolated interpreters I found that setting PYTHONMALLOC to a different allocator would prevent the crash for me. I think there is still a possibility of problems with other allocators so I do not recommend doing this in production but if you want to test how isolated interpreters perform it might give you some idea.

@novos40
Copy link
Author

novos40 commented Jan 24, 2025

Thanks a lot for a quick response on this!
I'm certainly willing to try whatever you can give me. One of the main goals of bringing python into java ecosystem for us is real multithreading. I understand all this is very new and unstable, but this is too important for us. Could you please provide instructions for me how I can do this? Can I just take dev_4.3 and try or do I need to tweak something? I'm a bit rusty on C/C++ but I can figure things out provided with some guidance. We already have a potential client lined up for this capability so I'd like to make it work ASAP even if it's not a production quality yet. In any case it will take some time for them to evaluate and setup their dev process for that. By that time, hopefully, we can have a permanent fix.

From you comment about dict crashing when it need to grow there are some quick fix ideas:

  1. Can you just pre-allocate dict size to some large enough capacity so it does not need to grow? Initial size can be a parameter provided via a variable or something. Clearly a temporary and not pretty, but allows to move forward before permanent fix is in place
  2. Can you just move the registry to [concurrent] java map? This is what I usually do. Whenever I have problems with python code limitations, I move the functionality to java and simply provide python wrapper. Works like a charm. Practically all shared object in our system are java objects wrapped in python interfaces including dict API. I find python implementations to be very capricious most of the time so I just get rid of them. Nobody has to know implementation details of some internal data structure :-)

Please let me know if I can help in any way.

P.S. It's just occurred to me: Can I fix the problem by first creating a shared interpreter with all initializations (i.e. allow dict to grow to correct size) and then work with sub-interpreters?

@novos40
Copy link
Author

novos40 commented Jan 24, 2025

Just tried first to create a shared interpreter first and then work with sub-interpreters. It works! :-) Kinda :-(

  • Good: I've got expected 100% CPU utilization and correspondent 8 times reduction in test run time
  • Bad: After few consequent runs of the same test JVM still died b/c access violation, but this time with a log file (see attached)

hs_err_pid2160.log

Can you please take a look and tell me what else I can do to make it work.
Thanks

P.S. first log might not be totally useful b/c I was using visualvm and it would instrument the code potentially changing classes
Here are a couple more logs

hs_err_pid8900.log
hs_err_pid15656.log

I'm not an expert, but it seems both of them are trying to write something to a null pointer (freed memory?). Last one took about 10 test runs before dying. A few second pause between runs seems to help to keep it running so it might be related to GC, I guess

@bsteffensmeier
Copy link
Member

Could you please provide instructions for me how I can do this? Can I just take dev_4.3 and try or do I need to tweak something? I'm a bit rusty on C/C++ but I can figure things out provided with some guidance.

The changes currently on the dev_4.3 branch are only a small step in the right direction. In Jep 4.2 I think there are 4 statically allocated types that need to be moved to heap allocated types to be safe in sub-interpreters. Those types are PyJObject, PyJClass, PyJArray, and PyJBuffer. The existing changes handle PyJBuffer leaving only 3 more. Unfortunatly byjbuffer is by far the easiest of the 4 since the type is only referenced during initialization. Since the other 3 are referenced more often we will have to save off the types in an interpreter specific data structure. I am still trying to understand what needs to change myself but from what I have seen I think PEP-630 describes all the things we need to do.

  1. Can you just pre-allocate dict size to some large enough capacity so it does not need to grow? Initial size can be a parameter provided via a variable or something. Clearly a temporary and not pretty, but allows to move forward before permanent fix is in place

I am not aware of any API for pre-sizing dicts. Also the crash while resizing is just a symptom of a larger problem, it is not safe to concurrently access a dict and even if pre-sizing prevents crashes it will not prevent interpreters from concurrently modifying the dict and potentially creating invalid state. I suspect this is why you still see crashes in your tests where you use a shared interpreter.

2. Can you just move the registry to [concurrent] java map? This is what I usually do. Whenever I have problems with python code limitations, I move the functionality to java and simply provide python wrapper. Works like a charm. Practically all shared object in our system are java objects wrapped in python interfaces including dict API. I find python implementations to be very capricious most of the time so I just get rid of them. Nobody has to know implementation details of some internal data structure :-)

The problem is occurring in the tp_sublasses dict is in the cpython code. It is marked as internal to cpython. Jep does not allocate, modify, or even access it. You would have to change the cpython code to use anything other than a dict.

P.S. It's just occurred to me: Can I fix the problem by first creating a shared interpreter with all initializations (i.e. allow dict to grow to correct size) and then work with sub-interpreters?

Every sub-interpreter creates a hierarchy of java types that mirrors the java class hierarchy of any Java class used form python. Right now the java.lang.Object type is shared in all sub-interpreters. When sub-interpreters are creating their own types for subclasses of java.lang.Object everything breaks. If you could create a type hierarchy beforehand for every java class that will ever be used from python and ensure all of those types are immutable then that types could be shared between sub-interpreters and the problems should go away. I don't think that is a very practical solution for most use cases but it might be something you could get working if you happen to know every java class needed in python beforehand.

@bsteffensmeier
Copy link
Member

@novos40 Since you have mentioned on other issues that you are using shared modules and also that compatibility with other python modules is important to you I also want to make sure you are aware that there is no plans to support shared modules in isolated sub-interpreters and while I can't speak authoritatively for the entire python ecosystem I suspect that a vast majority of python modules that include native code will not work with isolated sub-interpreters. I know that numpy is a popular extension module that has been hesitant to support sub-interpreters in the past so I was curious if they are moving to support isolated sub-interpreters and found this issue saying they do not currently support it and are not actively working to support it and also this post which shares my opinion that most extension modules do not work with isolated sub-interpreters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants