-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode #28968
Conversation
Just a question: do we need |
This comment has been minimized.
This comment has been minimized.
I think that's what people usually do. In particular, ML side often. I think it's better to classify it more explicitly. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Oh, got it. It seems that I misunderstand the standard. Thanks. :)
|
Retest this please. |
This comment has been minimized.
This comment has been minimized.
@dongjoon-hyun no problem :-). I think this is more about preference things .. If it causes any problem or confusion, I will change my way. |
retest this please |
Test build #124839 has finished for PR 28968 at commit
|
gentle ping for a review :-). |
…ssue in pinned thread mode
Test build #125802 has finished for PR 28968 at commit
|
retest this please |
Test build #126222 has finished for PR 28968 at commit
|
retest this please |
Test build #126231 has finished for PR 28968 at commit
|
retest this please |
Test build #126528 has finished for PR 28968 at commit
|
retest this please |
Test build #126640 has finished for PR 28968 at commit
|
sync with @HyukjinKwon offline, LGTM except one concern: But fixing it in py4j seems to be difficult, py4j do not know which thread is about to be GCed except thread notifying py4j initiatively |
Thanks @WeichenXu123. I will leave it open few more days before merging it. |
Merged to master. |
each thread with its own local properties. To work around this, you should manually copy and set the | ||
local properties from the parent thread to the child thread when you create another thread in PVM. | ||
to `true`. This pinned thread mode allows one PVM thread has one corresponding JVM thread. With this mode, | ||
`pyspark.InheritableThread` is recommanded to use together for a PVM thread to inherit the interitable attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: interitable -> inheritable
|
||
if isinstance(sc._gateway, ClientServer): | ||
# Here's when the pinned-thread mode (PYSPARK_PIN_THREAD) is on. | ||
properties = sc._jsc.sc().getLocalProperties().clone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to clone
? Doesn't sc.localProperties
get clone in childValue
already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we're mimicking that behaviour here because the thread in JVM does not respect the inheritance here since the thread is always sepearately created via the JVM gateway whereas Scala Java side we can keep the inheritance by creating a thread within a thread.
I found I missed this and looked at now. LGTM. I'm just wondering we should use >>> from threading import Thread
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> spark._jvm._gateway_client.deque
deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>])
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> spark._jvm._gateway_client.deque
deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc08d0>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>]) |
Oh yeah we should use |
Thank you for taking a look @viirya. |
What changes were proposed in this pull request?
This PR proposes:
To introduce
InheritableThread
class, that works identically withthreading.Thread
but it can inherit the inheritable attributes of a JVM thread such asInheritableThreadLocal
.This was a problem from the pinned thread mode, see also [SPARK-22340][PYTHON] Add a mode to pin Python thread into JVM's #24898. Now it works as below:
Also, it adds the resource leak fix into
InheritableThread
. Py4J leaks the thread and does not close the connection from Python to JVM. InInheritableThread
, it manually closes the connections when PVM garbage collection happens. So, JVM threads finish safely. I manually verified by profiling but there's also another easy way to verify:This issue is fixed now.
Because now we have a fix for the issue here, it also proposes to deprecate
collectWithJobGroup
which was a temporary workaround added to avoid this leak issue.Why are the changes needed?
To support pinned thread mode properly without a resource leak, and a proper inheritable local properties.
Does this PR introduce any user-facing change?
Yes, it adds an API
InheritableThread
class for pinned thread mode.How was this patch tested?
Manually tested as described above, and unit test was added as well.