[SPARK-22340][PYTHON] Add a mode to pin Python thread into JVM's #24898

HyukjinKwon · 2019-06-18T04:21:16Z

What changes were proposed in this pull request?

This PR proposes to add Single threading model design (pinned thread model) mode which is an experimental mode to sync threads on PVM and JVM. See https://www.py4j.org/advanced_topics.html#using-single-threading-model-pinned-thread

Multi threading model

Currently, PySpark uses this model. Threads on PVM and JVM are independent. For instance, in a different Python thread, callbacks are received and relevant Python codes are executed. JVM threads are reused when possible.

Py4J will create a new thread every time a command is received and there is no thread available. See the current model we're using - https://www.py4j.org/advanced_topics.html#the-multi-threading-model

One problem in this model is that we can't sync threads on PVM and JVM out of the box. This leads to some problems in particular at some codes related to threading in JVM side. See:

spark/core/src/main/scala/org/apache/spark/SparkContext.scala

Line 334 in 7056e00

protected[spark] val localProperties = new InheritableThreadLocal[Properties] {

Due to reusing JVM threads, seems the job groups in Python threads cannot be set in each thread as described in the JIRA.

Single threading model design (pinned thread model)

This mode pins and syncs the threads on PVM and JVM to work around the problem above. For instance, in the same Python thread, callbacks are received and relevant Python codes are executed. See https://www.py4j.org/advanced_topics.html#the-single-threading-model

Even though this mode can sync threads on PVM and JVM for other thread related code paths,
this might cause another problem: seems unable to inherit properties as below (assuming multi-thread mode still creates new threads when existing threads are busy, I suspect this issue already exists when multiple jobs are submitted in multi-thread mode; however, it can be always seen in single threading mode):

$ PYSPARK_PIN_THREAD=true ./bin/pyspark

import threading

spark.sparkContext.setLocalProperty("a", "hi")
def print_prop():
    print(spark.sparkContext.getLocalProperty("a"))

threading.Thread(target=print_prop).start()

None

Unlike Scala side:

spark.sparkContext.setLocalProperty("a", "hi")
new Thread(new Runnable {
  def run() = println(spark.sparkContext.getLocalProperty("a"))
}).start()

hi

This behaviour potentially could cause weird issues but this PR currently does not target this fix this for now since this mode is experimental.

How does this PR fix?

Basically there are two types of Py4J servers GatewayServer and ClientServer. The former is for multi threading and the latter is for single threading. This PR adds a switch to use the latter.

In Scala side:
The logic to select a server is encapsulated in Py4JServer and use Py4JServer at PythonRunner for Spark summit and PythonGatewayServer for Spark shell. Each uses ClientServer when PYSPARK_PIN_THREAD is true and GatewayServer otherwise.

In Python side:
Simply do an if-else to switch the server to talk. It uses ClientServer when PYSPARK_PIN_THREAD is true and GatewayServer otherwise.

This is disabled by default for now.

How was this patch tested?

Manually tested. This can be tested via:

PYSPARK_PIN_THREAD=true ./bin/pyspark

and/or

cd python
./run-tests --python-executables=python --testnames "pyspark.tests.test_pin_thread"

Also, ran the Jenkins tests with PYSPARK_PIN_THREAD enabled.

core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala

SparkQA · 2019-11-01T05:44:02Z

Test build #113066 has finished for PR 24898 at commit f72a38d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-11-04T05:39:46Z

retest this please

SparkQA · 2019-11-04T08:05:02Z

Test build #113191 has finished for PR 24898 at commit f72a38d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-11-04T09:33:55Z

retest this please

SparkQA · 2019-11-04T12:16:03Z

Test build #113203 has finished for PR 24898 at commit f72a38d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito

lgtm other than some very minor things

docs/job-scheduling.md

python/pyspark/context.py

python/pyspark/java_gateway.py

squito · 2019-11-06T19:58:23Z

python/pyspark/tests/test_context.py

+                # When thread is pinned, job group should be set for each thread for now.
+                # Local properties seem not being inherited like Scala side does.
+                if os.environ.get("PYSPARK_PIN_THREAD", "false").lower() == "true":
+                    sc.setJobGroup('test_progress_api', '', True)


actually, this test probably isn't reliable outside of pinned mode, right? the java side could arbitrarily decide to switch threads at any point.

anyway, just something to keep in mind if we notice flakiness in this test in the future.

yeah .. I think so, though, at least this test hasn't been detected as a flaky test yet. I was actually thinking of removing this test out even but .. let me leave this out of this PR scope for now.

squito · 2019-11-06T20:11:51Z

python/pyspark/tests/test_pin_thread.py

+                is_job_cancelled[index] = False
+            except Exception:
+                # Assume that exception means job cancellation.
+                is_job_cancelled[index] = True


I have always been confused about the guarantees of python around mutating a variable like this from multiple threads -- I can't find anything which makes it clear that this mutation is visible to other threads. The section on the GIL says they'll be atomic (https://docs.python.org/3/faq/library.html#what-kinds-of-global-value-mutation-are-thread-safe) but that isn't quite the same.

I guess this OK? again something to be aware of it we see flakiness

Ah, yeah, such pattern is considered safe given my experience. I think D[x] = y infers this case .. ? I think it's fine anyway.

BTW, there's dis package to check Python's opcodes (e.g., import dis; func = lambda: 1 + 1; dis.dis(func)). seems assignment is a single atomic instruction in Python so looks fine.

that link above says it'll be atomic, but that's not exactly the same as knowing the change is visible -- there will be some per-core cache which isn't always flushed. Or, at least, its not in lower-level languages, but maybe it really is in python? I guess it must be (or flushed every time the GIL changes threads); otherwise this would have to be discussed somewhere in python docs

Ah, I see. Yeah, we were talking about visibility. I think it must be ...

HyukjinKwon · 2019-11-07T10:46:22Z

Thanks @squito for the thorough review. @WeichenXu123 do you have some comments on this? Otherwise, looks we're good to go.

SparkQA · 2019-11-07T12:56:35Z

Test build #113380 has finished for PR 24898 at commit 9e2d832.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-11-07T21:43:09Z

I think we will go in this direction .. I am merging this given the sign-off and i'm pretty confident of this change.

But still let me know guys here if you have any concern or issue. We can still consider reverting this and going to another direction if we find that's better.

HyukjinKwon · 2019-11-07T21:44:10Z

This will actually fix many potential issues.

cc @brkyvz FYI since we talked about threads in PySpark before.

HyukjinKwon · 2019-11-07T21:44:15Z

Merged to master.

…s and fixing a thread leak issue in pinned thread mode ### What changes were proposed in this pull request? This PR proposes: 1. To introduce `InheritableThread` class, that works identically with `threading.Thread` but it can inherit the inheritable attributes of a JVM thread such as `InheritableThreadLocal`. This was a problem from the pinned thread mode, see also #24898. Now it works as below: ```python import pyspark spark.sparkContext.setLocalProperty("a", "hi") def print_prop(): print(spark.sparkContext.getLocalProperty("a")) pyspark.InheritableThread(target=print_prop).start() ``` ``` hi ``` 2. Also, it adds the resource leak fix into `InheritableThread`. Py4J leaks the thread and does not close the connection from Python to JVM. In `InheritableThread`, it manually closes the connections when PVM garbage collection happens. So, JVM threads finish safely. I manually verified by profiling but there's also another easy way to verify: ```bash PYSPARK_PIN_THREAD=true ./bin/pyspark ``` ```python >>> from threading import Thread >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> spark._jvm._gateway_client.deque deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>]) >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> spark._jvm._gateway_client.deque deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc08d0>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>]) ``` This issue is fixed now. 3. Because now we have a fix for the issue here, it also proposes to deprecate `collectWithJobGroup` which was a temporary workaround added to avoid this leak issue. ### Why are the changes needed? To support pinned thread mode properly without a resource leak, and a proper inheritable local properties. ### Does this PR introduce _any_ user-facing change? Yes, it adds an API `InheritableThread` class for pinned thread mode. ### How was this patch tested? Manually tested as described above, and unit test was added as well. Closes #28968 from HyukjinKwon/SPARK-32010. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

### What changes were proposed in this pull request? PySpark added pinned thread mode at #24898 to sync Python thread to JVM thread. Previously, one JVM thread could be reused which ends up with messed inheritance hierarchy such as thread local especially when multiple jobs run in parallel. To completely fix this, we should enable this mode by default. ### Why are the changes needed? To correctly support parallel job submission and management. ### Does this PR introduce _any_ user-facing change? Yes, now Python thread is mapped to JVM thread one to one. ### How was this patch tested? Existing tests should cover it. Closes #32429 from HyukjinKwon/SPARK-35303. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…d mode (#471) ## What is the problem? Correctly, there is resource leak when using the pinned thread mode (see also apache/spark#24898). For example, if you repeat the codes below multiple times to create Py4J connections in multiple threads, ```python # PySpark application import threading def print_prop(): # Py4J connection is used under the hood. print(spark.sparkContext.getLocalProperty("a")) threading.Thread(target=print_prop).start() ``` the number of leftover connections grows: ```python spark._jvm._gateway_client.deque deque([<py4j.clientserver.ClientServerConnection object at 0x7fdc60170940>, <py4j.clientserver.ClientServerConnection object at 0x7fdca011e760>, <py4j.clientserver.ClientServerConnection object at 0x7fdcb01acdc0>, <py4j.clientserver.ClientServerConnection object at 0x7fdc60170100>, <py4j.clientserver.ClientServerConnection object at 0x7fdcb0232d30>]) ``` In the environment where multiple threads are used without a pool, it easily causes "Too many files open" due to the lack of file descriptors (as they are all occupied by unclosed sockets). ## How do you fix? This PR adds another variable to thread local that cleans up the connection right before the thread is finished. We need it as a separate thread local because `connection` is NOT cleaned because the reference is being held at `JavaClient.deque`. See also 50fe45e for more details.

HyukjinKwon force-pushed the pinned-thread branch from ed56a7f to 0948598 Compare June 18, 2019 04:26

HyukjinKwon commented Jun 18, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

HyukjinKwon force-pushed the pinned-thread branch from 0948598 to a546fc9 Compare June 18, 2019 04:35

HyukjinKwon commented Jun 18, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala Outdated Show resolved Hide resolved

dongjoon-hyun added the PYSPARK label Jun 18, 2019

This comment has been minimized.

Sign in to view

HyukjinKwon force-pushed the pinned-thread branch from a546fc9 to 44a1f10 Compare June 18, 2019 06:54

This comment has been minimized.

Sign in to view

HyukjinKwon force-pushed the pinned-thread branch 2 times, most recently from 863eb58 to 201eb4a Compare June 19, 2019 06:09

HyukjinKwon changed the title ~~[WIP][SPARK-22340][PYTHON] Add a mode to pin Python thread into JVM's~~ [SPARK-22340][PYTHON] Add a mode to pin Python thread into JVM's Jun 19, 2019

HyukjinKwon force-pushed the pinned-thread branch from 201eb4a to 8ac95ab Compare June 19, 2019 06:20

This comment has been minimized.

Sign in to view

HyukjinKwon force-pushed the pinned-thread branch from fb6802a to f72a38d Compare November 1, 2019 03:25

WeichenXu123 mentioned this pull request Nov 4, 2019

[WIP] Provide spark-based parallel backend for joblib joblib/joblib#956

Closed

squito approved these changes Nov 6, 2019

View reviewed changes

HyukjinKwon added 6 commits November 7, 2019 19:09

Add a mode to pin Python thread into JVM's

4253ddb

Add warnings

f9e4f22

nit

14ee98e

Address comments

9b7bb0d

Address comments

97fa953

Address comments

9e2d832

HyukjinKwon force-pushed the pinned-thread branch from f72a38d to 9e2d832 Compare November 7, 2019 10:32

HyukjinKwon closed this in 4ec04e5 Nov 7, 2019

S-C-H mentioned this pull request Jan 20, 2020

Python REPL connection issues Ibotta/sk-dist#35

Closed

HyukjinKwon deleted the pinned-thread branch March 3, 2020 01:17

HyukjinKwon mentioned this pull request Jul 1, 2020

[SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode #28968

Closed

HyukjinKwon mentioned this pull request May 4, 2021

[SPARK-35303][PYTHON] Enable pinned thread mode by default #32429

Closed

HyukjinKwon mentioned this pull request Mar 7, 2022

Clean up the leftover connection for finished threads in pinned thread mode py4j/py4j#471

Merged

[SPARK-22340][PYTHON] Add a mode to pin Python thread into JVM's #24898

[SPARK-22340][PYTHON] Add a mode to pin Python thread into JVM's #24898

Conversation

HyukjinKwon commented Jun 18, 2019 • edited Loading

What changes were proposed in this pull request?

Multi threading model

Single threading model design (pinned thread model)

How does this PR fix?

How was this patch tested?

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

SparkQA commented Nov 1, 2019

HyukjinKwon commented Nov 4, 2019

SparkQA commented Nov 4, 2019

HyukjinKwon commented Nov 4, 2019

SparkQA commented Nov 4, 2019

squito left a comment

Choose a reason for hiding this comment

squito Nov 6, 2019

Choose a reason for hiding this comment

HyukjinKwon Nov 7, 2019

Choose a reason for hiding this comment

squito Nov 6, 2019

Choose a reason for hiding this comment

HyukjinKwon Nov 7, 2019

Choose a reason for hiding this comment

HyukjinKwon Nov 7, 2019

Choose a reason for hiding this comment

squito Nov 7, 2019

Choose a reason for hiding this comment

HyukjinKwon Nov 7, 2019

Choose a reason for hiding this comment

HyukjinKwon commented Nov 7, 2019

SparkQA commented Nov 7, 2019

HyukjinKwon commented Nov 7, 2019

HyukjinKwon commented Nov 7, 2019

HyukjinKwon commented Nov 7, 2019

HyukjinKwon commented Jun 18, 2019 •

edited

Loading