-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28153][PYTHON] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF) #24958
Conversation
cc @cloud-fan, @viirya, @brkyvz |
BTW, this seems a long standing issue. I could reproduce this at 2.3, 2.4. |
Test build #106862 has finished for PR 24958 at commit
|
Test build #106878 has finished for PR 24958 at commit
|
Test build #106884 has finished for PR 24958 at commit
|
Am I right that the fix here isn't so much the 'atomic'-ness but just the fact that there's a container object whose referent can be updated? like you could use a 1-element array too? that's all fine either way, just for my understanding. |
Ah, virtually yes. But using atomic at least will guard the safety between child and the parent thread. For instance, the case an UDF launches another thread |
While this fix seems fine, I'm worried it could actually lead to correctness issues. Py4j may reuse the same threads for different tasks, therefore I'm worried that we could return the wrong reference. A more robust solution would be:
What do you think? |
Test build #106897 has finished for PR 24958 at commit
|
I think Py4J is only used at driver side and we're safe about this concern.
and
So they are set and get at executor's side. Even if there are some spots I missed, Py4j reuses the same threads for different tasks but the job execution call happens one at one time due to GIL and Py4J launches another thread if one thread is busy on JVM. So, it won't happen that one JVM thread somehow launches multiple jobs at the same time and same thread. Moreover, I opened a PR to pin thread between PVM and JVM - #24898 which might be more correct behaviour (?). If we could switch the mode, it can permanently get rid of this concern. |
Cc @jose-torres and @jiangxb1987 as well. I found you guys worked in a related issue before. |
gentle ping .. :-) .. |
Retest this please. |
Test build #107269 has finished for PR 24958 at commit
|
While I'm reading the discussion here, I can hardly understand how we implement |
Thing is, it might be easy to understand but not neccessary from my understanding. It's not only |
retest this please |
Test build #107287 has finished for PR 24958 at commit
|
@HyukjinKwon I can guarantee you, with this solution we may hit a separate corner case. We should avoid using thread locals (especially inheritable thread locals) as much as possible. |
@brkyvz and @cloud-fan, this PR targets to fix SPARK-28153 with a minimised change. If the suggestions can fix other corner cases I am not aware of commonly, we should fix everywhere else that uses thread local like that, for instance, spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochTracker.scala Lines 26 to 31 in 5264164
Currently I am not sure which case the new suggestion will fix and cover. I would like to avoid to refactor it here to fix a bug due to a vague concern but without knowing which other corner cases we fix. Can we target it in a different ticket if this is correct, and if you guys think new suggestion should be done? |
Since Spark already uses thread local in many places, I'm OK with this surgical fix. But I do like the proposal from @brkyvz . Can we create a new JIRA ticket for it, and add a TODO somewhere? Then I think this PR is ready to go. |
We didn't get it right in SPARK-27711, which is the reason I would prefer a complete and robust solution this time. I'm just afraid more things would prop up. If you would like to merge this, and delete it in a month, that's fine with me, but I don't see the value in rushing this fix. |
This PR targets master/2.4/2.3, and I think it's safer to only implement the new idea in master. What do you think @brkyvz ? |
@brkyvz, we're not rushing - we're not ignoring any issue or holes actually found or merging it without discussion. Also, using thread local isn't a horrible way although it might be less preferred case by case - we can avoid to have one place that multiple tasks access but run them in parallel separately. I get the suggestion makes sense too but adding new way isn't necessarily safe. There are always new holes that can pop up. One conservative way is usually to keep the codes with less changes (you know for instance bug compatibility). In addition, it's rather a general design issue not specific only to this code path. For instance, the codes below: spark/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/EpochTracker.scala Lines 26 to 31 in 5264164
have the almost similar issue as SPARK-28153 - the parent thread updates the current epoch but the child thread (Python write thread) cannot read it. Ideally we should identify where to fix as well. |
Test build #108027 has finished for PR 24958 at commit
|
retest this please |
cc @zsxwing as well. Okie, then shall we proceed in this PR and discuss about the alternative later separately? |
Test build #108424 has finished for PR 24958 at commit
|
After reading the PR description closely, seems we only need to initialize |
We don't need actually. I just used that to match it to SS side - #24946. It won't affect the perf I guess and it's safer anyway because there are multiple threads that access to the variable. |
FYI @srowen pointed it out too #24958 (comment) . We can use 1-length array too, for instance. |
makes sense, merging to master! |
Actually I failed to do the backport due to bad network. @HyukjinKwon can you do the backport yourself? thanks! |
I'll do that, @cloud-fan . |
Ur, |
Hi, @HyukjinKwon , @cloud-fan , @brkyvz . It's irrelevant to this PR, but PR build seems to skip many Python UTs. Is it safe to proceed the python PRs like this?
|
Oh, never mind. It seems to be covered by
|
Most of skipped tests are because Python 2 doesn't have the required pandas and pyarrow installed in our Jenkins.I talked with Shane before but seems he's pretty busy. For this test case, it's safe to backport. Let me open a PR to backport today. Due to minimal version difference, we cannot simply backport pandas harrow related PRs anyway. |
… support input_file_name with Python UDF) This PR proposes to use `AtomicReference` so that parent and child threads can access to the same file block holder. Python UDF expressions are turned to a plan and then it launches a separate thread to consume the input iterator. In the separate child thread, the iterator sets `InputFileBlockHolder.set` before the parent does which the parent thread is unable to read later. 1. In this separate child thread, if it happens to call `InputFileBlockHolder.set` first without initialization of the parent's thread local (which is done when the `ThreadLocal.get()` is first called), the child thread seems calling its own `initialValue` to initialize. 2. After that, the parent calls its own `initialValue` to initializes at the first call of `ThreadLocal.get()`. 3. Both now have two different references. Updating at child isn't reflected to parent. This PR fixes it via initializing parent's thread local with `AtomicReference` for file status so that they can be used in each task, and children thread's update is reflected. I also tried to explain this a bit more at apache#24958 (comment). Manually tested and unittest was added. Closes apache#24958 from HyukjinKwon/SPARK-28153. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
opened #25321 |
…ckHolder (to support input_file_name with Python UDF) ## What changes were proposed in this pull request? This PR backports #24958 to branch-2.4. This PR proposes to use `AtomicReference` so that parent and child threads can access to the same file block holder. Python UDF expressions are turned to a plan and then it launches a separate thread to consume the input iterator. In the separate child thread, the iterator sets `InputFileBlockHolder.set` before the parent does which the parent thread is unable to read later. 1. In this separate child thread, if it happens to call `InputFileBlockHolder.set` first without initialization of the parent's thread local (which is done when the `ThreadLocal.get()` is first called), the child thread seems calling its own `initialValue` to initialize. 2. After that, the parent calls its own `initialValue` to initializes at the first call of `ThreadLocal.get()`. 3. Both now have two different references. Updating at child isn't reflected to parent. This PR fixes it via initializing parent's thread local with `AtomicReference` for file status so that they can be used in each task, and children thread's update is reflected. I also tried to explain this a bit more at #24958 (comment). ## How was this patch tested? Manually tested and unittest was added. Closes #25321 from HyukjinKwon/backport-SPARK-28153. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…ckHolder (to support input_file_name with Python UDF) ## What changes were proposed in this pull request? This PR backports apache#24958 to branch-2.4. This PR proposes to use `AtomicReference` so that parent and child threads can access to the same file block holder. Python UDF expressions are turned to a plan and then it launches a separate thread to consume the input iterator. In the separate child thread, the iterator sets `InputFileBlockHolder.set` before the parent does which the parent thread is unable to read later. 1. In this separate child thread, if it happens to call `InputFileBlockHolder.set` first without initialization of the parent's thread local (which is done when the `ThreadLocal.get()` is first called), the child thread seems calling its own `initialValue` to initialize. 2. After that, the parent calls its own `initialValue` to initializes at the first call of `ThreadLocal.get()`. 3. Both now have two different references. Updating at child isn't reflected to parent. This PR fixes it via initializing parent's thread local with `AtomicReference` for file status so that they can be used in each task, and children thread's update is reflected. I also tried to explain this a bit more at apache#24958 (comment). ## How was this patch tested? Manually tested and unittest was added. Closes apache#25321 from HyukjinKwon/backport-SPARK-28153. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…ckHolder (to support input_file_name with Python UDF) ## What changes were proposed in this pull request? This PR backports apache#24958 to branch-2.4. This PR proposes to use `AtomicReference` so that parent and child threads can access to the same file block holder. Python UDF expressions are turned to a plan and then it launches a separate thread to consume the input iterator. In the separate child thread, the iterator sets `InputFileBlockHolder.set` before the parent does which the parent thread is unable to read later. 1. In this separate child thread, if it happens to call `InputFileBlockHolder.set` first without initialization of the parent's thread local (which is done when the `ThreadLocal.get()` is first called), the child thread seems calling its own `initialValue` to initialize. 2. After that, the parent calls its own `initialValue` to initializes at the first call of `ThreadLocal.get()`. 3. Both now have two different references. Updating at child isn't reflected to parent. This PR fixes it via initializing parent's thread local with `AtomicReference` for file status so that they can be used in each task, and children thread's update is reflected. I also tried to explain this a bit more at apache#24958 (comment). ## How was this patch tested? Manually tested and unittest was added. Closes apache#25321 from HyukjinKwon/backport-SPARK-28153. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR proposes to use
AtomicReference
so that parent and child threads can access to the same file block holder.Python UDF expressions are turned to a plan and then it launches a separate thread to consume the input iterator. In the separate child thread, the iterator sets
InputFileBlockHolder.set
before the parent does which the parent thread is unable to read later.In this separate child thread, if it happens to call
InputFileBlockHolder.set
first without initialization of the parent's thread local (which is done when theThreadLocal.get()
is first called), the child thread seems calling its owninitialValue
to initialize.After that, the parent calls its own
initialValue
to initializes at the first call ofThreadLocal.get()
.Both now have two different references. Updating at child isn't reflected to parent.
This PR fixes it via initializing parent's thread local with
AtomicReference
for file status so that they can be used in each task, and children thread's update is reflected.I also tried to explain this a bit more at #24958 (comment).
How was this patch tested?
Manually tested and unittest was added.