Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27812][K8S][2.4] Bump K8S client version to 4.6.1 #26152

Closed

Conversation

igorcalabria
Copy link
Contributor

What changes were proposed in this pull request?

Backport of #26093 to branch-2.4

Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927

We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in #25785

Does this PR introduce any user-facing change?

No

How was this patch tested?

This patch was tested manually using a simple pyspark job

from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.getOrCreate()

The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running

"OkHttp WebSocket https://10.96.0.1/..." #121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000]
"OkHttp WebSocket https://10.96.0.1/..." #117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000]

This is caused by a bug on kubernetes-client library, which is fixed on the version that we are upgrading to.

When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.0 is restored and both processes terminate successfully

@igorcalabria
Copy link
Contributor Author

cc @dongjoon-hyun

@dongjoon-hyun
Copy link
Member

Thank you, @igorcalabria .

@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 18, 2019

BTW, @igorcalabria .
FYI, from GitHub email setting, you can add your additional email, igor.calabria@ubee.in, as a secondary email.

  • Then, from GitHub commit log, your account is linked and your GitHub icon will show up.
  • Also, it will remove First-time contributor label from your PR. (At the top of this PR, you can still see First-time contributor)

I highly recommend you to do that.

@SparkQA
Copy link

SparkQA commented Oct 18, 2019

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17233/

@SparkQA
Copy link

SparkQA commented Oct 18, 2019

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/17233/

@SparkQA
Copy link

SparkQA commented Oct 18, 2019

Test build #112247 has finished for PR 26152 at commit 0a358f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @igorcalabria and @srowen .
Merged to branch-2.4

dongjoon-hyun pushed a commit that referenced this pull request Oct 18, 2019
# What changes were proposed in this pull request?

Backport of #26093 to `branch-2.4`

### Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927

We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in #25785

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

This patch was tested manually using a simple pyspark job

```python
from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.getOrCreate()
```

The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running

```
"OkHttp WebSocket https://10.96.0.1/..." #121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000]
"OkHttp WebSocket https://10.96.0.1/..." #117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000]
```
This is caused by a bug on `kubernetes-client` library, which is fixed on the version that we are upgrading to.

When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.0 is restored and both processes terminate successfully

Closes #26152 from igorcalabria/k8s-client-update-2.4.

Authored-by: igor.calabria <igor.calabria@ubee.in>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@sethhorrigan
Copy link

Attempting to start Spark 2.4.0 in Azure using AKS with Kubernetes version 1.14.6, we ran into the error referenced here: kubernetes/kubernetes#82131

I attempted to build an image from tag v2.4.0-rc5 but with #25641 applied, and encountered the same error when trying to start Spark in AKS with K8s 1.14.6. I also attempted the same, but with the changes from this PR (see https://github.com/sethhorrigan/spark/commit/8f96a5ea3d078a205ceb5924bf7aa2af04e6ced1), but with the exact same error:

java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
at okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216)

Is it possible to apply this as a patch to 2.4.0, or will it absolutely require switching to the branch-2.4 code with this fix merged in?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Oct 25, 2019

Hi, @sethhorrigan . What do you mean by Is it possible to apply this as a patch to 2.4.0?
The next version of Apache Spark is 2.4.5. There is no such thing like 2.4.0.1.

@dongjoon-hyun
Copy link
Member

If AKS breaks something in your production environment, please file an issue to them. That's the best way you get the commercial support you paid.

@srowen
Copy link
Member

srowen commented Oct 25, 2019

Put another way, use 2.4.4 at least. 2.4.0-rc5 is some release candidate of an old release, not even a final release.

@sethhorrigan
Copy link

sethhorrigan commented Oct 25, 2019

@dongjoon-hyun in the commit referenced above (https://github.com/sethhorrigan/spark/commit/8f96a5ea3d078a205ceb5924bf7aa2af04e6ced1), you can see what I mean Is it possible to apply this as a patch to 2.4.0; I forked the Spark repo and applied the changes to the code as of v2.4.0 as I said, then built the Spark image and tried to start it.

I am aware that v2.4.5 will be the next release, but your comments on #25641 suggest that will likely not be available until January 2020, so if we are going to have to build the Spark image from source until then, it makes as much sense to me to try to patch the code we have been using (v2.4.0) as patching another version (v2.4.4) and trying to get that working or trying to build from the current build-2.4 that will have other flaws that will need to be worked out before v2.4.5 is ready for release.

@srowen the tag v2.4.0 and v2.4.0-rc5 both refer to same commit, which I believe was the release v2.4.0

Has the change in this pull request been verified to fix kubernetes/kubernetes#82131 or is that just a hopeful guess?

Edit: reading through the comments on https://issues.jenkins-ci.org/browse/JENKINS-59000 (referenced from kubernetes/kubernetes#82131) I see that using https://kubernetes.default:443 as the endpoint fixed the issue for some people, and I found that work-around fixes the issue for us. From the comments on that Jira issue, I would guess that using version 1.18.3 of the Kubernetes plugin may also address it.

Hope this solution helps anyone else who stumbles on this thread as well.

@igorcalabria
Copy link
Contributor Author

@sethhorrigan This PR also fixes the issue with kubernetes you mentioned(kubernetes-client was upgraded to a higher version). If you are building from source, I recommend that you start with the latest released minor(2.4.4) and apply this patch. Or, you could simply use branch-2.4 where this patch was merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants