Add SparkKubernetesOperator crd implementation #22253

hamedhsn · 2022-03-14T16:43:58Z

Currently, the SparkKubernetes operator is based on spark-on-k8s-operator project.

While it is a good first step but it just accepts yaml file to create a spark job. Moreover, the built-in functionality for the operators is not available e.g., error handling, log management, retry, delete job, timeout, etc. It is not also designed to be used by Data Scientists and other users that are not familiar with Kubernetes yaml specification.

This PR reimplements the SparkKubernetes operator to add functionalities to simplify the interface and accept different parameters in Python. Similar to the KubernetesOperator, we have added the logic to wait for a job after submission, manage error handling, retrieve logs from the driver pod, ability to delete a spark job. It also supports out-of-the-box Kubernetes functionalities such as handling volumes, config maps, secrets, etc.

potiuk · 2022-04-12T12:09:27Z

@dstandish - this is yet another example and contribution to the discussion we have here: https://lists.apache.org/thread/m40ljkgcqsnvbxz6vlgnwflygvr6vznz

The change here spans acrross both "airflow/kubernetes" (part of core) and "cncf.kubernetes" provider. And by a quick look I have completely no idea how those two interact.

Will the new provider work without upgrading Airflow after the change ? Will it require the change to be released? Will the core continue running with old cncf.kubernetes ?

No idea.

This is far more important issue to solve than just deprecating config parameters. And the root cause of it is that unlike in all other providers we have functionality split betweeen "core" and "provider" and one uses the other and there is implicit dependency.

There are two options we could approach to solve this problem

decouple cncf.kubernetes from core completely and remove all the implicit binding between the two
move out all stuff to cncf.kubernetes to "core" and deprecate the provider

The first one might be possible - but we need to know if it really is possible, and whether it brings some backwards, compatibility issues. But ultimately - your proposal about moving core kubernetes settings is in-line with this change.

The second case is "easier" (requires no real changes to the core except moving the provider code to the core) but it does not nicely fit our "modular" approach of core and separate providers (Kubernetes is essentially an optional dependency).

I think - in order to decide on deprecating the kubernetes settings from core - we need to be sure that 1) is possible - and for that we need to have a plan how to achieve that.

dstandish · 2022-04-15T17:04:44Z

This is far more important issue to solve than just deprecating config parameters. And the root cause of it is that unlike in all other providers we have functionality split betweeen "core" and "provider" and one uses the other and there is implicit dependency.

@potiuk what you are talking about is code sharing, provider depending on code in core, having deps intermixed with core deps. but just because the operator depends on core doesn't mean that it has to take it's configuration in a non-standard way, in a way that no other operator does, namely from airflow.cfg. KPO can depend on core code without taking its configuration from core settings. Indeed every provider depends on core -- that's why we have "min airflow version". But what's different with KPO is that it doesn't use a hook, and my goal is to fix that, and have it be configured like every other operator -- with a connection or params.

potiuk · 2022-04-25T11:12:43Z

@potiuk what you are talking about is code sharing, provider depending on code in core, having deps intermixed with core deps. but just because the operator depends on core doesn't mean that it has to take it's configuration in a non-standard way, in a way that no other operator does, namely from airflow.cfg. KPO can depend on core code without taking its configuration from core settings. Indeed every provider depends on core -- that's why we have "min airflow version". But what's different with KPO is that it doesn't use a hook, and my goal is to fix that, and have it be configured like every other operator -- with a connection or params.

Yeah. I know that aligning it to be "same as others" is a good goal as long as we know that it is going to be a separate provider in the future. To be honest I do not know - see continuation in https://lists.apache.org/thread/m40ljkgcqsnvbxz6vlgnwflygvr6vznz

One of the possible paths for cncf.kubernetes after seeing how disruptive any changes might be and how unforeseen compatibility issues with cncf.kubernetes provider are is to incorporate it BACK into the core. I (and I think we all) have not enough discussion and exploration to understand if this is a possibe or better path - but I think we should take a look at this and make deliberate decision on the future of it before we attempt to make any changes in the settings, because we actually might end up with getting it incorporated into the core if we will not decouple K8S Executor from the core.

hamedhsn · 2022-04-27T17:16:08Z

@potiuk @dstandish can you let me know what action should be taken from myside?
with the currect code structure in airflow, this PR is ready to be reviewed.

potiuk · 2022-05-04T21:03:28Z

@hamedhsn - first of all : rebase (it has conflicts now). Secondly - make sure that all checks pass. Then ping us.

potiuk · 2024-01-05T18:08:29Z

If I could run that locally it would have been much faster to test.

You can.

See the output of the failed build It contains detailed instructions right after the error message. And our contrbution documentation https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#id7 contains a chapter conveniently named "documentation" which has links to the docs that explain how you can run docs building locally

hamedhsn · 2024-01-08T11:25:39Z

If I could run that locally it would have been much faster to test.

You can.

See the output of the failed build It contains detailed instructions right after the error message. And our contrbution documentation https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#id7 contains a chapter conveniently named "documentation" which has links to the docs that explain how you can run docs building locally

@potiuk its not that straightforward, after breeze installation when I run breeze build-docs cncf.kubernetes it fails with this kind of error

.
.
.
------------------------------ Error  44 --------------------
 WARNING: toctree contains reference to nonexisting document '_api/tests/system/providers/cncf/kubernetes/index'

File path: apache-airflow-providers-cncf-kubernetes/index.rst (49)

  44 |
  45 |     Configuration <configurations-ref>
  46 |     CLI <cli-ref>
  47 |     Python API <_api/airflow/providers/cncf/kubernetes/index>
  48 |
> 49 | .. toctree::
  50 |     :hidden:
  51 |     :maxdepth: 1
  52 |     :caption: System tests
  53 |
  54 |     System Tests <_api/tests/system/providers/cncf/kubernetes/index>

############################## End docs build errors summary ##############################


The documentation has errors.
####################  Packages errors summary  ####################
Package name                                         Count of doc build errors    Count of spelling errors
-------------------------------------------------  ---------------------------  --------------------------
apache-airflow-providers-cncf-kubernetes                           44                           0
##################################################

bolkedebruin · 2024-01-08T11:57:12Z

@hamedhsn documentation generation is always a bit tough especially in the beginning and I speak from experience. I have added some comments. Make sure to check the error message carefully because they do say what's wrong. In other circumstances when the build errors out on whole blocks you probably need to check proper indentation (spaces, tabs) and line spacing.

dstandish · 2024-01-10T04:02:26Z

well, docs are green, just have a few other issues to resolve, it seems 👏

bolkedebruin · 2024-01-10T08:57:07Z

Almost there @hamedhsn !

airflow/providers/cncf/kubernetes/operators/custom_object_launcher.py:106: error:
Incompatible default for argument "driver" (default has type "None", argument
has type "Dict[Any, Any]")  [assignment]
            driver: dict = None,
                           ^~~~
airflow/providers/cncf/kubernetes/operators/custom_object_launcher.py:106: note: PEP 484 prohibits implicit Optional. Accordingly, mypy has changed its default to no_implicit_optional=True
airflow/providers/cncf/kubernetes/operators/custom_object_launcher.py:106: note: Use https://github.com/hauntsaninja/no_implicit_optional to automatically upgrade your codebase
airflow/providers/cncf/kubernetes/operators/custom_object_launcher.py:107: error:
Incompatible default for argument "executor" (default has type "None", argument
has type "Dict[Any, Any]")  [assignment]
            executor: dict = None,
                             ^~~~
airflow/providers/cncf/kubernetes/operators/custom_object_launcher.py:107: note: PEP 484 prohibits implicit Optional. Accordingly, mypy has changed its default to no_implicit_optional=True
airflow/providers/cncf/kubernetes/operators/custom_object_launcher.py:107: note: Use https://github.com/hauntsaninja/no_implicit_optional to automatically upgrade your codebase

And

=========================== short test summary info ============================
FAILED tests/providers/cncf/kubernetes/operators/test_spark_kubernetes.py::TestSparkKubernetesOperator::test_new_template_from_yaml - AttributeError: 'KubernetesSpec' object has no attribute 'node_selector'
FAILED tests/providers/cncf/kubernetes/operators/test_spark_kubernetes.py::TestSparkKubernetesOperator::test_template_spec - AttributeError: 'KubernetesSpec' object has no attribute 'node_selector'
FAILED tests/providers/cncf/kubernetes/operators/test_spark_kubernetes.py::TestSparkKubernetesOperator::test_env - AttributeError: 'KubernetesSpec' object has no attribute 'node_selector'
FAILED tests/providers/cncf/kubernetes/operators/test_spark_kubernetes.py::TestSparkKubernetesOperator::test_volume - AttributeError: 'KubernetesSpec' object has no attribute 'node_selector'
FAILED tests/providers/cncf/kubernetes/operators/test_spark_kubernetes.py::TestSparkKubernetesOperator::test_pull_secret - AttributeError: 'KubernetesSpec' object has no attribute 'node_selector'
FAILED tests/providers/cncf/kubernetes/operators/test_spark_kubernetes.py::TestSparkKubernetesOperator::test_affinity - AttributeError: 'KubernetesSpec' object has no attribute 'node_selector'
FAILED tests/providers/cncf/kubernetes/operators/test_spark_kubernetes.py::TestSparkKubernetesOperator::test_toleration - AttributeError: 'KubernetesSpec' object has no attribute 'node_selector'
= 7 failed, 9398 passed, 8189 skipped, 2 xfailed, 251 warnings in 434.41s (0:07:14) =

hamedhsn · 2024-01-11T18:10:21Z

@bolkedebruin can you approve the build pls?

hamedhsn · 2024-01-12T11:18:31Z

@bolkedebruin sounds like all tests are green now

bolkedebruin · 2024-01-12T13:23:29Z

Jihaa! Thank you for your patience and persistence @hamedhsn

docs/apache-airflow-providers-google/operators/cloud/kubernetes_engine.rst

airflow/providers/cncf/kubernetes/operators/pod.py

bolkedebruin · 2024-01-12T13:25:15Z

Side note @potiuk: this is where I do not like "unresolved conversations". They are hiding in outdated changes :/

hamedhsn · 2024-01-12T15:47:17Z

@bolkedebruin Thanks for the review, glad it finally merged 😌 🎉

potiuk · 2024-01-12T16:02:32Z

🎉

potiuk · 2024-01-13T09:22:15Z

Side note @potiuk: this is where I do not like "unresolved conversations". They are hiding in outdated changes :/

@bolkedebruin - you just gave the most important reason why enabling "requiring resolving conversation" is useful. Those unresolved conversations are STILL there. Does not matter if you we have "require resolving conversation" enabled. And you might have MISSED that there are unresolved conversation. By requiring them to be resolved, yes - you have to do few more clicks to find them, read and resolve.

Which is PRECISELY the point. It forces SOMEONE to actually check and do some action to resolve the conversations to merge the PR.

You migh not like having to do it yourself, but it is good for the project as a whole (and for you as well) because some important points from teh conversation could have been missed.

I will actually add this comment to my list of positive effects that I am gatherting to present at the end of the trial.

bolkedebruin · 2024-01-13T15:07:43Z

I'm of the opposite opinion here @potiuk . I understand why you would like to have it, I do not support it though as I think it a administrative waste tbh. Trust the authors and contributors to do the right thing rather than put this into technology. I have not seen your argument backed up by facts on the amount of PRs merged vs errors we have made while merging them. That is my biggest concern with adding this kind of administrative overhead. It seems to be a solution in search of a problem.

This PR started over one and a half year ago. Its remarks are tied to that timeline. The world has progressed. The remarks were not surfaced by Github and still needed to be resolved while being irrelevant. The risk that this generates is that committers will start looking at the conversations and just click "resolve" without reading to get past the administrative part.

Which is PRECISELY the point. It forces SOMEONE to actually check and do some action to resolve the conversations to merge the PR.

No it wasn't. The point is to prevent errors while merging. The means that is experimented with now is to accomplish that by forcing resolution by clicking "resolve".

So your welcome to list it as a positive, I will list it as a negative :-).

potiuk · 2024-01-13T15:16:01Z

I simply thinnk that individual "little" time on reviewing the comments and resolving them is very little price to pay for "project" positive effect of making sure we have not forgotten everything. The value for project trumps (small) inconvenience for individual.

I personally review and merge maybe 80 PRs a week. And I still do not see an inconvenience of resolving such unresolved comments. I just find it super-strange that someone finds it inconvenient to do it for few PRs they are seeing and iterating over during the week. Somehow I cannot believe it's that big of an overhead.

And yes this case. is a BIT of an outlier because it accumulated some historical conversation that were not resolved "on the spot" - but hopefully, if we agree to get the conversattion requirement to go ahead, this will be less and less of a problem as people will get used to resolving conversations when they are well, resolved.

Yes. That asks for change in an individual habits and behaviours. But I strongly belive the small inconvenience are nothing comparing with the "pause and think before merging and make sure what you are merging had not missed important comment" for the project.

vsoch · 2024-01-14T02:17:36Z

Going to agree with you @bolkedebruin. I had similar issues about 5 years ago, and while I think the intention is best, when it results in a PR being open on the order of years and not a more reasonable few months, it gives me pause. I can't comment on the details of the community here because it's been too long since I contributed, but it seems to me that something is off when the PR time is on the order of years. My heart goes out to you @potiuk, I hope you find the right balance, and seek feedback if/when you determine that things might not be working.

potiuk · 2024-01-14T20:34:12Z

@vsoch :) -> yep I know what you talk about, but I think in this case there is reverse cause <> effect.

What we are discussing here (need to resolve comments to merge PR) is just causing problems when PR runs long, but I think it has little chance of causing it in the first place IMHO. The experiment here is to see the results of experiment where we enabled "require conversation resolving" to merge PR. We enabled it 2 weeks ago and the problem is that this PR has been opened a long time ago, so what @bolkedebruin is complaining here is that he had to go back and solve a number of old and stale conversations here - but the root cause was really that it had some outstanding changes and tests that were solved just a few days ago (and it took few minutes for @bolkedebruin - I guess - to resolve all the conversations that were outstanding (even if old) after the tests passed and comments were addressed.

So I would not really connect your past experience then and the discussion we have :D

vsoch · 2024-01-14T21:18:14Z

I understood that. The high level pattern is adding developer churn to the process of review, regardless of the specific details.

potiuk · 2024-01-14T21:51:52Z

I understood that. The high level pattern is adding developer churn to the process of review, regardless of the specific details.

I believe it adds more to the process of reviewer and that one who merges the PR, not developer. Which If you are looking at the stats, about 30% -40% of all PRs is actually ... me .... So what I am proposing here is to add more "pauses" to my own workflow - way more than anyone else to be honest (and this is also good, I think because it will prevent me from merging PRs where I missed that there is unresolved conversation).

hamedhsn requested review from jedcunningham and dstandish as code owners March 14, 2022 16:43

boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:providers labels Mar 14, 2022

hamedhsn force-pushed the spark_k8s_operator branch 5 times, most recently from 007830a to 10543ab Compare April 1, 2022 14:52

hamedhsn changed the title ~~WIP - Add SparkKubernetesOperator crd implementation~~ Add SparkKubernetesOperator crd implementation Apr 1, 2022

hamedhsn changed the title ~~Add SparkKubernetesOperator crd implementation~~ wIP - Add SparkKubernetesOperator crd implementation Apr 1, 2022

hamedhsn changed the title ~~wIP - Add SparkKubernetesOperator crd implementation~~ WIP - Add SparkKubernetesOperator crd implementation Apr 1, 2022

hamedhsn force-pushed the spark_k8s_operator branch from 10543ab to 0635ca3 Compare April 4, 2022 18:10

hamedhsn changed the title ~~WIP - Add SparkKubernetesOperator crd implementation~~ Add SparkKubernetesOperator crd implementation Apr 4, 2022

hamedhsn force-pushed the spark_k8s_operator branch 3 times, most recently from 3936fb4 to 4b49663 Compare May 5, 2022 10:27

hamedhsn added 8 commits May 10, 2022 08:20

Add SparkKubernetesOperator using custom operator

f872519

clean up

bca8292

fix volume mount

fae1f8a

add configmap_file support

1d45420

add env_from support for secret and configmap

dfbe704

rename variable

a87742b

add dynamic_allocation and set it to false

d2a034c

add tests

101a7eb

hamedhsn force-pushed the spark_k8s_operator branch from 6cbf3f3 to 2022a89 Compare January 8, 2024 15:45

add doc

4213cd5

hamedhsn force-pushed the spark_k8s_operator branch from 2022a89 to 4213cd5 Compare January 9, 2024 11:27

fix test

2053d1f

hamedhsn force-pushed the spark_k8s_operator branch from 0858c25 to 2053d1f Compare January 11, 2024 18:13

bolkedebruin approved these changes Jan 12, 2024

View reviewed changes

docs/apache-airflow-providers-google/operators/cloud/kubernetes_engine.rst Outdated Show resolved Hide resolved

airflow/providers/cncf/kubernetes/operators/pod.py Outdated Show resolved Hide resolved

bolkedebruin merged commit aa25aff into apache:main Jan 12, 2024

eladkal mentioned this pull request Jan 22, 2024

Status of testing Providers that were prepared on January 26, 2024 #36948

Closed

hussein-awala mentioned this pull request Feb 8, 2024

SparkKubernetesOperator not rendering template correctly #37261

Closed

2 tasks

gbloisi-openaire mentioned this pull request Mar 10, 2024

SparkKubernetesOperator fails after upgrade from 2.8.1 to 2.8.2 #38017

Closed

2 tasks

eladkal mentioned this pull request Mar 12, 2024

SparkKubernetesOperator not retrieves logs from the driver pod and displays them in the Airflow UI. #37681

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SparkKubernetesOperator crd implementation #22253

Add SparkKubernetesOperator crd implementation #22253

hamedhsn commented Mar 14, 2022 •

edited by dstandish

Loading

potiuk commented Apr 12, 2022

dstandish commented Apr 15, 2022

potiuk commented Apr 25, 2022

hamedhsn commented Apr 27, 2022

potiuk commented May 4, 2022

potiuk commented Jan 5, 2024

hamedhsn commented Jan 8, 2024

bolkedebruin commented Jan 8, 2024

dstandish commented Jan 10, 2024

bolkedebruin commented Jan 10, 2024 •

edited

Loading

hamedhsn commented Jan 11, 2024

hamedhsn commented Jan 12, 2024

bolkedebruin commented Jan 12, 2024

bolkedebruin commented Jan 12, 2024

hamedhsn commented Jan 12, 2024

potiuk commented Jan 12, 2024

potiuk commented Jan 13, 2024

bolkedebruin commented Jan 13, 2024

potiuk commented Jan 13, 2024

vsoch commented Jan 14, 2024

potiuk commented Jan 14, 2024 •

edited

Loading

vsoch commented Jan 14, 2024

potiuk commented Jan 14, 2024 •

edited

Loading

Add SparkKubernetesOperator crd implementation #22253

Add SparkKubernetesOperator crd implementation #22253

Conversation

hamedhsn commented Mar 14, 2022 • edited by dstandish Loading

potiuk commented Apr 12, 2022

dstandish commented Apr 15, 2022

potiuk commented Apr 25, 2022

hamedhsn commented Apr 27, 2022

potiuk commented May 4, 2022

potiuk commented Jan 5, 2024

hamedhsn commented Jan 8, 2024

bolkedebruin commented Jan 8, 2024

dstandish commented Jan 10, 2024

bolkedebruin commented Jan 10, 2024 • edited Loading

hamedhsn commented Jan 11, 2024

hamedhsn commented Jan 12, 2024

bolkedebruin commented Jan 12, 2024

bolkedebruin commented Jan 12, 2024

hamedhsn commented Jan 12, 2024

potiuk commented Jan 12, 2024

potiuk commented Jan 13, 2024

bolkedebruin commented Jan 13, 2024

potiuk commented Jan 13, 2024

vsoch commented Jan 14, 2024

potiuk commented Jan 14, 2024 • edited Loading

vsoch commented Jan 14, 2024

potiuk commented Jan 14, 2024 • edited Loading

hamedhsn commented Mar 14, 2022 •

edited by dstandish

Loading

bolkedebruin commented Jan 10, 2024 •

edited

Loading

potiuk commented Jan 14, 2024 •

edited

Loading

potiuk commented Jan 14, 2024 •

edited

Loading