Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SparkKubernetesOperator crd implementation #22253

Merged
merged 74 commits into from
Jan 12, 2024

Conversation

hamedhsn
Copy link
Contributor

@hamedhsn hamedhsn commented Mar 14, 2022

Currently, the SparkKubernetes operator is based on spark-on-k8s-operator project.

While it is a good first step but it just accepts yaml file to create a spark job. Moreover, the built-in functionality for the operators is not available e.g., error handling, log management, retry, delete job, timeout, etc. It is not also designed to be used by Data Scientists and other users that are not familiar with Kubernetes yaml specification.

This PR reimplements the SparkKubernetes operator to add functionalities to simplify the interface and accept different parameters in Python. Similar to the KubernetesOperator, we have added the logic to wait for a job after submission, manage error handling, retrieve logs from the driver pod, ability to delete a spark job. It also supports out-of-the-box Kubernetes functionalities such as handling volumes, config maps, secrets, etc.

@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:providers labels Mar 14, 2022
@hamedhsn hamedhsn force-pushed the spark_k8s_operator branch 5 times, most recently from 007830a to 10543ab Compare April 1, 2022 14:52
@hamedhsn hamedhsn changed the title WIP - Add SparkKubernetesOperator crd implementation Add SparkKubernetesOperator crd implementation Apr 1, 2022
@hamedhsn hamedhsn changed the title Add SparkKubernetesOperator crd implementation wIP - Add SparkKubernetesOperator crd implementation Apr 1, 2022
@hamedhsn hamedhsn changed the title wIP - Add SparkKubernetesOperator crd implementation WIP - Add SparkKubernetesOperator crd implementation Apr 1, 2022
@hamedhsn hamedhsn force-pushed the spark_k8s_operator branch from 10543ab to 0635ca3 Compare April 4, 2022 18:10
@hamedhsn hamedhsn changed the title WIP - Add SparkKubernetesOperator crd implementation Add SparkKubernetesOperator crd implementation Apr 4, 2022
@potiuk
Copy link
Member

potiuk commented Apr 12, 2022

@dstandish - this is yet another example and contribution to the discussion we have here: https://lists.apache.org/thread/m40ljkgcqsnvbxz6vlgnwflygvr6vznz

The change here spans acrross both "airflow/kubernetes" (part of core) and "cncf.kubernetes" provider. And by a quick look I have completely no idea how those two interact.

Will the new provider work without upgrading Airflow after the change ? Will it require the change to be released? Will the core continue running with old cncf.kubernetes ?

No idea.

This is far more important issue to solve than just deprecating config parameters. And the root cause of it is that unlike in all other providers we have functionality split betweeen "core" and "provider" and one uses the other and there is implicit dependency.

There are two options we could approach to solve this problem

  1. decouple cncf.kubernetes from core completely and remove all the implicit binding between the two
  2. move out all stuff to cncf.kubernetes to "core" and deprecate the provider

The first one might be possible - but we need to know if it really is possible, and whether it brings some backwards, compatibility issues. But ultimately - your proposal about moving core kubernetes settings is in-line with this change.

The second case is "easier" (requires no real changes to the core except moving the provider code to the core) but it does not nicely fit our "modular" approach of core and separate providers (Kubernetes is essentially an optional dependency).

I think - in order to decide on deprecating the kubernetes settings from core - we need to be sure that 1) is possible - and for that we need to have a plan how to achieve that.

@dstandish
Copy link
Contributor

This is far more important issue to solve than just deprecating config parameters. And the root cause of it is that unlike in all other providers we have functionality split betweeen "core" and "provider" and one uses the other and there is implicit dependency.

@potiuk what you are talking about is code sharing, provider depending on code in core, having deps intermixed with core deps. but just because the operator depends on core doesn't mean that it has to take it's configuration in a non-standard way, in a way that no other operator does, namely from airflow.cfg. KPO can depend on core code without taking its configuration from core settings. Indeed every provider depends on core -- that's why we have "min airflow version". But what's different with KPO is that it doesn't use a hook, and my goal is to fix that, and have it be configured like every other operator -- with a connection or params.

@potiuk
Copy link
Member

potiuk commented Apr 25, 2022

@potiuk what you are talking about is code sharing, provider depending on code in core, having deps intermixed with core deps. but just because the operator depends on core doesn't mean that it has to take it's configuration in a non-standard way, in a way that no other operator does, namely from airflow.cfg. KPO can depend on core code without taking its configuration from core settings. Indeed every provider depends on core -- that's why we have "min airflow version". But what's different with KPO is that it doesn't use a hook, and my goal is to fix that, and have it be configured like every other operator -- with a connection or params.

Yeah. I know that aligning it to be "same as others" is a good goal as long as we know that it is going to be a separate provider in the future. To be honest I do not know - see continuation in https://lists.apache.org/thread/m40ljkgcqsnvbxz6vlgnwflygvr6vznz

One of the possible paths for cncf.kubernetes after seeing how disruptive any changes might be and how unforeseen compatibility issues with cncf.kubernetes provider are is to incorporate it BACK into the core. I (and I think we all) have not enough discussion and exploration to understand if this is a possibe or better path - but I think we should take a look at this and make deliberate decision on the future of it before we attempt to make any changes in the settings, because we actually might end up with getting it incorporated into the core if we will not decouple K8S Executor from the core.

@hamedhsn
Copy link
Contributor Author

@potiuk @dstandish can you let me know what action should be taken from myside?
with the currect code structure in airflow, this PR is ready to be reviewed.

@potiuk
Copy link
Member

potiuk commented May 4, 2022

@hamedhsn - first of all : rebase (it has conflicts now). Secondly - make sure that all checks pass. Then ping us.

@hamedhsn hamedhsn force-pushed the spark_k8s_operator branch 3 times, most recently from 3936fb4 to 4b49663 Compare May 5, 2022 10:27
@potiuk
Copy link
Member

potiuk commented Jan 5, 2024

If I could run that locally it would have been much faster to test.

You can.

See the output of the failed build It contains detailed instructions right after the error message. And our contrbution documentation https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#id7 contains a chapter conveniently named "documentation" which has links to the docs that explain how you can run docs building locally

@hamedhsn
Copy link
Contributor Author

hamedhsn commented Jan 8, 2024

If I could run that locally it would have been much faster to test.

You can.

See the output of the failed build It contains detailed instructions right after the error message. And our contrbution documentation https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst#id7 contains a chapter conveniently named "documentation" which has links to the docs that explain how you can run docs building locally

@potiuk its not that straightforward, after breeze installation when I run breeze build-docs cncf.kubernetes it fails with this kind of error

.
.
.
------------------------------ Error  44 --------------------
 WARNING: toctree contains reference to nonexisting document '_api/tests/system/providers/cncf/kubernetes/index'

File path: apache-airflow-providers-cncf-kubernetes/index.rst (49)

  44 |
  45 |     Configuration <configurations-ref>
  46 |     CLI <cli-ref>
  47 |     Python API <_api/airflow/providers/cncf/kubernetes/index>
  48 |
> 49 | .. toctree::
  50 |     :hidden:
  51 |     :maxdepth: 1
  52 |     :caption: System tests
  53 |
  54 |     System Tests <_api/tests/system/providers/cncf/kubernetes/index>

############################## End docs build errors summary ##############################


The documentation has errors.
####################  Packages errors summary  ####################
Package name                                         Count of doc build errors    Count of spelling errors
-------------------------------------------------  ---------------------------  --------------------------
apache-airflow-providers-cncf-kubernetes                           44                           0
##################################################

@bolkedebruin
Copy link
Contributor

@hamedhsn documentation generation is always a bit tough especially in the beginning and I speak from experience. I have added some comments. Make sure to check the error message carefully because they do say what's wrong. In other circumstances when the build errors out on whole blocks you probably need to check proper indentation (spaces, tabs) and line spacing.

@hamedhsn hamedhsn force-pushed the spark_k8s_operator branch from 6cbf3f3 to 2022a89 Compare January 8, 2024 15:45
@hamedhsn hamedhsn force-pushed the spark_k8s_operator branch from 2022a89 to 4213cd5 Compare January 9, 2024 11:27
@dstandish
Copy link
Contributor

well, docs are green, just have a few other issues to resolve, it seems 👏

@bolkedebruin
Copy link
Contributor

bolkedebruin commented Jan 10, 2024

Almost there @hamedhsn !

airflow/providers/cncf/kubernetes/operators/custom_object_launcher.py:106: error:
Incompatible default for argument "driver" (default has type "None", argument
has type "Dict[Any, Any]")  [assignment]
            driver: dict = None,
                           ^~~~
airflow/providers/cncf/kubernetes/operators/custom_object_launcher.py:106: note: PEP 484 prohibits implicit Optional. Accordingly, mypy has changed its default to no_implicit_optional=True
airflow/providers/cncf/kubernetes/operators/custom_object_launcher.py:106: note: Use https://github.com/hauntsaninja/no_implicit_optional to automatically upgrade your codebase
airflow/providers/cncf/kubernetes/operators/custom_object_launcher.py:107: error:
Incompatible default for argument "executor" (default has type "None", argument
has type "Dict[Any, Any]")  [assignment]
            executor: dict = None,
                             ^~~~
airflow/providers/cncf/kubernetes/operators/custom_object_launcher.py:107: note: PEP 484 prohibits implicit Optional. Accordingly, mypy has changed its default to no_implicit_optional=True
airflow/providers/cncf/kubernetes/operators/custom_object_launcher.py:107: note: Use https://github.com/hauntsaninja/no_implicit_optional to automatically upgrade your codebase

And

=========================== short test summary info ============================
FAILED tests/providers/cncf/kubernetes/operators/test_spark_kubernetes.py::TestSparkKubernetesOperator::test_new_template_from_yaml - AttributeError: 'KubernetesSpec' object has no attribute 'node_selector'
FAILED tests/providers/cncf/kubernetes/operators/test_spark_kubernetes.py::TestSparkKubernetesOperator::test_template_spec - AttributeError: 'KubernetesSpec' object has no attribute 'node_selector'
FAILED tests/providers/cncf/kubernetes/operators/test_spark_kubernetes.py::TestSparkKubernetesOperator::test_env - AttributeError: 'KubernetesSpec' object has no attribute 'node_selector'
FAILED tests/providers/cncf/kubernetes/operators/test_spark_kubernetes.py::TestSparkKubernetesOperator::test_volume - AttributeError: 'KubernetesSpec' object has no attribute 'node_selector'
FAILED tests/providers/cncf/kubernetes/operators/test_spark_kubernetes.py::TestSparkKubernetesOperator::test_pull_secret - AttributeError: 'KubernetesSpec' object has no attribute 'node_selector'
FAILED tests/providers/cncf/kubernetes/operators/test_spark_kubernetes.py::TestSparkKubernetesOperator::test_affinity - AttributeError: 'KubernetesSpec' object has no attribute 'node_selector'
FAILED tests/providers/cncf/kubernetes/operators/test_spark_kubernetes.py::TestSparkKubernetesOperator::test_toleration - AttributeError: 'KubernetesSpec' object has no attribute 'node_selector'
= 7 failed, 9398 passed, 8189 skipped, 2 xfailed, 251 warnings in 434.41s (0:07:14) =

@hamedhsn
Copy link
Contributor Author

@bolkedebruin can you approve the build pls?

@hamedhsn
Copy link
Contributor Author

@bolkedebruin sounds like all tests are green now

@bolkedebruin
Copy link
Contributor

Jihaa! Thank you for your patience and persistence @hamedhsn

@bolkedebruin
Copy link
Contributor

Side note @potiuk: this is where I do not like "unresolved conversations". They are hiding in outdated changes :/

@bolkedebruin bolkedebruin merged commit aa25aff into apache:main Jan 12, 2024
@hamedhsn
Copy link
Contributor Author

@bolkedebruin Thanks for the review, glad it finally merged 😌 🎉

@potiuk
Copy link
Member

potiuk commented Jan 12, 2024

🎉

@potiuk
Copy link
Member

potiuk commented Jan 13, 2024

Side note @potiuk: this is where I do not like "unresolved conversations". They are hiding in outdated changes :/

@bolkedebruin - you just gave the most important reason why enabling "requiring resolving conversation" is useful. Those unresolved conversations are STILL there. Does not matter if you we have "require resolving conversation" enabled. And you might have MISSED that there are unresolved conversation. By requiring them to be resolved, yes - you have to do few more clicks to find them, read and resolve.

Which is PRECISELY the point. It forces SOMEONE to actually check and do some action to resolve the conversations to merge the PR.

You migh not like having to do it yourself, but it is good for the project as a whole (and for you as well) because some important points from teh conversation could have been missed.

I will actually add this comment to my list of positive effects that I am gatherting to present at the end of the trial.

@bolkedebruin
Copy link
Contributor

I'm of the opposite opinion here @potiuk . I understand why you would like to have it, I do not support it though as I think it a administrative waste tbh. Trust the authors and contributors to do the right thing rather than put this into technology. I have not seen your argument backed up by facts on the amount of PRs merged vs errors we have made while merging them. That is my biggest concern with adding this kind of administrative overhead. It seems to be a solution in search of a problem.

This PR started over one and a half year ago. Its remarks are tied to that timeline. The world has progressed. The remarks were not surfaced by Github and still needed to be resolved while being irrelevant. The risk that this generates is that committers will start looking at the conversations and just click "resolve" without reading to get past the administrative part.

Which is PRECISELY the point. It forces SOMEONE to actually check and do some action to resolve the conversations to merge the PR.

No it wasn't. The point is to prevent errors while merging. The means that is experimented with now is to accomplish that by forcing resolution by clicking "resolve".

So your welcome to list it as a positive, I will list it as a negative :-).

@potiuk
Copy link
Member

potiuk commented Jan 13, 2024

I simply thinnk that individual "little" time on reviewing the comments and resolving them is very little price to pay for "project" positive effect of making sure we have not forgotten everything. The value for project trumps (small) inconvenience for individual.

I personally review and merge maybe 80 PRs a week. And I still do not see an inconvenience of resolving such unresolved comments. I just find it super-strange that someone finds it inconvenient to do it for few PRs they are seeing and iterating over during the week. Somehow I cannot believe it's that big of an overhead.

And yes this case. is a BIT of an outlier because it accumulated some historical conversation that were not resolved "on the spot" - but hopefully, if we agree to get the conversattion requirement to go ahead, this will be less and less of a problem as people will get used to resolving conversations when they are well, resolved.

Yes. That asks for change in an individual habits and behaviours. But I strongly belive the small inconvenience are nothing comparing with the "pause and think before merging and make sure what you are merging had not missed important comment" for the project.

@vsoch
Copy link
Contributor

vsoch commented Jan 14, 2024

Going to agree with you @bolkedebruin. I had similar issues about 5 years ago, and while I think the intention is best, when it results in a PR being open on the order of years and not a more reasonable few months, it gives me pause. I can't comment on the details of the community here because it's been too long since I contributed, but it seems to me that something is off when the PR time is on the order of years. My heart goes out to you @potiuk, I hope you find the right balance, and seek feedback if/when you determine that things might not be working.

@potiuk
Copy link
Member

potiuk commented Jan 14, 2024

@vsoch :) -> yep I know what you talk about, but I think in this case there is reverse cause <> effect.

What we are discussing here (need to resolve comments to merge PR) is just causing problems when PR runs long, but I think it has little chance of causing it in the first place IMHO. The experiment here is to see the results of experiment where we enabled "require conversation resolving" to merge PR. We enabled it 2 weeks ago and the problem is that this PR has been opened a long time ago, so what @bolkedebruin is complaining here is that he had to go back and solve a number of old and stale conversations here - but the root cause was really that it had some outstanding changes and tests that were solved just a few days ago (and it took few minutes for @bolkedebruin - I guess - to resolve all the conversations that were outstanding (even if old) after the tests passed and comments were addressed.

So I would not really connect your past experience then and the discussion we have :D

@vsoch
Copy link
Contributor

vsoch commented Jan 14, 2024

I understood that. The high level pattern is adding developer churn to the process of review, regardless of the specific details.

@potiuk
Copy link
Member

potiuk commented Jan 14, 2024

I understood that. The high level pattern is adding developer churn to the process of review, regardless of the specific details.

I believe it adds more to the process of reviewer and that one who merges the PR, not developer. Which If you are looking at the stats, about 30% -40% of all PRs is actually ... me .... So what I am proposing here is to add more "pauses" to my own workflow - way more than anyone else to be honest (and this is also good, I think because it will prevent me from merging PRs where I missed that there is unresolved conversation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants