-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider splitting Google Provider #15933
Comments
Hey,
To start the discussion.
The split that would make logical sense to me would be per Cloud Service
(API), like shown be `gcloud services list`, with transfer operators
belonging to appropriate Google side package, like Google Cloud Storage for
gcs transfer operators.
For the common part maybe an apache-airflow-providers-google-core package,
similarly to https://pypi.org/project/google-cloud-core/ ?
The most problematic part would be keeping common python libraries (but
that could be solved by apache-airflow-providers-google-core package) and
other cloud provider libraries in sync, especially between transfer
operators, so
maybe apache-airflow-providers-google-transfer-othercloudprovider for
transfer operators in the end?
That is what comes to my mind straight away but I am probably not seeing
some potential problems. WDYT
Anita
śr., 19 maj 2021 o 07:59 Jarek Potiuk ***@***.***> napisał(a):
… cc: @dstandish <https://github.com/dstandish> @kaxil
<https://github.com/kaxil> @ashb <https://github.com/ashb> @mik-laj
<https://github.com/mik-laj> @turbaszek <https://github.com/turbaszek>
@anitakar <https://github.com/anitakar>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15933 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABE6AZEVD5PX22PR6PB7YULTONHTDANCNFSM45D4T6YA>
.
|
Having looked a little closer (
https://github.com/apache/airflow/tree/master/airflow/providers/google):
1. all the non cloud operators, sensors, hooks (ads, marketing_platform,
firebase) could become their own modules
(apache-airflow-providers-google-ads etc.)
2. all the files in cloud operators, sensors, hooks (
https://github.com/apache/airflow/tree/master/airflow/providers/google/cloud/operators)
could
become their own modules (apache-airflow-providers-google-dataflow etc.)
3. all transfer operators would have to move per service
With modifying __init__.py in a smart way all operators could preserve the
same import.
The problem would be only with transfer operators as they would have to be
moved not to have the same packages in different pypi modules and that
would be a breaking change for the users.
Seems like a lot of work. What would be the benefit? Being able to release
for example firebase operators separately?
śr., 19 maj 2021 o 22:56 Anita Fronczak ***@***.***> napisał(a):
… Hey,
To start the discussion.
The split that would make logical sense to me would be per Cloud Service
(API), like shown be `gcloud services list`, with transfer operators
belonging to appropriate Google side package, like Google Cloud Storage for
gcs transfer operators.
For the common part maybe an apache-airflow-providers-google-core package,
similarly to https://pypi.org/project/google-cloud-core/ ?
The most problematic part would be keeping common python libraries (but
that could be solved by apache-airflow-providers-google-core package) and
other cloud provider libraries in sync, especially between transfer
operators, so
maybe apache-airflow-providers-google-transfer-othercloudprovider for
transfer operators in the end?
That is what comes to my mind straight away but I am probably not seeing
some potential problems. WDYT
Anita
śr., 19 maj 2021 o 07:59 Jarek Potiuk ***@***.***>
napisał(a):
> cc: @dstandish <https://github.com/dstandish> @kaxil
> <https://github.com/kaxil> @ashb <https://github.com/ashb> @mik-laj
> <https://github.com/mik-laj> @turbaszek <https://github.com/turbaszek>
> @anitakar <https://github.com/anitakar>
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#15933 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ABE6AZEVD5PX22PR6PB7YULTONHTDANCNFSM45D4T6YA>
> .
>
|
This came up in part because leveldb hooks depend on plyvel library which requires that the machine has leveldb installed. So, prior to #15770, on mac if you want to install google provider (e.g. because you want to use some GCP hook) then you also have to install leveldb (which has nothing to do with GCP), and if you do not have leveldb installed, then the google provider install will fail. To resolve this, what we ended up doing was making plyvel an optional dependency of google provider (#15770) so you have to install (Come to think of it perhaps CI could be amended to do Rather than making leveldb a google provider extra, I think a cleaner solution would be if leveldb was made a distinct provider, either google-leveldb or just leveldb. In the case of leveldb, there's no interaction with GCP hooks -- AFAIK it's completely independent. And this way plyvel would not need to be an extra, and would not need to be a airflow core dep. The issue with leveldb was my main concern because being forced to |
+1 to that, I think we had similar discussion about salesforce, tableau and slack (Salesforce have bought Tableau and Slack): Also with companies like google it would be nice if they would be able to take care of the providers (even if that would mean extracting them from core). In that way they would be able to split it in whatever seems best. |
Agree with @turbaszek -- The best judge of the split is the company itself -- so according to me that would best. However, if Google does not want to do that, we should separate leveldb to a distinct provider like @dstandish suggested |
Hey,
So maybe without big upfront design and trying to solve the problem of
backwards compatibility for all cases, let's create another package for
leveldb. I could start during the weekend but I would need some help.
The way I see it you can not preserve same imports if you move the operator
file outside the current folder. You can only nest it deeper and create
__init__ .py that will fix the imports. Am I correct?
Thanks,
Anita
…On Fri, 21 May 2021, 21:02 Kaxil Naik, ***@***.***> wrote:
Agree with @turbaszek <https://github.com/turbaszek> -- The best judge of
the split is the company itself -- so according to me that would best.
However, if Google does not want to do that, we should separate leveldb to
a distinct provider like @dstandish <https://github.com/dstandish>
suggested
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15933 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABE6AZD4HMCWUPXMEZXONP3TO2U2ZANCNFSM45D4T6YA>
.
|
Thats true but there may be a different case here as leveldb is under Google github account. So the users of leveldb are very aware that this is a google service unlike Tableau/Slack & Salesforce relationship. |
Yeah. I see the case @eladkal. But I kind of see how Plyvel is very different from all other Google "core" components. Which cloud/ads/marketing_platform might be really "core" google business, where Plyvel is more of a "side" thing and could be fully separated out. I think it is much easier to separate out plyvel as a provider than splitting google "core" one. I am quite torn on this one. I see the benefits of keeping all those as single "provider" - on the other hand the benefits of splitting it are "potentially good" but it makes them also difficult to work together when installed with all the dependencies - it will be easy for example to run in a situation where "cloud" dependencies are different (and conflicting) with "ads" dependencies. I think until we figure out (possibly) some way of separating the dependencies out between tasks, this might be a difficult one to tackle and rather than "origin" (google), the "independence" of a package might be more important. I think the "extras" approach where plyvel is "optional" google extra dependency is maybe not perfect, but it kind of join both worlds. On one hand we have them together by "ownership", on the other hand we have just a problem of "environment" set properly for plyvel to work (so plyvel db libs installed). I think the current approach is kinda sustainable and does not introduce too much of complexity. As usually with "compromise" solutions, it's not perfect but maybe it's "good enough"? As long as it works, it could be "OK". I am also quite OK to separate just plyvel as @anitakar suggested. |
@potiuk Thank you for pointing out that leveldb package is outside of cloud
package (
https://github.com/apache/airflow/tree/master/airflow/providers/google/leveldb)
and also that it is an extra (
https://github.com/apache/airflow/blob/master/airflow/providers/google/provider.yaml#L751
).
I am a little confused though. I see plyvel on the list of requirements on
https://pypi.org/project/apache-airflow-providers-google/ and I do not see
any extra name there. Ok, the change has been merged 4 days ago (
#15812). Yep, it seems to solve the
problem of a problematic dependency on Mac. So, nothing more to be done
there, at least in my opinion.
Going back to splitting discussion.
I can see that having to download and install lots of libraries when you
just want to use for example firebase, can lead to dependency conflicts
with your code, even though you do not use those other dependencies when
you are using Google providers. Also, why download all those binaries and
for example make your docker image bigger.
But managing dependencies is a complicated and in my opinion quite
unsolvable problem. Actually, the way it is now and as Jarek pointed out,
when you install Google providers, the package has taken care of solving
those conflicts at least for the Google providers package code.
If I could I would like to make a poll here if people are unhappy about
having to download and solve dependency conflicts against all libraries on
Google provider package? Or maybe, they are happy that somebody has fixed
those dependencies so that they match each other?
When it comes to backwards compatibility. It is quite easy to do a split on
the top level of google providers package (
https://github.com/apache/airflow/tree/master/airflow/providers/google).
But it seems impossible inside cloud package (
https://github.com/apache/airflow/tree/master/airflow/providers/google/cloud)
as packages there are split technically (hooks, sensors, operators ...) not
on a domain/service level (automl, bigquery etc.). Operators could keep
their imports (
https://github.com/apache/airflow/tree/master/airflow/providers/google/cloud/operators)
but matching hooks and sensors couldn't, so a change would be breaking by
definition.
Personally, in software development I am all for splitting packages by
domain not technical function. But I am also against forcing everyone to
modify their code if there is no strong reason.
I would also like to say those are my personal opinions not my employer's.
sob., 22 maj 2021 o 18:41 Jarek Potiuk ***@***.***>
napisał(a):
… Yeah. I see the case @eladkal <https://github.com/eladkal>. But I kind of
see how Plyvel is very different from all other Google "core" components.
Which cloud/ads/marketing_platform might be really "core" google business,
where Plyvel is more of a "side" thing and could be fully separated out. I
think it is much easier to separate out plyvel as a provider than splitting
google "core" one.
I am quite torn on this one. I see the benefits of keeping all those as
single "provider" - on the other hand the benefits of splitting it are
"potentially good" but it makes them also difficult to work together when
installed with all the dependencies - it will be easy for example to run in
a situation where "cloud" dependencies are different (and conflicting) with
"ads" dependencies. I think until we figure out (possibly) some way of
separating the dependencies out between tasks, this might be a difficult
one to tackle and rather than "origin" (google), the "independence" of a
package might be more important.
I think the "extras" approach where plyvel is "optional" google extra
dependency is maybe not perfect, but it kind of join both worlds. On one
hand we have them together by "ownership", on the other hand we have just a
problem of "environment" set properly for plyvel to work (so plyvel db libs
installed). I think the current approach is kinda sustainable and does not
introduce too much of complexity.
As usually with "compromise" solutions, it's not perfect but maybe it's
"good enough"? As long as it works, it could be "OK". I am also quite OK to
separate just plyvel as @anitakar <https://github.com/anitakar> suggested.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15933 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABE6AZBUE4CM5Z3JL4JJZUTTO7NEZANCNFSM45D4T6YA>
.
|
Yeah. It will be released in the next wave of providers. And I also think the [plyvel] extra is a good option especially that this is really just a dependency. The plyvel provider is still there it will just fail with missing import if it's not installed with extra (same as a number of other cases, where we have cross-provider dependency.
Yeah. I am a bit torn myself. I see a value in having the "google" provider to get "everything-google". It it 'biggish" but on the other hand, it's also a kind of "good" to promote the whole platform (no need to install anything else to use "another" Google operator - it's the same with AWS (though there are far less number of of operators) and a bit different in Microsoft as there are really different "domains" there (winrm, mssql, azure). Actually I like the idea of the poll from @anitakar ! Just a thought... Maybe we can use Airflow Summit to run a number of mini-pols where we could prepare a list of such questions and make people more engaged? That could be one of a number of questions that we could ask. I will raise it to Airflow Summit organizers :) |
+1 to the polls on Airflow Summit. |
I think the google provider became more trouble than worth. |
Did the poll actually happen? (I wasn’t following closely and may have missed it.) |
Nope. The poll did not happen.
But I agree we might want to split it. It's not a MUST, but It might be
wise to do it. However if we are to split it we need to discuss and figure
out what should be the best strategy (probably an "umbrella" provider that
brings all sup-providers as dependencies to keep backwards compatibility.
It also requires - I think - moving the sub-packages to their own
sub-modules and keeping deprecation imports from sub-packages.
Kurtqq - would you be willing to help with that if we start the effort?
A number of things will have to be changed and it's a lot of
semi-automated work to be done (we can likely automate some parts but a lot
of it will be to manually move things).
Who else would like to help with this if we decide to do it ?
J.
…On Fri, Jul 30, 2021 at 10:28 AM Tzu-ping Chung ***@***.***> wrote:
Did the poll actually happen? (I wasn’t following closely and may have
missed it.)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15933 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAERMI4RW3EVFPPR54XOJQTT2JPEXANCNFSM45D4T6YA>
.
--
+48 660 796 129
|
yes I might be able to help but what exactly are the tasks? |
hi @potiuk @dstandish is this planned? |
No - not planned maybe you would like to take a lead on it and itntroduce it simply (though it will likely mean quite some thinking on how to do backwards compatibility). |
To be more specific, backward compatibility should be conceptually simple—we just need to keep publishing an aggregated Google provider that depends on all the Google providers—but building an infrastructure to build this aggregated provider can be more challenging. |
In general yes, but there are a few watch outs there - first of all most Google Providers depends on the common code - Base Hooks, common decorators, utils. And this means that there should also be a a 'google-common' package most likely and then there is a complexity of version dependencies between all those. When you take into account all the scenarios With various versions o those and potential cross dependencies between the split providers, this is its own small dependency-hell. The more I think of it, the more i see this one will introduce far more problems than it solves. |
@potiuk I would say that common should not change that often and if it does it probably won't add additional dependencies- it's mostly code that you already need for the service it's just being used by another service so it's in a higher hierarchy. Why can't the provider always work with the latest common version ? |
I think should is the key problem here. The initial split is quite easy, but what would happen after few releases is scary like hell. "Keep backwards compatibiliity of commons" in this case is quite a bit of wishful thinking and tt's far easier said than done and it requires constant vigiliance and fixes of accidental incompatibilities that will eventually creep-in leading to uncontrollable growth in complexity. The problem is when people start refactoring and add code that will break the compatibility accidentally or when you want to do a refactor that will improve the common code but introduces compatibilty isses.. We already saw examples of that with db_api: see the comment here: https://github.com/apache/airflow/blob/main/airflow/hooks/dbapi.py#L45 People making changes to such common code (and often even reviewwers/maintainers) might not realise the consequences of their changes on already released packages. Some changes will accidentally break compatibilities even if you are careful. It basically requires that all the relased google-providers are FUTURE compatible with alll the released version of the "common" package. Unless you have full test suite that can handle various cases, and make sure that the "common package" will work with ALL already released and compatible providers that people have, thiere is no way to "make sure" it's the case. And even if we can add such a test suite (which is possible to some extent just very costly on maintenance and running), this prevents you from doing more "bold" refactorings - which is generally very bad side-effect of such approach. I think ability to refector code is crucial to maintainability. For example now, we have such a test suite for all providers - we make sure that they import without warnings on Airflow 2.1 in our CI. But this is not a full guarantee they will work with Airflow 2.1 - this test is just a "smoke" test - but it already caught at least 2 cases of seemingly "innocent" change that would make all such providers stop working on 2.1. Of course you can also introduce "breaking" changes in the common code (and release 2.0 package), but then this inevitably leads to one of two things:
The question here is what is more costly (and for whom): a) complexity of maintenance of compatibilities between different versions of common packages and released "google" packages, with potential ability for the user to upgrade only some parts or b) complexity for the users who have to adapt to potentially more frequently handling breaking changes with new "full gogle provider" release I tnink the a) one is something that will grow more and more complex for maintainers over time, where b) is kinda "stable" - it requires some regular effort from the users but in a long run it is esiear to handle by them. Also a) has one very uncomfortable - for open-source project at least - property. I immediately imagine many issues opened by the users "i want to install google-ads-6 and google-gcs-3 because this and that and they do not work together because they require different "commons". Just conversations about that and explaining what can and cannot be done will take a good chunk of time for maintainers who will know how it works. Explaining that in docs will be next to impossible I am afraid. Right now we avoid all those conversations by releasing a single google provider. The conversation is very simple: "if you want to use google ads which were added in provider 5.0 you need to also adapt all other google properties to the version that is there". End of story. By having multiple providers/versions, the amount of user stories here grows exponentially large. Also I think in a long term you will not avoid the "breaking all" releases anyway. Users will have to do it anyway - only it will cost them more because they will have to do it much less frequently (counter-intuitively). There is an old saying that if migration is painful - just do it more often. The "split providers" approach leads to potentially less frequent, but more painful upgrades to our users + adds effort on maintenance of it on committers. The "single provider" means that you potentially must do more frequent but less paindul upgrades (say every month when you upgrade google provider). But there is no solution with "no pain". I am not saying all this is impossible - technically splitting the google provider is possible. I just think that the person (or rather grouop) who commits doing it realises the consequences and takes the burden of organising it in the way that we do not land in "dependency" hell. I personally currently have not enough courage to commit to it to be honest. |
And just one more comment here: It also depends on the "Extent" of the split of course. If we are talking for example about extracting the "plyvel" out of Google provider - by all means let's do it. Adding Plyvel to google provider was simply a mistake - it does not use any common code and conceptually it is very different than other integrations. But for all the "cloud" and "marketing_platform" I'd say there are far too many things in common that separating them out is worth the cost. |
To my perspective it's not the same case as compatibility issue with Airflow core (like the dbApi hook) I checked the google common code: I'm very OK with the alternative suggested. For me the goal is just to reduce the amount of services that installed directly with the google provider. The more services we can set as optional the better. After we do that we can revisit further splitting if needed. So the action items to resolve this issue are: |
I hope I am not exaggerating, but my experience tells me different story (but maybe, of course when handled differently it can be made simple). It's just I've seen what happened with provider split and how much time after the split it took to sort out all the various cases/incompatibilities and cross-version problems (and build processes around maintaining those). But if there is someone who would like to take the lead on the split and work out the protections and make sure the cost of maintenance is low - I will certainly not be opposing it, I am super happy to help!. And happy to answer any questions and help to build and test all the infrastructure. It will not be "split and done". My point here is that this is a long-haul and requires to work out the process of maintenance, managing to respond to user's questions, document it properly, make frequent decisions on whether to accept and release changes that are "possibly" incompatible and how to version those separate providers and core. Raealistically speaking - once you split you should expect half a year of monthly releases and get various user stories in that will be addressed, then you will be past the initiaal problems, and within the year or two you might expect breaking changes in Google API that wil require some backwards-incompatible changes of all such providers. And it needs committment to lead not only the split but also what happens after, including communication with the users, sorting out potential incompatibilities, and handling/reviewing multiple PRs with care and focus on compatibility and potentiallybreaking changes. That's it. For me it's scary to make such committment, but maybe someone would like to take that on? |
Just to comment on that part - those cross-provider dependencies are (I believe) way simpler. They use very stable (by definition) APIs that are shared between providers now (Hooks). That's about it. And we already had 1 case where there was a breaking change which required to document it in the changelog, add cross-provider-versioned-dependency with quite some communication overhead. What we are talking about in Google is a deep integration with a number of common functionalities revolving mostly about authentication but also some "common" functionality that is shared and expectation is (and current state) that if we implement it for one service, it is implemented for all of them at once. This is far closer to the DBAPI issues than cross-provider dependencies IMHO. Just to give a realistic example and where it would create problems. An example of "breaking" change I am talking here was adding "impersonation chain" to GoogleBaseHook and a number of "services" #9915. This was a functionality added acrross pretty much all google services (it was done before we released Airflow 2 and split providers) and if we did not have all those changes in one release, the risk of introducing problems when implementing it in fully backwards compatible way in the common part was high. You'd either:
or
Case 2) is an absolute nightmare from the code/maintenance point of view. You cannot rely on having new "common" code (it has to work with already released commons-1.0). So basically what you have to do is to properly handle the "if impersonation available" code in all the providers in similar way. The terrible part of it is tha you have to copy the if-ed code to all split providers that use impersonation - because you can only rely on the common code that is out there in commons 1.0. There are yet a number of authentication features we do not have and at some point we might want to (or need to) add them. Surely - we could make it a "breaking" change and release all providers only depending on the new commons then. Those changes are not "often" but IMHO those are super painful. This is because in this case our users completely loose the benefits of using "separately released" providers. Users who will be used to just upgrading one "split" provider at a time will complain with exactly this thing: "I am using gcs 2.0" but in order to use this new feature from ads 4.0 I have to upgrade to gcs 5.0 because I have to upgrade "commons" to 2.0 and it is not compatible with my version. I still want to continue using gcs 2.0 and get the feature from ads 4.0 - how can I do it?" . IMHO it's much better to have our users get "used" to upgrading all the google stuff together rather than risk this kind of problems. But once again - if someone woudl like to take a lead on that, I am happy to help to set it up :). |
Hello guys
Airflow 2.0.2 Thanks! |
@rodrigo-fss -> Plyvel is not a requirement for Google Provider any more - it's optional dependency - you do not have to install it. But in any case you need to upgrade to 2.1+ as google providers are already not compatible with 2.0.* line so you are using a really old version of those and even if we split providers, you will have to upgrade anyway to use that. So upgreding to Airflow 2.1+ (Ideallly 2.3+) is the RIGHT solution to your problem, not waiting for splitting the provider. |
The
Google
provider is huge. It's by far the biggest provider out there and it has a number of sometimes more closely coupled and sometimes not coupled at all (leveldb) operators.We should consider splitting the Google operator into smaller pieces, but that would involve some more decisions:
This is more of opening the discussion than making decisions yet.
The text was updated successfully, but these errors were encountered: