Add pre-commits preventing accidental API changes in common.sql #27946

potiuk · 2022-11-27T01:01:09Z

We had recently some mishaps when it comes to changes to the common.sql provider that accidentally changed API of the common.sql providers and they passed both tests and reviews.

Common.sql is pretty special provider because it used by other providers and we have to make sure the API it epxoses is backwards compatible, otherwise upgrading common.sql provider might break those providers.

This had already happened:

The Common sql bugfixes and improvements #26761 refactored common.sql provided and removed some methods that were actually used by Google Provider
This caused a problem described in apache-airflow-providers-common-sql==1.3.0 breaks BigQuery operators #27838 and forced us ot yank common.sql 1.3.0 and release 1.3.1 with the methods added back in Restore removed (but used) methods in common.sql #27843

This change introduces tools and process to flag common.sql API changes that are likely backwards compatible.

They will require a deliberate action from the contributor who introduces a breaking API to common.sql, it will also help to flag to reviewer when such a change is made.

The idea is based on how Android OS handles flagging API changes for their public APIs.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

potiuk · 2022-11-27T01:18:59Z

cc: @vincbeck @alexott @oliviervg1 @kazanzhy - this is the pre-commit tests that I described in our previous discussions. It should flag any changes to the common.sql API when they happen, so that it is not easy to change the API accidentally and reviewers can also immediately see that the API has been changed.

We are still keeping _parse_boolean as part of the public API - even now it is used by google provider.

Examples of errors printed when there are changes in the API:

When only new methods are added, the contributor will have to just commit the change:

When there is removal, the contributor will have to manually update the api.txt file:

Similarly- when there is an update, to exisiting API, user will have to manually update the api.txt file:

potiuk · 2022-11-27T01:30:48Z

Also the second part:

Here is how checking if other providers are only using the public API:

And here error returned by pre-commit in case any non-public method is used:

potiuk · 2022-11-27T01:59:00Z

BTW. When we see it works, we should be able to use very similar approach (and reuse the code) for Airflow core API.

This should be a prerequisite for splitting providers.

potiuk · 2022-11-27T11:39:27Z

I reviewed and clarified (it's clearere and simpler I hope) the description of the API verification processs in the README_API.md.

I've also added the "Possible future work" chapter explaining how results of experimenting with API in the common.sql might be eventually converted into much more comprehensive tool for internal API verification.

This might be not only for cross-provider dependencies but also for providers -> Airflow core dependencies eventually.

We've been discussing it several times before, and from those discussions, I think having such comprehensive API "snapshot" and "verification" tooling is a prerequisite for the future splitting of providers to separate repos (if we decide to do that of course).

Eventually e might give external provider developers a tool to verify if the providers they develop, are API-level compatible with certain versions (+) of the Apache Airflow. With approach similar to what I described, we might not only be able to finally be able to tell "Those are the official Airflow classes/methods you can use and rely on" but also have a mechanism for provider developers to use them - we will be able to test those for our providers (once/if we split them in a separate repositories).

We had recently some mishaps when it comes to changes to the common.sql provider that accidentally changed API of the common.sql providers and they passed both tests and reviews. Common.sql is pretty special provider because it used by other providers and we have to make sure the API it epxoses is backwards compatible, otherwise upgrading common.sql provider might break those providers. This had already happened: * The apache#26761 refactored common.sql provided and removed some methods that were actually used by Google Provider * This caused a problem described in apache#27838 and forced us ot yank common.sql 1.3.0 and release 1.3.1 with the methods added back in apache#27843 This change introduces tools and process to flag common.sql API changes that are likely backwards compatible. They will require a deliberate action from the contributor who introduces a breaking API to common.sql, it will also help to flag to reviewer when such a change is made. The idea is based on how Android OS handles flagging API changes for their public APIs.

alexott · 2022-11-27T16:11:37Z

That's really nice feature!

potiuk · 2022-11-27T20:02:41Z

That's really nice feature!

Indeed. I found this kind of approach works really well for Android OS (I've been developing custom Android OS and migrating between different API versions). It's quite a bit of more complex beast for Android but it seems we are getting to the point to start implementing something similar as it might prevent some problems that we already started to experience.

ashb

Is this really a "pre-commit" check or should this just be in the unit tests?

(We seem to be adding more and more and more checks over time to pre-commit. Is there any rhyme or reason to what goes where?

ashb · 2022-11-28T09:51:25Z

I love this idea, however I think using https://mypy.readthedocs.io/en/stable/stubgen.html might be a good idea -- it captures (I think) everything we want but we have far less code to maintain as a result.

Oh: this might not let us capture only specific "private" things we want. Humbug.

potiuk · 2022-11-28T10:33:19Z

Is this really a "pre-commit" check or should this just be in the unit tests?

(We seem to be adding more and more and more checks over time to pre-commit. Is there any rhyme or reason to what goes where?

I think for me the rule of thumb I see is what we are testing.

The "unit tests" should test the "real" airflow functionality. (i..e whether things work as expected even if some of the things they use are mocked).
Static checks (i.e. pre-commits) on the other hand check if the "code fullfils certain criteria" (formatting, Type checking, using certain constructs, etc.).

The tests above check if:

a) the API has not changed accidentally
b) whether the code is only using what is intended

IMHO, if we need to run AST parsing for Airflow code, this is a clear indication we analyse the code of Airlfow not the functionality of Airflow. So they both seem to fit in the second camp.

There are other criteria - such as how fast things are, whether they are feasible to be run in pre-commit - before each commit (if you have pre-commit installed), which also boils down to - whether we can run them super-selectively - only for changed files in some very selected part of the code and be pretty sure about the result - this is what pre-commit gives us.

In this case - yes - we are running the tests for "common.sql" in the first case and for all providers in the other case - and it will give good results if we only change few files in providers and run the pre-commit - it should fail the same as we run it for all applicable files.

So yeah - this one very firmly goes into pre-commits.

Side commend. As discussed in the previous discussion with @vincbeck - we could add "regular" unit tests (or rather system tests following the AIP-47) to add similar checks on "whether it works" level #27854 (comment) - but this seems to be quite an impossible feat because we would have to run tests with literally every released version of the other providers.

Another side - comment. There is another (upcoming) example which will go into unit tests and it would explain where the difference is. For AIP-44 I plan to run ALL APPLICABLE unit tests of our in a "db-less" backend - i.e. with only internal API server running - and making sure our code will not execute any SQLAlchemy/DB operation. - simply to make sure that Airlfow worker, dag file processor, triggerer are truly - DB-less. And in this case this falls precisely in the other camp - where we want to actually "run" the real airflow code and see that it behaves as expected.

potiuk · 2022-11-28T10:41:11Z

I love this idea, however I think using https://mypy.readthedocs.io/en/stable/stubgen.html might be a good idea -- it captures (I think) everything we want but we have far less code to maintain as a result.

Oh: this might not let us capture only specific "private" things we want. Humbug.

Interesting idea. I will see if I can bend the stubgen to our case, I would also prefer not to have our own code doint this AST parsing (I've learned a bit and it's simple, but not having to do it would be a plus). The "diff" part is going to work very similarly and .pyi file seems to fulfill very well the criteria I have for the API definition. The not-really-privte-even-if -it-shoudl be might be indeed a blocker, but maybe I can bend it to our will a bit.

Also one difficulty with .pyi file that we would have to parse the file anyway to check if our code is actually using the "exposed" API functions. I am not sure if we can run any check to tell "Am I only using the stuff from that .pyi file and nothing else"?

But the idea made me think - maybe we can do something a bit different. Maybe instead of generating the api file we could generate the .pyi files and let the Mypy checks to the verifications for us. I will explore this one.

potiuk · 2022-11-28T11:56:24Z

But the idea made me think - maybe we can do something a bit different. Maybe instead of generating the api file we could generate the .pyi files and let the Mypy checks to the verifications for us. I will explore this one.

Hmm. The --include-private flag does not really work (and it is far too broad).

I think we could be able to add some post-processing of the stub files to include the _private stuff that could be used. Let me see if I can implement it easily. I think it should be possible to just add some hard-coded replacements. I really like the idea that the "API" specification will be standard .pyi files, rather than our proprietary format.

MyPy actually does even better job than our rudimentary check in this case, It actually detected that we also have another case that we should handle (which is enough of a justification for me to really bend it to our will):

I just have to make sure to surface the information to the comon.sql contributor that those issues should be treated somehow differently and add smarter generation of the stubs - rather than outright generating, them, we can generate them to a separate folder, compare and then decide what to do.

ashb · 2022-11-28T11:59:48Z

Oh nice.

rather than outright generating, them, we can generate them to a separate folder, compare and then decide what to do.

Yeah, perhaps we have them generated/stored in the normal location (airflow/providers/common/sql/hooks/sql.pyi) and then the test generates those to a temp folder and compares them?

ashb · 2022-11-28T12:01:40Z

The "unit tests" should test the "real" airflow functionality. (i..e whether things work as expected even if some of the things they use are mocked).

Same criteria but different view: To me ensuring that we maintain back compat of the provider is part of the real functionality.

Ultimately it doesn't matter, so long as it's tested somewhere. (If I was writing it I would have put it in unit tests, but I'm not so it doesn't really matter where it is 😄)

potiuk · 2022-11-28T13:18:18Z

Same criteria but different view: To me ensuring that we maintain back compat of the provider is part of the real functionality.

But we do not actually "run" those methods - we just statically verify them them that they are correct.

potiuk · 2022-11-28T13:19:16Z

Yeah, perhaps we have them generated/stored in the normal location (airflow/providers/common/sql/hooks/sql.pyi) and then the test generates those to a temp folder and compares them?

Yep. I am sending PR with that shortly

potiuk · 2022-11-28T15:50:49Z

The mypy change is in #27962 - overall it's even more code eventually (multiple stub files generated, and I also had to handle black formatting of the resulting code as well as nicely handle cases where whole python files would be deleted (or renamed).

I think though it is totally worth - it also because the .pyi files will be part of the provider package and you will not only check it during our pre-commits but also any of the users of common.sql provider who uses mypy will have errors when using internal methods. So ultimately I think this was a great suggestion @ashb

potiuk · 2022-11-28T15:53:22Z

closing in favour of #27962

potiuk · 2022-11-28T16:17:21Z

Ultimately it doesn't matter, so long as it's tested somewhere. (If I was writing it I would have put it in unit tests, but I'm not so it doesn't really matter where it is )

BTW.

There are projects (we are not one of them) where there are real, fast, super-fast unit tests (taking a fraction of second to run), people often also run (at least those fast tests) in pre-commits.

This saves a lot of iteration time on CI with very limited impact on the developer productivity - whether the commit takes 100ms longer does not really matter, but saving a CI iteration feedback to find out that things do not work is a lot of saved time and focus.

So ultimately for me the choice is:

fast -> pre-commit
slow -> tests in CI

Getting feedback as soon as possible without unnecessary steps is very important.

Maybe even one day we could move some of the "fast" tests to be run as pre-commit (especially when we separate out providers and separate DB-requiring tests from those that do not require DB.

Then we should be able to not only run those tests in a local venv (like most projects have), use tox (like lot of project hs) and run the unit tests as part of pre-commit. That would be my dream setup after we separate providers out.

Then on CI we would only have to run all the "heavy tests" that will be left - of course the fast as well - (and if we apply technique from this PR to generate "true" API interface of Airflow for Providers, then we might even not have to run provider compatibility tests for Airlfow.

And are not that far from it, I believe. If we apply similar approach to define and separate Airflow core API that Airlfow exposes to providers, and add mechanism to control changes to those APIs in similar way to this PR, then we will have the final step to be able to attempt the split of providers (if we decide we want it of course).

potiuk requested review from ashb, mik-laj, jedcunningham and eladkal as code owners November 27, 2022 01:01

boring-cyborg bot added area:dev-tools area:providers provider:common-sql labels Nov 27, 2022

potiuk requested review from uranusjr and ephraimbuddy November 27, 2022 01:01

potiuk force-pushed the add-pre-commits-for-common-sql-api branch from d325ee3 to d9cc8c0 Compare November 27, 2022 01:04

potiuk force-pushed the add-pre-commits-for-common-sql-api branch 3 times, most recently from ae72312 to 550268d Compare November 27, 2022 01:22

potiuk force-pushed the add-pre-commits-for-common-sql-api branch 2 times, most recently from 68a57d8 to 24001b2 Compare November 27, 2022 01:39

potiuk requested a review from Taragolis November 27, 2022 09:28

potiuk force-pushed the add-pre-commits-for-common-sql-api branch 2 times, most recently from acdfb10 to 2eb5d60 Compare November 27, 2022 11:28

potiuk force-pushed the add-pre-commits-for-common-sql-api branch from 2eb5d60 to 2ac282f Compare November 27, 2022 11:40

potiuk force-pushed the add-pre-commits-for-common-sql-api branch from 2ac282f to 2fdf489 Compare November 27, 2022 13:04

potiuk requested a review from pierrejeambrun November 27, 2022 14:56

potiuk mentioned this pull request Nov 28, 2022

Merge common code to DbApiHook #27763

Closed

ashb reviewed Nov 28, 2022

View reviewed changes

potiuk closed this Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pre-commits preventing accidental API changes in common.sql #27946

Add pre-commits preventing accidental API changes in common.sql #27946

potiuk commented Nov 27, 2022

potiuk commented Nov 27, 2022

potiuk commented Nov 27, 2022 •

edited

Loading

potiuk commented Nov 27, 2022 •

edited

Loading

potiuk commented Nov 27, 2022 •

edited

Loading

alexott commented Nov 27, 2022

potiuk commented Nov 27, 2022

ashb left a comment

ashb commented Nov 28, 2022 •

edited

Loading

potiuk commented Nov 28, 2022

potiuk commented Nov 28, 2022

potiuk commented Nov 28, 2022 •

edited

Loading

ashb commented Nov 28, 2022

ashb commented Nov 28, 2022

potiuk commented Nov 28, 2022

potiuk commented Nov 28, 2022

potiuk commented Nov 28, 2022

potiuk commented Nov 28, 2022

potiuk commented Nov 28, 2022 •

edited

Loading

Add pre-commits preventing accidental API changes in common.sql #27946

Add pre-commits preventing accidental API changes in common.sql #27946

Conversation

potiuk commented Nov 27, 2022

potiuk commented Nov 27, 2022

potiuk commented Nov 27, 2022 • edited Loading

potiuk commented Nov 27, 2022 • edited Loading

potiuk commented Nov 27, 2022 • edited Loading

alexott commented Nov 27, 2022

potiuk commented Nov 27, 2022

ashb left a comment

Choose a reason for hiding this comment

ashb commented Nov 28, 2022 • edited Loading

potiuk commented Nov 28, 2022

potiuk commented Nov 28, 2022

potiuk commented Nov 28, 2022 • edited Loading

ashb commented Nov 28, 2022

ashb commented Nov 28, 2022

potiuk commented Nov 28, 2022

potiuk commented Nov 28, 2022

potiuk commented Nov 28, 2022

potiuk commented Nov 28, 2022

potiuk commented Nov 28, 2022 • edited Loading

potiuk commented Nov 27, 2022 •

edited

Loading

potiuk commented Nov 27, 2022 •

edited

Loading

potiuk commented Nov 27, 2022 •

edited

Loading

ashb commented Nov 28, 2022 •

edited

Loading

potiuk commented Nov 28, 2022 •

edited

Loading

potiuk commented Nov 28, 2022 •

edited

Loading