Skip to content

Commit

Permalink
Add pre-commits preventing accidental API changes in common.sql
Browse files Browse the repository at this point in the history
We had recently some mishaps when it comes to changes to the
common.sql provider that accidentally changed API of the common.sql
providers and they passed both tests and reviews.

Common.sql is pretty special provider because it used by other
providers and we have to make sure the API it epxoses is backwards
compatible, otherwise upgrading common.sql provider might break
those providers.

This had already happened:

* The apache#26761 refactored common.sql provided and removed some
  methods that were actually used by Google Provider

* This caused a problem described in apache#27838 and forced us ot
  yank common.sql 1.3.0 and release 1.3.1 with the methods
  added back in apache#27843

This change introduces tools and process to flag common.sql API
changes that are likely backwards compatible.

They will require a deliberate action from the contributor who
introduces a breaking API to common.sql, it will also help to
flag to reviewer when such a change is made.

The idea is based on how Android OS handles flagging API changes
for their public APIs.
  • Loading branch information
potiuk committed Nov 27, 2022
1 parent 15e842d commit 2fdf489
Show file tree
Hide file tree
Showing 10 changed files with 743 additions and 164 deletions.
14 changes: 14 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,20 @@ repos:
files: ^dev/breeze/src/airflow_breeze/utils/docker_command_utils\.py$|^scripts/ci/docker_compose/local\.yml$
pass_filenames: false
additional_dependencies: ['rich>=12.4.4']
- id: update-common-sql-api
name: Check and update common.sql API
entry: ./scripts/ci/pre_commit/pre_commit_update_common_sql_api.py
language: python
language_version: "3.7"
files: ^scripts/ci/pre_commit/pre_commit_update_common_sql_api\.py$
pass_filenames: false
additional_dependencies: ['rich>=12.4.4', 'astor']
- id: check-common-sql-api-usage
name: Check usages of the common.sql API in other providers
entry: ./scripts/ci/pre_commit/pre_commit_check_common_sql_api_usage.py
language: python
files: ^airflow/providers/.*\.py$
additional_dependencies: ['rich>=12.4.4']
- id: update-providers-dependencies
name: Update cross-dependencies for providers packages
entry: ./scripts/ci/pre_commit/pre_commit_update_providers_dependencies.py
Expand Down
4 changes: 4 additions & 0 deletions STATIC_CODE_CHECKS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,8 @@ require Breeze Docker image to be build locally.
+--------------------------------------------------------+------------------------------------------------------------------+---------+
| check-changelog-has-no-duplicates | Check changelogs for duplicate entries | |
+--------------------------------------------------------+------------------------------------------------------------------+---------+
| check-common-sql-api-usage | Check usages of the common.sql API in other providers | |
+--------------------------------------------------------+------------------------------------------------------------------+---------+
| check-core-deprecation-classes | Verify using of dedicated Airflow deprecation classes in core | |
+--------------------------------------------------------+------------------------------------------------------------------+---------+
| check-daysago-import-from-utils | Make sure days_ago is imported from airflow.utils.dates | |
Expand Down Expand Up @@ -305,6 +307,8 @@ require Breeze Docker image to be build locally.
+--------------------------------------------------------+------------------------------------------------------------------+---------+
| update-breeze-readme-config-hash | Update Breeze README.md with config files hash | |
+--------------------------------------------------------+------------------------------------------------------------------+---------+
| update-common-sql-api | Check and update common.sql API | |
+--------------------------------------------------------+------------------------------------------------------------------+---------+
| update-er-diagram | Update ER diagram | * |
+--------------------------------------------------------+------------------------------------------------------------------+---------+
| update-extras | Update extras in documentation | |
Expand Down
100 changes: 100 additions & 0 deletions airflow/providers/common/sql/README_API.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Keeping the API of common.sql provider backwards compatible

The API of the `common.sql` provider should be kept backwards compatible with previously released versions.
The reason is - there are already released providers that use `common.sql` provider and rely on the API and
behaviour of the `common.sql` provider, and any updates in the API or behaviour of it, might potentially
break those released providers.

Therefore, we should keep an extra care when changing the APIs.

The approach we take is similar to one that has been applied by Android OS team to keep their API in check,
and it is based on storing the current varsion of API and flagging changes that are potentially breaking.
This is done by comparing the previous API (store in [api.txt](api.txt) and the upcoming API from the PR.
The upcoming API is automatically extracted from `common.sql` Python files using `update-common-sql-api`
pre-commit. If the comparison determines that the change is potentially breaking, the contributor is asked
to review the changes and manually regenerate the [api.txt](api.txt).

The details of the workflow are as follows:

1) The previous API is stored in the (committed to repository) [api.txt](api.txt) file in this folder.
2) Every time when common.sql Python files are modified the `update-common-sql-api` pre-commit regenerates
the API file content and looks for potentially breaking changes are detected (removals or updates of
the existing classes/methods.
3) If the check reveals there are no changes to the API, nothing happens.
4) If there are only additions, the pre-commit automatically updates the [api.txt](api.txt) file,
asks the contributor to commit resulting update to the [api.txt](api.txt) and fails the pre-commit. This
is very similar to other static checks that automatically modify/fix source code.
5) If the pre-commit detects potentially breaking changes, the process is a bit more involved for the
contributor. The pre-commit flags such changes to the contributor by failing the pre-commit and
asks the contributor to review the change looking specifically for breaking compatibility with previous
providers (and fix any backwards compatibility). Once this is completed, the contributor is asked to
manually and explicitly regenerate and commit the new version of the API by running the pre-commit
with manually added environment variable: `UPDATE_COMMON_SQL_API=1 pre-commit run update-common-sql-api`.

# Verifying other providers to use only public API of the `common.sql` provider

The file which holds the current API is used by a second pre-commit - `check-common-sql-api-usage`.

This pre-commit does not do any update - it only checks that any of the providers uses only the
public API stored in the [api.txt](api.txt) when accessing `common.sql` provider. The reason is that some
methods or classes in the common.sql are not supposed to be part of the API - there are helper methods
and classes. Those methods and classes are usually starting with `_` (protected methods) and they are
excluded.

However - historically - some protected methods have been used, so until we determine it is
safe to remove them, the pre-commit will continue to manually select them as part of the API and they
are protected by this mechanism. This way we can be more certain that the community providers that use
`common.sql` provider only use the APIs that are intended to be used as public.

Of course, this is never 100% fool-proof, due to the dynamic Python behaviours and due to possible internal
changes in behaviours of the common.sql APIs, but at the very least any of "interface" changes will be
flagged in any change to contributor making a change and also reviewers will be notified whenever any
API change happens.

# External usage, tracking API changes

The `api.txt` file might be used by any external provider developer to make sure that only official
APIs are used, which will also help those external providers to use the `common.sql` provider
functionality without the fear of using methods and classes that are deemed non-public.

Since this file is stored in the repository, we can automatically generate the API differences between
future released version of `common.sql` provider, and we will be able to easily reason about changes of the
API between versions.

# Possible future work

This mechanism providing the "official" API of the `common.sql` provider is just the first attempt to
standardise APIs provided by various Airflow's components. In the future, we might decide to extend
after experimenting and testing its behaviour for the `common.sql`.

Keeping a formal, automatically generated and maintained list of official classes and methods that
providers can use. Once we get used to it and confirm that it is a good mechanism, we will be able
to extend this approach to more cases:

* for other cross-provider-dependencies, we will be able to flag potentially breaking changes between them
* we should be able to use very similar mechanism for exposing and flagging the public API of Airflow's core
and flag any misuse of methods in the Community providers
* as a follow-up of those two - we should be able to produce more "formal" API that providers written
externally could use and write a "compatibility tool" that will be able to verify if the provider
written externally is compatible (on API level) with certain version of Airflow, or in case of the
cross-provider dependencies whether it is compatible (on API level) with certain versions of the released
providers.
69 changes: 69 additions & 0 deletions airflow/providers/common/sql/api.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# This is a common.sql provider API dump
#
# This file is generated automatically by the update-common-sql-api pre-commit
# and it shows the current API of the common.sql provider
#
# Any, potentially breaking change in the API will require deliberate manual action from the contributor
# making a change to the common.sql provider. This file is also used in checking if only public API of
# the common.sql provider is used by all the other providers.
#
# You can read more in the README_API.md file
#
airflow.providers.common.sql.hooks.sql.ConnectorProtocol
airflow.providers.common.sql.hooks.sql.ConnectorProtocol.connect() -> Any
airflow.providers.common.sql.hooks.sql.DbApiHook
airflow.providers.common.sql.hooks.sql.DbApiHook.__init__(schema<KW-ONLY>: str | None = None, log_sql<KW-ONLY>: bool = True, *args, **kwargs)
airflow.providers.common.sql.hooks.sql.DbApiHook.bulk_dump()
airflow.providers.common.sql.hooks.sql.DbApiHook.bulk_load()
airflow.providers.common.sql.hooks.sql.DbApiHook.get_autocommit() -> bool
airflow.providers.common.sql.hooks.sql.DbApiHook.get_conn()
airflow.providers.common.sql.hooks.sql.DbApiHook.get_cursor()
airflow.providers.common.sql.hooks.sql.DbApiHook.get_first(self = None) -> Any
airflow.providers.common.sql.hooks.sql.DbApiHook.get_pandas_df(self = None, **kwargs)
airflow.providers.common.sql.hooks.sql.DbApiHook.get_pandas_df_by_chunks(self = None, chunksize<KW-ONLY>, **kwargs)
airflow.providers.common.sql.hooks.sql.DbApiHook.get_records(self = None) -> Any
airflow.providers.common.sql.hooks.sql.DbApiHook.get_sqlalchemy_engine(self = None)
airflow.providers.common.sql.hooks.sql.DbApiHook.get_uri() -> str
airflow.providers.common.sql.hooks.sql.DbApiHook.insert_rows(self = None, table = 1000, rows = False, **kwargs)
airflow.providers.common.sql.hooks.sql.DbApiHook.last_description() -> Sequence | None
airflow.providers.common.sql.hooks.sql.DbApiHook.run(self = False, sql: str | Iterable = None, autocommit: bool = None, parameters: Iterable | Mapping | None = False, handler: Callable | None = True) -> Any | list | None
airflow.providers.common.sql.hooks.sql.DbApiHook.set_autocommit()
airflow.providers.common.sql.hooks.sql.DbApiHook.split_sql_string() -> list
airflow.providers.common.sql.hooks.sql.DbApiHook.strip_sql_string() -> str
airflow.providers.common.sql.hooks.sql.DbApiHook.test_connection()
airflow.providers.common.sql.hooks.sql.fetch_all_handler() -> list | None
airflow.providers.common.sql.hooks.sql.fetch_one_handler() -> list | None
airflow.providers.common.sql.hooks.sql.return_single_query_results()
airflow.providers.common.sql.operators.sql.BaseSQLOperator
airflow.providers.common.sql.operators.sql.BaseSQLOperator.__init__(conn_id<KW-ONLY>: str | None = None, database<KW-ONLY>: str | None = None, hook_params<KW-ONLY>: dict | None = None, retry_on_failure<KW-ONLY>: bool = True, **kwargs)
airflow.providers.common.sql.operators.sql.BaseSQLOperator.get_db_hook() -> DbApiHook
airflow.providers.common.sql.operators.sql.BranchSQLOperator
airflow.providers.common.sql.operators.sql.BranchSQLOperator.__init__(sql<KW-ONLY>: str, follow_task_ids_if_true<KW-ONLY>: list, follow_task_ids_if_false<KW-ONLY>: list, conn_id<KW-ONLY>: str = default_conn_id, database<KW-ONLY>: str | None = None, parameters<KW-ONLY>: Iterable | Mapping | None = None, **kwargs) -> None
airflow.providers.common.sql.operators.sql.BranchSQLOperator.execute()
airflow.providers.common.sql.operators.sql.SQLCheckOperator
airflow.providers.common.sql.operators.sql.SQLCheckOperator.__init__(sql<KW-ONLY>: str, conn_id<KW-ONLY>: str | None = None, database<KW-ONLY>: str | None = None, parameters<KW-ONLY>: Iterable | Mapping | None = None, **kwargs) -> None
airflow.providers.common.sql.operators.sql.SQLCheckOperator.execute()
airflow.providers.common.sql.operators.sql.SQLColumnCheckOperator
airflow.providers.common.sql.operators.sql.SQLColumnCheckOperator.__init__(table<KW-ONLY>: str, column_mapping<KW-ONLY>: dict, partition_clause<KW-ONLY>: str | None = None, conn_id<KW-ONLY>: str | None = None, database<KW-ONLY>: str | None = None, accept_none<KW-ONLY>: bool = True, **kwargs)
airflow.providers.common.sql.operators.sql.SQLColumnCheckOperator.execute()
airflow.providers.common.sql.operators.sql.SQLExecuteQueryOperator
airflow.providers.common.sql.operators.sql.SQLExecuteQueryOperator.__init__(sql<KW-ONLY>: str | list, autocommit<KW-ONLY>: bool = False, parameters<KW-ONLY>: Mapping | Iterable | None = None, handler<KW-ONLY>: Callable = fetch_all_handler, split_statements<KW-ONLY>: bool = False, return_last<KW-ONLY>: bool = True, **kwargs) -> None
airflow.providers.common.sql.operators.sql.SQLExecuteQueryOperator.execute()
airflow.providers.common.sql.operators.sql.SQLExecuteQueryOperator.prepare_template() -> None
airflow.providers.common.sql.operators.sql.SQLIntervalCheckOperator
airflow.providers.common.sql.operators.sql.SQLIntervalCheckOperator.__init__(table<KW-ONLY>: str, metrics_thresholds<KW-ONLY>: dict, date_filter_column<KW-ONLY>: str | None = ds, days_back<KW-ONLY>: SupportsAbs = -7, ratio_formula<KW-ONLY>: str | None = max_over_min, ignore_zero<KW-ONLY>: bool = True, conn_id<KW-ONLY>: str | None = None, database<KW-ONLY>: str | None = None, **kwargs)
airflow.providers.common.sql.operators.sql.SQLIntervalCheckOperator.execute()
airflow.providers.common.sql.operators.sql.SQLTableCheckOperator
airflow.providers.common.sql.operators.sql.SQLTableCheckOperator.__init__(table<KW-ONLY>: str, checks<KW-ONLY>: dict, partition_clause<KW-ONLY>: str | None = None, conn_id<KW-ONLY>: str | None = None, database<KW-ONLY>: str | None = None, **kwargs)
airflow.providers.common.sql.operators.sql.SQLTableCheckOperator.execute()
airflow.providers.common.sql.operators.sql.SQLThresholdCheckOperator
airflow.providers.common.sql.operators.sql.SQLThresholdCheckOperator.__init__(sql<KW-ONLY>: str, min_threshold<KW-ONLY>: Any, max_threshold<KW-ONLY>: Any, conn_id<KW-ONLY>: str | None = None, database<KW-ONLY>: str | None = None, **kwargs)
airflow.providers.common.sql.operators.sql.SQLThresholdCheckOperator.execute()
airflow.providers.common.sql.operators.sql.SQLThresholdCheckOperator.push()
airflow.providers.common.sql.operators.sql.SQLValueCheckOperator
airflow.providers.common.sql.operators.sql.SQLValueCheckOperator.__init__(sql<KW-ONLY>: str, pass_value<KW-ONLY>: Any, tolerance<KW-ONLY>: Any = None, conn_id<KW-ONLY>: str | None = None, database<KW-ONLY>: str | None = None, **kwargs)
airflow.providers.common.sql.operators.sql.SQLValueCheckOperator.execute()
airflow.providers.common.sql.operators.sql._parse_boolean() -> str | bool
airflow.providers.common.sql.sensors.sql.SqlSensor
airflow.providers.common.sql.sensors.sql.SqlSensor.__init__(conn_id<KW-ONLY>, sql<KW-ONLY>, parameters<KW-ONLY> = None, success<KW-ONLY> = None, failure<KW-ONLY> = None, fail_on_empty<KW-ONLY> = False, hook_params<KW-ONLY> = None, **kwargs)
airflow.providers.common.sql.sensors.sql.SqlSensor.poke()
2 changes: 2 additions & 0 deletions dev/breeze/src/airflow_breeze/pre_commit_ids.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
"check-breeze-top-dependencies-limited",
"check-builtin-literals",
"check-changelog-has-no-duplicates",
"check-common-sql-api-usage",
"check-core-deprecation-classes",
"check-daysago-import-from-utils",
"check-decorated-operator-implements-custom-name",
Expand Down Expand Up @@ -100,6 +101,7 @@
"ts-compile-and-lint-javascript",
"update-breeze-cmd-output",
"update-breeze-readme-config-hash",
"update-common-sql-api",
"update-er-diagram",
"update-extras",
"update-in-the-wild-to-be-sorted",
Expand Down
2 changes: 1 addition & 1 deletion images/breeze/output-commands-hash.txt
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ setup:version:123b462a421884dc2320ffc5e54b2478
setup:a3bd246c3a425f3e586d11bbdc8937cb
shell:51149096ed3fb28d2a01e0a5ad0146ef
start-airflow:4bec5d26386dc268ec2de2f637c85272
static-checks:61c8d96ffe3008bff89a9df39b423be2
static-checks:4f1fc165bae62536a2f91de96799ba11
stop:8969537ccdd799f692ccb8600a7bbed6
testing:docker-compose-tests:b86c044b24138af0659a05ed6331576c
testing:helm-tests:94a442e7f3f63b34c4831a84d165690a
Expand Down
Loading

0 comments on commit 2fdf489

Please sign in to comment.