Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support dag_ids arguments in db cleanup utility #24987

Closed
wants to merge 335 commits into from

Conversation

davidkl97
Copy link

closes: #24828
this PR adds optional dag_id_column to each table configuration and optional list dag_ids to the runtime logic to append in condition at _build_query func to tables whom support this column.

added test regarding the dag_ids arguments and updated cli parser and commands to support the new argument.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

potiuk and others added 30 commits May 21, 2022 22:01
* Add typing for airflow/configuration.py

The configuraiton.py did not have typing information and it made
it rather difficult to reason about it-especially that it went
a few changes in the past that made it rather complex to
understand.

This PR adds typing information all over the configuration file

(cherry picked from commit 71e4deb)
* Handle invalid date from query parameters in views.

* Add tests.

* Use common parsing helper.

* Add type hint.

* Remove unwanted error check.

* Fix extra_links endpoint.

(cherry picked from commit 9e25bc2)
* UnicodeDecodeError: 'utf-8' codec can't decode byte 0xXX in position X: invalid start byte

  File "/opt/work/python395/lib/python3.9/site-packages/airflow/hooks/subprocess.py", line 89, in run_command
    line = raw_line.decode(output_encoding).rstrip()            # raw_line ==  b'\x00\x00\x00\x11\xa9\x01\n'
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 4: invalid start byte

* Update subprocess.py

* Update subprocess.py

* Fix:  Exception when parsing log apache#20966

* Fix:  Exception when parsing log apache#20966

 Another alternative is: try-catch it.

e.g.

```
            line = ''
            for raw_line in iter(self.sub_process.stdout.readline, b''):
                try:
                    line = raw_line.decode(output_encoding).rstrip()
                except UnicodeDecodeError as err:
                    print(err, output_encoding, raw_line)
                self.log.info("%s", line)
```

* Create test_subprocess.sh

* Update test_subprocess.py

* Added shell directive and license to test_subprocess.sh

* Distinguish between raw and decoded lines as suggested by @uranusjr

* simplify test

Co-authored-by: muhua <microhuang@live.com>
(cherry picked from commit 863b257)
…#23462)

* Load requested dagrun even when there are many dagruns at (almost) the same time

* Fix code formatting issues

(cherry picked from commit 8280167)
apache#23486)

When task is expanded from a mapped task that returned no value, it
crashes the scheduler. This PR fixes it by first checking if there's
a return value from the mapped task, if no returned value, then error
in the task itself instead of crashing the scheduler

(cherry picked from commit 7813f99)
…e#23521)

* Prevent KubernetesJobWatcher getting stuck on resource too old

If the watch fails because "resource too old" the
KubernetesJobWatcher should not retry with the same resource version
as that will end up in loop where there is no progress.

* Reset ResourceVersion().resource_version to 0

(cherry picked from commit dee05b2)
In certain databases there is a need to set the collation for ID fields
like dag_id or task_id to something different than the database default.
This is because in MySQL with utf8mb4 the index size becomes too big for
the MySQL limits. In past pull requests this was handled
[apache#7570](apache#7570),
[apache#17729](apache#17729), but the
root_dag_id field on the dag model was missed. Since this field is used
to join with the dag_id in various other models ([and
self-referentially](https://github.com/apache/airflow/blob/451c7cbc42a83a180c4362693508ed33dd1d1dab/airflow/models/dag.py#L2766)),
it also needs to have the same collation as other ID fields.

This can be seen by running `airflow db reset` before and after applying
this change while also specifying `sql_engine_collation_for_ids` in the
configuration.

Other related PRs
[apache#19408](apache#19408)

(cherry picked from commit b7f8627)
* Fix `PythonVirtualenvOperator` templated_fields
The `PythonVirtualenvOperator` templated_fields override `PythonOperator` templated_fields which caused functionality not to work as expected.
fixes: apache#23557

(cherry picked from commit 1657bd2)
Update missing bracket

(cherry picked from commit 827bfda)
(cherry picked from commit f313e14)
These checks are only make sense for upgrades.  Generally they exist to resolve referential integrity issues etc before adding constraints.  In the downgrade context, we generally only remove constraints, so it's a non-issue.

(cherry picked from commit 9ab9cd4)
Move top margin to each breadcrumb component to make sure that there is no overlap when the header wraps with long names.

(cherry picked from commit f77a691)
when StandardTaskRunner runs tasks with exec

Issue: apache#23540
(cherry picked from commit e453e68)
If you tried to expand via xcom into a non-templated field without
explicitly setting the upstream task dependency, the scheduler would
crash because the upstream task dependency wasn't being set
automatically. It was being set only for templated fields, but now we do
it for both.

(cherry picked from commit 3849ebb)
The rename from apache#23562 missed few shell_parms usage where it
also should be replaced.

(cherry picked from commit 4afa8e3)
…pache#23687)

Several commands of Breeze depends on docker, docker compose
being available as well as breeze image. They will work
fine if you "just" built the image but they might benefit
from the image being rebuilt (to make sure all latest
dependencies are installed in the image). The common checks
done in "shell" command for that are now extracted to common
utils and run as first thing in those commands that need it.

(cherry picked from commit 3f4ab6c)
(cherry picked from commit 4a85370)
The "wait for image" step lacked --tag-as-latest which made the
subsequent "fix-ownership" step run sometimes far longer than
needed - because it rebuilt the image for fix-ownership case.

Also the "fix-ownership" command has been changed to just pull
the image if one is missing locally rather than build. This
command might be run in an environment where the image is missing
or any other image was build (for example in jobs where an image
was build for different Python version) in this case the command
will simply use whatever Python version is available (it does
not matter), or in case no image is available, it will pull the image
as the last resort.

(cherry picked from commit 5e3f652)
After apache#23775 I noticed that there is yet another small improvement
area in the CI buld speed. Currently build-ci-image builds and push
only "commit-tagged" images, but "fix-ownership" requires
the "latest" image to run.

This PR adds --tag-as-latest option also to build-image and
build-prod-image commands - similarly as for the pull-image and
pull-prod-image. This will retag the "commit" images as latest in the
build-ci-images step and allow to save 1m on pulling the latest image
before fix-ownership (bringing it back to 1s overhead)

(cherry picked from commit 252ef66)
ephraimbuddy and others added 12 commits July 1, 2022 19:13
There were errors with retieving constraints branch caused by
using different convention for output names (sometimes dash,
sometimes camelCase as suggested by most GitHub documents).

The "dash-name" looks much better and is far more readable so
we shoud unify all internal outputs to follow it.

During that rename some old, unused outputs were removed,
also it turned out that the new selective-check can
replace previous "dynamic outputs" written in Bash as well.

Additionally, the "defaults" are now retrieved via Python script, not
bash script which will make it much more readable - both build_images
and ci.yaml use it in the right place - before replacing
the scripts and dev with the version coming in from PR in case
of build_images.yaml.

(cherry picked from commit 017507be1e1dbf39abcc94a44fab8869037893ea)
@boring-cyborg
Copy link

boring-cyborg bot commented Jul 12, 2022

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@davidkl97 davidkl97 closed this Jul 12, 2022
@davidkl97 davidkl97 deleted the add-dbcleanup-dag-ids-arg branch July 12, 2022 06:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:dev-tools area:production-image Production image improvements and fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support dag level db cleanup on airflow cli