Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

airbyte-ci: refactor GradleTask to run in subprocess #52033

Conversation

alafanechere
Copy link
Contributor

@alafanechere alafanechere commented Jan 20, 2025

What

We want to move away from containerized gradle execution in CI.
This PR makes airbyte-ci run gradle as a subprocess on the system, while keeping the airbyte-ci orchestration benefits.

Example

I got a 🟢 run on destination-s3 from this branch.

Running a longer job on source-postgres, destination-bigquery and destination-snowflake.

Question for reviewers ❓

  • Which gradle options should we keep compared to the original containerized implementation : --no-watch-fs --build-cache and --scan
  • Do we want live Gradle logging when executing the gradle task in airbyte-ci, I found it really noisy and piping gradle output to python then to console has some performance overhead. I'll try to provide live gradle output in a follow up PR but local testing has shown strange overhead when streaming gradle output to the console.

How

New GitHub Action for Java Environment Setup:

Updates to Build and Test Steps:

Refactoring Gradle Task Handling:

  • airbyte-ci/connectors/pipelines/pipelines/airbyte_ci/steps/gradle.py:
    • Removed the use of container-based Gradle task execution in favor of running Gradle commands in subprocesses.
    • Introduced a context manager to handle Gradle environment setup and teardown.
    • Simplified the Gradle task options and removed unnecessary properties and methods. [1] [2] [3] [4] [5]

DX benefits

  • Faster runs - less overhead related to containerization
  • More stable caching behavior - we get the usual S3 cache implementation
  • Easier debugging, no need to dive into DockerInDocker complexity to investigate CI issues.
Screenshot 2025-01-21 at 18 36 35

Copy link

vercel bot commented Jan 20, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 22, 2025 10:30am

Copy link
Contributor Author

alafanechere commented Jan 20, 2025

@alafanechere alafanechere force-pushed the augustin/01-20-airbyte-ci_refactor_GradleTask_to_run_in_subprocess branch 2 times, most recently from 8399d12 to b0000a6 Compare January 21, 2025 14:07
@alafanechere alafanechere force-pushed the augustin/01-20-airbyte-ci_refactor_GradleTask_to_run_in_subprocess branch 5 times, most recently from 5c9d7e2 to 68d0f82 Compare January 21, 2025 16:54
@alafanechere alafanechere force-pushed the augustin/01-20-airbyte-ci_refactor_GradleTask_to_run_in_subprocess branch from 68d0f82 to def2396 Compare January 21, 2025 17:22
@alafanechere alafanechere marked this pull request as ready for review January 21, 2025 17:43
@alafanechere alafanechere requested a review from a team as a code owner January 21, 2025 17:43
CDK_MAVEN_METADATA_URL = (
"https://airbyte.mycloudrepo.io/public/repositories/airbyte-public-jars/io/airbyte/cdk/airbyte-cdk-core/maven-metadata.xml"
)
STATIC_GRADLE_OPTIONS = ("--build-cache", "--scan")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which options shall we keep compared to the containerized implementation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should keep --no-daemon and --no-watch-fs since this is a single build so we don't need to incur the overhead of start the daemon/watching the fs that is mostly useful for systems that are long running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about --no-daemon, aren't we going to call gradle multiple times in the same VM now?

Copy link
Contributor

@davinchia davinchia Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question - I'm assuming we only run 1 connector test per CI runner and most devs will not invoke airbyte-ci locally, and will instead directly run gradle, so we won't invoke the gradle command multiple times in a single VM using the airbyte-ci command. Augustine, does that sound right? Or is there a possibility we run different gradle commands within a single CI runner?

Copy link
Contributor Author

@alafanechere alafanechere Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming we only run 1 connector test per CI runner

Nope, the default airbyte-ci implementation is to run all modified connectors tests in the same runner, concurrently for python connectors, sequentially for java connectors as concurrent access to the gradle cache is not possible. I can try to revisit this logic later. Colocating execution on the same runner has benefits for python connectors for dagger caching. Following this change colocation for java connectors makes less sense and hurts DX.

In any case we do run multiple gradle command in a single connector pipeline execution: distTar, unitTests, integrationTests.

TLDR;: I don't think we should add --no-daemon back. I'll re-add --no-watch-fs. Thanks for your inputs.

@@ -115,7 +115,9 @@ runs:
ls -la
ls -la airbyte-python-cdk || echo "No airbyte-python-cdk directory"
ls -laL ../airbyte-python-cdk || echo "No airbyte-python-cdk symlink"

- name: Install Java Environment
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a followup: that would be cool to spare python connector only PR from installing the java env. We could rely on the changes output to check if any java connector was modified

@@ -60,7 +60,8 @@ async def run_connector_build(context: ConnectorContext) -> StepResult:
build_connector_tar_result = await BuildConnectorDistributionTar(context).run()
if build_connector_tar_result.status is not StepStatus.SUCCESS:
return build_connector_tar_result
dist_dir = await build_connector_tar_result.output.directory(dist_tar_directory_path(context))

dist_dir = await build_connector_tar_result.output.directory("build/distributions")
Copy link
Contributor

@davinchia davinchia Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for understanding: why are we using "build/distributions" directly? feels unrelated to the other changes, maybe I'm missing something.

Copy link
Contributor Author

@alafanechere alafanechere Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It changes because the original path related to an "absolute path airbyte repo path". the build_connector_tar step now returns a directory which is the connector directory.

# When running locally, this dependency update is slower and less useful than within a CI runner. Skip it.
warm_dependency_cache_args = ["--dry-run"]
return stdout, stderr, process.returncode
finally:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the started process hangs, how does this finally block trigger?

this is probably my lack of python knowledge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it hangs it's never reached. Good point. I can add a timeout.

Copy link
Contributor

@davinchia davinchia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alafanechere this looks sane to me. I'm not really qualified to review the Dagger code or Python best practices, so I mostly focused on understanding those pieces. Everything else seems good.

One comment on the python process handling, but non-blocking.

I see CI works. Did you manage to test this locally?

@alafanechere
Copy link
Contributor Author

@alafanechere this looks sane to me. I'm not really qualified to review the Dagger code or Python best practices, so I mostly focused on understanding those pieces. Everything else seems good.

👍

One comment on the python process handling, but non-blocking.

I'll address that tomorrow.

I see CI works. Did you manage to test this locally?

I did, but it was pretty slow - my machine was dying. I'm going to try again tomorrow with a less busy env.

@davinchia
Copy link
Contributor

Do we want live Gradle logging when executing the gradle task in airbyte-ci, I found it really noisy and piping gradle output to python then to console has some performance overhead.

I realised I didn't answer this. What is the overhead of doing so? I think the use case for this is being able to see logs as tests run in the CI. I imagine it's useful if trying to debug in CI, however I'm less sure of the value since I'm not sure how the tests outputs will look like once this change goes in.

If the dx is the same as the current dagger tests, I think value is low, since the output is only at the end when everything is run. Value is high if we are able to see things live in the GHA UI, though lower if perf overhead is large. I think we can poll various Java devs and fast follow in another PR since this PR by itself has lots of value.

@alafanechere
Copy link
Contributor Author

Do we want live Gradle logging when executing the gradle task in airbyte-ci, I found it really noisy and piping gradle output to python then to console has some performance overhead.

I realised I didn't answer this. What is the overhead of doing so? I think the use case for this is being able to see logs as tests run in the CI. I imagine it's useful if trying to debug in CI, however I'm less sure of the value since I'm not sure how the tests outputs will look like once this change goes in.

If the dx is the same as the current dagger tests, I think value is low, since the output is only at the end when everything is run. Value is high if we are able to see things live in the GHA UI, though lower if perf overhead is large. I think we can poll various Java devs and fast follow in another PR since this PR by itself has lots of value.

We can show the output live in the GHA console, but when I tried locally I had the impression that the output streaming had overhead. I can definitely experiment on this on a follow up PR.

process.terminate()
process.wait(timeout=5) # Give it 5 seconds to terminate gracefully
try:
process.kill() # Force kill if still running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this specific scenario be logged somehow? We might want to know that it happened?

Copy link
Contributor

@postamar postamar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like something worth doing! I'll note that there's more followup work in the form of ripping out all the stuff we did to support dind (wasn't /tmp mounted as a volume at some point?)

cache-write-only: ${{ inputs.gradle_cache_write_only }}
- name: Install system tools (jq, xargs)
shell: bash
run: sudo apt-get -y install jq findutils
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this step really required? Have you tried without? jq is no longer needed (IIRC) and xargs should be present on the system (which is now a full fledged ubuntu and not some bare bones alpine distro)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it and it still works. Thanks for the suggestion

CDK_MAVEN_METADATA_URL = (
"https://airbyte.mycloudrepo.io/public/repositories/airbyte-public-jars/io/airbyte/cdk/airbyte-cdk-core/maven-metadata.xml"
)
STATIC_GRADLE_OPTIONS = ("--build-cache", "--scan")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about --no-daemon, aren't we going to call gradle multiple times in the same VM now?

@alafanechere alafanechere force-pushed the augustin/01-20-airbyte-ci_refactor_GradleTask_to_run_in_subprocess branch from def2396 to 2176226 Compare January 22, 2025 08:24
@alafanechere alafanechere force-pushed the augustin/01-20-airbyte-ci_refactor_GradleTask_to_run_in_subprocess branch from 2176226 to 5709810 Compare January 22, 2025 08:57
@alafanechere
Copy link
Contributor Author

This seems like something worth doing! I'll note that there's more followup work in the form of ripping out all the stuff we did to support dind (wasn't /tmp mounted as a volume at some point?)

dind is still used for some python connector testing use cases which rely on docker-compose files.

@@ -61,24 +61,3 @@ def context_with_setup(dagger_client, python_connector_with_setup_not_latest_cdk
@pytest.fixture(scope="module")
def python_connector_base_image_address(python_connector_with_setup_not_latest_cdk):
return python_connector_with_setup_not_latest_cdk.metadata["connectorBuildOptions"]["baseImage"]


async def test_with_python_connector_installed_from_setup(context_with_setup, python_connector_base_image_address, latest_cdk_version):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was failing - removing it as installing from setup.py is now legacy

@alafanechere
Copy link
Contributor Author

alafanechere commented Jan 22, 2025

@postamar @davinchia @frifriSF59 thanks for your comments. Here are my latest changes following your suggestions.

  • Implement a timeout to avoid hanging gradle tasks. I've set it to 2 hours by default. This can be customized for distTar / unit / integration tests by tweaking the value on the GradleTask subclasses.
  • I added the --no-watch-fs option. I decided to not re-add the --no-daemon option because a single pipeline can run multiple gradle tasks.
  • I added extra logging + made some readability improvement changes in GradleTask.
  • Fixed airbyte-ci tests to get 🟢 CI.

Here's my latest CI test run on source-postgres and destination-s3

I opened a followup PR for CI orchestration optimization: to leverage matrix strategies to run 1 java connector per runner and only install the java environment when needed (when testing java connectors). It's still a wip.

@alafanechere alafanechere force-pushed the augustin/01-20-airbyte-ci_refactor_GradleTask_to_run_in_subprocess branch from a49800e to 466ee65 Compare January 22, 2025 10:24
Copy link
Contributor

@davinchia davinchia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

I'd be curious what are the pieces @postamar as no longer needed and can now be removed. However definitely not blocking.

@alafanechere alafanechere merged commit 0b6d37d into master Jan 22, 2025
30 checks passed
@alafanechere alafanechere deleted the augustin/01-20-airbyte-ci_refactor_GradleTask_to_run_in_subprocess branch January 22, 2025 16:04
@postamar
Copy link
Contributor

Apparently docker-in-docker is required for some python connectors so it can't be removed. This was by far the biggest piece. I'll take a look and see what odds and ends I can remove but they're fairly inconsequential in comparison.

Copy link

sentry-io bot commented Jan 22, 2025

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ QueryError: resolve: lstat /build/distributions: no such file or directory pipelines.airbyte_ci.connectors.build_image.ste... View Issue

Did you find this useful? React with a 👍 or 👎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants