Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching yarn and pnpm dependencies in Docker #43329

Closed
wants to merge 11 commits into from

Conversation

bugraoz93
Copy link
Collaborator

closes: #43167


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:dev-tools area:production-image Production image improvements and fixes labels Oct 23, 2024
@bugraoz93
Copy link
Collaborator Author

Hey @potiuk,

I have started including the yarn dependency installation very similar to dependency caching. Most probably, I am still missing a lot of parts. I wanted to align with you on the approach before going too deep and include every detail :)

I think using generic caches from GitHub actions have also a couple of actions for caching. Some of them specifically to yarn, npm etc... I don't think we would fully cascade that to the local environment though since it's limited to using it in GitHub CI. Maybe we can consider caching in CI and locally as separate tasks which using GitHub Actions could be easier to maintain in the future.

I couldn't properly run the inline_scripts_in_docker.py to include it in the Dockfile. That's why CI is failing. I can take care of that part later after we have agreed on the approach and implementation phases.

Thanks for your time!

@potiuk
Copy link
Member

potiuk commented Oct 29, 2024

I couldn't properly run the inline_scripts_in_docker.py to include it in the Dockfile. That's why CI is failing. I can take care of that part later after we have agreed on the approach and implementation phases.

This should happen automatically with pre-commit you do not have to run it manually - just git add your changes and commit (after running pre-commit install once) then it will happen automatically for you.

@potiuk
Copy link
Member

potiuk commented Oct 29, 2024

I think using generic caches from GitHub actions have also a couple of actions for caching. Some of them specifically to yarn, npm etc... I don't think we would fully cascade that to the local environment though since it's limited to using it in GitHub CI. Maybe we can consider caching in CI and locally as separate tasks which using GitHub Actions could be easier to maintain in the future.

Yeah. That's something that we could do for regular installing of Yarn/NPM dependencies as part of CI jobs, but that will not work well for installing Yarn/NPM deps during the Dockerfile build. And yes - also we have a few places where we install it for local development as you wrote. But I think it's better to use specific caching in each case:

  • Dockerfile -> there the "tip" caching is really, really efficient, it works great because single build on the server saves all the rebuilds locally - and we just pull single container layer with installed dependencies - which is WAY faster than pulling half of the internet and 1000 node modules individually.

  • pre-commit - it installs node automatically without us doing anything (when language: node is used). And there we also easily controle node version default_language_version: python: python3 node: 22.2.0. There caching is achieved by caching the whole pre-commit cache. And the nice thing is that pre-commit will use single node environment for all the pre-commit hooks that have same specification. We have quite a few of them - some for old UI with yarn, some for new with pnpm and some for a number of node tools. And here I think it already works well with current caching in CI and locally without any change from our side.

I think both of them will be difficult to implement with any of the "standard" actions and are generally "better" - and same in CI and locally.

So yeah I think your current approach is right.

@bugraoz93
Copy link
Collaborator Author

I couldn't properly run the inline_scripts_in_docker.py to include it in the Dockfile. That's why CI is failing. I can take care of that part later after we have agreed on the approach and implementation phases.

This should happen automatically with pre-commit you do not have to run it manually - just git add your changes and commit (after running pre-commit install once) then it will happen automatically for you.

Yeah, I found the script itself from pre-commit and ran it multiple times as pre-commit and as a raw Python call. I think this is something in my local setup then. I can sort that out. Thanks!

@bugraoz93
Copy link
Collaborator Author

Yeah, it will be difficult with them. I have checked a couple of them, and they are a bit limited. Even though they seem like they are properly caching, it's still not flexible enough to fit into multiple use cases.
Awesome to hear the approach looks right! Hope I will ping you with a good solution in a couple of days :)
Thanks for the detailed answer!

@potiuk
Copy link
Member

potiuk commented Oct 29, 2024

Yeah, I found the script itself from pre-commit and ran it multiple times as pre-commit and as a raw Python call. I think this is something in my local setup then. I can sort that out. Thanks!

One watchout here - in order for the inlining to work with NEW script - you should make sure to add a comment with the right script path just before the script that is "inlined" (and ending comment after) - this is the way how the script finds where to inline it.

@bugraoz93 bugraoz93 force-pushed the feat/43167/yarn-install-cache branch from 674a600 to b836fda Compare November 16, 2024 19:11
@bugraoz93 bugraoz93 changed the title Caching yarn dependencies in Docker initial implementation [WIP] - Caching yarn dependencies in Docker initial implementation Nov 16, 2024
@potiuk
Copy link
Member

potiuk commented Nov 16, 2024

Looks good, but we also have to handle UI dependencies (with pnpm)

@bugraoz93 bugraoz93 force-pushed the feat/43167/yarn-install-cache branch from c081542 to 38c7f3e Compare November 16, 2024 23:37
@bugraoz93
Copy link
Collaborator Author

While reviewing the logs to ensure everything was running smoothly, I noticed an issue: I had accidentally reversed the -+e option 🤦‍♂️.

However, I encountered the following error:

#50 2.616 /scripts/docker/install_yarn_dependencies_from_branch_tip.sh: line 21: yarn: command not found

This suggests that yarn was either not installed or unavailable in the environment when the script was executed.

This made me realize that npm and yarn should have been installed at the point when the script is executed. I identified the best place to install them using apt and to cache them just before switching to the airflow user. Since yarn installs node_modules to a local directory, I copied the installed dependencies to a cache directory for future use. Once the cache was utilized in the appropriate location, it was deleted during the execution of compile_www_assets.py to ensure it was not included in the final package.

Looks good, but we also have to handle UI dependencies (with pnpm)

Indeed, thanks for pointing that out! I wanted to make sure everything was functioning as expected before implementing a similar solution with pnpm. That’s the next step I’ll tackle. Many thanks for the quick review!

@bugraoz93 bugraoz93 force-pushed the feat/43167/yarn-install-cache branch from 648075b to 6aed976 Compare November 18, 2024 01:37
@bugraoz93 bugraoz93 changed the title [WIP] - Caching yarn dependencies in Docker initial implementation Caching yarn dependencies in Docker initial implementation Nov 18, 2024
@bugraoz93 bugraoz93 added area:CI Airflow's tests and continious integration and removed area:production-image Production image improvements and fixes labels Nov 18, 2024
@bugraoz93 bugraoz93 changed the title Caching yarn dependencies in Docker initial implementation Caching yarn and pnpm dependencies in Docker Nov 18, 2024
@bugraoz93 bugraoz93 force-pushed the feat/43167/yarn-install-cache branch from d93f577 to b4fb977 Compare November 22, 2024 01:55
@bugraoz93
Copy link
Collaborator Author

There are too many commits, rebased rather than merge :) Could you please check again when you have time?

@potiuk potiuk force-pushed the feat/43167/yarn-install-cache branch from b4fb977 to 3a7d9e0 Compare November 28, 2024 00:31
@potiuk
Copy link
Member

potiuk commented Nov 28, 2024

Sorry for missing that one. Rebased it. It looks really good.

if os.getenv("AIRFLOW_PRE_CACHED_PNPM_PACKAGES", "false") == "true":
# Copy pnpm-cache to node_modules from pnpm-cache
shutil.copytree(pnpm_node_modules_cache_directory, node_modules_directory, dirs_exist_ok=True)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small tthing here. I think we should ALWAYS run the install step below. This covers case where the .lock file changed in the meantime and we just need to update it based on pre-cached node_modules

@@ -68,7 +69,11 @@ def get_directory_hash(directory: Path, skip_path_regexp: str | None = None) ->
shutil.rmtree(dist_directory, ignore_errors=True)
env = os.environ.copy()
env["FORCE_COLOR"] = "true"
subprocess.check_call(["yarn", "install", "--frozen-lockfile"], cwd=os.fspath(www_directory))
if os.getenv("AIRFLOW_PRE_CACHED_YARN_PACKAGES", "false") == "true":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small comment

@potiuk
Copy link
Member

potiuk commented Nov 28, 2024

Hmm. Actually I looked closer (something was wrong here)

I think this is (and I completely forgot about ) - a bit different issue. What we are currently doing - we do not have at all node /yarn/pnpm in Breeze. We are using and expecting is that pre-commit will manage the node pnmp/yarn environment for us.

      - id: compile-ui-assets
        name: Compile ui assets (manual)
        language: node
        stages: ['manual']
        types_or: [javascript, ts, tsx]
        files: ^airflow/ui/
        entry: ./scripts/ci/pre_commit/compile_ui_assets.py
        pass_filenames: false
        additional_dependencies: ['pnpm@9.7.1']
      - id: compile-ui-assets-dev
        name: Compile ui assets in dev mode (manual)
        language: node
        stages: ['manual']
        types_or: [javascript, ts, tsx]
        files: ^airflow/ui/
        entry: ./scripts/ci/pre_commit/compile_ui_assets_dev.py
        pass_filenames: false
        additional_dependencies: ['pnpm@9.7.1']
      - id: compile-www-assets
        name: Compile www assets (manual)
        language: node
        stages: ['manual']
        'types_or': [javascript, ts, tsx]
        files: ^airflow/www/
        entry: ./scripts/ci/pre_commit/compile_www_assets.py
        pass_filenames: false
        additional_dependencies: ['yarn@1.22.21']
      - id: compile-www-assets-dev
        name: Compile www assets in dev mode (manual)
        language: node
        stages: ['manual']
        'types_or': [javascript, ts, tsx]
        files: ^airflow/www/
        entry: ./scripts/ci/pre_commit/compile_www_assets_dev.py
        pass_filenames: false
        additional_dependencies: ['yarn@1.22.21']

So we do not even have anything in the image.

I think the idea of doing similar caching in image (my idea) was really wrong. We could move node + all things to the image but it's not really needed.

I am not really sure if we should do much now - the intermittent issues with node did not appear recently - so maybe just abandoning it for now is a better idea.

@bugraoz93
Copy link
Collaborator Author

Thanks for the review!

That's exactly what's the case, so I installed the node /yarn/ppm in the image. I tested with Breeze, too, since Breeze builds the Dockerfile.ci in local development. Aa, this only covers the local development case, though which may only speed up and not solve the entire problem.

I agree, I haven't seen the problem with node for a while now. Let's abandon this for now. Also, managing these dependencies even in more places would be an additional burden. Even let's say we managed the dependencies in the image and removed them from pre-commit, it would increase the image size as well as bring a lot of vulnerabilities. Keeping this in the pre-commit environment still makes more sense. I missed this one. Awesome catch!

@bugraoz93
Copy link
Collaborator Author

Closing this PR. If the problem appears again multiple times and bothers us, we can follow up again and discuss possible solutions. :)

@bugraoz93 bugraoz93 closed this Dec 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:CI Airflow's tests and continious integration area:dev-tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement caching of NPM in CI / local dev
2 participants