Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update optimizations #496

Merged

Conversation

mbolivar-nordic
Copy link
Contributor

@mbolivar-nordic mbolivar-nordic commented Apr 6, 2021

This implements new west update options that can be used to optimize performance:

  • --name-cache, --path-cache: allows cloning project repositories from a known place on the file system before hitting the network to fetch the current revisions (improves performance by not fetching if possible)
  • --narrow: fetches the project revision directly even if it is a SHA, and does not fetch tags (improves performance by fetching fewer objects than the default west update)
  • --fetch-opt: allows specifying additional options to each git fetch used by west update, such as --depth=1 (which may improve performance if many objects in history are not needed)

It also adds some new configuration options:

  • update.narrow: boolean, default false. If true, west update uses --narrow always.
  • update.path-cache: string, default no value. If nonempty, west update uses this as its --path-cache if not otherwise specified.
  • update.name-cache: same as update.path-cache, but for the --name-cache option.

@mbolivar-nordic
Copy link
Contributor Author

@marc-hb how do you think these options should play together with submodules?

@marc-hb
Copy link
Collaborator

marc-hb commented Apr 7, 2021

First I would like to say the new options and command line interface seem to capture perfectly all the sometimes contradicting requirements that were expressed in old, long, convoluted and confusing discussions scattered across multiple github issues (and I plead guilty for some of that). Of course the devil is always in the details and time will tell but the first impression of the user interface is: wow, what still stands between west and world domination?

Now back to your question, the long story short is: I'm not sure. After months (years?) discussing git optimizations in west, you just made me realize I never asked myself these same questions about submodules. Maybe because I try to avoid them as much as possible?

I'm not sure how caching will work because I haven't looked at this PR yet and will probably not have any time this week. Note git submodules have for quite some time been using their parent's .git/ directory instead of their own, not sure whether that will help. Worst case west caching should cache nothing submodule related.

I just had a look git help submodule and the update command supports a --depth argument. So I guess you could carry this one forward to submodules. However it would require filtering out other, unsupported --fetch-opt arguments. I would recommend leaving this as potential future improvement, waiting a fair amount of time after the dust of these new features has settled.

Submodules don't seem to have any --narrow equivalent. There is a branch attribute in .gitmodules that defaults to the remote default branch but it does not seem to have any --narrow or revision-in effect, all remote branches are fetched anyway. I think it's only meant for git submodule update --remote and for initial creation convenience. Again, I would recommend to leave anything here as a potential future improvement.

Sorry for not having much more than only superficial compliments, I'll try to get back to my usual negative self ASAP!

@carlescufi
Copy link
Member

@czeslawmakarski and @KaSroka please test and review

@tejlmand
Copy link
Collaborator

tejlmand commented Apr 7, 2021

Looks good, but I think it would be valuable if the fetch / narrow / cache options can be specified with west config.

@mbolivar-nordic
Copy link
Contributor Author

mbolivar-nordic commented Apr 7, 2021

Looks good, but I think it would be valuable if the fetch / narrow / cache options can be specified with west config.

OK, let's bikeshed that before I implement:

  • update.narrow: boolean, default false. If true, west update uses --narrow always.
  • update.path-cache: string, default empty. If nonempty, west update uses this as its --path-cache if not otherwise specified.
  • update.name-cache: same as update.path-cache, but for the --name-cache option.

@mbolivar-nordic
Copy link
Contributor Author

fetch [...] options can be specified with west config.

I prefer to leave --fetch-opt out of configuration for now if that's OK.

@mbolivar-nordic
Copy link
Contributor Author

mbolivar-nordic commented Apr 7, 2021

Again, I would recommend to leave anything [about submodules] here as a potential future improvement.

I think so too. Thanks.

@tejlmand
Copy link
Collaborator

tejlmand commented Apr 7, 2021

fetch [...] options can be specified with west config.

I prefer to leave --fetch-opt out of configuration for now if that's OK.

sure, let's start simple so feel free to leave out --fetch-opt for now.
Can always be added later, but in long run I guess any human user running west update with --fetch-opt in a workspace would probably want to use that particular option on each invocation in the workspace.

@mbolivar-nordic
Copy link
Contributor Author

Can always be added later, but in long run I guess any human user running west update with --fetch-opt in a workspace would probably want to use that particular option on each invocation in the workspace.

I kind of disagree, especially for the --depth 1 case. So let's keep this out for now as agreed, then, thanks.

@marc-hb
Copy link
Collaborator

marc-hb commented Apr 7, 2021

I kind of disagree, especially for the --depth 1 case

I firmly disagree for the --depth 1 case. From git help fetch:

        --depth=<depth>
           Limit fetching to the specified number of commits from
           the tip of each remote branch history. If fetching to a
           shallow repository created by git clone with
           --depth=<depth> option (see git-clone(1)), deepen or
           __SHORTEN__ the history to the specified number of commits.
           Tags for the deepened commits are not fetched.

Emphasis mine.

BTW this highlights an important mismatch between west and git interfaces:

  • git has both clone and fetch. If you want to use --depth, you most likely want to use it only when cloning, not when fetching.
  • west has only update for both. In fact it does not even invoke git clone.

Besides poor performance[*] and breaking basic git features unexpectedly, misunderstandings is yet another reason to avoid shallow clones. They're really a hack.

As a coincidence I just experienced this error today after long minutes of git silence. No idea what it means:

git fetch --deepen=500 myremote
fatal: shallow file has changed since we read it

[*] Try --unshallow just for fun. It feels slower than cloning from scratch.

@mbolivar-nordic
Copy link
Contributor Author

  • update.narrow: boolean, default false. If true, west update uses --narrow always.

  • update.path-cache: string, default empty. If nonempty, west update uses this as its --path-cache if not otherwise specified.

  • update.name-cache: same as update.path-cache, but for the --name-cache option.

Now added.

This allows the user to clone projects for the first time from
existing local repositories instead of fetching from the network.

For an example, consider this west.yml:

  manifest:
    defaults:
      remote: myorg
    remotes:
    - name: myorg
      url-base: https://github.com/myorg
    projects:
    - name: foo
      path: modules/foo
      revision: deadbeef
    - name: bar
      path: modules/bar
      revision: abcd1234

Suppose the 'foo' and 'bar' repositories are already available in a
/var/my-name-cache directory, like this (.git locations shown for clarity):

  /var/my-name-cache
  ├── bar
  │   └── .git
  └── foo
      └── .git

You can then run:

  west update --name-cache /var/my-name-cache

And the 'foo' and 'bar' projects will be cloned from '/var/my-name-cache/foo'
and '/var/my-name-cache/bar', respectively, before being updated.

NOTE:

  It's /var/my-name-cache/foo, NOT /var/my-name-cache/modules/foo.

  Project names are unique in a west manifest, so putting them in the
  top level directory is safe. Doing it this way lets you use
  --name-cache without duplicating cache repositories if project
  paths might move around.

By contrast, if you had this /var/my-path-cache directory:

  my-path-cache
  └── modules
      ├── bar
      │   └── .git
      └── foo
          └── .git

You could run:

  west update --path-cache /var/my-path-cache

And get a similar result: the 'foo' and 'bar' projects will be cloned
from '/var/my-path-cache/modules/foo' and
'/var/my-path-cache/modules/bar', respectively.

Using either option, west update won't hit the network at all if the
'deadbeef' and 'abcd1234' SHAs are already available locally and
the (default) '--fetch=smart' strategy for west update is used.

If both options are given, both caches are checked; --name-cache is
checked first.

Signed-off-by: Martí Bolívar <marti.bolivar@nordicsemi.no>
This moves the definition closer to the point of use, allows us to
move the --stats bookkeeping into the method itself, and makes it more
obvious that 'update' is the only command that ought to be hitting the
network to fetch revisions.

For now, we don't need any instance data, but that will change in a
subsequent patch.

Signed-off-by: Martí Bolívar <marti.bolivar@nordicsemi.no>
GitHub now does allow fetching SHAs.

Signed-off-by: Martí Bolívar <marti.bolivar@nordicsemi.no>
Update.do_run() stashes the command line arguments in self.args
almost before doing anything else. There's no need to pass it around
or repeat that work.

Signed-off-by: Martí Bolívar <marti.bolivar@nordicsemi.no>
Replace some unused things with _.

Signed-off-by: Martí Bolívar <marti.bolivar@nordicsemi.no>
This can be used to influence the way west fetches.

The motivating use case is to fetch with --depth=1. Since that's the
case, the new option can be used to override any 'clone-depth'
specified in the manifest for a project. Remove the extra 'with
--depth=foo' in the output when the project has a clone depth
accordingly, as this will be confusing if the user overrides it.

However, this is only part of the story to make this optimization
truly useful. We also need to allow the user to fetch something
that might be a SHA directly.

Without that, running 'west update -o --depth=1' on a project with
a SHA as the revision will sync the entire remote ref space, including
all tags, with --depth=1. Not quite what we're hoping for in terms of
limiting network bandwidth in that case.

Signed-off-by: Martí Bolívar <marti.bolivar@nordicsemi.no>
This option tells west to fetch project.revision exactly, and nothing
else.

This skips fetching tags, and tries this even if the revision looks
like a SHA, which does not work depending on the git host.

Signed-off-by: Martí Bolívar <marti.bolivar@nordicsemi.no>
Updating manifest-rev is a separate operation from fetching, but the
two are tracked and printed together in the --stats output.

Now that this code is in an instance method that has access to the
stats dict, it is easy to track manifest-rev and fetching separately,
so make it happen. Also resolve manifest-rev to a SHA before calling
_update_manifest_rev(), so the reflog entry says:

    west update: moving to <SHA>

instead of something like:

    west update: moving to FETCH_HEAD^{commit}

Signed-off-by: Martí Bolívar <marti.bolivar@nordicsemi.no>
Exercise --narrow alone and with --fetch-opt.

Signed-off-by: Martí Bolívar <marti.bolivar@nordicsemi.no>
If true, 'west update' is always '--narrow'.

Signed-off-by: Martí Bolívar <marti.bolivar@nordicsemi.no>
We are setting up various bits and pieces of instance state before
getting to the meat of the work. This is getting to be enough lines
that a dedicated helper feels right.

Signed-off-by: Martí Bolívar <marti.bolivar@nordicsemi.no>
@mbolivar-nordic
Copy link
Contributor Author

Rebased to fix merge conflicts.

@mbolivar-nordic mbolivar-nordic force-pushed the update-optimizations branch 2 times, most recently from f13a76d to 30782b3 Compare April 7, 2021 23:08
These set default values of the --name-cache and --path-cache command
line options, respectively.

Signed-off-by: Martí Bolívar <marti.bolivar@nordicsemi.no>
@marc-hb
Copy link
Collaborator

marc-hb commented Apr 8, 2021

Note git submodules have for quite some time been using their parent's .git/modules/ directory instead of their own, not sure whether that will help.

This could have really helped. Unfortunately it does not. git clone does not copy .git/modules/. In other words the new way git submodules are implemented offers a cache extremely easy to use but git clone does not take advantage of it! :-(

I disconnected the network, ran git clone locally and copied .git/modules manually. Then ta-da: I could initialize and update submodules without a network connection and recursively! That's a hack though, I would consider .git/modules/ as a private implementation detail that should not be accessed. Especially not in the very first version of these large new features.

Copy link
Member

@carlescufi carlescufi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me already.

@czeslawmakarski
Copy link

I'm not on the approvers list, but I checked and it works great by us. So I do approve.

Copy link

@KaSroka KaSroka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. As I have changed project I'm in I have no much room to test this but it should greatly speed up CI that only cares about given version and doesn't need history - this is what we needed in my old project.

@mbolivar-nordic mbolivar-nordic merged commit aba4164 into zephyrproject-rtos:master Apr 13, 2021
@mbolivar-nordic
Copy link
Contributor Author

Thanks for the reviews and testing!

@lucsegers
Copy link

@mbolivar-nordic could you tell me how this should be used in a CI environment?
Should I just use "west update --narrow"?
Not sure how I should use the path-cache and name-cache in a ci environment. Do we have some examples for this?

@mbolivar-nordic
Copy link
Contributor Author

@lucsegers I don't have any general advice; the results are always going to depend on the details.

Example: west update --narrow should generally be a win, but it depends on your manifest. Not all git servers support fetching a SHA directly, which means --narrow won't always work in all CI environments.

In terms of using a cache, you need to set it up so your CI environment has a prepopulated cache available and mounted on the file system somewhere. The specifics for this will depend on your CI environment. Are you running CI in a bare metal server somewhere? Set up the cache in the file system and update it periodically. Are you using docker? Try to figure out a way to do a volume mount from the host machine, if you have control over the file system of the host machine. Etc.

@marc-hb marc-hb added the performance How long things take label May 5, 2023
@marc-hb
Copy link
Collaborator

marc-hb commented May 5, 2023

I just added a brand new "performance" label to about 20 issues and pull requests:
https://github.com/zephyrproject-rtos/west/issues?q=label%3Aperformance+

AFAIK the most recent discussion happened in #638, have a look to at least this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance How long things take
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants