Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc: consider switching from GitPython #2215

Closed
Suor opened this issue Jul 2, 2019 · 31 comments · Fixed by iterative/scmrepo#29
Closed

dvc: consider switching from GitPython #2215

Suor opened this issue Jul 2, 2019 · 31 comments · Fixed by iterative/scmrepo#29
Assignees
Labels
p3-nice-to-have It should be done this or next sprint refactoring Factoring and re-factoring

Comments

@Suor
Copy link
Contributor

Suor commented Jul 2, 2019

GitPython causes constant headache and considerable time loss investigating and fixing issues on windows with file not closed and processes not terminated soon enough. We have several places in our code where we were forced to use retries to handle that.

The notable alternatives are pygit2 and Dulwich.

Does anyone know of any downsides of any of these? Or other alternatives? Or maybe there is a good reason we are using GitPython, which I am unaware of?

@shcheklein
Copy link
Member

Also, looks like gitpython depends on git being installed:

GitPython needs the git executable to be installed on the system and available in your PATH for most operations. If it is not in your PATH, you can help GitPython find it by setting the GIT_PYTHON_GIT_EXECUTABLE=<path/to/git> environment variable.

Dulwich is pure-python. I wonder if we can do git clone with it as well? It would be great if dvc does not require git at all in certain cases like get/import-url.

We had some troubles with Dulwich trying to compile its C extension or something. @efiop has better idea and even prepared a patch for Dulwich. I'm not sure why we are not just specifying the global option to compile it w/o C-extension.

@efiop
Copy link
Contributor

efiop commented Jul 2, 2019

Dulwich doesn't have wheels yet and to install it in pure-python form, we need to explicitly specify special flag to pip install, which can not be included into install_requires. That is why I've sent a patch that would fallbackt to pure-python if it is unable to complie C extensions, but that patch got declined, because maintainer didn't want people accidentally installing slower dulwich version. So I've helped a bit with automatic wheel building, but there is still stuff to do (e.g. need to fix tests on mac). Other than that, dulwich seems pretty promising.

pygit2 requires libgit2, but it has a pretty good wheel selection already https://pypi.org/project/pygit2/#files , so also might be promising.

We chose to use gitpython because it just worked and is the most popular project among these :) With more advanced use it does has a lot of drawbacks indeed.

@ghost
Copy link

ghost commented Jul 2, 2019

I really enjoy using the pygit2 interface.

@Suor
Copy link
Contributor Author

Suor commented Jul 3, 2019

I am also in favor of pygit2. It's maintained by the same people maintaining libgit2, so it should be future proof and not do anything weird.

@shcheklein shcheklein added p1-important Important, aka current backlog of things to do refactoring Factoring and re-factoring labels Jul 14, 2019
@efiop efiop added p2-medium Medium priority, should be done, but less important p3-nice-to-have It should be done this or next sprint and removed p1-important Important, aka current backlog of things to do p2-medium Medium priority, should be done, but less important labels Jul 16, 2019
@shcheklein
Copy link
Member

It looks like it's getting more and more important. A few people mentioned partial checkout problems with gitpython. And it's very important to solve get/dvc API dependency on command line and local filesystem. @efiop can we prioritize the initial research around pygit2?

@efiop efiop added p1-important Important, aka current backlog of things to do and removed p3-nice-to-have It should be done this or next sprint labels Jul 24, 2019
@efiop
Copy link
Contributor

efiop commented Jul 24, 2019

Sure @shcheklein , done.

@Suor
Copy link
Contributor Author

Suor commented Jul 24, 2019

We might reconsider using pygit2, there are lots of issues with it:

  • it's a beta
  • not actively developed
  • issues get answers in a week or several weeks, many times they don't
  • it's lower level than gitpython, see this for example
  • it unlike gitpython does not match git calls, I mean if someone is able do git ... do something then it might not work when dvc does that via pygit. I.e. cloning with https or ssh url might work cmd-line, but not via pygit2 depending on libraries installed or some configuration

GitPython for comparison:

  • not actively developed
  • issues get triaged fast, not fixed - "help wanted" label is assigned
  • PRs get merged regularly

@Suor
Copy link
Contributor Author

Suor commented Jul 24, 2019

P.S. Dulwich is somewhere inbetween.

@shcheklein
Copy link
Member

@Suor Dulwich - is there active development, at least? can you confirm that it does not depend on command line? If we decide to be involved in one those projects we need to pick one that is more or less active and we can be sure that we'll be able to release new versions, build wheels, etc. And it should be aligned with our long term goals - things like no dependency on CLI is important, for example.

@Suor
Copy link
Contributor Author

Suor commented Jul 24, 2019

@shcheklein Dulwich is not actively developed, none of the three is. Support is also so so, I would not call it reliable.

Why do you want to get away from CLI? I see using that as an advantage, this makes things more reliable and easier to set up for users, since they already did that. The only downsides I see is speed (presumably, and we need benching to really say this is an issue) and processes on windows, which we worked around so far.

@shcheklein
Copy link
Member

Because it's extremely strange to have git installed as a dependency for an API to work. I've seen already some push back on this.

@Suor
Copy link
Contributor Author

Suor commented Jul 24, 2019

I see. But for everything else it's an advantage.

If we go pygit2 way we will need to own/develop/release it, I see zero chances it will work other way. We will also need to provide custom wheels to include libssh and OpenSSL and their counterparts on other systems to make it work with https and ssh urls, otherwise users will need to install those manually. We will need add lots of code to dvc to support many different use cases and we will get lots of bugs also.

If we go Dulwich way we most probably will also need to support it. If it turns out to be slow - it is mostly pure python - then we will need to rewrite its parts in C or Cython, there are some already.

@shcheklein
Copy link
Member

k, how about we use both then? For api we need some limited functionality - only fetch the bunlde of files that belong to a revision. Can we do this somehow with some other library?

@Suor
Copy link
Contributor Author

Suor commented Jul 24, 2019

I guess with some hack we may use pygit2 or Dulwich just for API.

@shcheklein
Copy link
Member

@Suor what kind of hacks do you see? What else do we need from them to start using? How stable the "clone" part for them?

@Suor
Copy link
Contributor Author

Suor commented Jul 24, 2019

@shcheklein looked through the code and it looks like simply changing calls in external_repo() implementation to clone and checkout with different lib would do.

@shcheklein
Copy link
Member

@Suor is there a function/command in Git to only checkout, I wonder? Anyway, sounds like a good workaround then. But it looks like we will need to find a way to support new stuff in GitPython. So, let's create a ticket for fixing GriPython + a ticket for fixing Dulwich and use it in API.

@ghost
Copy link

ghost commented Jul 27, 2019

@shcheklein what do you mean by "only checkout", isn't git checkout ?

@shcheklein
Copy link
Member

@MrOutis just download the workspace that corresponds to a revision as a tar bundle for example. Don't download .git, etc. Use simple http protocol, etc. I'm not sure if git server supports stuff like this, but it would be handy for our API.

@ghost
Copy link

ghost commented Jul 27, 2019

@shcheklein looks like it does, but GitHub doesn't allow that transaction https://twitter.com/GitHubHelp/status/322818593748303873

https://www.gilesorr.com/blog/git-archive-github.html

@pmrowla pmrowla added this to DVC Oct 19, 2021
@pmrowla pmrowla moved this to Backlog in DVC Oct 19, 2021
@pmrowla pmrowla moved this from Backlog to In Progress in DVC Oct 19, 2021
@efiop efiop moved this from In Progress to Review In Progress in DVC Dec 14, 2021
Repository owner moved this from Review In Progress to Done in DVC Mar 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p3-nice-to-have It should be done this or next sprint refactoring Factoring and re-factoring
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants