-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerated Alignment Code + Adapter for ProjectCommitData #32
Accelerated Alignment Code + Adapter for ProjectCommitData #32
Conversation
My plan for testing was to pick two real commits in real projects and compare the alignment to what I expect it to be, but reading the |
Sorry, just noticed your message. The script at https://github.com/a-gardner1/coq-pearls/blob/main/scripts/data/extract_cache.py can be invoked with a subset of projects ( Instead, I would follow the recipe of |
Totally untested draft function! | ||
Don't merge this! | ||
""" | ||
# only attempt to align files present in both roots. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume given the comment above that this is just a placeholder for testing.
One reason that I prescribed treating the entire commit as one big file constructed from dependency order is that it makes the alignment largely invariant to renamed files (if a rule for always selecting the same topological sort could be established, which coqdep
may or may not follow, then the alignment would be completely invariant to renamed files) and more robust to moved definitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have two files whose relative order of compilation doesn't matter, a renaming could change their order in the topological sort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be a compromise method here-- if files have matching paths, then align those files. For any files that do not have a counterpart in the other commit, concat them all in dependency order and attempt a last-ditch alignment.
This preserves the performance of aligning smaller chunks and can still handle aligning moved files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~~We could also use diffs to realign only files in chunks that were modified. This reuses alignment.
~~ I see that the original issue already lines this part out.
Create ordered dict of files, each key corresponds to list files that presumably have changeable order based on dependency sort. Do alignment per chunk. Do different chunks in parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I suggested something to that effect in my outline of the corresponding GitHub issue (in all on phone, not sure how to link right now).
The fact that one can switch two independent files in the topological sort is just a more coarse example of being able to switch two independent sentences within a file. Order-preserving alignment cannot adequately address this situation, though we do not know how often it occurs. I think alignment is a fine place to start with, but one of the main outcomes I am lookin for with the PR is for it to set the groundwork to allow for alternatives in a plug-and-play fashion.
Indeed, I do not think order is an important property worth preserving as an invariant. I think it is a proxy for the more difficult to obtain dependency graph between definitions (which we will actually have enough information cached to compute). More formally, I'd posit that our objective is to assign sentences (or at least named definitions/lemmas/constants/etc.) from one set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I didn't have a good way to parse git diffs, and the lazy algorithm operates in linear-ish time on things that are unchanged, so I didn't end up getting around to it. Though I can see, given the intent to move to an all-pairs method, the desire to immediately discharge as many definitions as possible to avoid a large quadratic runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also use 'git -M' to detect a moved file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or more generally look for specific change types. https://gitpython.readthedocs.io/en/stable/reference.html#git.diff.DiffIndex.change_type
I got this error after trying to run the extraction script invoked as such: where I set up the directory format so that the dataset directory in the repo aligns with the default path.
There's a warning about not being able to find float, but then what was it extracting? |
The warning just means that nothing was extracted. That is a new error... at first glance it implies that the entire AST for some command was just a location. We will investigate it later today |
I've got an MR waiting to merge that fixed a couple of bugs, though I was unable to reproduce your error. One bug was pretty significant and might have caused yours: the switch was not being passed to I think the size of the cache can be reduced (e.g., by storing diffs of proof goals versus complete sets for each sentence), but not sure to what extent. |
The changes have been merged. Can you try again? And if you hit the same error, see if you can determine which file and commit caused it. I recommend adding |
That looks like it fixed it. It's making progress now. |
Though, it used up all 32 GBs of RAM on my computer. Then I gave it 10 GB swap and it still got OOM killed a while later, just as it was finishing up the first commit. I probably have to do as |
Oof, looks like some more profiling is in order. I'm sure we can reduce the memory usage. |
I have pushed a fix for a memory leak due to some missing I also have a couple of changes in mind to |
Here's another fun one. Use a default-commit of
I thought this was the 2019 filter, but I edited that to 2010 and re-pip-installed, and this still happened. I can manually clone the verdi repo and checkout that hash without issues, but this still happens. This error didn't occur before I changed the default-commit. This could be this upstream issue or related to this upstream issue: gitpython-developers/GitPython#1016 In the meantime I might just copy paste some definitions from some files of interest for a synthetic test. |
394511d
to
95d7026
Compare
OK, added a test by picking an interesting pair of commits from a project. I did some hacky stuff to include definitions in the alignment because those were deleted/added between the commits I picked. Made sure to include the license of the repository I took the examples from. Still needs a test of multiple files. |
With this new test, documentation, and accompanying bugfix for the new test, I think this is now in a reasonable state for someone to take a look at. |
[Clipped]
@tom-p-reichel I'm not seeing a commit with that hash in the verdi-chord repository. DistributedComponents/verdi-chord@e5f4b46 gives me a 404. I thought it might be a blob hash in that repo, but it doesn't seem to be that either. Where did that hash come from? |
Summary of offline conversation: we think the commit iterators might need to be modified to not follow chains of parents or children, as this hash likely arose as the parent of some commit that no longer exists after a rebase or other history-altering modification to the repository. Instead, we will just have to get the list of all commits in the repository up front, sort them by date, and iterate over the list. |
Oh no! Don't do that! No, I think what happened is that there are actually two popular coq repos named "verdi"-- one by DistributedComponents and one by uwplse, and I had been resolving names to github repos using google with no problem up until this one, which was ambiguous-- my fault. So that hash actually literally didn't exist in the repo, it existed in another coq project named "verdi", which I didn't expect to exist. |
d5e6980
to
0594d20
Compare
Implemented requested stub-- also found out that I was wrong about which edit distance I was using. Hamming appears to be for fixed length, I was using Levenshtein. |
Got some non-trivial performance improvements by running alignment on numpy arrays of characters as ints instead of strings. Kind of thought numba would do that automatically. |
FYI, I am going to replace usages of >>> from timeit import timeit
>>> import leven
>>> from prism.util.alignment import fast_edit_distance
>>> fast_edit_distance("flaw", "lawn", True)[0] # eliminate jit compilation from timeit
2.0
>>> timeit(lambda : leven.levenshtein("flaw", "lawn"))
0.2536887312307954
>>> timeit(lambda : fast_edit_distance("flaw", "lawn", True)[0])
31.69242916069925 |
The changes in this PR have been rebased and merged with the main branch. |
Numba alignment stuff works, shimmed into existing tests for alignment.
I did some testing by hand on
file_alignment
but I need to add some rigorous testing onalign_commits
before this should be merged-- PR sent now for visibility.