Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerated Alignment Code + Adapter for ProjectCommitData #32

Closed

Conversation

tom-p-reichel
Copy link
Contributor

Numba alignment stuff works, shimmed into existing tests for alignment.

I did some testing by hand on file_alignment but I need to add some rigorous testing on align_commits before this should be merged-- PR sent now for visibility.

@tom-p-reichel
Copy link
Contributor Author

tom-p-reichel commented Oct 20, 2022

My plan for testing was to pick two real commits in real projects and compare the alignment to what I expect it to be, but reading the test_build_cache, I actually still don't know how to populate a CoqProjectBuildCache with a project so that I can get two ProjectCommitData entries. Is it worth it to move a project through the whole pipeline or should I just build the ProjectCommitData objects directly?

@a-gardner1
Copy link
Collaborator

Sorry, just noticed your message. The script at https://github.com/a-gardner1/coq-pearls/blob/main/scripts/data/extract_cache.py can be invoked with a subset of projects (--project-names) and limited number of commits (--max-num-commits). A good candidate project is probably lambda or float, which we use in some of our unit tests. However, I do not think a time-efficient unit test for extraction could be made for alignment over two whole commits without having some ProjectCommitData precomputed.

Instead, I would follow the recipe of extract_cache_profile.py using the files_to_use argument and limit the extraction to a small subset of files. For an actual unit test, I would also have the ProjectCommitData precomputed and serialized in a file alongside the test module. That way the test is independent of any quirks in the actual extraction logic and doesn't incur the associated overhead.

Totally untested draft function!
Don't merge this!
"""
# only attempt to align files present in both roots.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume given the comment above that this is just a placeholder for testing.

One reason that I prescribed treating the entire commit as one big file constructed from dependency order is that it makes the alignment largely invariant to renamed files (if a rule for always selecting the same topological sort could be established, which coqdep may or may not follow, then the alignment would be completely invariant to renamed files) and more robust to moved definitions.

Copy link
Collaborator

@SimpleConjugate SimpleConjugate Oct 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have two files whose relative order of compilation doesn't matter, a renaming could change their order in the topological sort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a compromise method here-- if files have matching paths, then align those files. For any files that do not have a counterpart in the other commit, concat them all in dependency order and attempt a last-ditch alignment.

This preserves the performance of aligning smaller chunks and can still handle aligning moved files.

Copy link
Collaborator

@SimpleConjugate SimpleConjugate Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~~We could also use diffs to realign only files in chunks that were modified. This reuses alignment.
~~ I see that the original issue already lines this part out.

Create ordered dict of files, each key corresponds to list files that presumably have changeable order based on dependency sort. Do alignment per chunk. Do different chunks in parallel?

Copy link
Collaborator

@a-gardner1 a-gardner1 Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I suggested something to that effect in my outline of the corresponding GitHub issue (in all on phone, not sure how to link right now).

The fact that one can switch two independent files in the topological sort is just a more coarse example of being able to switch two independent sentences within a file. Order-preserving alignment cannot adequately address this situation, though we do not know how often it occurs. I think alignment is a fine place to start with, but one of the main outcomes I am lookin for with the PR is for it to set the groundwork to allow for alternatives in a plug-and-play fashion.

Indeed, I do not think order is an important property worth preserving as an invariant. I think it is a proxy for the more difficult to obtain dependency graph between definitions (which we will actually have enough information cached to compute). More formally, I'd posit that our objective is to assign sentences (or at least named definitions/lemmas/constants/etc.) from one set $U$ to another $V$ via a minimal-cost injective assignment operator $\pi$ such that $\pi(U)$ satisfies the dependency relations of $V$ (and $\pi^{-1}(V)$ satisfies the dependency relations of $U$). The cost to be minimized would be the overall cost of the assignment $\sum_{u\in U, v\in V} C(\pi(U), V)$, and $C$ may be the edit distance or something else (e.g., the product of the edit distance and the distance between the locations of assigned sentences, thus capturing the amount of "work" one must do to move the sentence to its destination). This could easily be cast as a linear program and possibly approximately solved via a constrained Sinkhorn iteration. However, most of the lift to prepare the input for it is the same as for alignment, so it makes sense to start (and with luck, quit) there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't have a good way to parse git diffs, and the lazy algorithm operates in linear-ish time on things that are unchanged, so I didn't end up getting around to it. Though I can see, given the intent to move to an all-pairs method, the desire to immediately discharge as many definitions as possible to avoid a large quadratic runtime.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also use 'git -M' to detect a moved file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tom-p-reichel
Copy link
Contributor Author

tom-p-reichel commented Oct 21, 2022

I got this error after trying to run the extraction script invoked as such:
python scripts/data/extract_cache.py --project-names float --max-num-commits 30 --root "/home/tpr/work/ra/PRISM"

where I set up the directory format so that the dataset directory in the repo aligns with the default path.

/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../prism/da
ta/commit_map.py:376: UserWarning: No results found for float
  warnings.warn(f"No results found for {p.name}")
<class 'prism.language.gallina.exception.SexpAnalyzingException'> encountered in pr
oject float:
Traceback (most recent call last):
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../
prism/language/gallina/analyze.py", line 894, in analyze_vernac
    control_flag = ControlFlag(vernac_control[0].content)
  File "/usr/lib/python3.8/enum.py", line 339, in __call__
    return cls.__new__(cls, value)
  File "/usr/lib/python3.8/enum.py", line 663, in __new__
    raise ve_exc
ValueError: 'loc' is not a valid ControlFlag

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 104, in _project_commit_fmap
    result = commit_fmap(project, commit, result)
  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 324, in _commit_fmap_and_update
    result = commit_fmap(project, commit_sha, result)
  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 1150, in extract_cache_func
    extract_cache(
  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 653, in extract_cache
    extract_cache_new(
  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 720, in extract_cache_new
    command_data = extract_vernac_commands(project, files_to_use)
  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 563, in extract_vernac_commands
    command_data[filename] = _extract_vernac_commands(
  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 398, in _extract_vernac_commands
    vernac = SexpAnalyzer.analyze_vernac(sentence.ast)
  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/language/gallina/analyze.py", line 896, in analyze_vernac
    raise SexpAnalyzingException(sexp) from e
prism.language.gallina.exception.SexpAnalyzingException: 
in sexp: (loc 
  (
    (
      (fname ToplevelInput) 
      (line_nb 1) 
      (bol_pos 0) 
      (line_nb_last 1) 
      (bol_pos_last 0) 
      (bp 0) 
      (ep 31))))

Done

There's a warning about not being able to find float, but then what was it extracting?

@a-gardner1
Copy link
Collaborator

I got this error after trying to run the extraction script invoked as such:

python scripts/data/extract_cache.py --project-names float --max-num-commits 30 --root "/home/tpr/work/ra/PRISM"

where I set up the directory format so that the dataset directory in the repo aligns with the default path.


/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../prism/da

ta/commit_map.py:376: UserWarning: No results found for float

  warnings.warn(f"No results found for {p.name}")

<class 'prism.language.gallina.exception.SexpAnalyzingException'> encountered in pr

oject float:

Traceback (most recent call last):

  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../

prism/language/gallina/analyze.py", line 894, in analyze_vernac

    control_flag = ControlFlag(vernac_control[0].content)

  File "/usr/lib/python3.8/enum.py", line 339, in __call__

    return cls.__new__(cls, value)

  File "/usr/lib/python3.8/enum.py", line 663, in __new__

    raise ve_exc

ValueError: 'loc' is not a valid ControlFlag



The above exception was the direct cause of the following exception:



Traceback (most recent call last):

  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 104, in _project_commit_fmap

    result = commit_fmap(project, commit, result)

  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 324, in _commit_fmap_and_update

    result = commit_fmap(project, commit_sha, result)

  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 1150, in extract_cache_func

    extract_cache(

  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 653, in extract_cache

    extract_cache_new(

  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 720, in extract_cache_new

    command_data = extract_vernac_commands(project, files_to_use)

  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 563, in extract_vernac_commands

    command_data[filename] = _extract_vernac_commands(

  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 398, in _extract_vernac_commands

    vernac = SexpAnalyzer.analyze_vernac(sentence.ast)

  File "/home/tpr/work/ra/coq-pearls/lib/python3.8/site-packages/seutil/../prism/language/gallina/analyze.py", line 896, in analyze_vernac

    raise SexpAnalyzingException(sexp) from e

prism.language.gallina.exception.SexpAnalyzingException: 

in sexp: (loc 

  (

    (

      (fname ToplevelInput) 

      (line_nb 1) 

      (bol_pos 0) 

      (line_nb_last 1) 

      (bol_pos_last 0) 

      (bp 0) 

      (ep 31))))



Done

There's a warning about not being able to find float, but then what was it extracting?

The warning just means that nothing was extracted. That is a new error... at first glance it implies that the entire AST for some command was just a location. We will investigate it later today

@a-gardner1
Copy link
Collaborator

I've got an MR waiting to merge that fixed a couple of bugs, though I was unable to reproduce your error. One bug was pretty significant and might have caused yours: the switch was not being passed to SerAPI. I was able to produce the cache for one commit of float, which came out to 313 MB (12 MB compressed).

I think the size of the cache can be reduced (e.g., by storing diffs of proof goals versus complete sets for each sentence), but not sure to what extent.

@a-gardner1
Copy link
Collaborator

a-gardner1 commented Oct 21, 2022

The changes have been merged. Can you try again? And if you hit the same error, see if you can determine which file and commit caused it. I recommend adding --force-serial for debugging.

@tom-p-reichel
Copy link
Contributor Author

That looks like it fixed it. It's making progress now.

@tom-p-reichel
Copy link
Contributor Author

Though, it used up all 32 GBs of RAM on my computer. Then I gave it 10 GB swap and it still got OOM killed a while later, just as it was finishing up the first commit. I probably have to do as extract_cache_profile does and pick a couple files to align.

@a-gardner1
Copy link
Collaborator

Though, it used up all 32 GBs of RAM on my computer. Then I gave it 10 GB swap and it still got OOM killed a while later, just as it was finishing up the first commit. I probably have to do as extract_cache_profile does and pick a couple files to align.

Oof, looks like some more profiling is in order. I'm sure we can reduce the memory usage.

@a-gardner1
Copy link
Collaborator

I have pushed a fix for a memory leak due to some missing Py_DecRefs in the C extension for sexp parsing. I've confirmed via memory profiling that there are no more leaks.

I also have a couple of changes in mind to SerAPI and the cached data representations that ought to significantly reduce both memory usage and time to completion (if the code is currently structured the way I think it is in my head, then I expect to nearly double the overall speed).

@tom-p-reichel
Copy link
Contributor Author

Here's another fun one.

Use a default-commit of e5f4b46405b24efb4ec9956e486df2b87adb10b2 for verdi-chord.

python scripts/data/extract_cache.py --project-names verdi-chord --max-num-commits 1 --root "/home/tpr/work/ra/PRISM" --n-build-workers 15

  File "/usr/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 117, in _project_commit_fmap_
    return _project_commit_fmap(*args)
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../prism/data/commit_map.py", line 99, in _project_commit_fmap
    for commit in pbar:
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/prism/data/extract_cache.py", line 753, in cache_extract_commit_iterator
    iterator = ChangedCoqCommitIterator(project, starting_commit_sha)
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../prism/project/repo.py", line 82, in __init__
    self._repo_initial_head = repo.commit(commit_sha)
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/repo/base.py", line 590, in commit
    return self.rev_parse(str(rev) + "^0")
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/repo/fun.py", line 249, in rev_parse
    obj = cast(Commit_ish, name_to_object(repo, rev[:start]))
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/repo/fun.py", line 186, in name_to_object
    return Object.new_from_sha(repo, hex_to_bin(hexsha))
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/objects/base.py", line 94, in new_from_sha
    oinfo = repo.odb.info(sha1)
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/db.py", line 40, in info
    hexsha, typename, size = self._git.get_object_header(bin_to_hex(binsha))
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/cmd.py", line 1338, in get_object_header
    return self.__get_object_header(cmd, ref)
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/cmd.py", line 1325, in __get_object_header
    return self._parse_object_header(cmd.stdout.readline())
  File "/home/tpr/work/ra/PRISM/prism/pearls/lib/python3.8/site-packages/seutil/../git/cmd.py", line 1286, in _parse_object_header
    raise ValueError("SHA %s could not be resolved, git returned: %r" % (tokens[0], header_line.strip()))
ValueError: SHA b'e5f4b46405b24efb4ec9956e486df2b87adb10b2' could not be resolved, git returned: b'e5f4b46405b24efb4ec9956e486df2b87adb10b2 missing'

I thought this was the 2019 filter, but I edited that to 2010 and re-pip-installed, and this still happened.

I can manually clone the verdi repo and checkout that hash without issues, but this still happens. This error didn't occur before I changed the default-commit.

This could be this upstream issue or related to this upstream issue: gitpython-developers/GitPython#1016

In the meantime I might just copy paste some definitions from some files of interest for a synthetic test.

@tom-p-reichel
Copy link
Contributor Author

tom-p-reichel commented Oct 27, 2022

OK, added a test by picking an interesting pair of commits from a project. I did some hacky stuff to include definitions in the alignment because those were deleted/added between the commits I picked. Made sure to include the license of the repository I took the examples from. Still needs a test of multiple files.

@tom-p-reichel
Copy link
Contributor Author

With this new test, documentation, and accompanying bugfix for the new test, I think this is now in a reasonable state for someone to take a look at.

@rwhender
Copy link
Collaborator

Here's another fun one.

Use a default-commit of e5f4b46405b24efb4ec9956e486df2b87adb10b2 for verdi-chord.

python scripts/data/extract_cache.py --project-names verdi-chord --max-num-commits 1 --root "/home/tpr/work/ra/PRISM" --n-build-workers 15

[Clipped]

I thought this was the 2019 filter, but I edited that to 2010 and re-pip-installed, and this still happened.

I can manually clone the verdi repo and checkout that hash without issues, but this still happens. This error didn't occur before I changed the default-commit.

This could be this upstream issue or related to this upstream issue: gitpython-developers/GitPython#1016

In the meantime I might just copy paste some definitions from some files of interest for a synthetic test.

@tom-p-reichel I'm not seeing a commit with that hash in the verdi-chord repository. DistributedComponents/verdi-chord@e5f4b46 gives me a 404. I thought it might be a blob hash in that repo, but it doesn't seem to be that either. Where did that hash come from?

prism/data/repair/align.py Show resolved Hide resolved
prism/data/repair/align.py Show resolved Hide resolved
@a-gardner1
Copy link
Collaborator

Here's another fun one.
Use a default-commit of e5f4b46405b24efb4ec9956e486df2b87adb10b2 for verdi-chord.
python scripts/data/extract_cache.py --project-names verdi-chord --max-num-commits 1 --root "/home/tpr/work/ra/PRISM" --n-build-workers 15

[Clipped]

I thought this was the 2019 filter, but I edited that to 2010 and re-pip-installed, and this still happened.
I can manually clone the verdi repo and checkout that hash without issues, but this still happens. This error didn't occur before I changed the default-commit.
This could be this upstream issue or related to this upstream issue: gitpython-developers/GitPython#1016
In the meantime I might just copy paste some definitions from some files of interest for a synthetic test.

@tom-p-reichel I'm not seeing a commit with that hash in the verdi-chord repository. DistributedComponents/verdi-chord@e5f4b46 gives me a 404. I thought it might be a blob hash in that repo, but it doesn't seem to be that either. Where did that hash come from?

Summary of offline conversation: we think the commit iterators might need to be modified to not follow chains of parents or children, as this hash likely arose as the parent of some commit that no longer exists after a rebase or other history-altering modification to the repository. Instead, we will just have to get the list of all commits in the repository up front, sort them by date, and iterate over the list.

@tom-p-reichel
Copy link
Contributor Author

tom-p-reichel commented Oct 30, 2022

we think the commit iterators might need to be modified to not follow chains of parents or children, as this hash likely arose as the parent of some commit that no longer exists after a rebase or other history-altering modification to the repository

Oh no! Don't do that! No, I think what happened is that there are actually two popular coq repos named "verdi"-- one by DistributedComponents and one by uwplse, and I had been resolving names to github repos using google with no problem up until this one, which was ambiguous-- my fault.

So that hash actually literally didn't exist in the repo, it existed in another coq project named "verdi", which I didn't expect to exist.

@tom-p-reichel
Copy link
Contributor Author

tom-p-reichel commented Nov 3, 2022

Implemented requested stub-- also found out that I was wrong about which edit distance I was using. Hamming appears to be for fixed length, I was using Levenshtein.

@tom-p-reichel
Copy link
Contributor Author

Got some non-trivial performance improvements by running alignment on numpy arrays of characters as ints instead of strings. Kind of thought numba would do that automatically.

@a-gardner1
Copy link
Collaborator

FYI, I am going to replace usages of fast_edit_distance with a third-party implementation of the Levenshtein distance that is considerably faster:

>>> from timeit import timeit
>>> import leven
>>> from prism.util.alignment import fast_edit_distance
>>> fast_edit_distance("flaw", "lawn", True)[0]  # eliminate jit compilation from timeit
2.0
>>> timeit(lambda : leven.levenshtein("flaw", "lawn"))
0.2536887312307954
>>> timeit(lambda : fast_edit_distance("flaw", "lawn", True)[0])
31.69242916069925

@a-gardner1
Copy link
Collaborator

The changes in this PR have been rebased and merged with the main branch.

@a-gardner1 a-gardner1 closed this Nov 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants