Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PRECOMMIT] Better UX for handling of edge cases for pre-commit hook #780

Closed
Skylion007 opened this issue May 10, 2021 · 3 comments · Fixed by #796
Closed

[PRECOMMIT] Better UX for handling of edge cases for pre-commit hook #780

Skylion007 opened this issue May 10, 2021 · 3 comments · Fixed by #796
Milestone

Comments

@Skylion007
Copy link
Contributor

I've been using the pre-commit JuPyText and found an issue with our current strategy. Currently, we use whichever source file has the more recent timestamp to determine which file should be updated. However, while this works well locally, it can cause issues later. We have a pre-commit install that runs as part of the CircleCI check. Each time the CI is run, the repo is freshly cloned. The expected behavior is that if only one of the two paired files is updated, it should suggest updating the "older" file to the more recently committed file. However, the git clone has freshly cloned each file and our git hook currently relies on timestamps. The notebook has been cloned later in the process than the Python file, therefore the timestamp is newer. Therefore, the older notebook tries to update the more recent script instead of the other way around producing an incorrect diff: the broken pre-commit run can be found here:
https://app.circleci.com/pipelines/github/facebookresearch/habitat-lab/1630/workflows/d81b2754-6461-45d1-bc95-634e82752119/jobs/3793

The expected behavior is that if one file is from a more recent commit, that one should be updated. There will be ambiguity when two incorrect files are from the same commit, but this is when we may be able to fallback from utime.

Solution: We should use the following tree to decide which file to update:

  1. Is the file staged but not committed? Are all other files unchanged from the git HEAD? If so, give priority to the staged file priority and issue an update for the other files. (An unstaged and unchanged file should never overwrite the staged file).
  2. Otherwise, does one of the files have a more recent commit than the other file? Are all other files unchanged from the git HEAD? If so, use these timestamps in liu of fs ones. (An unstaged be more outdated file should always be overwritten by newer one if both are unstaged and differ: ie. in the case of a CI check)
  3. Otherwise, fallback to current behavior. (FS-time)
@mwouts mwouts added this to the 1.11.3 milestone May 17, 2021
@mwouts
Copy link
Owner

mwouts commented May 17, 2021

Hi @Skylion007 👋 , thanks for discussing this!

Sure I agree that is an area where we could improve the logic.

Before we go to that let me mention that in some areas of the program (i.e. the contents manager) we do have a tolerance for these timestamps that might not be in the right order, and a tolerance of 1 second is allowed on timestamp:

jupytext/jupytext/config.py

Lines 135 to 141 in ee468c8

outdated_text_notebook_margin = Float(
1.0,
help="Refuse to overwrite inputs of a ipynb notebooks with those of a "
"text notebook when the text notebook plus margin is older than "
"the ipynb notebook (NB: This option is ignored by Jupytext CLI)",
config=True,
)

In the context of the --pre-commit-mode which we are discussing here we can certainly do a better job. The current implementation is True/False according to whether the file is in the git index (if a member of the pair is), or else we fallback on timestamps:

jupytext/jupytext/cli.py

Lines 940 to 948 in ee468c8

use_git_index_rather_than_timestamp = pre_commit_mode and file_in_git_index(nb_file)
def get_timestamp(path):
if not os.path.isfile(path):
return None
if use_git_index_rather_than_timestamp:
# Files that are in the git index are considered more recent
return file_in_git_index(path)
return os.lstat(path).st_mtime

I think it should be feasible to add an intermediate case and use the git commit order when all the member of the pair are unchanged from git HEAD, but come from different commits. Do you think you could propose a PR for that?

@mwouts
Copy link
Owner

mwouts commented May 30, 2021

Hello @Skylion007 , I have prepared a PR which implements something that sounds like what you suggest (and passes all the test). Would you like to give an eye to #796?

Also I'd be more confident if we could add a test that reproduces the issue documented above. Let me come back to you with a short proposal...

mwouts added a commit that referenced this issue May 30, 2021
@mwouts
Copy link
Owner

mwouts commented May 30, 2021

What I can propose is 1e1b2e5:

@pytest.mark.parametrize(
"commit_order", [["test.py", "test.ipynb"], ["test.ipynb", "test.py"]]
)
@pytest.mark.parametrize("sync_file", ["test.py", "test.ipynb"])
def test_sync_pre_commit_mode_respects_commit_order_780(
tmpdir,
cwd_tmpdir,
tmp_repo,
python_notebook,
commit_order,
sync_file,
):
file_1, file_2 = commit_order
nb = python_notebook
nb.metadata["jupytext"] = {"formats": "ipynb,py:percent"}
nb.cells = [new_code_cell("1 + 1")]
write(nb, file_1)
tmp_repo.git.add(file_1)
tmp_repo.index.commit(file_1)
# This needs to >1 sec because commit timestamps have a one-second resolution
time.sleep(1.2)
nb.cells = [new_code_cell("2 + 2")]
write(nb, file_2)
tmp_repo.git.add(file_2)
tmp_repo.index.commit(file_2)
# Invert file timestamps
ts_1 = os.stat(file_1).st_mtime
ts_2 = os.stat(file_2).st_mtime
os.utime(file_1, (ts_2, ts_2))
os.utime(file_2, (ts_1, ts_1))
jupytext(["--sync", "--pre-commit-mode", sync_file])
for file in commit_order:
nb = read(file)
assert nb.cells[0].source == "2 + 2", file

That test fails on master and passes on the git_timestamp branch.
Do you deem it representative of the issue reported here?

mwouts added a commit that referenced this issue Jun 4, 2021
* New function git_timestamp that returns the commit timestamp

* Reproduce #780 with a test

* The commit time is available as a Unix timestamp with %ct
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants