Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flip core algorithm so everything is no longer the mirror image of Myers's paper #440

Merged
merged 2 commits into from
Dec 27, 2023

Conversation

ExplodingCabbage
Copy link
Collaborator

@ExplodingCabbage ExplodingCabbage commented Dec 21, 2023

Until now, jsdiff's entire implementation has been kind of symmetrically flipped from the algorithm described in the Myers paper. In Myers' paper, each column in the edit graph corresponds to a character in the OLD text (and hence horizontal movements correspond to deletions), while each ROW corresponds to a character in the NEW text (and hence vertical movements correspond to insertions). See page 4:

... each horizontal edge to point (x,y) corresponds to the delete command ‘‘x D’’; and a sequence of
vertical edges from (x,y) to (x,z) corresponds to the insert command ...

The numbers of diagonals, meanwhile, are such that a larger diagonal number corresponds to having a higher x - i.e. having done more deletions:

Number the diagonals in the grid of edit graph vertices so that diagonal k consists of the points (x,y) for which
x − y = k

Note that this is the opposite of how diagonals are numbered in jsdiff:

addPath = bestPath[diagonalPath - 1]
removePath = bestPath[diagonalPath + 1],

The algorithm on page 6 shows us recording only the greatest x coordinate reached on each diagonal in array V, and shows us choosing whether to reach a diagonal via a vertical or horizontal move as follows:

If k = −D or k ≠ D and V[k − 1] < V[k + 1] Then
    x ← V[k + 1]
Else
    x ← V[k − 1]+1

x remaining the same (the top case) corresponds to a vertical move, i.e. an insertion. So the logic here says to do an insertion if the path on the more-deletion-heavy side of our target diagonal has made it further through the old string. This means we break ties by preferring to add an insertion on the end of a path that has previously done more deletions rather than adding a deletion on the end of a path which has previously done more insertions.

jsdiff does the mirror image of all of this. Its diagonalPath numbers go up as you do insertions, not deletions. In bestPath it stores objects that have a newPos property, corresponding to the y coordinate on a Myers edit graph, not the x. And it breaks ties in favour of doing a deletion, not an insertion.

This makes jsdiff's diffs differ from the diffs of more popular/canonical tools like the Unix diff command, which break ties in favour of doing insertions (i.e. in favour of putting deletions earlier). It also makes it confusing to compare this library to other Myers diff implementations or to apply optimizations suggested in Myers's paper (or elsewhere).

This PR rewrites the algo to better match the paper (though it introduces a hack, noted with a TODO, to preserve the incorrect tiebreak behaviour for now; fixing that is a breaking change and so will go out in a later release).

@ExplodingCabbage ExplodingCabbage changed the base branch from prefer-deletions-first to master December 27, 2023 20:51
…s algorithm

It makes it harder than it needs to be to reason about the algorithm when everything is flipped relative to Myers' paper. In Myers' paper, we keep track in an array of the index each diagonal has reached in the old string, and positive diagonal numbers represent diagonals where we have done more deletions than insertions. In JsDiff as it exists on master, we keep track in an array of the index each diagonal has reached in the NEW string, and positive diagonal numbers represent diagonals where we have done more insertions than deletions. Everything is mirrored, and this also causes an actual behaviour mismatch between JsDiff and an accurate Myers diff implementation - namely that when we diff e.g. 'abcd' against 'acbd' we output a diff that does an insertion and then later a deletion, rather than a deletion first and an insertion later like it should be.

This patch makes the code in base.js be a closer match to what's in Myers's paper, which should make comparing this to other implementations or adding any of the optimizations proposed in Myers's paper easier. For now, this DOESN'T change any behaviour (I've added a hack, noted with a TODO, to ensure this) although it'll now be straightforward to fix the bug mentioned above.
@ExplodingCabbage ExplodingCabbage force-pushed the make-algo-closer-match-to-myers-paper branch from c51fc0a to de93e91 Compare December 27, 2023 20:53
@ExplodingCabbage ExplodingCabbage marked this pull request as ready for review December 27, 2023 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant