Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(stdlib): add a proper diff implementation to the experimental package #4992

Merged
merged 2 commits into from
Jul 18, 2022

Conversation

jsternberg
Copy link
Contributor

@jsternberg jsternberg commented Jul 13, 2022

This adds a proper diff transformation that computes a longest common
subsequence table and then uses that to output a diff. This is in
contrast to the existing diff in testing.diff that both integrates
with our testing subsystem (by using yields) and also does a naive
algorithm where it just checks if A[X] = B[X] instead of trying to
determine sequences that are in common.

The limitations of this implementation are as follows:

  • Does not presently restrict memory usage for the LCS table. It should
    use the memory allocator to allocate the table but it constructs a
    matrix from Go slices.
  • Uses a recursive algorithm to trace the LCS table instead of a
    sequential algorithm. This could cause large tables to recurse too
    much.
  • Does not print only the context of a change. If it finds a difference
    in the table, it prints the entire table.

Related to #4815.

Fixes #4926.

Done checklist

  • docs/SPEC.md updated
  • Test cases written

@jsternberg jsternberg requested review from a team as code owners July 13, 2022 17:51
@jsternberg jsternberg requested review from nathanielc and sanderson and removed request for a team July 13, 2022 17:51
@jsternberg jsternberg force-pushed the feat/experimental-diff branch 6 times, most recently from fe2a33c to 2cfa533 Compare July 13, 2022 19:01
Copy link
Contributor

@sanderson sanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few doc updates.

stdlib/experimental/experimental.flux Outdated Show resolved Hide resolved
stdlib/experimental/experimental.flux Outdated Show resolved Hide resolved
stdlib/experimental/experimental.flux Outdated Show resolved Hide resolved
stdlib/experimental/experimental.flux Outdated Show resolved Hide resolved
stdlib/experimental/experimental.flux Outdated Show resolved Hide resolved
@jsternberg jsternberg force-pushed the feat/experimental-diff branch from 4f11569 to b622470 Compare July 13, 2022 19:06
@jsternberg
Copy link
Contributor Author

@sanderson applied all of the suggestions.

Copy link
Contributor

@sanderson sanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

…ckage

This adds a proper diff transformation that computes a longest common
subsequence table and then uses that to output a diff. This is in
contrast to the existing diff in `testing.diff` that both integrates
with our testing subsystem (by using yields) and also does a naive
algorithm where it just checks if `A[X] = B[X]` instead of trying to
determine sequences that are in common.

The limitations of this implementation are as follows:

* Does not presently restrict memory usage for the LCS table. It should
  use the memory allocator to allocate the table but it constructs a
  matrix from Go slices.
* Uses a recursive algorithm to trace the LCS table instead of a
  sequential algorithm. This could cause large tables to recurse too
  much.
* Does not print only the context of a change. If it finds a difference
  in the table, it prints the entire table.
@jsternberg jsternberg force-pushed the feat/experimental-diff branch from b622470 to 800c534 Compare July 13, 2022 20:26
@jsternberg jsternberg requested review from wolffcm and removed request for nathanielc July 14, 2022 14:41
Copy link

@wolffcm wolffcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some questions, but I like this a lot.

I think for the data volumes that we typically use for diff, this will work fine. If we want to optimize for scale at a later time, we can.

// - want: Input stream for the `-` side of the diff.
// - got: Input stream for the `+` side of the diff.
// - epsilon: Specify how far apart two float values can be, but still be considered equal. Default is `0.000000001`.
// - nansEqual: Consider `NaN` float values as equal. Default is `false`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this parameter necessary? I can't think of a case where one is using diff and wants NaNs to be considered different. If it's needed for some reason, true would seem like a better default to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd personally say no. I'd even remove epsilon and just consider some default.

My personal reason for removing epsilon is that future optimization of this probably depends on hashing the inputs and doing hash comparisons instead of the current column comparisons. Hash comparisons work for everything except floats where epsilon has to be considered. If we expose an epsilon option, we also have to ensure that whatever we do in the future works with epsilon, but if epsilon is just kind of hidden, we can potentially do something equivalent that doesn't involve a specific implementation.

Now I don't know if that exists. I just know that adding epsilon as a parameter leaks an implementation detail.

If we're fine with that, I'll remove both of these or I can just remove one.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it makes sense to expose epsilon as an option in the math package?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you would get a lot of unintended diffs if you are comparing floats bit-for-bit? Floats can get really weird. I'm okay with removing epsilon for now though.

I wonder if there's a way to "canonicalize" float values such that very close values would always have the same bit pattern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that using fmt.Sprintf("%.3f", f) or something might work? It would output the float up to three decimal places which is kind of the equivalent of what epsilon is supposed to do. I'm not sure if that works though and I don't really want to specify how a solution would work. Bit by bit definitely doesn't work though.

I think that's part of my reason for wanting to remove epsilon. I think the above wouldn't work if we had epsilon as a parameter because I could theoretically put any value there as the epsilon and it might not be possible with the above method. Once we figure it out, we can add appropriate parameters if needed.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I heard somewhere that some Intel processors use 80-bit registers to do math, but still store them as 64-bit values, so the compiler generating a store and a load might result in two values comparing as different, even if they would otherwise be identical. If this is the case, we'd be entirely at the mercy of the foibles of the compiler and architecture---seems really hard to work with portably?

experimental.diff(want, got)
|> rename(columns: {_diff: "diff"})
|> testing.diff(want: exp)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have some code to consolidate multiple chunks into one, would it make sense to have some tests for that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. This seems like something that should get added to internal/debug and we can selectively use it in tests.

schema.wantIdx = append(schema.wantIdx, -1)
schema.gotIdx = append(schema.gotIdx, i)
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the .flux file, the type of this transformation is stream[A] for both sides---is that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so? Both of the inputs should be equivalent types, but we still have to handle unknown variables. I'm open to having these two be different type signatures though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might run afoul of the cases where the data is the same but that types technically differ due to our implementation of extensible records... We could loosen the type later if it becomes a problem though.

// is also 1 indexed instead of zero where zero is the base case. That means the first
// element in the table chunks corresponds to index 1 in this table.
//
// See the example in the wikipedia article for details.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the memory used for this O(n*m) where n and m is the number of rows on the want and got tables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes although it's O(n+1*m+1) because of the extra row/column for the base case.

@jsternberg
Copy link
Contributor Author

I think my only concern with the memory allocations is that they are presently untracked memory allocations so the memory allocator won't abort the query if it's using too much memory. For our uses, I don't think this is an issue. This is mostly an issue for user queries where they might be trying to check if two inputs are the same.

The default epsilon argument will just be the base and nans equal will
also be the default.
@jsternberg jsternberg merged commit 0795f80 into master Jul 18, 2022
@jsternberg jsternberg deleted the feat/experimental-diff branch July 18, 2022 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

testing.diff can break non-test scripts
3 participants