-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(stdlib): add a proper diff implementation to the experimental package #4992
Conversation
fe2a33c
to
2cfa533
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few doc updates.
4f11569
to
b622470
Compare
@sanderson applied all of the suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
…ckage This adds a proper diff transformation that computes a longest common subsequence table and then uses that to output a diff. This is in contrast to the existing diff in `testing.diff` that both integrates with our testing subsystem (by using yields) and also does a naive algorithm where it just checks if `A[X] = B[X]` instead of trying to determine sequences that are in common. The limitations of this implementation are as follows: * Does not presently restrict memory usage for the LCS table. It should use the memory allocator to allocate the table but it constructs a matrix from Go slices. * Uses a recursive algorithm to trace the LCS table instead of a sequential algorithm. This could cause large tables to recurse too much. * Does not print only the context of a change. If it finds a difference in the table, it prints the entire table.
b622470
to
800c534
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some questions, but I like this a lot.
I think for the data volumes that we typically use for diff
, this will work fine. If we want to optimize for scale at a later time, we can.
// - want: Input stream for the `-` side of the diff. | ||
// - got: Input stream for the `+` side of the diff. | ||
// - epsilon: Specify how far apart two float values can be, but still be considered equal. Default is `0.000000001`. | ||
// - nansEqual: Consider `NaN` float values as equal. Default is `false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this parameter necessary? I can't think of a case where one is using diff
and wants NaNs to be considered different. If it's needed for some reason, true
would seem like a better default to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd personally say no. I'd even remove epsilon and just consider some default.
My personal reason for removing epsilon is that future optimization of this probably depends on hashing the inputs and doing hash comparisons instead of the current column comparisons. Hash comparisons work for everything except floats where epsilon has to be considered. If we expose an epsilon option, we also have to ensure that whatever we do in the future works with epsilon, but if epsilon is just kind of hidden, we can potentially do something equivalent that doesn't involve a specific implementation.
Now I don't know if that exists. I just know that adding epsilon
as a parameter leaks an implementation detail.
If we're fine with that, I'll remove both of these or I can just remove one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it makes sense to expose epsilon
as an option
in the math
package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like you would get a lot of unintended diffs if you are comparing floats bit-for-bit? Floats can get really weird. I'm okay with removing epsilon
for now though.
I wonder if there's a way to "canonicalize" float values such that very close values would always have the same bit pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that using fmt.Sprintf("%.3f", f)
or something might work? It would output the float up to three decimal places which is kind of the equivalent of what epsilon is supposed to do. I'm not sure if that works though and I don't really want to specify how a solution would work. Bit by bit definitely doesn't work though.
I think that's part of my reason for wanting to remove epsilon
. I think the above wouldn't work if we had epsilon
as a parameter because I could theoretically put any value there as the epsilon and it might not be possible with the above method. Once we figure it out, we can add appropriate parameters if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I heard somewhere that some Intel processors use 80-bit registers to do math, but still store them as 64-bit values, so the compiler generating a store
and a load
might result in two values comparing as different, even if they would otherwise be identical. If this is the case, we'd be entirely at the mercy of the foibles of the compiler and architecture---seems really hard to work with portably?
experimental.diff(want, got) | ||
|> rename(columns: {_diff: "diff"}) | ||
|> testing.diff(want: exp) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have some code to consolidate multiple chunks into one, would it make sense to have some tests for that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. This seems like something that should get added to internal/debug
and we can selectively use it in tests.
schema.wantIdx = append(schema.wantIdx, -1) | ||
schema.gotIdx = append(schema.gotIdx, i) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the .flux
file, the type of this transformation is stream[A]
for both sides---is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so? Both of the inputs should be equivalent types, but we still have to handle unknown variables. I'm open to having these two be different type signatures though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might run afoul of the cases where the data is the same but that types technically differ due to our implementation of extensible records... We could loosen the type later if it becomes a problem though.
// is also 1 indexed instead of zero where zero is the base case. That means the first | ||
// element in the table chunks corresponds to index 1 in this table. | ||
// | ||
// See the example in the wikipedia article for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is the memory used for this O(n*m) where n and m is the number of rows on the want and got tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes although it's O(n+1*m+1)
because of the extra row/column for the base case.
I think my only concern with the memory allocations is that they are presently untracked memory allocations so the memory allocator won't abort the query if it's using too much memory. For our uses, I don't think this is an issue. This is mostly an issue for user queries where they might be trying to check if two inputs are the same. |
The default epsilon argument will just be the base and nans equal will also be the default.
This adds a proper diff transformation that computes a longest common
subsequence table and then uses that to output a diff. This is in
contrast to the existing diff in
testing.diff
that both integrateswith our testing subsystem (by using yields) and also does a naive
algorithm where it just checks if
A[X] = B[X]
instead of trying todetermine sequences that are in common.
The limitations of this implementation are as follows:
use the memory allocator to allocate the table but it constructs a
matrix from Go slices.
sequential algorithm. This could cause large tables to recurse too
much.
in the table, it prints the entire table.
Related to #4815.
Fixes #4926.
Done checklist