feat(stdlib): add a proper diff implementation to the experimental package #4992

jsternberg · 2022-07-13T17:51:37Z

This adds a proper diff transformation that computes a longest common
subsequence table and then uses that to output a diff. This is in
contrast to the existing diff in testing.diff that both integrates
with our testing subsystem (by using yields) and also does a naive
algorithm where it just checks if A[X] = B[X] instead of trying to
determine sequences that are in common.

The limitations of this implementation are as follows:

Does not presently restrict memory usage for the LCS table. It should
use the memory allocator to allocate the table but it constructs a
matrix from Go slices.
Uses a recursive algorithm to trace the LCS table instead of a
sequential algorithm. This could cause large tables to recurse too
much.
Does not print only the context of a change. If it finds a difference
in the table, it prints the entire table.

Related to #4815.

Fixes #4926.

Done checklist

docs/SPEC.md updated
Test cases written

sanderson

A few doc updates.

stdlib/experimental/experimental.flux

jsternberg · 2022-07-13T19:07:12Z

@sanderson applied all of the suggestions.

sanderson

👍🏻

…ckage This adds a proper diff transformation that computes a longest common subsequence table and then uses that to output a diff. This is in contrast to the existing diff in `testing.diff` that both integrates with our testing subsystem (by using yields) and also does a naive algorithm where it just checks if `A[X] = B[X]` instead of trying to determine sequences that are in common. The limitations of this implementation are as follows: * Does not presently restrict memory usage for the LCS table. It should use the memory allocator to allocate the table but it constructs a matrix from Go slices. * Uses a recursive algorithm to trace the LCS table instead of a sequential algorithm. This could cause large tables to recurse too much. * Does not print only the context of a change. If it finds a difference in the table, it prints the entire table.

wolffcm

I had some questions, but I like this a lot.

I think for the data volumes that we typically use for diff, this will work fine. If we want to optimize for scale at a later time, we can.

wolffcm · 2022-07-14T17:25:54Z

stdlib/experimental/experimental.flux

+// - want: Input stream for the `-` side of the diff.
+// - got: Input stream for the `+` side of the diff.
+// - epsilon: Specify how far apart two float values can be, but still be considered equal. Default is `0.000000001`.
+// - nansEqual: Consider `NaN` float values as equal. Default is `false`.


Is this parameter necessary? I can't think of a case where one is using diff and wants NaNs to be considered different. If it's needed for some reason, true would seem like a better default to me.

I'd personally say no. I'd even remove epsilon and just consider some default.

My personal reason for removing epsilon is that future optimization of this probably depends on hashing the inputs and doing hash comparisons instead of the current column comparisons. Hash comparisons work for everything except floats where epsilon has to be considered. If we expose an epsilon option, we also have to ensure that whatever we do in the future works with epsilon, but if epsilon is just kind of hidden, we can potentially do something equivalent that doesn't involve a specific implementation.

Now I don't know if that exists. I just know that adding epsilon as a parameter leaks an implementation detail.

If we're fine with that, I'll remove both of these or I can just remove one.

I wonder if it makes sense to expose epsilon as an option in the math package?

It seems like you would get a lot of unintended diffs if you are comparing floats bit-for-bit? Floats can get really weird. I'm okay with removing epsilon for now though.

I wonder if there's a way to "canonicalize" float values such that very close values would always have the same bit pattern?

I was thinking that using fmt.Sprintf("%.3f", f) or something might work? It would output the float up to three decimal places which is kind of the equivalent of what epsilon is supposed to do. I'm not sure if that works though and I don't really want to specify how a solution would work. Bit by bit definitely doesn't work though.

I think that's part of my reason for wanting to remove epsilon. I think the above wouldn't work if we had epsilon as a parameter because I could theoretically put any value there as the epsilon and it might not be possible with the above method. Once we figure it out, we can add appropriate parameters if needed.

I heard somewhere that some Intel processors use 80-bit registers to do math, but still store them as 64-bit values, so the compiler generating a store and a load might result in two values comparing as different, even if they would otherwise be identical. If this is the case, we'd be entirely at the mercy of the foibles of the compiler and architecture---seems really hard to work with portably?

wolffcm · 2022-07-14T18:25:25Z

stdlib/experimental/diff_test.flux

+    experimental.diff(want, got)
+        |> rename(columns: {_diff: "diff"})
+        |> testing.diff(want: exp)
+}


You have some code to consolidate multiple chunks into one, would it make sense to have some tests for that here?

Good idea. This seems like something that should get added to internal/debug and we can selectively use it in tests.

wolffcm · 2022-07-14T18:48:56Z

stdlib/experimental/diff.go

+			schema.wantIdx = append(schema.wantIdx, -1)
+			schema.gotIdx = append(schema.gotIdx, i)
+		}
+	}


According to the .flux file, the type of this transformation is stream[A] for both sides---is that intentional?

I think so? Both of the inputs should be equivalent types, but we still have to handle unknown variables. I'm open to having these two be different type signatures though.

We might run afoul of the cases where the data is the same but that types technically differ due to our implementation of extensible records... We could loosen the type later if it becomes a problem though.

wolffcm · 2022-07-14T19:00:12Z

stdlib/experimental/diff.go

+// is also 1 indexed instead of zero where zero is the base case. That means the first
+// element in the table chunks corresponds to index 1 in this table.
+//
+// See the example in the wikipedia article for details.


So is the memory used for this O(n*m) where n and m is the number of rows on the want and got tables?

Yes although it's O(n+1*m+1) because of the extra row/column for the base case.

jsternberg · 2022-07-14T19:48:57Z

I think my only concern with the memory allocations is that they are presently untracked memory allocations so the memory allocator won't abort the query if it's using too much memory. For our uses, I don't think this is an issue. This is mostly an issue for user queries where they might be trying to check if two inputs are the same.

The default epsilon argument will just be the base and nans equal will also be the default.

jsternberg requested review from a team as code owners July 13, 2022 17:51

jsternberg requested review from nathanielc and sanderson and removed request for a team July 13, 2022 17:51

jsternberg force-pushed the feat/experimental-diff branch 6 times, most recently from fe2a33c to 2cfa533 Compare July 13, 2022 19:01

sanderson requested changes Jul 13, 2022

View reviewed changes

jsternberg force-pushed the feat/experimental-diff branch from 4f11569 to b622470 Compare July 13, 2022 19:06

sanderson approved these changes Jul 13, 2022

View reviewed changes

jsternberg force-pushed the feat/experimental-diff branch from b622470 to 800c534 Compare July 13, 2022 20:26

jsternberg requested review from wolffcm and removed request for nathanielc July 14, 2022 14:41

wolffcm reviewed Jul 14, 2022

View reviewed changes

refactor: remove nans equal and epsilon as parameters

64ebf3d

The default epsilon argument will just be the base and nans equal will also be the default.

wolffcm approved these changes Jul 18, 2022

View reviewed changes

jsternberg merged commit 0795f80 into master Jul 18, 2022

jsternberg deleted the feat/experimental-diff branch July 18, 2022 15:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(stdlib): add a proper diff implementation to the experimental package #4992

feat(stdlib): add a proper diff implementation to the experimental package #4992

jsternberg commented Jul 13, 2022 •

edited

Loading

sanderson left a comment

jsternberg commented Jul 13, 2022

sanderson left a comment

wolffcm left a comment

wolffcm Jul 14, 2022

jsternberg Jul 14, 2022

wolffcm Jul 15, 2022

wolffcm Jul 15, 2022

jsternberg Jul 15, 2022

wolffcm Jul 15, 2022

wolffcm Jul 14, 2022

jsternberg Jul 14, 2022

wolffcm Jul 14, 2022

jsternberg Jul 14, 2022

wolffcm Jul 15, 2022

wolffcm Jul 14, 2022

jsternberg Jul 14, 2022

jsternberg commented Jul 14, 2022

feat(stdlib): add a proper diff implementation to the experimental package #4992

feat(stdlib): add a proper diff implementation to the experimental package #4992

Conversation

jsternberg commented Jul 13, 2022 • edited Loading

Done checklist

sanderson left a comment

Choose a reason for hiding this comment

jsternberg commented Jul 13, 2022

sanderson left a comment

Choose a reason for hiding this comment

wolffcm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsternberg commented Jul 14, 2022

jsternberg commented Jul 13, 2022 •

edited

Loading