Make the Dataset equality inequality messages better #68

MrPowers · 2020-03-31T10:58:03Z

Here's the current content inequality message:

I think it'd be better to align this output. It'd also be better to put "Actual Content | Expected Content" on a newline.

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
[info] Actual Content      | Expected Content
[info] [frank,44,us]       | [frank,44,us]
[info] [li,30,china]       | [li,30,china]
[info] [bob,1,uk]          | [bob,1,france]
[info] [camila,5,peru]     | [camila,5,peru]
[info] [maria,19,colombia] | [maria,19,colombia]

It'd be really nice to suppress all the info warnings, but not sure if that's possible with Scalatest.

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content      | Expected Content
[frank,44,us]       | [frank,44,us]
[li,30,china]       | [li,30,china]
[bob,1,uk]          | [bob,1,france]
[camila,5,peru]     | [camila,5,peru]
[maria,19,colombia] | [maria,19,colombia]

Should we get rid of the square brackets for each row of data too?

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content    | Expected Content
frank,44,us       | frank,44,us
li,30,china       | li,30,china
bob,1,uk          | bob,1,france
camila,5,peru     | camila,5,peru
maria,19,colombia | maria,19,colombia

The text was updated successfully, but these errors were encountered:

MrPowers · 2020-03-31T11:14:08Z

@carlsverre @nvander1 @gorros - Can you please take a look and provide thoughts on the best error message we can provide users for DataFrame inequality comparisons? Thanks!

MrPowers · 2020-03-31T11:22:44Z

See here for the utest output that doesn't have all the info warnings: #64

gorros · 2020-04-01T11:49:27Z

I agree.

carlsverre · 2020-04-01T22:42:00Z

I like this but I would add spaces between the values:

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content      | Expected Content
frank, 44, us       | frank, 44, us
li, 30, china       | li, 30, china

Also consider outputting strings wrapped with " characters so it's obvious which parts of the row is inside and outside the string. Consider the case that a string is empty or contains a comma

MrPowers · 2020-04-02T11:06:07Z

@gorros @carlsverre - Here's a PR to migrate spark-fast-tests back to Scalatest (it's currently using utest): #69

I think it'll be easier to develop the optimal Scalatest output if this repo is actually using Scalatest ;)

Let me know your thoughts!

MrPowers · 2020-04-03T09:25:04Z

Here's the current DataFrame comparison message:

Here's the new message (added in this PR):

@carlsverre @gorros @snithish - can you please take a look and let me know if this output looks better / you have any suggestions. Some specific points to note:

I changed the colors for the matching rows from Blue to DarkGray. Do you think that's better? Here's the list of color options.
I needed to prepend "Diffs\n" to get the message to output on a newline in Scalatest. "\n" worked for uTest, but not for Scalatest. I also tried hacking in the null character with "\u0000\n", but Scalatest ignored that too. So looks like we need some sort of real character.

carlsverre · 2020-04-09T22:03:51Z

Good catch - ScalaTest is almost certainly running trim on the string before printing it which will remove all leading/trailing whitespace. I guess a null byte is also considered part of that... If you can get blue to work I think that's better - dark grey can be set to be very similar to the shell bg color in some colorschemes. This looks good to me though - love the new format!

khampson · 2020-04-24T02:28:47Z

@MrPowers : Re: brackets around the values, I would recommend keeping those, as it helps to avoid subtle issues around spaces and tabs and such that can affect inequality but be hidden without such delimiters, e.g.

[foo, bar] vs. [foo, bar ]

I agree the alignment is definitely a plus.

I agree with @carlsverre that blue would be better than dark grey in terms of colors for equality.

mikenac · 2022-06-02T13:16:51Z

I would love to see something that shows what column values are different. This is especially important for larger data frames that may have 50 columns.

zeotuan · 2024-10-16T11:45:53Z

@mikenac actually this might be a good idea. I will close this issue since we have already Done the improvement on Dataset equality/inequality messages. And keep track of your idea on #170

zeotuan closed this as completed Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the Dataset equality inequality messages better #68

Make the Dataset equality inequality messages better #68

MrPowers commented Mar 31, 2020 •

edited

Loading

MrPowers commented Mar 31, 2020

MrPowers commented Mar 31, 2020 •

edited

Loading

gorros commented Apr 1, 2020

carlsverre commented Apr 1, 2020 •

edited

Loading

MrPowers commented Apr 2, 2020

MrPowers commented Apr 3, 2020 •

edited

Loading

carlsverre commented Apr 9, 2020

khampson commented Apr 24, 2020 •

edited

Loading

mikenac commented Jun 2, 2022

zeotuan commented Oct 16, 2024

Make the Dataset equality inequality messages better #68

Make the Dataset equality inequality messages better #68

Comments

MrPowers commented Mar 31, 2020 • edited Loading

MrPowers commented Mar 31, 2020

MrPowers commented Mar 31, 2020 • edited Loading

gorros commented Apr 1, 2020

carlsverre commented Apr 1, 2020 • edited Loading

MrPowers commented Apr 2, 2020

MrPowers commented Apr 3, 2020 • edited Loading

carlsverre commented Apr 9, 2020

khampson commented Apr 24, 2020 • edited Loading

mikenac commented Jun 2, 2022

zeotuan commented Oct 16, 2024

MrPowers commented Mar 31, 2020 •

edited

Loading

MrPowers commented Mar 31, 2020 •

edited

Loading

carlsverre commented Apr 1, 2020 •

edited

Loading

MrPowers commented Apr 3, 2020 •

edited

Loading

khampson commented Apr 24, 2020 •

edited

Loading