Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the Dataset equality inequality messages better #68

Closed
MrPowers opened this issue Mar 31, 2020 · 10 comments
Closed

Make the Dataset equality inequality messages better #68

MrPowers opened this issue Mar 31, 2020 · 10 comments

Comments

@MrPowers
Copy link
Collaborator

MrPowers commented Mar 31, 2020

Here's the current content inequality message:

Screen Shot 2020-03-31 at 5 33 18 AM

I think it'd be better to align this output. It'd also be better to put "Actual Content | Expected Content" on a newline.

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
[info] Actual Content      | Expected Content
[info] [frank,44,us]       | [frank,44,us]
[info] [li,30,china]       | [li,30,china]
[info] [bob,1,uk]          | [bob,1,france]
[info] [camila,5,peru]     | [camila,5,peru]
[info] [maria,19,colombia] | [maria,19,colombia]

It'd be really nice to suppress all the info warnings, but not sure if that's possible with Scalatest.

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content      | Expected Content
[frank,44,us]       | [frank,44,us]
[li,30,china]       | [li,30,china]
[bob,1,uk]          | [bob,1,france]
[camila,5,peru]     | [camila,5,peru]
[maria,19,colombia] | [maria,19,colombia]

Should we get rid of the square brackets for each row of data too?

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content    | Expected Content
frank,44,us       | frank,44,us
li,30,china       | li,30,china
bob,1,uk          | bob,1,france
camila,5,peru     | camila,5,peru
maria,19,colombia | maria,19,colombia
@MrPowers
Copy link
Collaborator Author

@carlsverre @nvander1 @gorros - Can you please take a look and provide thoughts on the best error message we can provide users for DataFrame inequality comparisons? Thanks!

@MrPowers
Copy link
Collaborator Author

MrPowers commented Mar 31, 2020

See here for the utest output that doesn't have all the info warnings: #64

@gorros
Copy link

gorros commented Apr 1, 2020

I agree.

@carlsverre
Copy link
Contributor

carlsverre commented Apr 1, 2020

I like this but I would add spaces between the values:

[info] com.github.mrpowers.spark.fast.tests.DatasetContentMismatch:
Actual Content      | Expected Content
frank, 44, us       | frank, 44, us
li, 30, china       | li, 30, china

Also consider outputting strings wrapped with " characters so it's obvious which parts of the row is inside and outside the string. Consider the case that a string is empty or contains a comma

@MrPowers
Copy link
Collaborator Author

MrPowers commented Apr 2, 2020

@gorros @carlsverre - Here's a PR to migrate spark-fast-tests back to Scalatest (it's currently using utest): #69

I think it'll be easier to develop the optimal Scalatest output if this repo is actually using Scalatest ;)

Let me know your thoughts!

@MrPowers
Copy link
Collaborator Author

MrPowers commented Apr 3, 2020

Here's the current DataFrame comparison message:

Screen Shot 2020-04-02 at 5 03 56 PM

Here's the new message (added in this PR):

Screen Shot 2020-04-02 at 5 05 03 PM

@carlsverre @gorros @snithish - can you please take a look and let me know if this output looks better / you have any suggestions. Some specific points to note:

  • I changed the colors for the matching rows from Blue to DarkGray. Do you think that's better? Here's the list of color options.
  • I needed to prepend "Diffs\n" to get the message to output on a newline in Scalatest. "\n" worked for uTest, but not for Scalatest. I also tried hacking in the null character with "\u0000\n", but Scalatest ignored that too. So looks like we need some sort of real character.

@carlsverre
Copy link
Contributor

Good catch - ScalaTest is almost certainly running trim on the string before printing it which will remove all leading/trailing whitespace. I guess a null byte is also considered part of that... If you can get blue to work I think that's better - dark grey can be set to be very similar to the shell bg color in some colorschemes. This looks good to me though - love the new format!

@khampson
Copy link

khampson commented Apr 24, 2020

@MrPowers : Re: brackets around the values, I would recommend keeping those, as it helps to avoid subtle issues around spaces and tabs and such that can affect inequality but be hidden without such delimiters, e.g.

[foo, bar] vs. [foo, bar ]

I agree the alignment is definitely a plus.

I agree with @carlsverre that blue would be better than dark grey in terms of colors for equality.

@mikenac
Copy link

mikenac commented Jun 2, 2022

I would love to see something that shows what column values are different. This is especially important for larger data frames that may have 50 columns.

@zeotuan
Copy link
Collaborator

zeotuan commented Oct 16, 2024

@mikenac actually this might be a good idea. I will close this issue since we have already Done the improvement on Dataset equality/inequality messages. And keep track of your idea on #170

@zeotuan zeotuan closed this as completed Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants