Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 277: make toCaseFold idempotent #293

Closed
wants to merge 1 commit into from

Conversation

phadej
Copy link
Contributor

@phadej phadej commented Jul 16, 2020

This PR currently only adds a test for #227. Expectedly it fails (when you run locally, run with -p idemp -a 10000 to run just the new tests enough times to find a counterexample).

toCaseFold should be idempotent, but it doesn't seem to be.

foldMapping '\x10c7' s = Yield '\x2d27' (CC s '\x0000' '\x0000')
-- GEORGIAN CAPITAL LETTER AEN
foldMapping '\x10cd' s = Yield '\x2d2d' (CC s '\x0000' '\x0000')
-- NOT FOLDED TO toLower '\x13a0'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of where toLower doesn't work. '\x13a0 is a Cherokee Letter A which case folds to itself (i.e. stays uppercase!).

This file has also plenty of added letters with known name, i.e. in CaseFolding.txt. Those are capital letters for which there is folding to lowercase, but toLower doesn't always fold them. Adding them makes toCaseFold behave uniformly independently of base version.

@phadej phadej changed the title Issue 277 Issue 277: make toCaseFold idempotent Aug 25, 2020
- Add property and regression test that toCaseFold should be idempotent
- Add scripts/tests.sh to run tests with all GHCs.
  There are plenty of setup commands to pass.
- Rework CaseFolding.hs so it considers that toLower behaves
  differently with different GHCs, and therefore fallbacks to it
  only when it behaves consistently.
  For that purpose a helper `scripts/Dump.hs` is added.
  Also MurMurVariant is quick hash which works with all GHCs,
  it should notice if there are changes in the behaviour.

Note: `toLower`, `toUpper`, and `toTitle` would benefit from using
dumped database as well.  This commit is already quite big, so that is
left for a follow up.
@phadej
Copy link
Contributor Author

phadej commented Mar 15, 2021

I won't fix the conflicts

@phadej phadej closed this Mar 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant