Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22979 Support inverse rule for [] span in RBNF #3326

Merged
merged 1 commit into from
Jan 15, 2025

Conversation

grhoten
Copy link
Member

@grhoten grhoten commented Jan 6, 2025

This feature enhancement only affects the documentation and RBNF syntax. Some changes were made to keep both the ICU4J and ICU4C implementations in sync. Some compiler warnings were also fixed.

I’d like to extend the RBNF syntax to support more complex grammar. I’d like to change the omission rule with square brackets. By default, everything between the square brackets are omitted when the remainder is 0. My proposal will not change this behavior by default, unless a “|” (pipe symbol) is present between the square brackets. You can think of it performing like an else statement. Everything between the beginning square bracket and the pipe acts as it currently does. Everything between the pipe symbol and the end square bracket will be used instead of omitting the text.

This behavior is important for supporting large ordinals in slavic languages. It’s convenient for other languages, like English.

The test case in the prototype and the ticket provides more examples of the change. Below is a simplified example of the new syntax. Right now, we have the following ordinals in English.

%%tieth:
0: tieth;
1: ty-=%spellout-ordinal=;
%spellout-ordinal:
...
20: twen>%%tieth>;
30: thir>%%tieth>;
40: for>%%tieth>;
50: fif>%%tieth>;

That could be simplified to the following rules instead.

%spellout-ordinal:
...
20: twent[y->>|ieth];
30: thirt[y->>|ieth];
40: fort[y->>|ieth];
50: fift[y->>|ieth];

The cardinal and ordinal rules will work on either side of the pipe symbol.

Checklist

  • Required: Issue filed: ICU-22979
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@grhoten grhoten requested a review from richgillam January 6, 2025 22:29
Comment on lines +1989 to +1999
"20: twent[y->>|ieth];\n" +
"30: thirt[y->>|ieth];\n" +
"40: fort[y->>|ieth];\n" +
"50: fift[y->>|ieth];\n" +
"60: sixt[y->>|ieth];\n" +
"70: sevent[y->>|ieth];\n" +
"80: eight[y->>|ieth];\n" +
"90: ninet[y->>|ieth];\n" +
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Languages that define a rule for 31 or private rules for 30 are candidates to use this rule syntax.

Comment on lines +1997 to +2004
"100: <%cardinal< [$(cardinal,one{hundred}other{hundreds})$ >>|$(cardinal,one{hundredth}other{hundredths})$];\n" +
"1000: <%cardinal< [$(cardinal,one{thousand}other{thousands})$ >>|$(cardinal,one{thousandth}other{thousandths})$];\n" +
"1000000: <%cardinal< [$(cardinal,one{million}other{millions})$ >>|$(cardinal,one{millionth}other{millionths})$];\n" +
"1000000000: <%cardinal< [$(cardinal,one{billion}other{billions})$ >>|$(cardinal,one{billionth}other{billionths})$];\n" +
"1000000000000: <%cardinal< [$(cardinal,one{trillion}other{trillions})$ >>|$(cardinal,one{trillionth}other{trillionths})$];\n" +
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This style of large ordinals is currently impossible to support when you have more than 2 cardinal states to consider. It's excessively tedious to repeatedly split and copy the rules. It's worse for languages that are highly inflectional with many grammatical cases.

Languages that use private rules like English's %%tieth and %%th ordinals rules are candidates to use this syntax. Those private rules only define 0 and 1.

Comment on lines +1873 to -1918
char ch;
while (start < descriptionLength) {
// seek to the first non-whitespace character...
// Seek to the first non-whitespace character...
// If the first non-whitespace character is semicolon, skip it and continue
while (start < descriptionLength
&& PatternProps.isWhiteSpace(description.charAt(start)))
&& (PatternProps.isWhiteSpace(ch = description.charAt(start)) || ch == ';'))
{
++start;
}

//if the first non-whitespace character is semicolon, skip it and continue
if (start < descriptionLength && description.charAt(start) == ';') {
start += 1;
continue;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this logic in stripWhitespace was out of sync between C++ and Java. Both sides have been brought in sync in this pull request.

richgillam
richgillam previously approved these changes Jan 7, 2025
Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great. A couple things you included as comments on this PR probably ought to be comments in the actual tests or code, and I had a couple small suggestions on your documentation, but the code looks great.

"1000000: <%cardinal< [$(cardinal,one{million}other{millions})$ >>|$(cardinal,one{millionth}other{millionths})$];\n" +
"1000000000: <%cardinal< [$(cardinal,one{billion}other{billions})$ >>|$(cardinal,one{billionth}other{billionths})$];\n" +
"1000000000000: <%cardinal< [$(cardinal,one{trillion}other{trillions})$ >>|$(cardinal,one{trillionth}other{trillionths})$];\n" +
"1000000000000000: =#,##0=$(ordinal,one{st}two{nd}few{rd}other{th})$;";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unclear on the $ syntax. It looks like you're just using normal plural rules. Do the English plural rules define few as just 3, or you doing that somewhere else in here just for the purposes of the test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other languages need to use the cardinal form for large ordinals. English doesn’t define few. I just wanted to verify that the syntax is usable, and not add additional plural rule tests. I think that these new tests demonstrate that.

I could try to pick a difficult language to test, like Lithuanian. I didn’t have patience to write such rules, and I wanted something readable, reviewable, and quick to write.

Do you really want the masculine singular nominative Lithuanian ordinals written in a test? I was hoping to defer such work and use the Number Format Tester instead of hard coded in tests. As an alternative, I can give you just the thousands line for documentation purposes. That’s simpler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a small sample of Lithianian thousands of the positive degree nominative masculine non-pronomial ordinal would be the following:

1000: [tūkstantis >>|tūkstantas];
2000: <%spellout-cardinal-masculine< [$(cardinal,one{tūkstantis}few{tūkstančiai}other{tūkstančių})$ >>|$(cardinal,one{tūkstantas}few{tūkstanti}other{tūkstantų})$];

Of course, this sample is unvetted, but the structure is what is needed. I got the spellings from Wiktionary. See tūkstantis, tūkstantas, and pirmas for the inflection tables. This pull request is needed to support ordinals larger than 9,999 in Lithuanian. To fully support Lithuanian ordinals, I'd likely need to copy the structure of these 2 rules less than 162 times for only the ordinals. The current Lithianian ordinal rules are a little clunky.

The structure is pretty close to the English example in the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you had is okay with me; I was just trying to understand it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a couple of sentences before these 2 tests to clarify to describe what's going on. Please check it out.

icu4c/source/i18n/nfrule.cpp Outdated Show resolved Hide resolved
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/i18n/nfrule.cpp is different
  • icu4c/source/i18n/unicode/rbnf.h is different
  • icu4c/source/test/intltest/itrbnf.cpp is different
  • icu4j/main/common_tests/src/test/java/com/ibm/icu/dev/test/format/RbnfTest.java is different
  • icu4j/main/core/src/main/java/com/ibm/icu/text/NFRule.java is different
  • icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedNumberFormat.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another look. I like your updates to the documentation. No notes.

@grhoten grhoten merged commit 2d6a418 into unicode-org:main Jan 15, 2025
101 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants