Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intl.breakIterator #60

Closed
caridy opened this issue Dec 15, 2015 · 18 comments · Fixed by #553
Closed

Intl.breakIterator #60

caridy opened this issue Dec 15, 2015 · 18 comments · Fixed by #553
Assignees
Labels
c: text Component: case mapping, collation, properties Proposal Larger change requiring a proposal s: in progress Status: the issue has an active proposal

Comments

@caridy
Copy link
Contributor

caridy commented Dec 15, 2015

Standardize Intl.v8BreakIterator.

Backpointers:

Update 1 (Sept 26th, 2016):

@jungshik
Copy link

/cc @jungshik, @littledan

@srl295
Copy link
Member

srl295 commented Jan 26, 2016

use cases:
* rendering (Canvas etc… Thai…)
* console rendering (word wrap!)
* counting words/lines/sentences
* Translation tooling…

@caridy caridy added this to the 4rd Edition milestone Feb 29, 2016
@caridy
Copy link
Contributor Author

caridy commented Feb 29, 2016

@littledan will champion this one.

@littledan
Copy link
Member

I think we'd probably want a somewhat different API for this compared to what V8 currently ships, if it's not too late backwards compatibility-wise. The current API looks like this (more docs here: https://code.google.com/p/v8-i18n/wiki/BreakIterator):

  • Create a new break iterator with var instance = new Intl.v8BreakIterator(options)
  • Set the next with instance.adoptText()
  • Get information about the current iteration with instance.current() (for the index) and instance.breakType() (for a string representing the type, probably a CLDR thing)
  • Go to the next place with instance.next(), which returns the new current index.
  • Start at the beginning of the string again with instance.first().

I think a more ES2015-y way to do it would be to have a method instance.breakText("my string") on the instance which returns an iterator over the breaks in the string. Each item would be an object like {index: 1, breakType: "letter"}. To put the cherry on top, we should probably make the sole method breakText not be a bound function, unlike the current five-method API of bound functions, if this is the general strategy for new APIs.

A possible downside is that this could have worse performance (for the object allocation, and also for accounting for the case where multiple strings are being iterated over by the same instance at the same time), but I don't think this proposal would introduce further implications for a high-performance implementation compared to lots of other ES2015 features. It would also mean making a brand new iterator in place of first--would this be very bad performance-wise?

What do you all think of this general API shape?

The first step towards this will be unshipping Intl.v8BreakIterator in V8, as the standardized version will likely be incompatible. Current usage is low, but nonzero, so we'll see how this goes. If there are a lot of complaints, then maybe I'll want to argue for sticking to the current API; or maybe the complaining users would be happy to hear that if they are OK with the new API, then they'll get the support in more browsers.

I don't think I'll be able to write up a proposal for the March TC39 meeting unfortunately.

@littledan
Copy link
Member

I ended up deciding against unshipping v8BreakIterator in V8 when I unshipped several other nonstandard features (which all had much lower usage counts).

@littledan
Copy link
Member

I wrote up a quick explainer doc explaining the motivation and a strawman API shape. It seems reasonable to me for this to include both line breaking and grapheme/word/sentence segmentation. Maybe hyphenation could go into the same API, just with a different type "hyphen" rather than an entirely different class (as I imagine the API would be similar).

Does anyone have any thoughts? I'm interested in both web developers and implementers.

@mathiasbynens
Copy link
Member

The proposed API in https://github.com/littledan/BreakIterator#example looks great! I’m in favor of overloading the type to include 'hyphen' provided the API can remain similar.

@sebmarkbage
Copy link

sebmarkbage commented Aug 26, 2016

I'm very worried about the performance of this API because the use of this API over native methods is going to be performance critical enough anyway. Additionally, anyone compiling native layout code to asm.js or wasm is going to want the lowest level possible access to that. I've seen nothing to indicate that iterables and the allocations it requires can be optimized away in existing engines. Can you even iterate over a significant document without causing multiple young generation GCs? I'd like to see something to suggest that perf concerns are unfounded before moving forward with the alternative design. Otherwise I fear we'll have to use a polyfill anyway.

EDIT: I suppose supporting both would be an ok tradeoff is iterables aren't fast enough yet. Similar to how other iterable APIs have alternative iteration APIs.

The hyphenation API should be different. Unlike line breaks it is often possible to find a hyphenation point in the middle of a string without iterating through all of the possible ones. Using the iterator API would be very inefficient.

The way you do text-layout hyphenation is by first measuring the unhyphenated word, and only then find the closest point to hyphenate if it is too long - which will give you a single direct value.

IMO we can just look at what browsers already do rather than trying to be clever. They're designed that way for a reason.

@jungshik
Copy link

I'd rather not include 'hyphen' in the proposed API.

In addition to what @sebmarkbage wrote, hyphenation can change the input in some languages (e.g. German).

@littledan
Copy link
Member

@sebmarkbage To the performance concern: What if %SegmentIterator% had an additional "low-level API" with three methods, advance (to imperatively move to the next match, returning undefined; the user could tell if they are at the end by observing the index getting too high, or maybe this could return true at the end), index and breakType (to get properties of the current breakpoint)?

@littledan
Copy link
Member

I updated the explainer with the low-level segmentation interface, though I won't be surprised if we got pushback for this. I assume it's OK to do an allocation when adopting a different piece of text to perform segmentation over, right?

@sebmarkbage
Copy link

Short pieces of text are likely to be combined into a single string rope often.

I'm not as concerned about those allocations since new pieces of text are often associated with allocations anyway. The allocations are proportional. E.g. you might have <span>a</span> <em>lot</em> of small <strong>segments</strong> and iterate through them independently but the number of allocations is proportional to the allocations you do for the data structures holding them anyway.

@SebastianZ
Copy link

In addition to what @sebmarkbage wrote, hyphenation can change the input in some languages (e.g. German).

As far as I know, the only case in German to which this applied is hyphenation between c and k, which turned the c into another k. E.g. 'Zucker' became 'Zuk-ker'. This rule no longer applies since the orthography reform from 1996.

So, at least in German there is no such issue anymore, though I have no idea if other languages still have similar rules.

Sebastian

@sebmarkbage
Copy link

@SebastianZ there are a few other cases mentioned here http://www.unicode.org/L2/L2002/02279-muller.htm#4 for example in Swedish "tuggummi" becomes "tugg-gummi".

I think it is fairly rare to handle these special cases correctly but it'd be good for the API to handle it.

@jungshik
Copy link

We also need to support 'strictness' (for lack of a better term) either as a separate option or as values of 'type'.

CSS3 has 'strict', 'normal', 'loose' (and 'auto') for line-break and ICU/CLDR support them. (when v8BreakIterator was written, there's no such distinction).

@littledan
Copy link
Member

I filed a new issue for strictness at tc39/proposal-intl-segmenter#5 . Let's migrate all additional discussion of feature requests related to segmentation to that repository.

@sffc sffc added s: in progress Status: the issue has an active proposal c: text Component: case mapping, collation, properties and removed enhancement labels Mar 19, 2019
@sffc sffc added the Proposal Larger change requiring a proposal label Jun 5, 2020
@sffc sffc removed this from the 4th Edition milestone Jun 5, 2020
@ryzokuken
Copy link
Member

Since Intl.Segmenter is almost done, can we close this?

@sffc
Copy link
Contributor

sffc commented May 8, 2021

Since Intl.Segmenter is almost done, can we close this?

I think it should be closed when #553 is merged. I'll add it as a linked issue.

@sffc sffc linked a pull request May 8, 2021 that will close this issue
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: text Component: case mapping, collation, properties Proposal Larger change requiring a proposal s: in progress Status: the issue has an active proposal
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants