-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce wrapping using an optimal-fit algorithm #234
Conversation
66796e0
to
a568413
Compare
2e425e6
to
fb54a01
Compare
This looks very interesting! Is the last line taken into account when calculating the penalties? And regarding overlong lines, is it possible that the optimal-fit algorithm produces overlong lines that would not be overlong when using the first-fit algorithm? |
Thanks! I've been playing with it for a while now, but only recently found the time to push it over the finish line :-)
No, the last line does not get the The logic here can be more or less complex — the original code simply added a penalty if the last line had a single word. However, I found that this looks odd in my small test cases where the last like might be
Yes, this is possible. I got curious about it myself and create a pathological case to demonstrate it. Basically, if a line looks like this:
then it's a question of the penalty paid for moving the long word onto the next line. Here I made the word 30 characters long. In the worst case, it is the very last
The gap has a penalty of 900. In addition, there is a 1000 penalty for every new line added, so the total solution costs 1900. This is weighed against the alternative of letting the
I've set the per-character penalty for overflow to 2500. So overflowing costs 2500 which is more than 1900 and thus we end up with two lines. However, if the long word is 50 characters wide, then the same the cost of leaving a gap is 2500, which together with the per-line penalty changes the balance so that an overflow does happen. The numbers are pretty arbitrary, though I played around with the interactive example program to see what the effect of the parameters are. |
This introduces a new wrapping algorithm which finds a globally optimal set of line breaks, taking certain penalties into account. This is inspired by the line breaking algorithm used TeX, described in the 1981 article Breaking Paragraphs into Lines[1] by Knuth and Plass. The implementation here is based on Python code by David Eppstein[2]. The wrapping algorithm which we’ve been using until now is a “greedy” or “first fit” algorithm with no look-ahead. It simply accumulates words until no more fit on the line. While simple and predictable, this algorithm can produce poor line breaks when a long word is moved to a new line, leaving behind a large gap. The new “optimal fit” algorithm considers all possible break points and picks the breaks which minimizes the gaps at the end of each line. More precisely, the algorithm assigns a penalty to a break point, determined by (target_width - line_width)**2. As an example, if you’re wrapping at 80 columns, a line with 78 characters has a penalty of 4, but a line that with only 75 characters has a penalty of 25. Shorter lines are thus penalized more heavily. The overall optimization minimizes the sum of the squares. The effect is that the algorithm will move short words down to subsequent lines if it lowers the total cost for the paragraph. This can be seen in action if we wrap the text “To be, or not to be: that is the question” in a narrow column with room for only 10 characters. The greedy algorithm will produce these lines, each annotated with the corresponding penalty: "To be, or" 1² = 1 "not to be:" 0² = 0 "that is" 3² = 9 "the" 7² = 49 "question" 2² = 4 We see that line four with “the” leaves a gap of 7 columns, which gives it a penalty of 49. The sum of the penalties is 63. With an optimal wrapping algorithm, the first line is shortened in order to ensure that line four has a smaller gap: "To be," 4² = 16 "or not to" 1² = 1 "be: that" 2² = 4 "is the" 4² = 16 "question" 2² = 4 This time the sum of the penalties is 41, so the algorithm will prefer these break points over the first ones. The full algorithm is slightly more complex than this, e.g., lines longer than the line width are penalized heavily to suppress them. Additionally, hyphens are penalized to ensure they only occur when they improve the breaks substantially. If a paragraph has n places where line breaks can occur, there are potentially 2**n different ways to typeset it. Searching through all possible combinations would be prohibitively slow. However, it turns out that the problem can be formulated as the task of finding minimal in a cost matrix. This matrix has a special form (totally monotone) which lets us use a linear-time algorithm called SMAWK[3] to find the optimal break points. This means that the time complexity remains O(n) where n is the number of words. Benchmarking shows that wrapping a very long paragraph with ~300 words or 1600 characters take ~3.5 times as long as before. The first-fit algorithm took 19 microseconds, optimal-fit takes 72 microseconds. This seems more than fast enough, and I’ve thus made the optimal-fit algorithm the default. If desired, the best-fit algorithm can still be selected. [1]: http://www.eprg.org/G53DOC/pdfs/knuth-plass-breaking.pdf [2]: https://github.com/jfinkels/PADS/blob/master/pads/wrap.py [3]: https://lib.rs/crates/smawk
fb54a01
to
bc48530
Compare
Thanks for the explanations! I’m currently thinking about whether it is possible to use this algorithm if I don’t know all line widths in advance. Maybe I can estimate an upper bound for the fragments that fit in the current area using the first-fit algorithm, re-wrap them using the optimal-fit algorithm and then choose the better result. I’ll try to run some experiments. Please let me know if you have any other ideas. |
I started writing this in #126, but I think it applies better here...
Thanks for the explanation, that makes sense... The cost function in More concretely, the At every invocation of the cost function, I use these minima to computing the line number for fragment // Line number for fragment `i`.
let line_number = line_numbers.get(i, &minima); // was &values
let target_width = std::cmp::max(1, line_widths(line_number)); The line numbers are computed and cached by At this point, we don't know what the final line breaks will be for the whole paragraph — this depends on the jumps in the final let mut lines = Vec::with_capacity(line_numbers.get(fragments.len(), &minima));
let mut pos = fragments.len();
loop {
let prev = minima[pos].0;
lines.push(&fragments[prev..pos]);
pos = prev;
if pos == 0 {
break;
}
}
lines.reverse(); However, when computing the cost for One could in principle use the information in Now, I'm not sure that this is a O(1) computation any longer. The amazing guarantee of the SMAWK machinery is that it will evaluate the cost function only O(n) times for a n word string. I had to introduce caching of line numbers to ensure that we can compute the cost function in constant time, yielding an overall linear time algorithm. I guess similar caching could be used if you typeset |
bc48530
to
bde9dee
Compare
I think there is potential to make this more flexible and smarter going forward... I'll merge this for now and make a release to get the new API into the hands of people sooner rather than later. I feel we can adjust things in |
Sounds good to me!
Just to be clear, that was just a general thought that does not apply to my use case. (While the line height might change, the line width is constant for the current text area.) So while I think that such a feature might be useful for others, I personally don’t need it. |
This PR introduces a new wrapping algorithm which finds a globally optimal set of line breaks, taking certain penalties into account. This is inspired by the line breaking algorithm used TeX, described in the 1981 article Breaking Paragraphs into Lines by Knuth and Plass. The implementation here is based on Python code by David Eppstein.
The wrapping algorithm which we’ve been using until now is a “greedy” or “first fit” algorithm with no look-ahead. It simply accumulates words until no more fit on the line. While simple and predictable, this algorithm can produce poor line breaks when a long word is moved to a new line, leaving behind a large gap.
The new “optimal fit” algorithm considers all possible break points and picks the breaks which minimizes the gaps at the end of each line. More precisely, the algorithm assigns a penalty to a break point, determined by
(target_width - line_width)**2
. As an example, if you’re wrapping at 80 columns, a line with 78 characters has a penalty of 4, but a line that with only 75 characters has a penalty of 25. Shorter lines are thus penalized more heavily.The overall optimization minimizes the sum of the squares. The effect is that the algorithm will move short words down to subsequent lines if it lowers the total cost for the paragraph. This can be seen in action if we wrap the text “To be, or not to be: that is the question” in a narrow column with room for only 10 characters.
The greedy algorithm will produce these lines, each annotated with the corresponding penalty:
We see that line four with “the” leaves a gap of 7 columns, which gives it a penalty of 49. The sum of the penalties is 63.
With an optimal wrapping algorithm, the first line is shortened in order to ensure that line four has a smaller gap:
This time the sum of the penalties is 41, so the algorithm will prefer these break points over the first ones.
The full algorithm is slightly more complex than this, e.g., lines longer than the line width are penalized heavily to suppress them. Additionally, hyphens are penalized to ensure they only occur when they improve the breaks substantially.
If a paragraph has
n
places where line breaks can occur, there are potentially2**n
different ways to typeset it. Searching through all possible combinations would be prohibitively slow. However, it turns out that the problem can be formulated as the task of finding column minima in a cost matrix. This matrix has a special form (totally monotone) which lets us use a linear-time algorithm called SMAWK3 to find the optimal break points.This means that the time complexity remains O(n) where n is the number of words.
Benchmarking shows that wrapping a very long paragraph with ~300 words or 1600 characters take ~3.5 times as long as before. The first-fit algorithm took 19 microseconds, optimal-fit takes 72 microseconds. This seems more than fast enough, and I’ve thus made the optimal-fit algorithm the default. If desired, the best-fit algorithm can still be selected.