Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Introduce wrapping using globally optimal breakpoints
This introduces a new wrapping algorithm which finds a globally optimal set of line breaks, taking certain penalties into account. This is inspired by the line breaking algorithm used TeX, described in the 1981 article Breaking Paragraphs into Lines[1] by Knuth and Plass. The implementation here is based on Python code by David Eppstein[2]. The wrapping algorithm which we’ve been using until now is a “greedy” or “first fit” algorithm with no look-ahead. It simply accumulates words until no more fit on the line. While simple and predictable, this algorithm can produce poor line breaks when a long word is moved to a new line, leaving behind a large gap. The new “optimal fit” algorithm considers all possible break points and picks the breaks which minimizes the gaps at the end of each line. More precisely, the algorithm assigns a penalty to a break point, determined by (target_width - line_width)**2. As an example, if you’re wrapping at 80 columns, a line with 78 characters has a penalty of 4, but a line that with only 75 characters has a penalty of 25. Shorter lines are thus penalized more heavily. The overall optimization minimizes the sum of the squares. The effect is that the algorithm will move short words down to subsequent lines if it lowers the total cost for the paragraph. This can be seen in action if we wrap the text “To be, or not to be: that is the question” in a narrow column with room for only 10 characters. The greedy algorithm will produce these lines, each annotated with the corresponding penalty: "To be, or" 1² = 1 "not to be:" 0² = 0 "that is" 3² = 9 "the" 7² = 49 "question" 2² = 4 We see that line four with “the” leaves a gap of 7 columns, which gives it a penalty of 49. The sum of the penalties is 63. With an optimal wrapping algorithm, the first line is shortened in order to ensure that line four has a smaller gap: "To be," 4² = 16 "or not to" 1² = 1 "be: that" 2² = 4 "is the" 4² = 16 "question" 2² = 4 This time the sum of the penalties is 41, so the algorithm will prefer these break points over the first ones. The full algorithm is slightly more complex than this, e.g., lines longer than the line width are penalized heavily to suppress them. Additionally, hyphens are penalized to ensure they only occur when they improve the breaks substantially. If a paragraph has n places where line breaks can occur, there are potentially 2**n different ways to typeset it. Searching through all possible combinations would be prohibitively slow. However, it turns out that the problem can be formulated as the task of finding minimal in a cost matrix. This matrix has a special form (totally monotone) which lets us use a linear-time algorithm called SMAWK[3] to find the optimal break points. This means that the time complexity remains O(n) where n is the number of words. Benchmarking shows that wrapping a very long paragraph with ~300 words or 1600 characters take ~3.5 times as long as before. The first-fit algorithm took 19 microseconds, optimal-fit takes 72 microseconds. This seems more than fast enough, and I’ve thus made the optimal-fit algorithm the default. If desired, the best-fit algorithm can still be selected. [1]: http://www.eprg.org/G53DOC/pdfs/knuth-plass-breaking.pdf [2]: https://github.com/jfinkels/PADS/blob/master/pads/wrap.py [3]: https://lib.rs/crates/smawk
- Loading branch information