Skip to content

Commit

Permalink
Introduce wrapping using globally optimal breakpoints
Browse files Browse the repository at this point in the history
This introduces a new wrapping algorithm which finds a globally
optimal set of line breaks, taking certain penalties into account.
This is inspired by the line breaking algorithm used TeX, described in
the 1981 article Breaking Paragraphs into Lines[1] by Knuth and Plass.
The implementation here is based on Python code by David Eppstein[2].

The wrapping algorithm which we’ve been using until now is a “greedy”
or “first fit” algorithm with no look-ahead. It simply accumulates
words until no more fit on the line. While simple and predictable,
this algorithm can produce poor line breaks when a long word is moved
to a new line, leaving behind a large gap.

The new “optimal fit” algorithm considers all possible break points
and picks the breaks which minimizes the gaps at the end of each line.
More precisely, the algorithm assigns a penalty to a break point,
determined by (target_width - line_width)**2. As an example, if you’re
wrapping at 80 columns, a line with 78 characters has a penalty of 4,
but a line that with only 75 characters has a penalty of 25. Shorter
lines are thus penalized more heavily.

The overall optimization minimizes the sum of the squares. The effect
is that the algorithm will move short words down to subsequent lines
if it lowers the total cost for the paragraph. This can be seen in
action if we wrap the text “To be, or not to be: that is the question”
in a narrow column with room for only 10 characters.

The greedy algorithm will produce these lines, each annotated with the
corresponding penalty:

    "To be, or"   1² =  1
    "not to be:"  0² =  0
    "that is"     3² =  9
    "the"         7² = 49
    "question"    2² =  4

We see that line four with “the” leaves a gap of 7 columns, which
gives it a penalty of 49. The sum of the penalties is 63.

With an optimal wrapping algorithm, the first line is shortened in
order to ensure that line four has a smaller gap:

    "To be,"     4² = 16
    "or not to"  1² =  1
    "be: that"   2² =  4
    "is the"     4² = 16
    "question"   2² =  4

This time the sum of the penalties is 41, so the algorithm will prefer
these break points over the first ones.

The full algorithm is slightly more complex than this, e.g., lines
longer than the line width are penalized heavily to suppress them.
Additionally, hyphens are penalized to ensure they only occur when
they improve the breaks substantially.

If a paragraph has n places where line breaks can occur, there are
potentially 2**n different ways to typeset it. Searching through all
possible combinations would be prohibitively slow. However, it turns
out that the problem can be formulated as the task of finding minimal
in a cost matrix. This matrix has a special form (totally monotone)
which lets us use a linear-time algorithm called SMAWK[3] to find the
optimal break points.

This means that the time complexity remains O(n) where n is the number
of words.

Benchmarking shows that wrapping a very long paragraph with ~300 words
or 1600 characters take ~3.5 times as long as before. The first-fit
algorithm took 19 microseconds, optimal-fit takes 72 microseconds.
This seems more than fast enough, and I’ve thus made the optimal-fit
algorithm the default. If desired, the best-fit algorithm can still be
selected.

[1]: http://www.eprg.org/G53DOC/pdfs/knuth-plass-breaking.pdf
[2]: https://github.com/jfinkels/PADS/blob/master/pads/wrap.py
[3]: https://lib.rs/crates/smawk
  • Loading branch information
mgeisler committed Dec 2, 2020
1 parent 674d540 commit 6c0d4c2
Show file tree
Hide file tree
Showing 5 changed files with 455 additions and 20 deletions.
1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ name = "linear"
harness = false

[dependencies]
smawk = "0.3"
unicode-width = "0.1"
terminal_size = { version = "0.1", optional = true }
hyphenation = { version = "0.8", optional = true, features = ["embed_en-us"] }
Expand Down
26 changes: 21 additions & 5 deletions benches/linear.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,28 @@ pub fn benchmark(c: &mut Criterion) {
let mut group = c.benchmark_group("String lengths");
for length in [100, 200, 400, 800, 1600, 3200, 6400].iter() {
let text = lorem_ipsum(*length);
let options = textwrap::Options::new(LINE_LENGTH);
group.bench_with_input(BenchmarkId::new("fill", length), &text, |b, text| {
b.iter(|| textwrap::fill(text, &options));
});
let options = textwrap::Options::new(LINE_LENGTH)
.wrap_algorithm(textwrap::core::WrapAlgorithm::OptimalFit);
group.bench_with_input(
BenchmarkId::new("fill_optimal_fit", length),
&text,
|b, text| {
b.iter(|| textwrap::fill(text, &options));
},
);

let options = textwrap::Options::new(LINE_LENGTH)
.wrap_algorithm(textwrap::core::WrapAlgorithm::FirstFit);
group.bench_with_input(
BenchmarkId::new("fill_first_fit", length),
&text,
|b, text| {
b.iter(|| textwrap::fill(text, &options));
},
);

let options: textwrap::Options = options.splitter(Box::new(textwrap::HyphenSplitter));
let options: textwrap::Options =
textwrap::Options::new(LINE_LENGTH).splitter(Box::new(textwrap::HyphenSplitter));
group.bench_with_input(BenchmarkId::new("fill_boxed", length), &text, |b, text| {
b.iter(|| textwrap::fill(text, &options));
});
Expand Down
17 changes: 17 additions & 0 deletions examples/interactive.rs
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ mod unix_only {
use termion::raw::{IntoRawMode, RawTerminal};
use termion::screen::AlternateScreen;
use termion::{color, cursor, style};
use textwrap::core::WrapAlgorithm::{FirstFit, OptimalFit};
use textwrap::{wrap, HyphenSplitter, NoHyphenation, Options, WordSplitter};

#[cfg(feature = "hyphenation")]
Expand Down Expand Up @@ -101,6 +102,16 @@ mod unix_only {
)?;
left_row += 1;

write!(
stdout,
"{}- algorithm: {}{:?}{} (toggle with Ctrl-o)",
cursor::Goto(left_col, left_row),
style::Bold,
options.wrap_algorithm,
style::Reset,
)?;
left_row += 1;

let now = std::time::Instant::now();
let mut lines = wrap(text, options);
let elapsed = now.elapsed();
Expand Down Expand Up @@ -232,6 +243,12 @@ mod unix_only {
Key::Left => options.width = options.width.saturating_sub(1),
Key::Right => options.width = options.width.saturating_add(1),
Key::Ctrl('b') => options.break_words = !options.break_words,
Key::Ctrl('o') => {
options.wrap_algorithm = match options.wrap_algorithm {
OptimalFit => FirstFit,
FirstFit => OptimalFit,
}
}
Key::Ctrl('s') => {
let idx = idx_iter.next().unwrap();
std::mem::swap(&mut options.splitter, &mut splitters[idx]);
Expand Down
Loading

0 comments on commit 6c0d4c2

Please sign in to comment.