Encoding mismatches in text boxes with truncation #777

airblade · 2014-10-01T13:33:34Z

Here's my use case: I have a fixed size text box and I want to write text in it. Often that text is too long for the text box and so the text should be truncated. This all works fine for me.

However I would like to indicate when the text has been truncated, perhaps with an ellipsis or simply three full stops (periods). This is how I am trying to achieve it:

box = Prawn::Text::Box.new text, disable_wrap_by_character: true, ...  # sizing options omitted
overflow = box.render dry_run: true
text = text.sub(/ #{overflow}$/, '...') unless overflow.empty?
document.text_box text, disable_wrap_by_character: true, ...  # as above

However I often, though not always, get incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string) errors. Sometimes text is UTF-8, sometimes it's US-ASCII. Sometimes overflow is UTF-8, sometimes it's US-ASCII.

I have tried re-encoding both text and overflow to UTF-8 but then the substitution doesn't work because the strings no longer match.

Is this the best approach for achieving my goal?
If it is, how can I make it work? :)

The text was updated successfully, but these errors were encountered:

airblade · 2014-10-01T13:48:58Z

It looks like when text is UTF-8 and overflow contains any non-ASCII characters, e.g. ß or ü, overflow's encoding is set to US-ASCII.

That seems counter-intuitive to me ;)

Example debugging output:

text: (UTF-8) "AKKESOIR ~ SS14 / RHINESTONE FINGERTIP RING / Ring klein rund / Strass"
overflow: (UTF-8) "Strass"
result: (UTF-8) "AKKESOIR ~ SS14 / RHINESTONE FINGERTIP RING / Ring klein rund / "

text: (UTF-8) "AKKESOIR ~ SS14 / BLACK WHITE CHANDELIERS / Ohrring / weiß,schwarz"
overflow: (ASCII-8BIT) "wei\xDF,schwarz"
result: (UTF-8) "AKKESOIR ~ SS14 / BLACK WHITE CHANDELIERS / Ohrring / "

airblade · 2014-10-01T14:01:02Z

Ah, here's an alternative approach which appears to work.

Instead of using text.sub, take a substring of text whose length is the difference in length between text and overflow. The length method seems to calculate the length as one would hope regardless of the string's encoding.

text = text[0, (text.length - overflow.length - 1)]
text = "#{text}..."

(The -1 gets rid of the trailing space or hyphen.)

Of course the dots may still get truncated...but with :disable_wrap_by_character: true I think this is the best I can do.

airblade · 2014-10-01T14:10:41Z

This may be related to #603; I'm not sure.

practicingruby · 2014-10-01T22:13:06Z

@airblade Thanks, I'll take a closer look at this in the next few days, hopefully. We have some lingering encoding issues (mainly on Ruby 1.9.3, but some affect all Ruby versions), and I'd like to get those sorted out if we can.

airblade · 2014-10-02T07:08:06Z

@sandal Thanks. I'm on Ruby 1.9.3p286 and Prawn 1.0.0 – let me know if I can provide any further information.

practicingruby · 2014-10-02T09:34:06Z

@airblade: Would it be possible for you to try to reproduce on Ruby 2.0 or 2.1? Even if it's not feasible for you to upgrade Ruby in your production code, it'll help us narrow this down.

airblade · 2014-10-02T13:19:21Z

@sandal I'll have a go and let you know what I find.

straydogstudio · 2014-10-02T21:16:57Z

@airblade @sandal I've had success forcing the overflow encoding as iso-8859-1 and re-encoding it as UTF-8:

overflow = overflow.force_encoding('iso-8859-1').encode('utf-8')

This is on Ruby 2.0. Of course, forcing the encoding as iso-8859-1 may still be incorrect.

airblade · 2014-10-03T09:53:19Z

Here's a short program that demonstrates the problem. I ran it on Ruby 1.9.3p286 and Ruby 2.1.3, and Prawn 1.0.0 and Prawn 1.3.0. The results are below.

# encoding: utf-8
require 'prawn'

@doc = Prawn::Document.new page_size: 'A4'

def debug(name, string)
  puts "#{name}: (#{string.encoding}) #{string.inspect}"
end

def render_text_box(string)
  Prawn::Text::Box.new(
    string,
    width: 100,
    height: 20,
    document: @doc
  ).render
end

text = "A quick brown fox jumped over the lazy dog."
overflow = render_text_box text
debug 'text', text
debug 'overflow', overflow

text = "A quick brown fox jumped über the lazy dog."
overflow = render_text_box text
debug 'text', text
debug 'overflow', overflow

And here are the results:

# Ruby 1.9.3p286, Prawn 1.0.0
text: (UTF-8) "A quick brown fox jumped over the lazy dog."
overflow: (UTF-8) "jumped over the lazy dog."
text: (UTF-8) "A quick brown fox jumped über the lazy dog."
overflow: (ASCII-8BIT) "jumped \xFCber the lazy dog."

# Ruby 1.9.3p286, Prawn 1.3.0
text: (UTF-8) "A quick brown fox jumped over the lazy dog."
overflow: (ASCII-8BIT) "jumped over the lazy dog."
text: (UTF-8) "A quick brown fox jumped über the lazy dog."
overflow: (ASCII-8BIT) "jumped \xFCber the lazy dog."

# Ruby 2.1.3, Prawn 1.0.0
text: (UTF-8) "A quick brown fox jumped over the lazy dog."
overflow: (UTF-8) "jumped over the lazy dog."
text: (UTF-8) "A quick brown fox jumped über the lazy dog."
overflow: (ASCII-8BIT) "jumped \xFCber the lazy dog."

# Ruby 2.1.3, Prawn 1.3.0
text: (UTF-8) "A quick brown fox jumped over the lazy dog."
overflow: (ASCII-8BIT) "jumped over the lazy dog."
text: (UTF-8) "A quick brown fox jumped über the lazy dog."
overflow: (ASCII-8BIT) "jumped \xFCber the lazy dog."

I would expect the overflow to always have the same encoding as the original text, i.e. UTF-8.

practicingruby · 2014-10-05T11:45:31Z

@airblade: On closer investigation, behavior isn't exactly a bug, at least in 1.3.0.

Here's the summary of why:

You're using built-in PDF fonts, which are NOT generally UTF-8 friendly. They support the Win-1252 format, which is nearly equivalent to ISO-8859-1(Latin-1), which explains why @straydogstudio's workaround might work)
When using built-in fonts, Prawn will attempt to convert UTF-8 to Win-1252, a format that Ruby itself doesn't provide an encoding for, so we need to treat it as Ascii-8bit and implement the character map ourselves. So from this perspective, the returned overflow value is "right", but it's going to fail to render anything outside of win-1252's very limited character range.
To get around this problem, you can use any TTF font file with Prawn that supports the range of Unicode characters you want: Most common fonts support everything you'd need.

Using DejaVuSans, I was able to get the following output on both Prawn 1.0 and Prawn 1.3 (I don't think Ruby version matters):

text: (UTF-8) "A quick brown fox jumped over the lazy dog."
overflow: (UTF-8) "fox jumped over the lazy dog."
text: (UTF-8) "A quick brown fox jumped über the lazy dog."
overflow: (UTF-8) "fox jumped über the lazy dog."

I think that's what you were looking for, right?

We need to do a better job of informing people that Prawn's default font selection (and not coincidentally, the PDF format's defaults) are NOT unicode friendly, even though Prawn itself handles UTF-8 text fine given fonts that support it. I think this may involve raising a warning or error when non-compatible glyphs are found, and also probably a guide explaining this. I'll open a ticket for those issues.

practicingruby · 2014-10-05T11:57:30Z

Note about need for better documentation / warning behavior is in #779.

practicingruby · 2014-10-05T14:28:00Z

@airblade Upon closer investigation, the plot thickens! Here's a summary of what's going wrong here:

Because we're using the built in AFM fonts, Prawn takes the UTF-8 text, converts it to ASCII-8BIT (i.e. just a binary seequence of bytes), and then uses Prawn's WinAnsi lookup table to map the codepoints to their relevant glyphs. So ü would be mapped to codepoint 252, which is actually the same in both UTF-8 and WinAnsi.

In WinAnsi, the byte value and codepoint are the same (252), but in Unicode, they are not:

>> "ü".codepoints
=> [252]
>> "ü".bytes
=> [195, 188]

So when we attempt to convert this text back into UTF-8 in various places throughout the text call chain, we're losing information and attempting to treat WinAnsi byte values as if they are equivalent to UTF-8 byte values: They're not!

This is going to take further investigation, but it seems like this gets us at least a little closer. Sorry for the long and probably fuzzy explanation above.

airblade · 2014-10-06T07:36:13Z

@sandal Thank you very much for investigating this and for the explanations.

I had no idea that the PDF standard specifies default fonts which only support Win-1252. That's surprising in this day and age but I suppose it's an antique standard. I'll leave a comment on #779.

You're right, using a TTF font solves my immediate problem – so thank you.

As for the thickening plot, I shall follow with interest and try to contribute where I can.

practicingruby · 2014-10-19T12:45:57Z

@airblade: I'm working on a fix that would convert the remaining text back into UTF-8, which I think is a better behavior. But as it turns out, the existing behavior is documented with specs, and there is a way of getting things to work on released versions of Prawn.

By passing the :skip_encoding => true option to text_box, WinANSI text will be correctly rendered. This is a clunky interface, so we should think about changing it, and that's why I'm considering always returning UTF-8.

practicingruby · 2014-10-19T13:39:36Z

@airblade See #793 for a rough proof of concept of how we can improve the API. It's not guaranteed to be stable yet, but should give you an idea of where I'd like to head w. things.

airblade · 2014-10-20T08:39:00Z

@sandal That looks like as simple a solution as can be.

practicingruby closed this as completed Oct 5, 2014

practicingruby reopened this Oct 5, 2014

practicingruby mentioned this issue Oct 5, 2014

Improve awareness of lack of UTF-8 support in PDF built-in (AFM) fonts #779

Closed

practicingruby added the confirmed-bug label Oct 5, 2014

practicingruby added change-request and removed confirmed-bug labels Oct 19, 2014

practicingruby mentioned this issue Oct 19, 2014

Change text box to return remaining text as UTF-8, improve Win1252 handling internally, raise errors or warnings rather than silently replacing invalid glyphs #793

Closed

practicingruby closed this as completed Jan 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding mismatches in text boxes with truncation #777

Encoding mismatches in text boxes with truncation #777

airblade commented Oct 1, 2014

airblade commented Oct 1, 2014

airblade commented Oct 1, 2014

airblade commented Oct 1, 2014

practicingruby commented Oct 1, 2014

airblade commented Oct 2, 2014

practicingruby commented Oct 2, 2014

airblade commented Oct 2, 2014

straydogstudio commented Oct 2, 2014

airblade commented Oct 3, 2014

practicingruby commented Oct 5, 2014

practicingruby commented Oct 5, 2014

practicingruby commented Oct 5, 2014

airblade commented Oct 6, 2014

practicingruby commented Oct 19, 2014

practicingruby commented Oct 19, 2014

airblade commented Oct 20, 2014

Encoding mismatches in text boxes with truncation #777

Encoding mismatches in text boxes with truncation #777

Comments

airblade commented Oct 1, 2014

airblade commented Oct 1, 2014

airblade commented Oct 1, 2014

airblade commented Oct 1, 2014

practicingruby commented Oct 1, 2014

airblade commented Oct 2, 2014

practicingruby commented Oct 2, 2014

airblade commented Oct 2, 2014

straydogstudio commented Oct 2, 2014

airblade commented Oct 3, 2014

practicingruby commented Oct 5, 2014

practicingruby commented Oct 5, 2014

practicingruby commented Oct 5, 2014

airblade commented Oct 6, 2014

practicingruby commented Oct 19, 2014

practicingruby commented Oct 19, 2014

airblade commented Oct 20, 2014