Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding mismatches in text boxes with truncation #777

Closed
airblade opened this issue Oct 1, 2014 · 16 comments
Closed

Encoding mismatches in text boxes with truncation #777

airblade opened this issue Oct 1, 2014 · 16 comments

Comments

@airblade
Copy link
Contributor

airblade commented Oct 1, 2014

Here's my use case: I have a fixed size text box and I want to write text in it. Often that text is too long for the text box and so the text should be truncated. This all works fine for me.

However I would like to indicate when the text has been truncated, perhaps with an ellipsis or simply three full stops (periods). This is how I am trying to achieve it:

box = Prawn::Text::Box.new text, disable_wrap_by_character: true, ...  # sizing options omitted
overflow = box.render dry_run: true
text = text.sub(/ #{overflow}$/, '...') unless overflow.empty?
document.text_box text, disable_wrap_by_character: true, ...  # as above

However I often, though not always, get incompatible encoding regexp match (ASCII-8BIT regexp with UTF-8 string) errors. Sometimes text is UTF-8, sometimes it's US-ASCII. Sometimes overflow is UTF-8, sometimes it's US-ASCII.

I have tried re-encoding both text and overflow to UTF-8 but then the substitution doesn't work because the strings no longer match.

  • Is this the best approach for achieving my goal?
  • If it is, how can I make it work? :)
@airblade
Copy link
Contributor Author

airblade commented Oct 1, 2014

It looks like when text is UTF-8 and overflow contains any non-ASCII characters, e.g. ß or ü, overflow's encoding is set to US-ASCII.

That seems counter-intuitive to me ;)

Example debugging output:

text: (UTF-8) "AKKESOIR ~ SS14 / RHINESTONE FINGERTIP RING / Ring klein rund / Strass"
overflow: (UTF-8) "Strass"
result: (UTF-8) "AKKESOIR ~ SS14 / RHINESTONE FINGERTIP RING / Ring klein rund / "

text: (UTF-8) "AKKESOIR ~ SS14 / BLACK WHITE CHANDELIERS / Ohrring / weiß,schwarz"
overflow: (ASCII-8BIT) "wei\xDF,schwarz"
result: (UTF-8) "AKKESOIR ~ SS14 / BLACK WHITE CHANDELIERS / Ohrring / "

@airblade
Copy link
Contributor Author

airblade commented Oct 1, 2014

Ah, here's an alternative approach which appears to work.

Instead of using text.sub, take a substring of text whose length is the difference in length between text and overflow. The length method seems to calculate the length as one would hope regardless of the string's encoding.

text = text[0, (text.length - overflow.length - 1)]
text = "#{text}..."

(The -1 gets rid of the trailing space or hyphen.)

Of course the dots may still get truncated...but with :disable_wrap_by_character: true I think this is the best I can do.

@airblade
Copy link
Contributor Author

airblade commented Oct 1, 2014

This may be related to #603; I'm not sure.

@practicingruby
Copy link
Member

@airblade Thanks, I'll take a closer look at this in the next few days, hopefully. We have some lingering encoding issues (mainly on Ruby 1.9.3, but some affect all Ruby versions), and I'd like to get those sorted out if we can.

@airblade
Copy link
Contributor Author

airblade commented Oct 2, 2014

@sandal Thanks. I'm on Ruby 1.9.3p286 and Prawn 1.0.0 – let me know if I can provide any further information.

@practicingruby
Copy link
Member

@airblade: Would it be possible for you to try to reproduce on Ruby 2.0 or 2.1? Even if it's not feasible for you to upgrade Ruby in your production code, it'll help us narrow this down.

@airblade
Copy link
Contributor Author

airblade commented Oct 2, 2014

@sandal I'll have a go and let you know what I find.

@straydogstudio
Copy link
Contributor

@airblade @sandal I've had success forcing the overflow encoding as iso-8859-1 and re-encoding it as UTF-8:

overflow = overflow.force_encoding('iso-8859-1').encode('utf-8')

This is on Ruby 2.0. Of course, forcing the encoding as iso-8859-1 may still be incorrect.

@airblade
Copy link
Contributor Author

airblade commented Oct 3, 2014

Here's a short program that demonstrates the problem. I ran it on Ruby 1.9.3p286 and Ruby 2.1.3, and Prawn 1.0.0 and Prawn 1.3.0. The results are below.

# encoding: utf-8
require 'prawn'

@doc = Prawn::Document.new page_size: 'A4'

def debug(name, string)
  puts "#{name}: (#{string.encoding}) #{string.inspect}"
end

def render_text_box(string)
  Prawn::Text::Box.new(
    string,
    width: 100,
    height: 20,
    document: @doc
  ).render
end

text = "A quick brown fox jumped over the lazy dog."
overflow = render_text_box text
debug 'text', text
debug 'overflow', overflow

text = "A quick brown fox jumped über the lazy dog."
overflow = render_text_box text
debug 'text', text
debug 'overflow', overflow

And here are the results:

# Ruby 1.9.3p286, Prawn 1.0.0
text: (UTF-8) "A quick brown fox jumped over the lazy dog."
overflow: (UTF-8) "jumped over the lazy dog."
text: (UTF-8) "A quick brown fox jumped über the lazy dog."
overflow: (ASCII-8BIT) "jumped \xFCber the lazy dog."

# Ruby 1.9.3p286, Prawn 1.3.0
text: (UTF-8) "A quick brown fox jumped over the lazy dog."
overflow: (ASCII-8BIT) "jumped over the lazy dog."
text: (UTF-8) "A quick brown fox jumped über the lazy dog."
overflow: (ASCII-8BIT) "jumped \xFCber the lazy dog."

# Ruby 2.1.3, Prawn 1.0.0
text: (UTF-8) "A quick brown fox jumped over the lazy dog."
overflow: (UTF-8) "jumped over the lazy dog."
text: (UTF-8) "A quick brown fox jumped über the lazy dog."
overflow: (ASCII-8BIT) "jumped \xFCber the lazy dog."

# Ruby 2.1.3, Prawn 1.3.0
text: (UTF-8) "A quick brown fox jumped over the lazy dog."
overflow: (ASCII-8BIT) "jumped over the lazy dog."
text: (UTF-8) "A quick brown fox jumped über the lazy dog."
overflow: (ASCII-8BIT) "jumped \xFCber the lazy dog."

I would expect the overflow to always have the same encoding as the original text, i.e. UTF-8.

@practicingruby
Copy link
Member

@airblade: On closer investigation, behavior isn't exactly a bug, at least in 1.3.0.

Here's the summary of why:

  • You're using built-in PDF fonts, which are NOT generally UTF-8 friendly. They support the Win-1252 format, which is nearly equivalent to ISO-8859-1(Latin-1), which explains why @straydogstudio's workaround might work)
  • When using built-in fonts, Prawn will attempt to convert UTF-8 to Win-1252, a format that Ruby itself doesn't provide an encoding for, so we need to treat it as Ascii-8bit and implement the character map ourselves. So from this perspective, the returned overflow value is "right", but it's going to fail to render anything outside of win-1252's very limited character range.
  • To get around this problem, you can use any TTF font file with Prawn that supports the range of Unicode characters you want: Most common fonts support everything you'd need.

Using DejaVuSans, I was able to get the following output on both Prawn 1.0 and Prawn 1.3 (I don't think Ruby version matters):

text: (UTF-8) "A quick brown fox jumped over the lazy dog."
overflow: (UTF-8) "fox jumped over the lazy dog."
text: (UTF-8) "A quick brown fox jumped über the lazy dog."
overflow: (UTF-8) "fox jumped über the lazy dog."      

I think that's what you were looking for, right?

We need to do a better job of informing people that Prawn's default font selection (and not coincidentally, the PDF format's defaults) are NOT unicode friendly, even though Prawn itself handles UTF-8 text fine given fonts that support it. I think this may involve raising a warning or error when non-compatible glyphs are found, and also probably a guide explaining this. I'll open a ticket for those issues.

@practicingruby
Copy link
Member

Note about need for better documentation / warning behavior is in #779.

@practicingruby practicingruby reopened this Oct 5, 2014
@practicingruby
Copy link
Member

@airblade Upon closer investigation, the plot thickens! Here's a summary of what's going wrong here:

  • Because we're using the built in AFM fonts, Prawn takes the UTF-8 text, converts it to ASCII-8BIT (i.e. just a binary seequence of bytes), and then uses Prawn's WinAnsi lookup table to map the codepoints to their relevant glyphs. So ü would be mapped to codepoint 252, which is actually the same in both UTF-8 and WinAnsi.

In WinAnsi, the byte value and codepoint are the same (252), but in Unicode, they are not:

>> "ü".codepoints
=> [252]
>> "ü".bytes
=> [195, 188]

So when we attempt to convert this text back into UTF-8 in various places throughout the text call chain, we're losing information and attempting to treat WinAnsi byte values as if they are equivalent to UTF-8 byte values: They're not!

This is going to take further investigation, but it seems like this gets us at least a little closer. Sorry for the long and probably fuzzy explanation above.

@airblade
Copy link
Contributor Author

airblade commented Oct 6, 2014

@sandal Thank you very much for investigating this and for the explanations.

I had no idea that the PDF standard specifies default fonts which only support Win-1252. That's surprising in this day and age but I suppose it's an antique standard. I'll leave a comment on #779.

You're right, using a TTF font solves my immediate problem – so thank you.

As for the thickening plot, I shall follow with interest and try to contribute where I can.

@practicingruby
Copy link
Member

@airblade: I'm working on a fix that would convert the remaining text back into UTF-8, which I think is a better behavior. But as it turns out, the existing behavior is documented with specs, and there is a way of getting things to work on released versions of Prawn.

By passing the :skip_encoding => true option to text_box, WinANSI text will be correctly rendered. This is a clunky interface, so we should think about changing it, and that's why I'm considering always returning UTF-8.

@practicingruby
Copy link
Member

@airblade See #793 for a rough proof of concept of how we can improve the API. It's not guaranteed to be stable yet, but should give you an idea of where I'd like to head w. things.

@airblade
Copy link
Contributor Author

@sandal That looks like as simple a solution as can be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants