Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

- lexer-strings.rb: avoid an exception on utf8 surrogate pair codepoints #1051

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Earlopain
Copy link
Contributor

Closes #855

Starting from Ruby 2.4, these are a syntax error but I don't see an easy way of representing such strings. Right now the parser actually crashses (in all versions) so I'd say it's an improvement.

Output of executing puts "\u{D800}" on all ruby versions:

Output

===================1.8===================
u{D800}
===================1.9===================
���
===================2.0===================
���
===================2.1===================
���
===================2.2===================
���
===================2.3===================
���
===================2.4===================
surrogate.rb:1: invalid Unicode codepoint
puts "\u{D800}"
         ^
===================2.5===================
surrogate.rb:1: invalid Unicode codepoint
puts "\u{D800}"
         ^~~~
===================2.6===================
surrogate.rb:1: invalid Unicode codepoint
puts "\u{D800}"
         ^~~~
===================2.7===================
surrogate.rb:1: invalid Unicode codepoint
puts "\u{D800}"
         ^~~~
===================3.0===================
surrogate.rb:1: invalid Unicode codepoint
puts "\u{D800}"
         ^~~~
===================3.1===================
surrogate.rb:1: invalid Unicode codepoint
puts "\u{D800}"
         ^~~~
===================3.2===================
surrogate.rb: --> surrogate.rb
invalid Unicode codepoint
> 1  puts "\u{D800}"
surrogate.rb:1: invalid Unicode codepoint (SyntaxError)
puts "\u{D800}"
             ^

===================3.3===================
surrogate.rb: 
surrogate.rb:1: invalid Unicode codepoint (SyntaxError)
puts "\u{D800}"
             ^

===================3.4===================
surrogate.rb: --> surrogate.rb

invalid Unicode escape sequence

> 1  puts "\u{D800}"

surrogate.rb:1: syntax error found (SyntaxError)
> 1 | puts "\u{D800}"
    |          ^~~~ invalid Unicode escape sequence
  2 | 

I used this script to check that integer.chr behaves the same on all ruby versions:

bounds = []
valid1 = true
valid2 = true
(0..(0x110000 - 1)).each do |num|
  begin
    valid1 = valid2
    num.chr(Encoding::UTF_8)
    valid2 = true
  rescue RangeError
    valid2 = false
  ensure
    bounds << num if valid1 != valid2
  end
end
puts bounds
Output

===================1.8===================
num_char.rb:7: uninitialized constant Encoding (NameError)
        from num_char.rb:4:in `each'
        from num_char.rb:4
===================1.9===================
55296
57344
===================2.0===================
55296
57344
===================2.1===================
55296
57344
===================2.2===================
55296
57344
===================2.3===================
55296
57344
===================2.4===================
55296
57344
===================2.5===================
55296
57344
===================2.6===================
55296
57344
===================2.7===================
55296
57344
===================3.0===================
55296
57344
===================3.1===================
55296
57344
===================3.2===================
55296
57344
===================3.3===================
55296
57344
===================3.4===================
55296
57344

Starting from Ruby 2.4, these are a syntax error.
I don't see an easy way of representing such strings.
Right now the parser actually crashses (in all versions) so I'd say it's an improvement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crashes during escaped Unicode surrogate pairs parsing
1 participant