Skip to content

Commit

Permalink
Rebased: Fix handling of invalid UTF-8 (and other character encoding …
Browse files Browse the repository at this point in the history
…errors).

So I've gone ahead and rebased this onto 2.10.7...

But can I ask, um, what your leaning towards here? If its ok, I'm going to go ahead and re-open the issue; that way you can a) close the issue if/when you choose to merge this; close the pull if you think this will be resolved another way, or close them both if this is a wontfix? It's totally fine however you choose, your project after all... I just get a little antsy with a pull just sitting open while new revisions get released, I guess?

Or maybe I'm just crazy? Does no-one else get a bunch of Unicode decoding errors when they try to run this over any significant amount of code?

This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829.

Addresses a number of encoding errors, mostly by:
 - For non-ASCII/UTF-8, convert text to UTF-8, replacing missing characters prior to spitting into lines and/or parsing.
 - For ASCII/UTF-8, convert to UTF-16, then back, replacing invalid characters. (This is necessary because Ruby won't convert to/from the same encoding.)
 - Workaround for incorrect (or maybe just extremely obscure) encodings reported by 'charlock'.
   See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb), etc.
 - Includes the following new test cases for the above, all taken from real repositories here on Github:
    - [Python/shtest-encoding.py](https://raw.github.com/llvm-mirror/llvm/master/utils/lit/tests/shtest-encoding.py) (invalid UTF-8, error in blob helper)
    - [Text/btParallelConstraintSolver.h](https://raw.github.com/kripken/emscripten/master/tests/bullet/src/BulletMultiThreaded/btParallelConstraintSolver.h) (invalid UTF-8, error in tokenizer)
    - [JavaScript/lang-vb.js](https://raw.github.com/nodesocket/commando/master/js/code-pretty/lang-vb.js) (no eqivalent character in UTF-8 from Windows-1252)
    - [JavaScript/xor-sanity.js](https://raw.github.com/mozilla-servo/mozjs/master/js/src/jit-test/tests/jaeger/xor-sanity.js) (bad encoding reported: IBM424_rtl)
  • Loading branch information
Geoff committed Dec 23, 2013
1 parent 3ece15b commit fe9eaca
Show file tree
Hide file tree
Showing 8 changed files with 373 additions and 29 deletions.
39 changes: 37 additions & 2 deletions lib/linguist/blob_helper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,11 @@ def binary?
elsif encoding.nil?
true

# If Charlock returns an ultra-rare encoding which cannot be converted
# to UTF-8. Probably a false positive, and unrenderable otherwise anyway.
elsif ''.respond_to?(:encode!) and not Encoding.name_list.include?(encoding)
true

# If Charlock says its binary
else
detect_encoding[:type] == :binary
Expand Down Expand Up @@ -233,6 +238,36 @@ def vendored?
name =~ VendoredRegexp ? true : false
end

# Internal: Explicitly remove invalid UTF-8 sequences by conversion.
#
# Avoid throwing an error on invalid byte sequences in UTF-8.
# Unfortunately, converting to and from the same encoding is a no-op,
# so if the data is already UTF-8, convert to UTF-16, then back.
#
# Only affects Ruby 1.9+ since 1.8 is charset naive.
#
# Returns the data blob with invalid characters replaced with \uFFFD if needed.
def _safe_data
if viewable? && data
if ''.respond_to?(:encode!) and not encoding.nil?
if encoding == 'UTF-8'
safe_utf16 = Encoding::Converter.new('UTF-8', 'UTF-16BE',
:invalid => :replace, :undefined => :replace)
convert_encoding = 'UTF-16BE'
convert_data = safe_utf16.convert(data)
else
convert_encoding = encoding
convert_data = data
end
safe_utf8 = Encoding::Converter.new(convert_encoding, 'UTF-8',
:invalid => :replace, :undefined => :replace)
safe_utf8.convert(convert_data)
else
data
end
end
end

# Public: Get each line of data
#
# Requires Blob#data
Expand All @@ -241,7 +276,7 @@ def vendored?
def lines
@lines ||=
if viewable? && data
data.split(/\r\n|\r|\n/, -1)
_safe_data.split(/\r\n|\r|\n/, -1)
else
[]
end
Expand Down Expand Up @@ -274,7 +309,7 @@ def sloc
#
# Return true or false
def generated?
@_generated ||= Generated.generated?(name, lambda { data })
@_generated ||= Generated.generated?(name, lambda { _safe_data })
end

# Public: Detects the Language of the blob.
Expand Down
61 changes: 37 additions & 24 deletions lib/linguist/samples.json
Original file line number Diff line number Diff line change
Expand Up @@ -511,8 +511,8 @@
".gemrc"
]
},
"tokens_total": 436395,
"languages_total": 507,
"tokens_total": 436487,
"languages_total": 510,
"tokens": {
"ABAP": {
"*/**": 1,
Expand Down Expand Up @@ -18967,10 +18967,10 @@
},
"JavaScript": {
"function": 1210,
"(": 8513,
")": 8521,
"(": 8518,
")": 8528,
"{": 2736,
";": 4052,
";": 4054,
"//": 410,
"jshint": 1,
"_": 9,
Expand All @@ -18990,9 +18990,9 @@
"constructor": 8,
"toggle": 10,
"return": 944,
"[": 1459,
"[": 1473,
"this.isShown": 3,
"]": 1456,
"]": 1470,
"show": 10,
"that": 33,
"e": 663,
Expand Down Expand Up @@ -19020,7 +19020,7 @@
"hide": 8,
"body": 22,
"modal": 4,
"-": 705,
"-": 707,
"open": 2,
"fade": 4,
"hidden": 12,
Expand Down Expand Up @@ -19067,7 +19067,7 @@
"Animal.prototype.move": 2,
"meters": 4,
"alert": 11,
"+": 1135,
"+": 1137,
"Snake.__super__.constructor.apply": 2,
"arguments": 83,
"Snake.prototype.move": 2,
Expand Down Expand Up @@ -19129,7 +19129,7 @@
"info.versionMinor": 2,
"parser.incoming.httpVersion": 1,
"parser.incoming.url": 1,
"n": 874,
"n": 875,
"headers.length": 2,
"parser.maxHeaderPairs": 4,
"Math.min": 5,
Expand Down Expand Up @@ -19213,7 +19213,7 @@
"this.socket": 10,
"this.connection": 8,
"this.httpVersion": 1,
"null": 427,
"null": 429,
"this.complete": 2,
"this.headers": 2,
"this.trailers": 2,
Expand Down Expand Up @@ -19300,7 +19300,7 @@
"this.connection.writable": 3,
"this.output.length": 5,
"this._buffer": 2,
"c": 775,
"c": 776,
"this.output.shift": 2,
"this.outputEncodings.shift": 2,
"this.connection.write": 4,
Expand Down Expand Up @@ -19713,7 +19713,7 @@
".type": 2,
"c.event.handle.apply": 1,
"oa": 1,
"r": 261,
"r": 262,
"c.data": 12,
"a.liveFired": 4,
"i.live": 1,
Expand Down Expand Up @@ -19747,7 +19747,7 @@
"j.handleObj.origHandler.apply": 1,
"pa": 1,
"b.replace": 3,
"/": 290,
"/": 297,
"./g": 2,
".replace": 38,
"/g": 37,
Expand Down Expand Up @@ -19791,7 +19791,7 @@
"T": 4,
"Ta": 1,
"<[\\w\\W]+>": 4,
"|": 206,
"|": 212,
"#": 13,
"Ua": 1,
".": 91,
Expand Down Expand Up @@ -20071,7 +20071,7 @@
"this.queue": 4,
"clearQueue": 2,
"Aa": 3,
"t": 436,
"t": 437,
"ca": 6,
"Za": 2,
"r/g": 2,
Expand All @@ -20081,7 +20081,7 @@
"ab": 1,
"button": 24,
"input": 25,
"/i": 22,
"/i": 23,
"bb": 2,
"select": 20,
"textarea": 8,
Expand Down Expand Up @@ -22786,7 +22786,7 @@
"u17b5": 1,
"u200c": 1,
"u200f": 1,
"u2028": 3,
"u2028": 5,
"u202f": 1,
"u2060": 1,
"u206f": 1,
Expand Down Expand Up @@ -22986,6 +22986,19 @@
"lt": 55,
"#x27": 1,
"#x2F": 1,
"PR.registerLangHandler": 1,
"PR.createSimpleLexer": 1,
"xa0": 2,
"u2029": 4,
"u201c": 5,
"u201d": 5,
"kwd": 1,
"com": 1,
"lit": 1,
"pln": 1,
"pun": 1,
"u2018": 1,
"u2019": 1,
"window.Modernizr": 1,
"Modernizr": 12,
"enableClasses": 3,
Expand Down Expand Up @@ -23278,7 +23291,6 @@
"result0.push": 1,
"parse_singleLineComment": 2,
"parse_multiLineComment": 2,
"u2029": 2,
"x0B": 1,
"uFEFF": 1,
"u1680": 1,
Expand Down Expand Up @@ -24976,7 +24988,8 @@
"exports.OPERATORS": 1,
"exports.is_alphanumeric_char": 1,
"exports.set_logger": 1,
"logger": 2
"logger": 2,
"assertEq": 1
},
"JSON": {
"{": 73,
Expand Down Expand Up @@ -46374,7 +46387,7 @@
"Ioke": 2,
"Jade": 3,
"Java": 8987,
"JavaScript": 76934,
"JavaScript": 77026,
"JSON": 183,
"JSON5": 57,
"Julia": 247,
Expand Down Expand Up @@ -46510,7 +46523,7 @@
"Ioke": 1,
"Jade": 1,
"Java": 6,
"JavaScript": 20,
"JavaScript": 22,
"JSON": 4,
"JSON5": 2,
"Julia": 1,
Expand Down Expand Up @@ -46558,7 +46571,7 @@
"Processing": 1,
"Prolog": 6,
"Protocol Buffer": 1,
"Python": 7,
"Python": 8,
"R": 2,
"Racket": 2,
"Ragel in Ruby Host": 3,
Expand Down Expand Up @@ -46600,5 +46613,5 @@
"Xtend": 2,
"YAML": 1
},
"md5": "7ab5683c610f7e81d6ea5fb470111bbe"
"md5": "12d1d4ad42ef152a4a4a69da5aa9ddd0"
}
8 changes: 6 additions & 2 deletions lib/linguist/samples.rb
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,10 @@ def self.data
db['filenames'][language_name].sort!
end

data = File.read(sample[:path])
# Avoid throwing an error on invalid byte sequences. Encoding to and from the same
# charset is a no-op, so read in as UTF-16, then to convert to UTF-8. Not an issue in Ruby 1.8.
data = ''.respond_to?(:encode!) ? File.read(sample[:path]).encode('UTF-16BE', :invalid => :replace,
:undefined => :replace).encode('UTF-8') : File.read(sample[:path])
Classifier.train!(db, language_name, data)
end

Expand All @@ -114,7 +117,8 @@ def self.data
# Used to retrieve the interpreter from the shebang line of a file's
# data.
def self.interpreter_from_shebang(data)
lines = data.lines.to_a
lines = ''.respond_to?(:encode!) ? data.encode('UTF-16BE', :invalid => :replace,
:undefined => :replace).encode('UTF-8').lines.to_a : data.lines.to_a

if lines.any? && (match = lines[0].match(/(.+)\n?/)) && (bang = match[0]) =~ /^#!/
bang.sub!(/^#! /, '#!')
Expand Down
3 changes: 2 additions & 1 deletion lib/linguist/tokenizer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,8 @@ def self.tokenize(data)
#
# Returns Array of token Strings.
def extract_tokens(data)
s = StringScanner.new(data)
s = ''.respond_to?(:encode!) ? StringScanner.new(data.encode('UTF-16BE',
:invalid => :replace,:undefined => :replace).encode('UTF-8')) : StringScanner.new(data)

tokens = []
until s.eos?
Expand Down
2 changes: 2 additions & 0 deletions samples/JavaScript/lang-vb.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions samples/JavaScript/xor-sanity.js
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
assertEq(-2^31, -31);
3 changes: 3 additions & 0 deletions samples/Python/shtest-encoding.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# RUN: true

# Here is a string that cannot be decoded in line mode: Â.
Loading

0 comments on commit fe9eaca

Please sign in to comment.