Rebased: Fix handling of invalid UTF-8 (and other character encoding …

…errors). So I've gone ahead and rebased this onto 2.10.7... But can I ask, um, what your leaning towards here? If its ok, I'm going to go ahead and re-open the issue; that way you can a) close the issue if/when you choose to merge this; close the pull if you think this will be resolved another way, or close them both if this is a wontfix? It's totally fine however you choose, your project after all... I just get a little antsy with a pull just sitting open while new revisions get released, I guess? Or maybe I'm just crazy? Does no-one else get a bunch of Unicode decoding errors when they try to run this over any significant amount of code? This pull request is a proposal to fix github-linguist#830, and closes github-linguist#829. Addresses a number of encoding errors, mostly by: - For non-ASCII/UTF-8, convert text to UTF-8, replacing missing characters prior to spitting into lines and/or parsing. - For ASCII/UTF-8, convert to UTF-16, then back, replacing invalid characters. (This is necessary because Ruby won't convert to/from the same encoding.) - Workaround for incorrect (or maybe just extremely obscure) encodings reported by 'charlock'. See changes in [blob_helper.rb](https://github.com/pullreq/linguist/blob/master/lib/linguist/blob_helper.rb), etc. - Includes the following new test cases for the above, all taken from real repositories here on Github: - [Python/shtest-encoding.py](https://raw.github.com/llvm-mirror/llvm/master/utils/lit/tests/shtest-encoding.py) (invalid UTF-8, error in blob helper) - [Text/btParallelConstraintSolver.h](https://raw.github.com/kripken/emscripten/master/tests/bullet/src/BulletMultiThreaded/btParallelConstraintSolver.h) (invalid UTF-8, error in tokenizer) - [JavaScript/lang-vb.js](https://raw.github.com/nodesocket/commando/master/js/code-pretty/lang-vb.js) (no eqivalent character in UTF-8 from Windows-1252) - [JavaScript/xor-sanity.js](https://raw.github.com/mozilla-servo/mozjs/master/js/src/jit-test/tests/jaeger/xor-sanity.js) (bad encoding reported: IBM424_rtl)
pullreq · Dec 23, 2013 · fe9eaca · fe9eaca
1 parent 3ece15b
commit fe9eaca
Show file tree

Hide file tree

Showing 8 changed files with 373 additions and 29 deletions.
diff --git a/lib/linguist/blob_helper.rb b/lib/linguist/blob_helper.rb
@@ -137,6 +137,11 @@ def binary?
       elsif encoding.nil?
         true
 
+      # If Charlock returns an ultra-rare encoding which cannot be converted
+      # to UTF-8. Probably a false positive, and unrenderable otherwise anyway.
+      elsif ''.respond_to?(:encode!) and not Encoding.name_list.include?(encoding)
+        true
+
       # If Charlock says its binary
       else
         detect_encoding[:type] == :binary
@@ -233,6 +238,36 @@ def vendored?
       name =~ VendoredRegexp ? true : false
     end
 
+    # Internal: Explicitly remove invalid UTF-8 sequences by conversion.
+    #
+    # Avoid throwing an error on invalid byte sequences in UTF-8.
+    # Unfortunately, converting to and from the same encoding is a no-op,
+    # so if the data is already UTF-8, convert to UTF-16, then back.
+    #
+    # Only affects Ruby 1.9+ since 1.8 is charset naive.
+    # 
+    # Returns the data blob with invalid characters replaced with \uFFFD if needed.
+    def _safe_data
+      if viewable? && data
+        if ''.respond_to?(:encode!) and not encoding.nil?
+          if encoding == 'UTF-8'
+            safe_utf16 = Encoding::Converter.new('UTF-8', 'UTF-16BE',
+                        :invalid => :replace, :undefined => :replace)
+            convert_encoding  = 'UTF-16BE'
+            convert_data      = safe_utf16.convert(data)
+          else
+            convert_encoding = encoding
+            convert_data     = data
+          end
+          safe_utf8 = Encoding::Converter.new(convert_encoding, 'UTF-8',
+                           :invalid => :replace, :undefined => :replace)
+          safe_utf8.convert(convert_data)
+        else
+          data
+        end
+      end     
+    end  
+
     # Public: Get each line of data
     #
     # Requires Blob#data
@@ -241,7 +276,7 @@ def vendored?
     def lines
       @lines ||=
         if viewable? && data
-          data.split(/\r\n|\r|\n/, -1)
+          _safe_data.split(/\r\n|\r|\n/, -1)
         else
           []
         end
@@ -274,7 +309,7 @@ def sloc
     #
     # Return true or false
     def generated?
-      @_generated ||= Generated.generated?(name, lambda { data })
+      @_generated ||= Generated.generated?(name, lambda { _safe_data })
     end
 
     # Public: Detects the Language of the blob.

diff --git a/lib/linguist/samples.json b/lib/linguist/samples.json
@@ -511,8 +511,8 @@
       ".gemrc"
     ]
   },
-  "tokens_total": 436395,
-  "languages_total": 507,
+  "tokens_total": 436487,
+  "languages_total": 510,
   "tokens": {
     "ABAP": {
       "*/**": 1,
@@ -18967,10 +18967,10 @@
     },
     "JavaScript": {
       "function": 1210,
-      "(": 8513,
-      ")": 8521,
+      "(": 8518,
+      ")": 8528,
       "{": 2736,
-      ";": 4052,
+      ";": 4054,
       "//": 410,
       "jshint": 1,
       "_": 9,
@@ -18990,9 +18990,9 @@
       "constructor": 8,
       "toggle": 10,
       "return": 944,
-      "[": 1459,
+      "[": 1473,
       "this.isShown": 3,
-      "]": 1456,
+      "]": 1470,
       "show": 10,
       "that": 33,
       "e": 663,
@@ -19020,7 +19020,7 @@
       "hide": 8,
       "body": 22,
       "modal": 4,
-      "-": 705,
+      "-": 707,
       "open": 2,
       "fade": 4,
       "hidden": 12,
@@ -19067,7 +19067,7 @@
       "Animal.prototype.move": 2,
       "meters": 4,
       "alert": 11,
-      "+": 1135,
+      "+": 1137,
       "Snake.__super__.constructor.apply": 2,
       "arguments": 83,
       "Snake.prototype.move": 2,
@@ -19129,7 +19129,7 @@
       "info.versionMinor": 2,
       "parser.incoming.httpVersion": 1,
       "parser.incoming.url": 1,
-      "n": 874,
+      "n": 875,
       "headers.length": 2,
       "parser.maxHeaderPairs": 4,
       "Math.min": 5,
@@ -19213,7 +19213,7 @@
       "this.socket": 10,
       "this.connection": 8,
       "this.httpVersion": 1,
-      "null": 427,
+      "null": 429,
       "this.complete": 2,
       "this.headers": 2,
       "this.trailers": 2,
@@ -19300,7 +19300,7 @@
       "this.connection.writable": 3,
       "this.output.length": 5,
       "this._buffer": 2,
-      "c": 775,
+      "c": 776,
       "this.output.shift": 2,
       "this.outputEncodings.shift": 2,
       "this.connection.write": 4,
@@ -19713,7 +19713,7 @@
       ".type": 2,
       "c.event.handle.apply": 1,
       "oa": 1,
-      "r": 261,
+      "r": 262,
       "c.data": 12,
       "a.liveFired": 4,
       "i.live": 1,
@@ -19747,7 +19747,7 @@
       "j.handleObj.origHandler.apply": 1,
       "pa": 1,
       "b.replace": 3,
-      "/": 290,
+      "/": 297,
       "./g": 2,
       ".replace": 38,
       "/g": 37,
@@ -19791,7 +19791,7 @@
       "T": 4,
       "Ta": 1,
       "<[\\w\\W]+>": 4,
-      "|": 206,
+      "|": 212,
       "#": 13,
       "Ua": 1,
       ".": 91,
@@ -20071,7 +20071,7 @@
       "this.queue": 4,
       "clearQueue": 2,
       "Aa": 3,
-      "t": 436,
+      "t": 437,
       "ca": 6,
       "Za": 2,
       "r/g": 2,
@@ -20081,7 +20081,7 @@
       "ab": 1,
       "button": 24,
       "input": 25,
-      "/i": 22,
+      "/i": 23,
       "bb": 2,
       "select": 20,
       "textarea": 8,
@@ -22786,7 +22786,7 @@
       "u17b5": 1,
       "u200c": 1,
       "u200f": 1,
-      "u2028": 3,
+      "u2028": 5,
       "u202f": 1,
       "u2060": 1,
       "u206f": 1,
@@ -22986,6 +22986,19 @@
       "lt": 55,
       "#x27": 1,
       "#x2F": 1,
+      "PR.registerLangHandler": 1,
+      "PR.createSimpleLexer": 1,
+      "xa0": 2,
+      "u2029": 4,
+      "u201c": 5,
+      "u201d": 5,
+      "kwd": 1,
+      "com": 1,
+      "lit": 1,
+      "pln": 1,
+      "pun": 1,
+      "u2018": 1,
+      "u2019": 1,
       "window.Modernizr": 1,
       "Modernizr": 12,
       "enableClasses": 3,
@@ -23278,7 +23291,6 @@
       "result0.push": 1,
       "parse_singleLineComment": 2,
       "parse_multiLineComment": 2,
-      "u2029": 2,
       "x0B": 1,
       "uFEFF": 1,
       "u1680": 1,
@@ -24976,7 +24988,8 @@
       "exports.OPERATORS": 1,
       "exports.is_alphanumeric_char": 1,
       "exports.set_logger": 1,
-      "logger": 2
+      "logger": 2,
+      "assertEq": 1
     },
     "JSON": {
       "{": 73,
@@ -46374,7 +46387,7 @@
     "Ioke": 2,
     "Jade": 3,
     "Java": 8987,
-    "JavaScript": 76934,
+    "JavaScript": 77026,
     "JSON": 183,
     "JSON5": 57,
     "Julia": 247,
@@ -46510,7 +46523,7 @@
     "Ioke": 1,
     "Jade": 1,
     "Java": 6,
-    "JavaScript": 20,
+    "JavaScript": 22,
     "JSON": 4,
     "JSON5": 2,
     "Julia": 1,
@@ -46558,7 +46571,7 @@
     "Processing": 1,
     "Prolog": 6,
     "Protocol Buffer": 1,
-    "Python": 7,
+    "Python": 8,
     "R": 2,
     "Racket": 2,
     "Ragel in Ruby Host": 3,
@@ -46600,5 +46613,5 @@
     "Xtend": 2,
     "YAML": 1
   },
-  "md5": "7ab5683c610f7e81d6ea5fb470111bbe"
+  "md5": "12d1d4ad42ef152a4a4a69da5aa9ddd0"
 }
diff --git a/lib/linguist/samples.rb b/lib/linguist/samples.rb
@@ -101,7 +101,10 @@ def self.data
           db['filenames'][language_name].sort!
         end
 
-        data = File.read(sample[:path])
+        # Avoid throwing an error on invalid byte sequences. Encoding to and from the same
+        # charset is a no-op, so read in as UTF-16, then to convert to UTF-8. Not an issue in Ruby 1.8.
+        data = ''.respond_to?(:encode!) ? File.read(sample[:path]).encode('UTF-16BE', :invalid => :replace,
+                  :undefined => :replace).encode('UTF-8') : File.read(sample[:path])
         Classifier.train!(db, language_name, data)
       end
 
@@ -114,7 +117,8 @@ def self.data
   # Used to retrieve the interpreter from the shebang line of a file's
   # data.
   def self.interpreter_from_shebang(data)
-    lines = data.lines.to_a
+    lines = ''.respond_to?(:encode!) ? data.encode('UTF-16BE', :invalid => :replace,
+               :undefined => :replace).encode('UTF-8').lines.to_a : data.lines.to_a
 
     if lines.any? && (match = lines[0].match(/(.+)\n?/)) && (bang = match[0]) =~ /^#!/
       bang.sub!(/^#! /, '#!')

diff --git a/lib/linguist/tokenizer.rb b/lib/linguist/tokenizer.rb
@@ -55,7 +55,8 @@ def self.tokenize(data)
     #
     # Returns Array of token Strings.
     def extract_tokens(data)
-      s = StringScanner.new(data)
+      s = ''.respond_to?(:encode!) ? StringScanner.new(data.encode('UTF-16BE',
+          :invalid => :replace,:undefined => :replace).encode('UTF-8')) : StringScanner.new(data)
 
       tokens = []
       until s.eos?

diff --git a/samples/JavaScript/lang-vb.js b/samples/JavaScript/lang-vb.js
diff --git a/samples/JavaScript/xor-sanity.js b/samples/JavaScript/xor-sanity.js
@@ -0,0 +1 @@
+assertEq(-2^31, -31);
diff --git a/samples/Python/shtest-encoding.py b/samples/Python/shtest-encoding.py
@@ -0,0 +1,3 @@
+# RUN: true
+
+# Here is a string that cannot be decoded in line mode: Â.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# RUN: true

		# Here is a string that cannot be decoded in line mode: Â.