Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML5 parsing and error checking #362

Merged
merged 11 commits into from
Dec 13, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Gemfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# frozen_string_literal: true

source 'http://rubygems.org'
source 'https://rubygems.org'
jeremy marked this conversation as resolved.
Show resolved Hide resolved

gemspec
16 changes: 9 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ Below is mostly comprehensive list of checks that HTMLProofer can perform.

### HTML

* Whether your HTML markup is valid. This is done via [Nokogiri](http://www.nokogiri.org/tutorials/ensuring_well_formed_markup.html) to ensure well-formed markup.
* Whether your HTML markup is valid. This is done via [Nokogumbo](https://github.com/rubys/nokogumbo) to validate well-formed HTML5 markup.

## Usage

Expand Down Expand Up @@ -222,7 +222,7 @@ htmlproofer --assume-extension ./_site

### Using through Docker

If you have trouble with (or don't want to) install Ruby/Nokogiri, the command-line tool can be run through Docker. See [html-proofer-docker](https://github.com/18F/html-proofer-docker) for more information.
If you have trouble with (or don't want to) install Ruby/Nokogumbo, the command-line tool can be run through Docker. See [html-proofer-docker](https://github.com/18F/html-proofer-docker) for more information.

### Using as Rack middleware

Expand Down Expand Up @@ -282,9 +282,10 @@ The `HTMLProofer` constructor takes an optional hash of additional options:
| `check_external_hash` | Checks whether external hashes exist (even if the webpage exists). This slows the checker down. | `false` |
| `check_favicon` | Enables the favicon checker. | `false` |
| `check_opengraph` | Enables the Open Graph checker. | `false` |
| `check_html` | Enables HTML validation errors from Nokogiri | `false` |
| `check_html` | Enables HTML validation errors from Nokogumbo | `false` |
| `check_img_http` | Fails an image if it's marked as `http` | `false` |
|`checks_to_ignore`| An array of Strings indicating which checks you'd like to not perform. | `[]`
| `check_sri` | Check that `<link>` and `<script>` external resources use SRI |false |
| `checks_to_ignore`| An array of Strings indicating which checks you do not want to run | `[]`
| `directory_index_file` | Sets the file to look for when a link refers to a directory. | `index.html` |
| `disable_external` | If `true`, does not run the external link checker, which can take a lot of time. | `false` |
| `empty_alt_ignore` | If `true`, ignores images with empty alt tags. | `false` |
Expand Down Expand Up @@ -314,15 +315,16 @@ See below for more information.

### Configuring HTML validation rules

If `check_html` is `true`, Nokogiri performs additional validation on your HTML.
If `check_html` is `true`, Nokogumbo performs additional validation on your HTML.

You can pass in additional options to configure this validation.

| Option | Description | Default |
| :----- | :---------- | :------ |
| `report_invalid_tags` | When `check_html` is enabled, HTML markup that is unknown to Nokogiri are reported as errors. | `false`
| `report_invalid_tags` | When `check_html` is enabled, HTML markup that is unknown to Nokogumbo are reported as errors. | `false`
| `report_missing_doctype` | When `check_html` is enabled, HTML markup with missing or out-of-order `DOCTYPE` are reported as errors. | `false`
| `report_missing_names` | When `check_html` is enabled, HTML markup that are missing entity names are reported as errors. | `false`
| `report_script_embeds` | When `check_html` is enabled, `script` tags containing markup [are reported as errors](http://git.io/vOovv). Enabling this option ignores those errors. | `false`
| `report_script_embeds` | When `check_html` is enabled, `script` tags containing markup are reported as errors. | `false`

For example:

Expand Down
14 changes: 8 additions & 6 deletions bin/htmlproofer
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,13 @@ Mercenary.program(:htmlproofer) do |p|
p.option 'as_links', '--as-links', 'Assumes that `PATH` is a comma-separated array of links to check.'
p.option 'alt_ignore', '--alt-ignore image1,[image2,...]', Array, 'A comma-separated list of Strings or RegExps containing `img`s whose missing `alt` tags are safe to ignore'
p.option 'assume_extension', '--assume-extension', 'Automatically add extension (e.g. `.html`) to file paths, to allow extensionless URLs (as supported by Jekyll 3 and GitHub Pages) (default: `false`).'
p.option 'checks_to_ignore', '--checks-to-ignore check1,[check2,...]', Array, ' An array of Strings indicating which checks you\'d like to not perform.'
p.option 'checks_to_ignore', '--checks-to-ignore check1,[check2,...]', Array, 'A comma-separated list of Strings indicating which checks you do not want to run (default: `[]`)'
p.option 'check_external_hash', '--check-external-hash', 'Checks whether external hashes exist (even if the webpage exists). This slows the checker down (default: `false`).'
p.option 'check_favicon', '--check-favicon', 'Enables the favicon checker (default: `false`).'
p.option 'check_html', '--check-html', 'Enables HTML validation errors from Nokogiri (default: `false`).'
p.option 'check_html', '--check-html', 'Enables HTML validation errors from Nokogumbo (default: `false`).'
p.option 'check_img_http', '--check-img-http', 'Fails an image if it\'s marked as `http` (default: `false`).'
p.option 'check_opengraph', '--check-opengraph', 'Enables the Open Graph checker (default: `false`).'
p.option 'check_sri', '--check-sri', 'Check that `<link>` and `<script>` external resources do use SRI (default: `false`).'
p.option 'check_sri', '--check-sri', 'Check that `<link>` and `<script>` external resources use SRI (default: `false`).'
p.option 'directory_index_file', '--directory-index-file <filename>', String, 'Sets the file to look for when a link refers to a directory. (default: `index.html`)'
p.option 'disable_external', '--disable-external', 'If `true`, does not run the external link checker, which can take a lot of time (default: `false`)'
p.option 'empty_alt_ignore', '--empty-alt-ignore', 'If `true`, ignores images with empty alt tags'
Expand All @@ -37,9 +37,10 @@ Mercenary.program(:htmlproofer) do |p|
p.option 'file_ignore', '--file-ignore file1,[file2,...]', Array, 'A comma-separated list of Strings or RegExps containing file paths that are safe to ignore'
p.option 'http_status_ignore', '--http-status-ignore 123,[xxx, ...]', Array, 'A comma-separated list of numbers representing status codes to ignore.'
p.option 'internal_domains', '--internal-domains domain1,[domain2,...]', Array, 'A comma-separated list of Strings containing domains that will be treated as internal urls.'
p.option 'report_invalid_tags', '--report-invalid-tags', 'Ignore `check_html` errors associated with unknown markup (default: `false`)'
p.option 'report_missing_names', '--report-missing-names', 'Ignore `check_html` errors associated with missing entities (default: `false`)'
p.option 'report_script_embeds', '--report-script-embeds', 'Ignore `check_html` errors associated with `script`s (default: `false`)'
p.option 'report_invalid_tags', '--report-invalid-tags', 'When `check_html` is enabled, HTML markup that is unknown to Nokogumbo are reported as errors (default: `false`)'
p.option 'report_missing_names', '--report-missing-names', 'When `check_html` is enabled, HTML markup that are missing entity names are reported as errors (default: `false`)'
p.option 'report_script_embeds', '--report-script-embeds', 'When `check_html` is enabled, `script` tags containing markup are reported as errors (default: `false`)'
p.option 'report_missing_doctype', '--report-missing-doctype', 'When `check_html` is enabled, HTML markup with missing or out-of-order `DOCTYPE` are reported as errors (default: `false`)'
p.option 'log_level', '--log-level <level>', String, 'Sets the logging level, as determined by Yell. One of `:debug`, `:info`, `:warn`, `:error`, or `:fatal`. (default: `:info`)'
p.option 'only_4xx', '--only-4xx', 'Only reports errors for links that fall within the 4xx status code range'
p.option 'storage_dir', '--storage-dir PATH', String, 'Directory where to store the cache log (default: "tmp/.htmlproofer")'
Expand Down Expand Up @@ -80,6 +81,7 @@ Mercenary.program(:htmlproofer) do |p|
options[:validation][:report_script_embeds] = opts['report_script_embeds'] unless opts['report_script_embeds'].nil?
options[:validation][:report_missing_names] = opts['report_missing_names'] unless opts['report_missing_names'].nil?
options[:validation][:report_invalid_tags] = opts['report_invalid_tags'] unless opts['report_invalid_tags'].nil?
options[:validation][:report_missing_doctype] = opts['report_missing_doctype'] unless opts['report_missing_doctype'].nil?

options[:typhoeus] = HTMLProofer::Configuration.parse_json_option('typhoeus_config', opts['typhoeus_config']) unless opts['typhoeus_config'].nil?

Expand Down
2 changes: 1 addition & 1 deletion html-proofer.gemspec
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Gem::Specification.new do |gem|

gem.add_dependency 'addressable', '~> 2.3'
gem.add_dependency 'mercenary', '~> 0.3'
gem.add_dependency 'nokogiri', '~> 1.10'
gem.add_dependency 'nokogumbo', '~> 2.0'
gem.add_dependency 'parallel', '~> 1.3'
gem.add_dependency 'rainbow', '~> 3.0'
gem.add_dependency 'typhoeus', '~> 1.3'
Expand Down
2 changes: 1 addition & 1 deletion lib/html-proofer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ def require_all(path)
begin
require 'awesome_print'
require 'pry-byebug'
rescue LoadError; end # rubocop:disable Lint/HandleExceptions
rescue LoadError; end # rubocop:disable Lint/SuppressedException
module HTMLProofer
def self.check_file(file, options = {})
raise ArgumentError unless file.is_a?(String)
Expand Down
31 changes: 17 additions & 14 deletions lib/html-proofer/check/html.rb
Original file line number Diff line number Diff line change
@@ -1,28 +1,31 @@
# frozen_string_literal: true

class HtmlCheck < ::HTMLProofer::Check
# tags embedded in scripts are used in templating languages: http://git.io/vOovv
SCRIPT_EMBEDS_MSG = /Element script embeds close tag/.freeze
INVALID_TAG_MSG = /Tag ([\w\-:]+) invalid/.freeze
INVALID_PREFIX = /Namespace prefix/.freeze
PARSE_ENTITY_REF = /htmlParseEntityRef: no name/.freeze
DOCTYPE_MSG = /The doctype must be the first token in the document/.freeze

def run
@html.errors.each do |error|
message = error.message
line = error.line

if message =~ INVALID_TAG_MSG || message =~ INVALID_PREFIX
next unless options[:validation][:report_invalid_tags]
end

if message =~ PARSE_ENTITY_REF
next unless options[:validation][:report_missing_names]
end

# tags embedded in scripts are used in templating languages: http://git.io/vOovv
next if !options[:validation][:report_script_embeds] && message =~ SCRIPT_EMBEDS_MSG
add_issue(error.message, line: error.line) if report?(error.message)
end
end

add_issue(message, line: line)
def report?(message)
case message
when SCRIPT_EMBEDS_MSG
options[:validation][:report_script_embeds]
when INVALID_TAG_MSG, INVALID_PREFIX
options[:validation][:report_invalid_tags]
when PARSE_ENTITY_REF
options[:validation][:report_missing_names]
when DOCTYPE_MSG
options[:validation][:report_missing_doctype]
else
true
jeremy marked this conversation as resolved.
Show resolved Hide resolved
end
end
end
2 changes: 1 addition & 1 deletion lib/html-proofer/check/links.rb
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ def run
if missing_href?
next if @link.allow_missing_href?
# HTML5 allows dropping the href: http://git.io/vBX0z
next if @html.internal_subset.name == 'html' && @html.internal_subset.external_id.nil?
next if @html.internal_subset.nil? || (@html.internal_subset.name == 'html' && @html.internal_subset.external_id.nil?)

add_issue('anchor has no href attribute', line: line, content: content)
next
Expand Down
3 changes: 2 additions & 1 deletion lib/html-proofer/configuration.rb
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,8 @@ module Configuration
VALIDATION_DEFAULTS = {
report_script_embeds: false,
report_missing_names: false,
report_invalid_tags: false
report_invalid_tags: false,
report_missing_doctype: false
}.freeze

CACHE_DEFAULTS = {}.freeze
Expand Down
2 changes: 1 addition & 1 deletion lib/html-proofer/middleware.rb
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ def call(env)
'response',
Middleware.options
).check_parsed(
Nokogiri::HTML(clean_content(html)), 'response'
Nokogiri::HTML5(html, max_errors: -1), 'response'
)

raise InvalidHtmlError, parsed[:failures] unless parsed[:failures].empty?
Expand Down
2 changes: 1 addition & 1 deletion lib/html-proofer/url_validator.rb
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ def response_handler(response, filenames)
href = response.request.base_url.to_s
method = response.request.options[:method]
response_code = response.code
response.body.gsub!("\x00", '')
response.body.delete!("\x00")

debug_msg = if filenames.nil?
"Received a #{response_code} for #{href}"
Expand Down
13 changes: 2 additions & 11 deletions lib/html-proofer/utils.rb
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# frozen_string_literal: true

require 'nokogiri'
require 'nokogumbo'

module HTMLProofer
module Utils
Expand All @@ -15,7 +15,7 @@ def create_nokogiri(path)
path
end

Nokogiri::HTML(clean_content(content))
Nokogiri::HTML5(content)
end

def swap(href, replacement)
Expand All @@ -24,14 +24,5 @@ def swap(href, replacement)
end
href
end

# address a problem with Nokogiri's parsing URL entities
# problem from http://git.io/vBYU1
# solution from http://git.io/vBYUi
def clean_content(string)
string.gsub(%r{(?:https?:)?//([^>]+)}i) do |url|
url.gsub(/&(?!amp;)/, '&amp;')
end
end
end
end
55 changes: 34 additions & 21 deletions spec/html-proofer/command_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -89,9 +89,9 @@
end

it 'works with check-html' do
broken = "#{FIXTURES_DIR}/html/invalid_tag.html"
broken = "#{FIXTURES_DIR}/html/unmatched_end_tag.html"
output = make_bin('--check-html --report-invalid-tags', broken)
expect(output).to match('1 failure')
expect(output).to match('HTML-Proofer finished successfully')
end

it 'works with empty-alt-ignore' do
Expand All @@ -106,25 +106,38 @@
expect(output).to match('successfully')
end

it 'has every option' do
config_keys = HTMLProofer::Configuration::PROOFER_DEFAULTS.keys
bin_file = File.read('bin/htmlproofer')
help_output = make_bin('--help')
readme = File.read('README.md')
config_keys.map(&:to_s).each do |key|
# match options
expect(bin_file).to match(key)
readme.each_line do |line|
next unless line =~ /\| `#{key}`/

description = line.split('|')[2].strip
description.gsub!('A hash', 'A comma-separated list')
description.gsub!('An array', 'A comma-separated list')
description.gsub!(/\[(.+?)\]\(.+?\)/, '\1')
description.sub!(/\.$/, '')
# match README description for option
expect(help_output).to include(description)
end
it 'has every option for proofer defaults' do
match_command_help(HTMLProofer::Configuration::PROOFER_DEFAULTS)
end

it 'has every option for validation defaults' do
match_command_help(HTMLProofer::Configuration::VALIDATION_DEFAULTS)
end
end

def match_command_help(config)
config_keys = config.keys
bin_file = File.read('bin/htmlproofer')
help_output = make_bin('--help')
readme = File.read('README.md')

config_keys.map(&:to_s).each do |key|
# match options
expect(bin_file).to match(key)
matched = false
readme.each_line do |line|
next unless line =~ /\| `#{key}`/

matched = true
description = line.split('|')[2].strip
description.gsub!('A hash', 'A comma-separated list')
description.gsub!('An array', 'A comma-separated list')
description.gsub!(/\[(.+?)\]\(.+?\)/, '\1')
description.sub!(/\.$/, '')
# match README description for option
expect(help_output).to include(description)
end

expect(matched).to be(true), "Could not find `#{key}` explained in README!" unless matched
end
end
16 changes: 8 additions & 8 deletions spec/html-proofer/element_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,54 +4,54 @@

describe HTMLProofer::Element do
before(:each) do
@check = HTMLProofer::Check.new('', '', Nokogiri::HTML(''), HTMLProofer::Configuration::PROOFER_DEFAULTS)
@check = HTMLProofer::Check.new('', '', Nokogiri::HTML5(''), HTMLProofer::Configuration::PROOFER_DEFAULTS)
end

describe '#initialize' do
it 'accepts the xmlns attribute' do
nokogiri = Nokogiri::HTML('<a xmlns:cc="http://creativecommons.org/ns#">Creative Commons</a>')
nokogiri = Nokogiri::HTML5('<a xmlns:cc="http://creativecommons.org/ns#">Creative Commons</a>')
checkable = HTMLProofer::Element.new(nokogiri.css('a').first, @check)
expect(checkable.instance_variable_get(:@xmlns_cc)).to eq 'http://creativecommons.org/ns#'
end

it 'assignes the text node' do
nokogiri = Nokogiri::HTML('<p>One')
nokogiri = Nokogiri::HTML5('<p>One')
checkable = HTMLProofer::Element.new(nokogiri.css('p').first, @check)
expect(checkable.instance_variable_get(:@text)).to eq 'One'
end

it 'accepts the content attribute' do
nokogiri = Nokogiri::HTML('<meta name="twitter:card" content="summary">')
nokogiri = Nokogiri::HTML5('<meta name="twitter:card" content="summary">')
checkable = HTMLProofer::Element.new(nokogiri.css('meta').first, @check)
expect(checkable.instance_variable_get(:@content)).to eq 'summary'
end
end

describe '#ignores_pattern_check' do
it 'works for regex patterns' do
nokogiri = Nokogiri::HTML('<script src=/assets/main.js></script>')
nokogiri = Nokogiri::HTML5('<script src=/assets/main.js></script>')
checkable = HTMLProofer::Element.new(nokogiri.css('script').first, @check)
expect(checkable.ignores_pattern_check([%r{\/assets\/.*(js|css|png|svg)}])).to eq true
end

it 'works for string patterns' do
nokogiri = Nokogiri::HTML('<script src=/assets/main.js></script>')
nokogiri = Nokogiri::HTML5('<script src=/assets/main.js></script>')
checkable = HTMLProofer::Element.new(nokogiri.css('script').first, @check)
expect(checkable.ignores_pattern_check(['/assets/main.js'])).to eq true
end
end

describe '#url' do
it 'works for src attributes' do
nokogiri = Nokogiri::HTML('<img src=image.png />')
nokogiri = Nokogiri::HTML5('<img src=image.png />')
checkable = HTMLProofer::Element.new(nokogiri.css('img').first, @check)
expect(checkable.url).to eq 'image.png'
end
end

describe '#ignore' do
it 'works for twitter cards' do
nokogiri = Nokogiri::HTML('<meta name="twitter:url" data-proofer-ignore content="http://example.com/soon-to-be-published-url">')
nokogiri = Nokogiri::HTML5('<meta name="twitter:url" data-proofer-ignore content="http://example.com/soon-to-be-published-url">')
checkable = HTMLProofer::Element.new(nokogiri.css('meta').first, @check)
expect(checkable.ignore?).to eq true
end
Expand Down
14 changes: 7 additions & 7 deletions spec/html-proofer/fixtures/html/html5_tags.html
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
<!DOCTYPE html>
<html>
<body>
<nav>
</nav>
<article>
<p>Some text</p>
</article>
</body>
<body>
<nav></nav>
<article>
<p>Some text</p>
</article>
</body>
</html>
2 changes: 1 addition & 1 deletion spec/html-proofer/fixtures/html/weird_onclick.html
Original file line number Diff line number Diff line change
@@ -1 +1 @@
<a href="http://www.google.com" onClick="window.open('https://calendar.google.com/calendar/embed?src=o3uckm13gkcc5tu4u3846ik0pc%40group.calendar.google.com&ctz=America/Los_Angeles','Team Calendar','')">
<a href="http://www.google.com" onClick="window.open('https://calendar.google.com/calendar/embed?src=o3uckm13gkcc5tu4u3846ik0pc%40group.calendar.google.com&ctz=America/Los_Angeles','Team Calendar','')"></a>
Loading