-
-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup, Redux #24
Speedup, Redux #24
Changes from 13 commits
27238e9
0d83048
893d1b2
7b13ef9
0e94fab
b87960c
1fa473c
db5fddc
fa745ae
484c670
8f5d769
97228ef
ce3cc9a
8b9b62c
bfcfcee
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -69,7 +69,9 @@ Note: since `swap` is a bit special, you'll pass in a pair of `RegEx:String` val | |
|
||
## Usage with Jekyll | ||
|
||
Want to use HTML Proofer with your Jekyll site? Awesome. Simply add `gem 'html-proofer'` to your `Gemfile` as described above, and add the following to your `Rakefile`, using `rake test` to execute: | ||
Want to use HTML Proofer with your Jekyll site? Awesome. Simply add `gem 'html-proofer'` | ||
to your `Gemfile` as described above, and add the following to your `Rakefile`, | ||
using `rake test` to execute: | ||
|
||
```ruby | ||
require 'html/proofer' | ||
|
@@ -94,12 +96,15 @@ htmlproof ./_site | |
* Whether your internal links are not broken; this includes hash references (`#linkToMe`) | ||
* Whether external links are working | ||
|
||
## Configuration | ||
## Configuration | ||
|
||
The `HTML::Proofer` constructor takes an optional hash of additional options: | ||
|
||
* `:ext`: the extension (including the `.`) of your HTML files (default: `.html`) | ||
* `:href_swap`: a hash containing key-value pairs of `RegExp => String`. It transforms links that match `RegExp` into `String` via `gsub`. | ||
* `:href_ignore`: an array of Strings containing `href`s that are safe to ignore (default: `mailto`) | ||
* `:href_ignore`: an array of Strings containing `href`s that are safe to ignore (by default, `mailto` is always checked) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this mean that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarified 👍 |
||
* `:disable_external`: if `true`, does not run the external link checker, which can take a lot of time (default: `false`) | ||
|
||
To any `<a>` or `<img>` tag, you may add the `data-proofer-ignore` attribute to ignore the link. | ||
You can also pass in any of Typhoeus' options for the external link check. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does this look like? Would you mind providing an example? |
||
|
||
To any `<a>` or `<img>` tag, you may add the `data-proofer-ignore` attribute to ignore the link. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,54 +1,117 @@ | ||
require 'nokogiri' | ||
|
||
require File.dirname(__FILE__) + '/proofer/checkable' | ||
require File.dirname(__FILE__) + '/proofer/checks' | ||
|
||
module HTML | ||
class Proofer | ||
def initialize(src, opts={}) | ||
@srcDir = src | ||
@options = {:ext => ".html"}.merge(opts) | ||
# Typhoeus::Config.verbose = true | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you think about turning this on if I pass in |
||
@proofer_opts = {:ext => ".html", :href_swap => [], :href_ignore => [], :disable_external => false } | ||
@options = @proofer_opts.merge({:followlocation => true}).merge(opts) | ||
|
||
@failed_tests = [] | ||
end | ||
|
||
def run | ||
total_files = 0 | ||
failed_tests = [] | ||
external_urls = {} | ||
hydra = Typhoeus::Hydra.new | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this need to be instantiated every time or can we memoize? def hydra
@hydra ||= Typhoeus::Hydra.new
end Then use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Memoize~~ |
||
|
||
puts "Running #{get_checks} checks on #{@srcDir} on *#{@options[:ext]}... \n\n" | ||
|
||
puts "Running #{get_checks} checks on #{@srcDir}... \n\n" | ||
if File.directory? @srcDir | ||
files = Dir.glob("#{@srcDir}/**/*#{@options[:ext]}") | ||
else | ||
files = File.extname(@srcDir) == @options[:ext] ? [@srcDir] : [] | ||
end | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This block might be a great candidate for a method. |
||
|
||
Dir.glob("#{@srcDir}/**/*#{@options[:ext]}") do |path| | ||
files.each do |path| | ||
total_files += 1 | ||
html = HTML::Proofer.create_nokogiri(path) | ||
|
||
get_checks.each do |klass| | ||
check = klass.new(@srcDir, path, html, @options) | ||
check.run | ||
check.hydra.run | ||
failed_tests.concat(check.issues) if check.issues.length > 0 | ||
external_urls.merge!(check.external_urls) | ||
@failed_tests.concat(check.issues) if check.issues.length > 0 | ||
end | ||
end | ||
|
||
# the hypothesis is that Proofer runs way faster if we pull out | ||
# all the external URLs and run the checks at the end. Otherwise, we're halting | ||
# the consuming process for every file. In addition, sorting the list lets | ||
# libcurl keep connections to hosts alive. Finally, we'll make a HEAD request, | ||
# rather than GETing all the contents | ||
external_urls = Hash[external_urls.sort] | ||
|
||
unless @options[:disable_external] | ||
puts "Checking #{external_urls.length} external links... \n\n" | ||
|
||
# Typhoeus won't let you pass any non-Typhoeus option | ||
@proofer_opts.each_key do |opt| | ||
@options.delete opt | ||
end | ||
|
||
external_urls.each_pair do |href, filenames| | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doesn't have to be in this pull request, but may make sense to break some things out into methods rather than having a single long, functional run method. OO and all that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @benbalter 👍 ^^ |
||
request = Typhoeus::Request.new(href, @options.merge({:method => :head})) | ||
request.on_complete { |response| response_handler(response, filenames) } | ||
hydra.queue request | ||
end | ||
hydra.run | ||
end | ||
|
||
puts "Ran on #{total_files} files!" | ||
|
||
if failed_tests.empty? | ||
if @failed_tests.empty? | ||
puts "HTML-Proofer finished successfully.".green | ||
exit 0 | ||
else | ||
failed_tests.each do |issue| | ||
@failed_tests.each do |issue| | ||
$stderr.puts issue + "\n\n" | ||
end | ||
|
||
raise "HTML-Proofer found #{failed_tests.length} failures!" | ||
raise "HTML-Proofer found #{@failed_tests.length} failures!" | ||
end | ||
end | ||
|
||
def response_handler(response, filenames) | ||
href = response.options[:effective_url] | ||
method = response.request.options[:method] | ||
response_code = response.code | ||
|
||
if response_code.between?(200, 299) | ||
# continue with no op | ||
elsif response.timed_out? | ||
@failed_tests << "#{filenames.join(' ').blue}: External link #{href} failed: got a time out" | ||
# hey here's a funny bug: sometimes HEAD requests return a 404, even on legitimate pages! The | ||
# bug seems to affect curl; try `curl -I -X HEAD https://help.github.com/changing-author-info` | ||
# so in these cases, try a regular `GET`. if it fails, it fails. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW this is a problem w/ the server not honoring HEAD requests, so isn't curl-specific. RestClient.head('https://help.github.com/changing-author-info')
#=> RestClient::ResourceNotFound: 404 Resource Not Found |
||
elsif response_code == 404 && method == :head | ||
next_response = Typhoeus.get(href, @options) | ||
response_handler(next_response, filenames) | ||
else | ||
if (response_code == 420 || response_code == 503) && method == :head | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Move this up a line to an |
||
# 420s usually come from rate limiting; let's ignore the query and try just the path with a GET | ||
uri = URI(href) | ||
next_response = Typhoeus.get(uri.scheme + "://" + uri.host + uri.path, @options) | ||
response_handler(next_response, filenames) | ||
else | ||
# Received a non-successful http response. | ||
@failed_tests << "#{filenames.join(' ').blue}: External link #{href} failed: #{response_code} #{response.return_message}" | ||
end | ||
end | ||
end | ||
|
||
def self.create_nokogiri(path) | ||
path << "/index.html" if File.directory? path # support for Jekyll-style links | ||
content = File.open(path, "rb") {|f| f.read } | ||
Nokogiri::HTML(File.read(path)) | ||
Nokogiri::HTML(content) | ||
end | ||
|
||
def get_checks | ||
HTML::Proofer::Checks::Check.subclasses | ||
end | ||
|
||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,24 +24,27 @@ def missing_src? | |
|
||
class Images < ::HTML::Proofer::Checks::Check | ||
def run | ||
@html.css('img').each do |img| | ||
@html.css('img').each do |i| | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about a more descriptive variable name, e.g. |
||
img = Image.new i, self | ||
|
||
img = Image.new img, self | ||
|
||
# screenshot filenames, return because invalid URL | ||
# screenshot filenames should return because of terrible names | ||
return self.add_issue "image has a terrible filename (#{img.src})" if img.terrible_filename? | ||
|
||
# does the image exist? | ||
if img.missing_src? | ||
self.add_issue "image has no src attribute" | ||
elsif img.remote? | ||
validate_url img.src, "external image #{img.src} does not exist" | ||
return if img.ignore? | ||
add_to_external_urls img.src | ||
else | ||
return if img.ignore? | ||
self.add_issue("internal image #{img.src} does not exist") unless img.exists? | ||
end | ||
|
||
# check alt tag | ||
self.add_issue "image #{img.src} does not have an alt attribute" unless img.valid_alt_tag? | ||
end | ||
|
||
return external_urls | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's your take on explicit There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I personally like to use them as a way of saying "this value will be used as part of something else. For one-off or memoized functions I usually don't. To explain my bizarre logic, in the case above, we're collecting a bunch of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unless you use them always, I think you can leave it out in this case, because the variable by itself on the last line of a method implies that it is returned and useful. |
||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,9 +15,8 @@ def missing_href? | |
class Links < ::HTML::Proofer::Checks::Check | ||
|
||
def run | ||
@html.css('a').each do |link| | ||
|
||
link = Link.new link, self | ||
@html.css('a').each do |l| | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
link = Link.new l, self | ||
return if link.ignore? | ||
|
||
# is there even a href? | ||
|
@@ -28,15 +27,15 @@ def run | |
|
||
# does the file even exist? | ||
if link.remote? | ||
validate_url link.href, "externally linking to #{link.href}, which does not exist" | ||
add_to_external_urls link.href | ||
else | ||
self.add_issue "internally linking to #{link.href}, which does not exist" unless link.exists? | ||
end | ||
|
||
# verify the target hash | ||
if link.hash | ||
if link.remote? | ||
#not yet checked | ||
add_to_external_urls link.href | ||
elsif link.internal? | ||
self.add_issue "linking to internal hash ##{link.hash} that does not exist" unless hash_check @html, link.hash | ||
elsif link.external? | ||
|
@@ -49,6 +48,8 @@ def run | |
end | ||
end | ||
end | ||
|
||
return external_urls | ||
end | ||
|
||
def hash_check(html, href_hash) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,5 +3,4 @@ | |
<a href="folder/">Folder relative to self</a> | ||
<a href="index.html#anchor">Anchor relative to self</a> | ||
|
||
<img alt="sure thing" src="/gpl.png"/>Relative to root | ||
<img alt="sure thing" src="gpl.png"/>Relative to self | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't want to check this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just moved the test file: https://github.com/gjtorikian/html-proofer/blob/speedup-redux/spec/html/proofer/fixtures/brokenRelativeImages.html relativeLinks should have links, relativeImages, images. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you thought about creating these tasks yourself? Then I could just do:
Which defines a new
jekyll
test and assigns it to the task nametest
?