Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup, Redux #24

Merged
merged 15 commits into from
Dec 31, 2013
Merged

Speedup, Redux #24

merged 15 commits into from
Dec 31, 2013

Conversation

gjtorikian
Copy link
Owner

This branch introduces a few significant changes and optimizations:

  • The ability to pass along any of Typhoeus' options from HTML::Proofer.new
  • A cleanup of the test suite
  • All external URL checks are collected, and the tests are run after everything else has run. This lets you pass in a new option, disable_external, which won't trigger these tests (in case you only want to run them on a CI or something)
  • All external URL checks are now HEAD request. This is way, way faster, since unlike GET, we don't download the page contents, nor do we care for them. A HEAD request is good enough for checking a URL's existence.

Unfortunately, using HEAD introduced some weird bug from curl (I'm on version 7.30). For an example, run this command from the terminal:

curl -I -X HEAD https://help.github.com/changing-author-info

You'll get back a 404, even though the browser sees it as a 301, then a 200. This seems to affect a wide variety of sites.

The workaround is to execute a GET request when you see a 404. If this is a failure, then the page really does not exist.

Although you're under no obligation to, since a lot of the code has changed, it'd be nice for some 👀. /cc @benbalter @afeld @parkr

@benbalter
Copy link
Contributor

All external URL checks are now HEAD request. This is way, way faster, since unlike GET, we don't download the page contents, nor do we care for them. A HEAD request is good enough for checking a URL's existence.

Does this affect anchors in external pages?

You'll get back a 404, even though the browser sees it as a 301, then a 200. This seems to affect a wide variety of sites.

Perhaps there's a flag to follow redirects?

@options.delete opt
end

external_urls.each_pair do |href, filenames|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't have to be in this pull request, but may make sense to break some things out into methods rather than having a single long, functional run method. OO and all that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benbalter 👍 ^^

@gjtorikian
Copy link
Owner Author

Does this affect anchors in external pages?

A GET or a HEAD on a legitimate anchor, or an illegitimate anchor but on a valid page, both return 200:

# legitimate anchor
curl -I -X HEAD https://github.com/gjtorikian/html-proofer#htmlproofer
curl https://github.com/gjtorikian/html-proofer#htmlproofer
# illegitimate
curl -I -X HEAD https://github.com/gjtorikian/html-proofer#htmlprooferssss
curl HEAD https://github.com/gjtorikian/html-proofer#htmlprooferssss

I assume since the underlying page is okay, the anchor is not checked. Probably similar to how the browser just goes "yeah here's a page, I don't know where the anchor is."

Perhaps there's a flag to follow redirects?

Yeah, that's the -L option. Same problem though:

[fix-search-dates] ~/github/help-docs $ curl -L -I -X HEAD https://help.github.com/changing-author-info
HTTP/1.1 404 Not Found
Cache-Control: no-cache
Content-Type: text/html; charset=utf-8
Date: Sun, 29 Dec 2013 20:03:00 GMT
Set-Cookie: _help_session=RmlOaEdiV1MxdGlEMGxqaWZ5bmtNT3NnQlR1UGVJekhhYXVpVmx3dEN0ZEgrTktVQUxHM1hqaEkrMm5wSnhXZmNIb3pVYmVoVlJ6Q3k2U0x4VDFiV0IxNVN2U29kdk02Q0ZWMEl5MHRwY1pZWkg3ZkJUU2cvQXptd3BWY0ZETTBtaGhPYjBTSC93aGtZaGZSQVZvOXR5R2Y1aUMva21waXFUbTFUNkUxYU0wTjhkbDNQVjdhZnkxeHNDWWY5cEhHLS1qMmN3djFmWXFoazRyMnQ5NEM5d093PT0%3D--27e89b46ed72208fffaf58850d0554e8a05ce352; path=/; expires=Fri, 29 Dec 2023 20:03:00 -0000; secure; HttpOnly
Status: 404 Not Found
Strict-Transport-Security: max-age=31536000
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Request-Id: 9d2a104c-d1a0-4f3a-af9c-be35fd9892da
X-Runtime: 0.009505
X-Ua-Compatible: chrome=1
X-Xss-Protection: 1; mode=block
Connection: keep-alive

Proofer now supports iterating over single files, so make sure we’re
using the file’s dirname if it’s not actually a directory
Want to use HTML Proofer with your Jekyll site? Awesome. Simply add `gem 'html-proofer'` to your `Gemfile` as described above, and add the following to your `Rakefile`, using `rake test` to execute:
Want to use HTML Proofer with your Jekyll site? Awesome. Simply add `gem 'html-proofer'`
to your `Gemfile` as described above, and add the following to your `Rakefile`,
using `rake test` to execute:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought about creating these tasks yourself? Then I could just do:

require 'html/proofer/tasks'
HTML::Proofer::Tasks.new(:jekyll, :test)

Which defines a new jekyll test and assigns it to the task name test?

@parkr
Copy link
Contributor

parkr commented Dec 29, 2013

This looks AWESOME! Great work. :)


``` ruby
HTML::Proofer.new("out/", {:ext => ".htm", :verbose = > true, :ssl_verifyhost => 2 })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may consider separating the Typhoeus options, so that they don't conflict with your own. Something like

HTML::Proofer.new("out/", {:ext => ".htm", :typhoeus => {:verbose = > true, :ssl_verifyhost => 2 }})

@failed_tests << "#{filenames.join(' ').blue}: External link #{href} failed: got a time out"
# hey here's a funny bug: sometimes HEAD requests return a 404, even on legitimate pages! The
# bug seems to affect curl; try `curl -I -X HEAD https://help.github.com/changing-author-info`
# so in these cases, try a regular `GET`. if it fails, it fails.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this is a problem w/ the server not honoring HEAD requests, so isn't curl-specific.

RestClient.head('https://help.github.com/changing-author-info')
#=> RestClient::ResourceNotFound: 404 Resource Not Found

@afeld
Copy link
Contributor

afeld commented Dec 30, 2013

I think I prefer the way the assertions were done in the tests before. It's great to add some "integration" tests to check that it outputs what you expect, but I wouldn't run it that way for every test.

gjtorikian added a commit that referenced this pull request Dec 31, 2013
@gjtorikian gjtorikian merged commit 3574395 into master Dec 31, 2013
@gjtorikian gjtorikian deleted the speedup-redux branch December 31, 2013 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants