Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escaping in new.txt and new.zip do not match #6

Closed
ato opened this issue Feb 1, 2018 · 0 comments
Closed

Escaping in new.txt and new.zip do not match #6

ato opened this issue Feb 1, 2018 · 0 comments
Labels

Comments

@ato
Copy link
Member

ato commented Feb 1, 2018

HTTrack appears to write the URL in new.txt escaped (e.g. spaces replaced with %20) but unescaped in new.zip. This causes cache lookup error when the two forms do not match:

Exception in thread "main" java.io.IOException: no cache entry: http://example.org/some%20file.jpg
    at au.gov.nla.httrack2warc.httrack.HttrackCrawl.buildRecord(HttrackCrawl.java:148)

It appears in the new.txt entry context HTTrack is escaping the following characters:

  • spaces
  • double-quotes
  • character codes <= 31
  • character codes >= 127

Notably this does not include the % character. Therefore this transformation is not safely reversible.

@ato ato added the bug label Feb 1, 2018
@ato ato closed this as completed in bccb83c Feb 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant