Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected character #8

Closed
fabstu opened this issue Mar 2, 2020 · 2 comments
Closed

Unexpected character #8

fabstu opened this issue Mar 2, 2020 · 2 comments
Labels

Comments

@fabstu
Copy link

fabstu commented Mar 2, 2020

I tried to view a warc file just now with openwayback and it outputs the following. Is this a problem with the warc or with httrack2warc?

WARNING: Bad Record. Trying skip (Record start 782): Unexpected character 41(Expecting d)
Mar 02, 2020 10:47:38 AM org.archive.wayback.resourcestore.indexer.IndexWorker doWork
SEVERE: FAILED to index or upload (crawl.warc)
java.lang.RuntimeException: After retry (Offset 782)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:512)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:436)
	at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:40)
	at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:29)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:56)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
	at org.archive.wayback.resourceindex.updater.IndexClient.addSearchResults(IndexClient.java:158)
	at org.archive.wayback.resourcestore.indexer.IndexWorker.doWork(IndexWorker.java:111)
	at org.archive.wayback.resourcestore.indexer.IndexWorker$WorkerThread.run(IndexWorker.java:244)
Caused by: java.io.IOException: Unexpected character 43(Expecting d)
	at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:80)
	at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:68)
	at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:176)
	at org.archive.io.ArchiveReader.get(ArchiveReader.java:144)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.innerNext(ArchiveReader.java:562)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.exceptionNext(ArchiveReader.java:537)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:505)
	... 9 more

The command I used to download the website:

httrack "https://web.archive.org/web/20180611033123/https://github.com/adlio/usgs-waterdata/tree-commit/89c97a80cdd6fba90972fd137fcd5a7a92ad1fff" '-*' '+https://web.archive.org/web/20180611033123*' '+https://archive.org/includes*' '+https://web.archive.org/_static*' '+https://archive.org/images*' '+https://archive.org/services*' '+https://archive.org/components*' '+https://www.archiveteam.org*' -N1005 --advanced-progressinfo --can-go-up-and-down --display --keep-alive --mirror --robots=0 --user-agent='Mozilla/5.0 (X11;U; Linux i686; en-GB; rv:1.9.1) Gecko/20090624 Ubuntu/9.04 (jaunty) Firefox/3.5' --verbose

The command I used to create the warc:

java -jar /Users/fabiansturm/Documents/projects/httrack2warc/target/httrack2warc-0.4.0-shaded.jar /var/folders/rw/35q09zqj5yv5pz4wwg3yjfkm0000gn/T/webcache-download731331670 -o /var/folders/rw/35q09zqj5yv5pz4wwg3yjfkm0000gn/T/http2warc115706301 -C none

After that I renamed it to crawl.warc since I used -C none.

To run the container:

docker pull iipc/openwayback
docker container run -it --rm -v /tmp/owb:/data -p 8089:8080 iipc/openwayback
@ato ato added the bug label Mar 5, 2020
@ato
Copy link
Member

ato commented Mar 5, 2020

This is a bug in httrack2warc, I should have a fix shortly.

@ato ato closed this as completed in 206fe0d Mar 5, 2020
@ato
Copy link
Member

ato commented Mar 5, 2020

Fix released as v0.5.0. Thanks for reporting this. I missed it because I'd only been testing with Pywb which has a rather forgiving parser. I've tested the fixed version with pywb cdx-indxer, openwayback cdx-indexer, jwat and jwarc and all appear to have no complaints about the generated WARCs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants