You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to view a warc file just now with openwayback and it outputs the following. Is this a problem with the warc or with httrack2warc?
WARNING: Bad Record. Trying skip (Record start 782): Unexpected character 41(Expecting d)
Mar 02, 2020 10:47:38 AM org.archive.wayback.resourcestore.indexer.IndexWorker doWork
SEVERE: FAILED to index or upload (crawl.warc)
java.lang.RuntimeException: After retry (Offset 782)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:512)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:436)
at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:40)
at org.archive.wayback.resourcestore.indexer.ArchiveReaderCloseableIterator.next(ArchiveReaderCloseableIterator.java:29)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:56)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.util.AdaptedIterator.hasNext(AdaptedIterator.java:55)
at org.archive.wayback.resourceindex.updater.IndexClient.addSearchResults(IndexClient.java:158)
at org.archive.wayback.resourcestore.indexer.IndexWorker.doWork(IndexWorker.java:111)
at org.archive.wayback.resourcestore.indexer.IndexWorker$WorkerThread.run(IndexWorker.java:244)
Caused by: java.io.IOException: Unexpected character 43(Expecting d)
at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:80)
at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:68)
at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:176)
at org.archive.io.ArchiveReader.get(ArchiveReader.java:144)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.innerNext(ArchiveReader.java:562)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.exceptionNext(ArchiveReader.java:537)
at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:505)
... 9 more
Fix released as v0.5.0. Thanks for reporting this. I missed it because I'd only been testing with Pywb which has a rather forgiving parser. I've tested the fixed version with pywb cdx-indxer, openwayback cdx-indexer, jwat and jwarc and all appear to have no complaints about the generated WARCs.
I tried to view a warc file just now with openwayback and it outputs the following. Is this a problem with the warc or with httrack2warc?
The command I used to download the website:
The command I used to create the warc:
After that I renamed it to crawl.warc since I used
-C none
.To run the container:
The text was updated successfully, but these errors were encountered: