-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add WACZ support #770
base: master
Are you sure you want to change the base?
Add WACZ support #770
Conversation
Pending a test passing and probably some better documentation, but I would appreciate your feedback on the diff here, @ibnesayeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer we handle it by externally extracting WACZ files and then indexing those WARCs the usual way until we have a built-in WARC record iterator in py-wacz (an idea that I discussed with @ikreymer already), at which point this change will become obsolete.
That said, I do not have any strong objections against this PR. I have added some inline comments for potential improvements though.
@@ -171,6 +186,8 @@ def index_file_at(warc_paths, encryption_key=None, | |||
cdxj_metadata_lines = generate_cdxj_metadata(cdxj_lines) | |||
cdxj_lines = cdxj_metadata_lines + cdxj_lines | |||
|
|||
cleanup_warc_files_extracted_from_wacz(warc_paths_to_append) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big issue, but I think the temporary folders created by the mkdtemp()
call will continue to exists (until cleaned up by the OS) because only the files inside them are deleted, not the folders themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs say that the creator is responsible for the deletion, so I think we should handle this. Given each WARC gets a new temp directory, it might be better to just retain the copy of this directory path and delete it along with its contents instead of deleting the WARC then the directory, which would require tracking the directory path, too. Which approach would you rather be implemented, @ibnesayeed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which would require tracking the directory path, too
Not really! It is possible to get the path of the directory if the path of a file is known that it contains.
That said, I would perhaps preferred not holding onto the list of WARC files, instead, operate on each WARC as we discover them, whether those are regular WARC files or those extracted from WACz files. I would deal with one file at a time and loop over for the next one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ibnesayeed This seems like it requires a revamp outside of the scope of this GH issue/PR. I agree that dealing with one WARC at a time would likely be more computationally optimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that it would require change in the workflow. When done, it would be more space efficient as not all the WARCs need to be extracted from WACZ files upfront, duplicating them on the disk, before processing them.
It is okay to leave it as things are right now and get back to this when we have a WARC record iterator for the WACZ files, when most of these changes will be rendered useless.
While its nice to see more support for WACZ, I am wondering about how you're seeing support the WACZ format in this implementation. It seems like this PR is simply treating WACZ as just ZIP containing some WARC files.
But a WACZ file is more than just a zip of WARC files! More importantly, it has a builtin CDXJ in a specific format, which allows for fast access and does not require reindexing of the WARCs. It also has metadata, fixity information (payload and full record digests), and optional signature. There is a specific structure, extending the frictionless data package, and validated by But if everything but the WARCs are being discarded, I wonder if that should be considered 'supporting WACZ' or just 'importing WARCs from WACZ file'. Can you see a way for ipwb to support more of the WACZ format, such as reusing the existing CDXJ indexes perhaps? In some ways, the current approach is a lossy transformation: WACZ files can already be hosted and used directly from IPFS. By importing into ipwb, you're taking the WARCs from the WACZ, reindexing them, and putting just the WARCs onto IPFS in a slightly different way, while creating a new index that is not stored on IPFS, and discarding other data (metadata, signatures, etc...). I wonder if ipwb could do something else to better leverage the properties of the WACZ format? |
@ikreymer This initial implementation is base support for WACZ. We are hoping to do more with the format in the future beyond treating it as a container for WARCs. Reusing the CDXJ in the WACZ for ipwb's own generated CDXJ index could be interesting. We have not thought about it much yet. @ibnesayeed likely has some ideas / 2¢. Thanks for your input! |
@ikreymer that would have been my preference as well (as noted in a comment earlier #770 (review)), but @machawk1 wanted a more built-in approach, hence this PR.
In that sense IPWB does not even support WARC files, let alone WACZ files. During the indexing and ingestion process, it iterates over each WARC record, splits the headers and payload to store them as small IPFS objects to be used at the time of replay. This is why we are interested in an iterator over the WARC records and not the container (or the container of the container) format.
We can leverage the CDXJ file packaged inside the WACZ file, but not for replay, only as a means to iterate over WARC records during indexing/ingestion. However, for large WACZ files this approach might be slow because the reader pointer will be jumping around to seek to different locations if the CDXJ file is not sorted in the order those records are preserved. Moreover, if the built-in CDXJ excludes certain WARC record types (which might not be the case, but worth mentioning here) then those will not be reachable.
I am not sure if I would call it lossy, because the goal of IPWB is different from some of the other web archiving systems. It operates on the atomic level of records while gluing related pieces together for replay. IPWB was not modeled to interact with WARC/WACZ or any other container/bundle format at the time of replay. For long term preservation, one is encouraged to retain WARC/WACZ files with as much provenance and metadata as possible. IPWB can be improved to incorporate more metadata followed by re-ingesting WARC files from the cold storage. |
@ibnesayeed I would like your re-review here, as I added some logic to retain the temp paths and remove them as required so as to not have side effects. Inferring these might be unreliable, as the WARCs extracted from a WACZ are stored at {unique_temp_path}/archive/warcname.warc.gz rather than simply {unique_temp_path}/warcname.warc.gz. If you would rather we change this to the latter scenario, which should not cause a clash issue due to the unique paths, let me know and I can cleanup the logic. Otherwise, the changes here meet the requirements of #710. The latter scenario can always come in with refactoring. |
Another caveat is whether subdirectories beyond /archive are legal for WARC storage in WACZ. This would cause any removal of the structure in WACZ not being retained in the on-disk WARC to introduce a clash. I would have to check the WACZ spec to see if additional organization like this is legal. If not, the cleaner way would be to removal the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we have some unnecessary code because we are doing things the complicated way. For example, we are maintaining list of extracted WARC files from WACZs to be removed later, then we also maintain list of temporary directories to be removed, though removing the later will automatically remove its contents.
I am not sure if merging this PR would be a good idea as it brings a lot of complexities that will be difficult to maintain and reason about and at some point they might become useless (if and when we have WARC record iterator built in the WACZ library).
wacz_paths = [] | ||
for warc_path in warc_paths: | ||
if is_wacz(warc_path): | ||
(w_paths, dirs_to_cleanup) = extract_warcs_from_wacz(warc_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dirs_to_cleanup
here is overwritten in each loop, so at the end it will only hold the reference to the temporary dirs of the last WACZ file (unless I am missing something) for cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on dirs_to_cleanup
not being retained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logic of extracted WARCs and temp directories can be simplified. There is a bit too much disk maintenance. The general gist is that paths to WACZ files could be inter-mingled with WARCs here and thus the WARCs extracted from the WACZ files need to be removed but not the WARCs that were passed in.
|
see more TODOs below before this PR can be merged