-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add resume single file upload #386
Comments
Thanks for the suggestion @pixelrust, this would be really nice! I think the way to do it would be adding support for multipart uploads to Hopefully I can find time for this someday, but if anyone else has time to work on this please let me know if you have any questions. |
This would help me a lot. I am getting lots of "connection aborted", "bad status line", etc. errors from IA recently, uploading dozens of large (multi-gb files) is a real chore because ia stops uploading the file list after the exception (instead of retrying the whole file, or at least moving to the next file) Some of the errors I'm getting:
|
I have been working on this for a few weeks now. The Internet Archive's API seems to differ from the official Amazon S3 API in some places. Additionally the limits are different. Amazon has a minimum part size of 5MB, with the Archive I tested part sizes as low as 1KB and they still worked. Those are already impractically low, so testing anything even lower would be pointless. Yesterday I slowly started to merge the multipart upload functionality into the actual I think I might create a pull request for the main functionality as soon as that was tested by me again. Properly adding the support to I'm also not yet sure about where on the hdd the information about the uploaded parts should be saved. Is there a canonical directory for such temporary, transfer related information yet? |
If we add support for multi-part upload, it should be added as a new One issue for archive.org with multi-part uploads is that if you're not really sure what's going on behind the scenes or how to manage these types of uploads, it has the potential to create a lot of cruft on archive.org (partially uploaded files, etc.). This is already a minor issue, and it's not even a part of the client yet. So, it will be important to make this feature a non-default that you have to explicitly call as well as providing very clear documentation on how it works. |
That was also a though I had. I was hoping that the Archive server just periodically flushed old It would be impossible to implement the proper bookkeeping solely client side. What if someones starts a 200GB upload and just never finishes it. |
@maxz After talking with others here, I think the right thing to do is clearly document that any partial uploads older than 90 days may be deleted in some circumstances. So, I think that takes care of this concern! I still think it should be a new method and a new option int the CLI. |
Support and thanks to anyone who may be working on this... |
|
No. There is nothing you could really help with regarding this feature. But thank you for the interest. There are many other issues in the tracker which you could resolve. |
thank you |
FWIW, I wrote an independent script for multipart uploading a little while ago. After using it a fair bit in different scenarios (including uploading from stdin #326 and parallel connections for better upload speeds from transatlantic connections), I can definitely confirm that it is a headache. Automatic cleanup of stale uploads on the IA side should probably be implemented before making this available to the general user. I also noticed that IA's processing of multipart uploads is much less efficient than direct uploads, which may be a concern as well. Further, there are issues with item creation from a multipart upload (the creation happens asynchronously after the multipart upload is initiated, and uploading the parts only starts working a bit later), and listing in-progress multipart uploads has TLS certificate problems. Much fun, can't recommend at this time. |
I also created it outside of I did not test parallel connections because I'm heavily against implementing those. The IA server are already strained and slow enough. Uploads take long enough in the current state. I don't want people to just easily increase that load for not much gain. When someone is dealing with repeatedly interrupted larger uploads, the multipart uploads on the other hand should considerably reduce strain but obviously has some additional transfer and processing overhead compared to a flawless upload, depending on the chosen part size.
I don't see that happening any time soon. Sadly it seems like we can't quite work in tandem with the backend and rely on them to fix things. There have been more pressing matters which have gone unfixed for years now. (e. g. supporting range requests in the Wayback Machine which would be required to reliably download files there. I could download the particular files for which the download kept aborting for a while, but it is a problem that could always resurface with other files.)
I could not observe any problems with uploading the parts after the initialisation, but there certainly are many idiosyncrasies (already written about above) to this part of the IA S3 API which don't even match the version of the S3 API specification they are based on. Completing uploads did not take any noticeable time, but I did not test it with very high part numbers. Some other parts of the API have major problems if the number of elements is too high and no problems if it is spread over multiple requests. But if that is also the case for completing uploads, then that would be another problem, because an upload can't be completed in multiple steps. Properly integrating it into the Right now the biggest burden on my mind regarding this feature would be how to write proper tests for it. I could not think of a way yet and there are no similar tests in the current test suit which I could use as a basis. |
It's much more complicated than 'it's already overloaded anyway', but yes, this is part of the reason why I didn't link my script. (I don't want to go into the details, but in the use case behind my script, parallelism is not just beneficial but really required and recommended by IA staff. In the general case, you're absolutely right.)
Neither do I, and there are definitely way more important issues that should've been tackled years ago (something something cookies). So yeah, basically this can't really be implemented reliably without awful crutches.
It depends on a lot of factors, including time of day (IA server load) and latency to IA. If the item doesn't already exist, CreateMultipartUpload creates it, but this doesn't happen immediately, and the client gets back an HTTP 200 before the corresponding
Yeah, and although the servers are reachable over HTTPS, they serve an invalid TLS certificate. I contacted
Neither did I, only a few dozen parts. But I didn't mean it's inefficient on the S3 client level. Rather, the
Yeah, this has been bothering me for a while in general. Quite a few parts of IA are poorly tested because testing them would require modification of test items on the live system, which is probably undesirable. But that's really the only way to properly test a software that interacts with another system unless you can faithfully emulate that system locally (e.g. MinIO S3 server for standard S3, but IA isn't fully compatible...). |
If you want to write tests against the S3 API, Moto is very good: https://github.com/getmoto/moto A version modified to emulate IA's incompatibilities shouldn't be that hard to make, although maintenance over time is a different question. |
Feature request / enhancement for IA CLI: ability to resume the uploading of a single file.
Use case:
I need to upload relatively big files, with limited upload bandwidth available; each upload can require 12 hours or more to complete; sometimes I just get disconnected for a few minutes, which forces me to start the whole upload process.
Automatic resume in case of errors would be optimal, but not required for my personal case.
The text was updated successfully, but these errors were encountered: