Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add resume single file upload #386

Open
pixelrust opened this issue Jan 12, 2021 · 14 comments
Open

Add resume single file upload #386

pixelrust opened this issue Jan 12, 2021 · 14 comments

Comments

@pixelrust
Copy link

Feature request / enhancement for IA CLI: ability to resume the uploading of a single file.

Use case:
I need to upload relatively big files, with limited upload bandwidth available; each upload can require 12 hours or more to complete; sometimes I just get disconnected for a few minutes, which forces me to start the whole upload process.

Automatic resume in case of errors would be optimal, but not required for my personal case.

@jjjake
Copy link
Owner

jjjake commented Jan 20, 2021

Thanks for the suggestion @pixelrust, this would be really nice!

I think the way to do it would be adding support for multipart uploads to internetarchive.

Hopefully I can find time for this someday, but if anyone else has time to work on this please let me know if you have any questions.

@darkstar
Copy link

darkstar commented Jul 2, 2021

This would help me a lot. I am getting lots of "connection aborted", "bad status line", etc. errors from IA recently, uploading dozens of large (multi-gb files) is a real chore because ia stops uploading the file list after the exception (instead of retrying the whole file, or at least moving to the next file)

Some of the errors I'm getting:

requests.exceptions.ConnectionError: (ProtocolError('Connection aborted.', BadStatusLine("''",)), 'https://s3.us.archive.org/someitem/some4gbfile.zip')

requests.exceptions.ConnectionError: (ProtocolError('Connection aborted.', error(104, 'Connection reset by peer')), 'https://s3.us.archive.org/someitem/some4gbfile.zip')

@maxz
Copy link
Contributor

maxz commented Jul 21, 2021

I have been working on this for a few weeks now.

The Internet Archive's API seems to differ from the official Amazon S3 API in some places.
There is no copying of ranges, so the whole copy operation with multipart uploads cannot be implemented.

Additionally the limits are different. Amazon has a minimum part size of 5MB, with the Archive I tested part sizes as low as 1KB and they still worked. Those are already impractically low, so testing anything even lower would be pointless.
I think we should probably stay with the 5MB and not go lower than 1MB.
I did not test whether the 10k part limit can be higher for the Internet Archive, because it is a pain to test uploading that many parts.
Additionally some parts of the multipart upload related API are rather unstable or unreliable. The endpoint responsible for listing already uploaded parts should support listing 1k parts at once, according to the S3 documentation, but the Internet Archive's versions just keeps on processing for hours without returning anything when trying to query that many parts.
When querying 100 parts at once it still seems somewhat reliable. Anything in between is a game of chance and seems to depend on the current workload. Querying 1000+ parts across multiple 100 part queries is no problem.

Yesterday I slowly started to merge the multipart upload functionality into the actual iacode.
I found some functions in there which I had implemented similarly and therefore extended them to support my use case, instead of having multiple very similar functions in the code base.

I think I might create a pull request for the main functionality as soon as that was tested by me again.

Properly adding the support to ia will then be another request.
I'm not entirely sure yet where to add it. My current candidate point is upload_file.
I would greatly appreciate your input here, @jjjake.
And since I only tested the MD5 version so far, I will also have to test how it interacts with the AWS Signature Version 4, because those signatures will replace the MD5 hashes, as soon as they are present.

I'm also not yet sure about where on the hdd the information about the uploaded parts should be saved. Is there a canonical directory for such temporary, transfer related information yet?
This part would not be strictly required.

@jjjake
Copy link
Owner

jjjake commented Jul 26, 2021

If we add support for multi-part upload, it should be added as a new Item method.

One issue for archive.org with multi-part uploads is that if you're not really sure what's going on behind the scenes or how to manage these types of uploads, it has the potential to create a lot of cruft on archive.org (partially uploaded files, etc.). This is already a minor issue, and it's not even a part of the client yet. So, it will be important to make this feature a non-default that you have to explicitly call as well as providing very clear documentation on how it works.

@maxz
Copy link
Contributor

maxz commented Jul 27, 2021

That was also a though I had. I was hoping that the Archive server just periodically flushed old __ia_spool entries.

It would be impossible to implement the proper bookkeeping solely client side. What if someones starts a 200GB upload and just never finishes it.
The API could periodically be queried for outstanding uploads. So when someone with the same account used the command line tool from any computer, it would abort uploads which are older than e. g. 4 weeks.
That would still not cover the case where someone began an upload and just ceased using ia afterwards.

@jjjake
Copy link
Owner

jjjake commented Jul 27, 2021

@maxz After talking with others here, I think the right thing to do is clearly document that any partial uploads older than 90 days may be deleted in some circumstances. So, I think that takes care of this concern! I still think it should be a new method and a new option int the CLI.

@pixelrust
Copy link
Author

Support and thanks to anyone who may be working on this...

@theDutchess
Copy link

Are you still looking for help?? I am looking for a job but I can help until I go up work??

@maxz
Copy link
Contributor

maxz commented Jan 6, 2022

No. There is nothing you could really help with regarding this feature.
The main parts are finished, but shared parts in Item have to be refactored.
It's unpleasant work and would only become harder by adding more people.

But thank you for the interest. There are many other issues in the tracker which you could resolve.

@theDutchess
Copy link

thank you

Repository owner deleted a comment from theDutchess Jan 24, 2022
Repository owner deleted a comment from theDutchess Jan 24, 2022
@JustAnotherArchivist
Copy link
Contributor

FWIW, I wrote an independent script for multipart uploading a little while ago. After using it a fair bit in different scenarios (including uploading from stdin #326 and parallel connections for better upload speeds from transatlantic connections), I can definitely confirm that it is a headache. Automatic cleanup of stale uploads on the IA side should probably be implemented before making this available to the general user. I also noticed that IA's processing of multipart uploads is much less efficient than direct uploads, which may be a concern as well. Further, there are issues with item creation from a multipart upload (the creation happens asynchronously after the multipart upload is initiated, and uploading the parts only starts working a bit later), and listing in-progress multipart uploads has TLS certificate problems. Much fun, can't recommend at this time.

@maxz
Copy link
Contributor

maxz commented Jan 25, 2022

FWIW, I wrote an independent script for multipart uploading a little while ago. After using it a fair bit in different scenarios (including uploading from stdin #326 and parallel connections for better upload speeds from transatlantic connections), I can definitely confirm that it is a headache.

I also created it outside of internetarchive at first to test the feature and implemented it into my own library after experimenting with it in pure Curl.

I did not test parallel connections because I'm heavily against implementing those. The IA server are already strained and slow enough. Uploads take long enough in the current state. I don't want people to just easily increase that load for not much gain. When someone is dealing with repeatedly interrupted larger uploads, the multipart uploads on the other hand should considerably reduce strain but obviously has some additional transfer and processing overhead compared to a flawless upload, depending on the chosen part size.

Automatic cleanup of stale uploads on the IA side should probably be implemented before making this available to the general user.

I don't see that happening any time soon. Sadly it seems like we can't quite work in tandem with the backend and rely on them to fix things. There have been more pressing matters which have gone unfixed for years now. (e. g. supporting range requests in the Wayback Machine which would be required to reliably download files there. I could download the particular files for which the download kept aborting for a while, but it is a problem that could always resurface with other files.)

I also noticed that IA's processing of multipart uploads is much less efficient than direct uploads, which may be a concern as well. Further, there are issues with item creation from a multipart upload (the creation happens asynchronously after the multipart upload is initiated, and uploading the parts only starts working a bit later), and listing in-progress multipart uploads has TLS certificate problems.

I could not observe any problems with uploading the parts after the initialisation, but there certainly are many idiosyncrasies (already written about above) to this part of the IA S3 API which don't even match the version of the S3 API specification they are based on.
And yes, as you also observed, the endpoint used for listing multipart uploads and the endpoint use to list multipart upload parts both redirect to http endpoints. The documentation also recommends against using those endpoint as integral part of a multipart upload. They recommend to store the information locally. I'm merely using them as a fallback right now.

Completing uploads did not take any noticeable time, but I did not test it with very high part numbers. Some other parts of the API have major problems if the number of elements is too high and no problems if it is spread over multiple requests. But if that is also the case for completing uploads, then that would be another problem, because an upload can't be completed in multiple steps.

Properly integrating it into the internetarchive library is a whole different pair of shoes though. I've been working on that on and off for months now whenever I have some time. And it mostly involves reading and testing the Item module and then refactoring it faithfully. This has lead to some interesting discoveries so far of parts which are currently entirely unneeded or even make things harder to read and I could get rid of a few of those. But I still try to find the perfect abstractions so that the whole module is in a more readable state afterwards while barely requiring any code duplication for the new multipart_upload method.

Right now the biggest burden on my mind regarding this feature would be how to write proper tests for it. I could not think of a way yet and there are no similar tests in the current test suit which I could use as a basis.

@JustAnotherArchivist
Copy link
Contributor

I did not test parallel connections because I'm heavily against implementing those. The IA server are already strained and slow enough. Uploads take long enough in the current state. I don't want people to just easily increase that load for not much gain.

It's much more complicated than 'it's already overloaded anyway', but yes, this is part of the reason why I didn't link my script. (I don't want to go into the details, but in the use case behind my script, parallelism is not just beneficial but really required and recommended by IA staff. In the general case, you're absolutely right.)

I don't see that happening any time soon.

Neither do I, and there are definitely way more important issues that should've been tackled years ago (something something cookies). So yeah, basically this can't really be implemented reliably without awful crutches.

I could not observe any problems with uploading the parts after the initialisation

It depends on a lot of factors, including time of day (IA server load) and latency to IA. If the item doesn't already exist, CreateMultipartUpload creates it, but this doesn't happen immediately, and the client gets back an HTTP 200 before the corresponding archive.php task actually runs. For that reason, when you complete the first part upload shortly after (within a minute or so in my tests), it fails with a 404 'bucket does not exist' or similar (don't remember the exact wording). I ended up checking for the bucket existence repeatedly after the initiation and only starting the first part upload when that succeeded.

And yes, as you also observed, the endpoint used for listing multipart uploads and the endpoint use to list multipart upload parts both redirect to http endpoints. The documentation also recommends against using those endpoint as integral part of a multipart upload. They recommend to store the information locally. I'm merely using them as a fallback right now.

Yeah, and although the servers are reachable over HTTPS, they serve an invalid TLS certificate. I contacted info@ about this a while ago and was, unsurprisingly, told that they'd look into it but might not fix it 'immediately'.
My script doesn't actually handle this at all; it just keeps the upload ID and parts data in memory and prints it to stderr on crashes. (I never said it was a good script...)

Completing uploads did not take any noticeable time, but I did not test it with very high part numbers.

Neither did I, only a few dozen parts. But I didn't mean it's inefficient on the S3 client level. Rather, the archive.php tasks on IA are inefficient. Specifically, it first syncs the parts from the S3 nodes to the item server (fine), then calculates the checksums and syncs them to the backup server, then merges the parts into the final file, and then hashes/syncs that again. The hashing and syncing is already one of the slowest parts most of the time, and doing it twice certainly doesn't help... (I do understand why it may be needed, of course, but there is certainly room for optimisation, namely processing the multipart completion in the same task if it's already queued.)
As an anecdotal point of reference, I uploaded 50 GiB of data (ten 5 GiB files) as multiparts to one item the other week. The upload of this only took 20 minutes or so, but it then took IA 10 hours (!) to process this. Normal uploads are typically done in 4-5 minutes for one 5 GiB file, maybe 10 if the particular item server is busy. Perhaps I was just unlucky there; I didn't use this at scale so far.

Right now the biggest burden on my mind regarding this feature would be how to write proper tests for it. I could not think of a way yet and there are no similar tests in the current test suit which I could use as a basis.

Yeah, this has been bothering me for a while in general. Quite a few parts of IA are poorly tested because testing them would require modification of test items on the live system, which is probably undesirable. But that's really the only way to properly test a software that interacts with another system unless you can faithfully emulate that system locally (e.g. MinIO S3 server for standard S3, but IA isn't fully compatible...).

@tungol
Copy link

tungol commented Mar 31, 2024

If you want to write tests against the S3 API, Moto is very good: https://github.com/getmoto/moto

A version modified to emulate IA's incompatibilities shouldn't be that hard to make, although maintenance over time is a different question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants