Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC2246: Asynchronous media uploads #2246

Merged
merged 30 commits into from
Apr 25, 2023
Merged
Changes from 16 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
b439277
Proposal for asynchronous media uploads
tulir Aug 24, 2019
a83c79c
Add security consideration and mention possible /create request body
tulir Aug 24, 2019
9a395ed
Expand on security consideration
tulir Aug 25, 2019
29e3463
Change error code for existing media PUT
tulir Aug 26, 2019
7cf22be
Add spam prevention error, example response and auth requirements
tulir Aug 26, 2019
658aac8
Add query parameters to control downloading
tulir Aug 1, 2020
bbd7d08
Rename max_stall to max_stall_ms
tulir Aug 1, 2020
0bffcb7
Merge branch 'master' into asynchronous_uploads
tulir Aug 1, 2020
4d009a9
Specify /create request body content
tulir Aug 1, 2020
1cbc04e
Update links, versions and prefixes
tulir Mar 14, 2022
c65f2bf
Add recommended maximum delay for starting upload
tulir Mar 14, 2022
63cef50
Explicitly deny uploading to other users' media IDs
tulir Mar 14, 2022
8ccf85f
Prevent spam by ratelimiting instead of limiting number of IDs
tulir Mar 18, 2022
12e907b
Remove streaming requirement
tulir Mar 18, 2022
d582bb3
Add unstable prefixes for new error codes
tulir Mar 18, 2022
173edf3
Add missing words
tulir Mar 18, 2022
f438754
Change /create endpoint to use v1
sumnerevans Jul 8, 2022
725675c
Reorganize /upload spec and integrate feedback
sumnerevans Jul 8, 2022
d55f1f9
Rename max_stall_ms -> timeout_ms
sumnerevans Jul 8, 2022
955177b
Mention that maximum value for timeout_ms should be imposed by the se…
sumnerevans Jul 8, 2022
823fcca
Mention that the timeout_ms can be ignored if the media exists already
sumnerevans Jul 8, 2022
3b00026
Change M_NOT_YET_UPLOADED to use 504 status code
sumnerevans Jul 8, 2022
9627af2
Remove retry_after_ms optional field
sumnerevans Jul 8, 2022
045c21e
Make unused_expires_at the deadline for the upload to complete rather…
sumnerevans Mar 30, 2023
011031b
Add notes about suggested rate-limiting techniques
sumnerevans Mar 30, 2023
6cb7e31
Recommend 24 hours instead of 1 minute
sumnerevans Mar 30, 2023
fedc697
Merge pull request #3 from beeper/async-uploads-rate-limiting-improve…
tulir Mar 31, 2023
7652f59
Clarify that rate limiting can apply on /create and /upload
sumnerevans Apr 20, 2023
098dd90
Clarify that unused_expires_at is a POSIX millisecond timestamp
sumnerevans Apr 20, 2023
9559ab0
Merge pull request #4 from beeper/asynchronous_uploads
tulir Apr 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 98 additions & 0 deletions proposals/2246-asynchronous-uploads.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Asynchronous media uploads
tulir marked this conversation as resolved.
Show resolved Hide resolved
Sending media to Matrix currently requires that clients first upload the media
to the content repository and then send the event. This is a problem for some
use cases, such as bridges that want to preserve message order, as reuploading
a large file would block all messages.

## Proposal
This proposal proposes a way to send the event containing media before actually
uploading the media, which would make the aformentioned bridge message order
preservation possible without blocking all other messages behind a long upload.

In the future, this new functionality could be used for streaming file
transfers, as requested in [matrix-spec#432].

### Content repository behavior
The proposal adds two new endpoints to the content repository API and modifies
the download and thumbnail endpoints.

#### `POST /_matrix/media/v3/create`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint, but not the newly-proposed /upload, doesn't qualify for a v3 version just yet under the new versioning scheme:

Suggested change
#### `POST /_matrix/media/v3/create`
#### `POST /_matrix/media/v1/create`

(PUT /upload does qualify for the v3 version because the versioning is done on the path rather than the method, therefore since we're adding a backwards-compatible change to the client-server API version we do not need to bump it to v4. We're also not intending to support the v1 or (implied) v2 versions of /upload which literally exist, so we stay with v3 - the current version).

Create a new MXC URI without content. Like `/upload`, this endpoint requires
auth and returns the `content_uri` that can be used in events.
tulir marked this conversation as resolved.
Show resolved Hide resolved

The request body should be an empty JSON object. In the future, the body could
be used for metadata about the file, such as the mime type or access control
settings (related: [MSC701]).

The server may optionally enforce a maximum age for unused media IDs to delete
media IDs when the client doesn't start the upload in time, or when the upload
was interrupted and not resumed in time. The server should include the maximum
timestamp to start the upload in the `unused_expires_at` field in the response
JSON. The recommended default expiration for starting the upload is 1 minute.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit unsure about this. The resources taken up by an unused media ID is minimal (just a row in the DB), whereas failing to upload media in time means that the associated event forever links to broken data.

The scenario I'm thinking of is dodgy/slow internet connection, where it wouldn't be unreasonable to struggle to upload the media in time. We could increase the recommended default expiry, but I'm not entirely sure what the expiry really buys in practice.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expiry is to prevent clients from retrying/waiting forever for files that are never uploaded. Increasing the suggested time may be reasonable if the upload fails for any reason as it would prevent retrying if you're uploading a large file making that media ID stale.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this a bit, and I agree it would be good to support both requests for the media a) initially blocking until the media starts getting uploaded, and b) after a while returning an error code immediately if the media hasn't been uploaded yet.

I guess the questions are:

  1. Do we want to forbid the client from uploading the media after the server has started returning errors for the media?
  2. Relatedly: what error should we return? A 5xx would indicate that it might later succeed and would allow late uploads, or a 404 if we don't.
  3. Do we want or need to tell the client how long they have to upload the media? What would the client do with that information?

Allowing late uploads means that every time a client comes to display the media it will retry, whereas if we return a 404 it could potentially cache that the media will never be uploaded. My hunch though is that clients won't ever bother caching error responses, so there's not much need to disallow late uploads?

I don't think this is a huge issue either way ftr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Web clients will cache the error response implicitly, particularly with aggressive cache-control headers, fwiw.

A possible attack angle here is that a malicious user/client could reserve a ton of event IDs (hundreds per minute with current rate limits and how cheap this request is to call) and send those into a room. The clients in that room then valiantly try to download that media, exhausting the available connections on the media server while it sits there blocking for media. This is somewhat of a risk if the user were to upload a ton of media to a room today, however servers generally cache or keep the recently uploaded media close so can respond to an influx of download requests quickly, reducing the attack surface by enough to not be as much of a concern.

To resolve this, maybe we don't use rate limits as our primary defence. We should still rate limit to avoid clients sending a billion requests per second, but we could additionally ask that servers employ an internal (undocumented/not-revealed) quota for the number of incomplete uploads a user can have. When the user exceeds that quota, the oldest incomplete upload(s) are expired such that they return immediate 404s to /download and similar.

If the server uses a small enough quota (say, 10?), then at worst the user can cause fewer requests which contribute to the server's maximum open file limit, similar to if the client uploaded large files which force the connection to be held open for longer.

This sort of quota system also allows clients to take an eternity to download and re-upload their files, which I think should be a feature of an async upload. A use case would be a non-messenger client which knows that an MXC URI is needed but the file upload is reliant on user input: it could take a few minutes for the user to find the file on their harddrive or they might realize that it's on a whole other machine and finish the upload the next day.

For clarity: we wouldn't have an expiration timestamp with the above quota system.

Also: I don't think quota is the word I'm after, but it's the word I'm using.


(I am concerned about this enough to block FCP, ftr)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An invisible quota system might be more a problem than a solution, as if a client expects to be able to upload 50 concurrent files, while the server internally only allows 10, then it'll result in unstable/unusable UX. (E.g. those first 40 files erroring out on others' screens, and/or erroring on the uploader's side, depending how this is implemented)

If there's a quota, the server should expose it, so clients can schedule uploads accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's quite a bit different, yes. With this MSC the attacker can exhaust a server's resources trivially, per my comment above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with the quota idea, it seems like we would want to have an expiration timeout still, in case a client crashes and never finishes the upload.

We don't want a dead upload to permanently cause the user's quota to be lowered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we make the download/thumbnail endpoints return immediately when if the media isn't uploaded yet, but give some metadata?

For example, upload_status of UPLOAD_NOT_STARTED, UPLOAD_STARTED in those cases. If the media is uploaded, then it can just return immediately as well.

This will allow clients to show something useful in these cases as well.

I think this solution would prevent malicious actors from causing servers to keep open tons of connections as they can respond immediately, and close the connection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also a new potential usecase on the horizon where we'd want extremely long-lived media uploads (could take literal hours to finish uploading the media), so would be good to consider what a timeout-less system looks like I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that these concerns have been addressed in the latest version of the MSC. The main mechanisms are:

  1. Implementing the "quota" system described above where the quota is on the number of "pending media uploads" (un-expired MXCs that haven't received their upload)
  2. Allowing servers to return earlier than the timeout_ms if desired (for example, if the server is running out of available connections)

I think that exploring a timeout-less system should be done in a separate MSC.


The server should also rate limit requests to create media. When combined with
the maximum age for created media IDs, it effectively prevents spam by creating
lots of unused ids.

##### Example response
```json
{
"content_uri": "mxc://example.com/AQwafuaFswefuhsfAFAgsw",
"unused_expires_at": 1647257217083
}
```

#### `PUT /_matrix/media/v3/upload/{serverName}/{mediaId}`
tulir marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention that this endpoint support the content-type header and filename query parameter, as https://spec.matrix.org/v1.6/client-server-api/#post_matrixmediav3upload does? It looks like the sample implementation at https://github.com/turt2live/matrix-media-repo/pull/364/files#diff-3ddbc505e50723b8440369eef4dc1fa055026668151b1dbecbd1725fc6765727 does, but this isn't mentioned in the MSC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In the future those details could be moved to the /create call request body, but that wasn't defined here

Upload content to a MXC URI that was created earlier. If the endpoint is called
with a media ID that already has content, the request should be rejected with
the error code `M_CANNOT_OVERWRITE_MEDIA` and HTTP status code 409. The endpoint
tulir marked this conversation as resolved.
Show resolved Hide resolved
should also reject upload requests from users other than the user who created
tulir marked this conversation as resolved.
Show resolved Hide resolved
the media ID. This endpoint requires auth.

If the upload is successful, an empty JSON object and status code 200 is
tulir marked this conversation as resolved.
Show resolved Hide resolved
returned. If the serverName/mediaId combination is not known or not local, an
`M_NOT_FOUND` error is returned. For other errors, such as file size, file type
or user quota errors, the normal `/upload` rules apply.

The client should include a `Content-Length` header to ensure the server knows
if the file was uploaded entirely. If the server receives a different amount of
data than specified in the header, the upload must fail.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a general recommendation not specific to this MSC - I think we can turn this into a clarification issue on the spec and just add it.


#### Changes to the `/download` and `/thumbnail` endpoints
A new query parameter, `max_stall_ms` is added to the endpoints that can
tulir marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add the words "the server can and should impose a maximum value for this parameter".

download media. It's an integer that specifies the maximum number of
milliseconds that the client is willing to wait to start receiving data.
turt2live marked this conversation as resolved.
Show resolved Hide resolved
The default value is 20000 (20 seconds).

If the data is not available before the specified time is up, the content
repository returns a `M_NOT_YET_UPLOADED` error with a HTTP 404 status code.
tulir marked this conversation as resolved.
Show resolved Hide resolved
The error may include an additional `retry_after_ms` field to suggest when the
client should try again.
erikjohnston marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is ever going to be respected - clients will just tightloop /download instead of waiting. We should just remove it.


For the `/download` endpoint, the server could also stream data directly as it
is being uploaded. However, streaming creates several implementation and spec
complications (e.g. how to stream if the media repo has multiple workers, what
to do if the upload is interrupted), so specifying exactly how streaming works
is left for another MSC.
Comment on lines +96 to +100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, this is sort of an implied feature from most sane media repo implementations: it's quite rare that a server ever wants to buffer the media into memory so will stream it from a cache, disk, or external service as it receives the data. By extension, they already will have had to figure out what stream interruption, multiple workers, etc means. However, clients obviously shouldn't rely on this functionality existing, but I also don't think it necessarily needs an MSC either.

Copy link
Contributor

@ShadowJonathan ShadowJonathan Jun 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this paragraph has some hidden history, this is a change from the original requirement (include streaming) after I commented about some security concerns here.

TLDR: it's not as much about being able to stream, but that the file is mutable and uncommitted while it's streamed, that there's an edge case where; if a client closes the upload stream, starts again, but then uploads a wholly different file, how should everything then fail/proceed gracefully to accommodate this new situation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds like a thing we should fix as part of this MSC regardless, to be honest. Does a partial failed upload count as uploading the media? (ie: does the server mark the media as "uploaded" at the start or the request handling or at the end?) what happens if the client tries to upload the media twice (concurrently)?

Copy link
Contributor

@ShadowJonathan ShadowJonathan Jun 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also poked at the "partial uploads" problem here; imo, there shouldn't be partial uploads, and the client should send content-length in the request body to have the server make sure it received everything before it commits the file.

Concurrent uploading is something I hadn't considered.


## Potential issues
Other clients may time out the download if the sender takes too long to upload
media.

## Alternatives

## Security considerations
turt2live marked this conversation as resolved.
Show resolved Hide resolved

## Unstable prefix
While this MSC is not in a released version of the spec, implementations should
use `fi.mau.msc2246` as a prefix and as an `unstable_features` flag in the
`/versions` endpoint.

* `POST /_matrix/media/unstable/fi.mau.msc2246/create`
* `PUT /_matrix/media/unstable/fi.mau.msc2246/upload/{serverName}/{mediaId}`
* `?fi.mau.msc2246.max_stall_ms`
* `FI.MAU.MSC2246_NOT_YET_UPLOADED`
* `FI.MAU.MSC2246_CANNOT_OVERWRITE_MEDIA`

[matrix-spec#432]: https://github.com/matrix-org/matrix-spec/issues/432
[MSC701]: https://github.com/matrix-org/matrix-doc/issues/701