Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] MSC2706: IPFS as a media repository for Matrix #2706

Draft
wants to merge 1 commit into
base: old_master
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions proposals/2706-IPFS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# MSC2706: IPFS as a media repository

The current media/content repository in Matrix is somewhat reliant on the origin server staying
online indefinitely to serve the media, which is not always the case. Some servers may be bandwidth
constrained (don't want to be dealing with thousands of people requesting media from them) or simply
go down for maintenance/indefinite closure. When this happens, it would be useful to have media
stored on other nodes and have a way to contact them.

We could invent our own system for finding out which other servers have a copy of the given media
and gossip it, or we could rely on a solution that has solved this problem.

[IPFS](https://ipfs.io/) describes itself as a peer-to-peer hypermedia protocol and fits perfectly
within Matrix's vision of an open, secure, and decentralised world. It handles media distribution
for free (from our perspective) and is easily integrated into Matrix.

## Proposal

If not obvious by now, the proposal is to use IPFS within Matrix for media handling. Unfortunately
this proposal does not recommend using `ipfs://` URIs in place of `mxc://` for backwards compatibility
reasons, however is sufficient adoption is achieved then Matrix could easily switch over to that.
For now, clients and servers *should* handle `ipfs://` URIs if they see them however this proposal
mostly focuses on introducing IPFS in a backwards compatible manner.

**TODO: Decide if not using `ipfs://` is a mistake.**

IPFS uses "content IDs" (or "cid") to reference media which are compatible with Matrix's media IDs (**TODO: CONFIRM**),
making the process even easier to migrate. To support backwards compatability with older clients
and servers, the media ID is proposed to be formatted as `ipfs:<cid>` for IPFS-hosted media. This
will allow legacy servers and clients to contact their homeserver and resolve it to an IPFS gateway
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who pins the IPFS content? IMO, server-side pinning creates an opportunity for managing retention and redaction using IPFS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by pinning here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://docs.ipfs.io/concepts/persistence/,

IPFS nodes treat the data they store like a cache, meaning that there is no guarantee that the data will continue to be stored. "Pinning" a CID tells an IPFS server that the data is important and mustn't be thrown away.

AFAIK, the current method of retrieving Matrix media effectively "pins" media on all participant servers. Ideally, a server could do a reference count on IPFS resources and pin them accordingly. The difficult part with that would be that there is no standard way of determining which media an event references without knowing its schema. I.e, if I create a new event type and upload media with it, the server has no clear way of pulling the media out of that new event except by searching for all mxc URLs.

A future P2P world could use a more conservative pinning algorithm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the MSC to solve that problem tbh. The server doesn't need to pin it, and in popular enough rooms the media will get shared across other nodes naturally.

We could try and pin the media to a server, however in a p2p environment we'd probably want to do the opposite in support of freedom?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw it is not proposed (and won't be proposed when this MSC is de-drafted) to have the old media system disappear. It would still exist, just at a lesser prominence than IPFS.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the MSC to solve that problem tbh.

Yeah probably best to leave up to a media pruning MSC.

The server doesn't need to pin it, and in popular enough rooms the media will get shared across other nodes naturally.

If not pinned, all participant nodes will prune it if its not accessed for a while, so at least 1 node has to pin it. This could be just the originating server, but if that server goes offline, it can't be accessed. Retaining access after a server goes offline may also be beyond the scope of this MSC. Just adding a file to IPFS pins it though, (unless otherwise specified) so if the originating server adds it, then others will be able to access it.

We could try and pin the media to a server, however in a p2p environment we'd probably want to do the opposite in support of freedom?

True... AFAIK, each client would serve as a pinning node in this case, so technically no "servers."

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinning it on clients can reveal your IP address to all other participants.
In large rooms this is harder because it's harder to tell what IP belongs to what user, but in smaller rooms, it's easier.
In DMs it's trivial:

  • send a file
  • see the one IP that pinned it

to be served while indicating to supporting implementations that they do not need to contact the
origin server and can instead use IPFS directly to retrieve the media.

For completeness, an example IPFS-styled MXC URI would be `mxc://example.org/ipfs:cidgoeshere`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may conflict with #2703. ( unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" ) Maybe #2703 could be expanded to allow certain exceptions for protocols like IPFS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe #2703 could use pchar = unreserved / pct-encoded / sub-delims / ":" / "@" instead, sans pct-encoded`?

Copy link

@lidel lidel Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If #2703 used self-describing CID then you would not need ipfs: prefix.
You could leverage codec field (table) for different types of Media ID.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be helpful to use a CID that points to an IPLD object containing metadata, such as filename, MIME type, or Content-Disposition data, as well as a link to the actual data. This could help implement #2702.


Because clients can embed an IPFS node into themselves or [access IPFS from the browser](https://github.com/ipfs/in-web-browsers/blob/master/ADDRESSING.md),
it would be extremely useful to allow the client to bypass the `/upload` endpoint and publish its
own MXC URI after having used a local IPFS node. Considering `ipfs://` support is not proposed here,
clients will need to get a homeserver name/origin to put into the `mxc://` URI. They'll also need to
know if the server even supports IPFS to be able to bypass `/upload` entirely.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple problems with using IPFS on clients:

  • In a room with many members, an upload would cost multiples of the file size in bandwidth because the first few nodes will download from the uploader
  • JS-IPFS isn't capable of seeding content (browser clients can't seed)
  • IPFS use some background traffic to keep the DHT connections, which wastes mobile data. Unacceptable on phones.
  • This would generate much more traffic than needed for downloading
    • If viewers don't seed, then everybody will download from the uploader. In a 200 member room, a 8 MiB image would cost 1.6 GiB for the uploader. Unacceptable on mobile data.
    • If viewers seed, then viewing an image would cost multitudes of the file size in traffic. (Dowload it once, and seed it for multiple peers.) Unacceptable on mobile data.
  • If the uploader has a slow connection, it will be slow for everybody
  • It has to be pinned on the client, which means it will use lot of storage. Unacceptable on phones. If it would be done that way, I would have to delete Element from my phone because it would use so much space that I don't have on my phone.
  • If client sends a file in a DM, then becomes offline, the chat partner can't download it

Copy link

@lidel lidel Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the above, including privacy concerns when providing data to IPFS swarm, assume a full IPFS node runs on the client that is also the only provider for the CID.

However, that is not the only possible architecture. P2P part of IPFS is optional. I'd argue even running full node is optional:

  • CID could be created on the client, and sent to Matrix server as CAR file for pinning and providing.
  • Or the original file could be sent to Matrix server, and it would do the work and produce CID for it.

In both cases, the client does not leak IP by providing data to the IPFS network.
In both cases, Matrix server performs a similar storage/retrieval function to what it does right now.
The only real difference is standardizing on identifying data with CIDs.

In my mind, the value of using IPFS in Matrix are CIDs. Content-addressed identifiers allow the community to keep the data alive in addition to Matrix server operators, and use the same data outside of Matrix (and benefit from the pinning and caching on various layers).

Matrix servers could cap costs and set up policy to "pin CIDs for X amount of time/space" and if people want them to be available for longer, they can cache them on their clients, external pinning services, or run their own IPFS node and start reproviding it to the IPFS swarm on their own.

When opening very old messages which are no longer kept around by Matrix server, one could still be able to retrieve the content, as long it was pinned somewhere.

  • Providing CIDs to IPFS swarm can be handled by Matrix server, OR multiple external pinning services.
  • Even when Matrix server does not have the data anymore, HTTP Gateways allow delegating retrieval of IPFS content without running full node or providing anything to the network.
    • Client can request deserialized data (and trust the gateway did it correctly - fine if it is provided by Matrix server), or make a trustless request for verifiable application/vnd.ipld.raw or vnd.ipld.car response and do inexpensive validation and deserialization on the client – this way, any gateway can be used, it no longer needs to be run by the same operator as Matrix server.


To permit the bypass of `/upload`, a new capability is proposed: `m.ipfs`. When present, this indicates
to the client that the server's media repo is IPFS-capable and thus can be bypassed. Clients will still
need to know an origin to provide in the MXC URI however. Clients should use the following steps to
determine an appropriate origin:

1. The one they were explicitly provided (in the case of a user wanting to use a particular gateway).
2. The origin specified by the optional `preferred_origin` in the `m.ipfs` capability.
3. The domain name for the user's ID, as a default option.

----

This proposal does encourage that client implementation embed IPFS support to avoid having to contact
the homeserver for content. Clients might still wish to use functionality like thumbnails from the
server, however if specified well enough by other MSCs a client could feasibly use the `thumbnail_uri`
provided by the sending client to display appropriate content without ever having to contact the
homeserver.

## Potential issues

**TODO: Investigate ways to mitigate.**

* Retention and redaction, erasure.
* Spam, abuse, etc
* Quarantining content (not currently specified, but should be considered).

## Alternatives

**TODO: Find other solutions than IPFS and explain why they're bad.**

## Security considerations

**TODO: This.**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPFS is known to have anonymity and privacy issues (fingerprintable, no audits, no Tor support), so it might be problematic for anonymous users. Running it on homeservers could mitigate this issue

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPFS was never designed to be private. It was designed to be censorship and attack resistant, but not private.
So it's important to run it on servers instead of clients.


## Unstable prefix

While this MSC is not in a released version of the spec, `io.t2bot.ipfs` should be used in place of
`m.ipfs`. No special endpoints, version flags, or other prefixes are required for this MSC.