Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] MSC2706: IPFS as a media repository for Matrix #2706

Draft
wants to merge 1 commit into
base: old_master
Choose a base branch
from

Conversation

turt2live
Copy link
Member

Rendered

This is inspired by matrix-media-repo's work towards IPFS, but still needs work.

This is done with a community hat on:

Signed-off-by: Travis Ralston <travis@t2bot.io>

@turt2live turt2live added kind:feature MSC for not-core and not-maintenance stuff proposal A matrix spec change proposal labels Jul 28, 2020
IPFS uses "content IDs" (or "cid") to reference media which are compatible with Matrix's media IDs (**TODO: CONFIRM**),
making the process even easier to migrate. To support backwards compatability with older clients
and servers, the media ID is proposed to be formatted as `ipfs:<cid>` for IPFS-hosted media. This
will allow legacy servers and clients to contact their homeserver and resolve it to an IPFS gateway
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who pins the IPFS content? IMO, server-side pinning creates an opportunity for managing retention and redaction using IPFS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by pinning here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From https://docs.ipfs.io/concepts/persistence/,

IPFS nodes treat the data they store like a cache, meaning that there is no guarantee that the data will continue to be stored. "Pinning" a CID tells an IPFS server that the data is important and mustn't be thrown away.

AFAIK, the current method of retrieving Matrix media effectively "pins" media on all participant servers. Ideally, a server could do a reference count on IPFS resources and pin them accordingly. The difficult part with that would be that there is no standard way of determining which media an event references without knowing its schema. I.e, if I create a new event type and upload media with it, the server has no clear way of pulling the media out of that new event except by searching for all mxc URLs.

A future P2P world could use a more conservative pinning algorithm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the MSC to solve that problem tbh. The server doesn't need to pin it, and in popular enough rooms the media will get shared across other nodes naturally.

We could try and pin the media to a server, however in a p2p environment we'd probably want to do the opposite in support of freedom?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw it is not proposed (and won't be proposed when this MSC is de-drafted) to have the old media system disappear. It would still exist, just at a lesser prominence than IPFS.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is the MSC to solve that problem tbh.

Yeah probably best to leave up to a media pruning MSC.

The server doesn't need to pin it, and in popular enough rooms the media will get shared across other nodes naturally.

If not pinned, all participant nodes will prune it if its not accessed for a while, so at least 1 node has to pin it. This could be just the originating server, but if that server goes offline, it can't be accessed. Retaining access after a server goes offline may also be beyond the scope of this MSC. Just adding a file to IPFS pins it though, (unless otherwise specified) so if the originating server adds it, then others will be able to access it.

We could try and pin the media to a server, however in a p2p environment we'd probably want to do the opposite in support of freedom?

True... AFAIK, each client would serve as a pinning node in this case, so technically no "servers."

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinning it on clients can reveal your IP address to all other participants.
In large rooms this is harder because it's harder to tell what IP belongs to what user, but in smaller rooms, it's easier.
In DMs it's trivial:

  • send a file
  • see the one IP that pinned it

to be served while indicating to supporting implementations that they do not need to contact the
origin server and can instead use IPFS directly to retrieve the media.

For completeness, an example IPFS-styled MXC URI would be `mxc://example.org/ipfs:cidgoeshere`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may conflict with #2703. ( unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" ) Maybe #2703 could be expanded to allow certain exceptions for protocols like IPFS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe #2703 could use pchar = unreserved / pct-encoded / sub-delims / ":" / "@" instead, sans pct-encoded`?

Copy link

@lidel lidel Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If #2703 used self-describing CID then you would not need ipfs: prefix.
You could leverage codec field (table) for different types of Media ID.

to be served while indicating to supporting implementations that they do not need to contact the
origin server and can instead use IPFS directly to retrieve the media.

For completeness, an example IPFS-styled MXC URI would be `mxc://example.org/ipfs:cidgoeshere`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be helpful to use a CID that points to an IPLD object containing metadata, such as filename, MIME type, or Content-Disposition data, as well as a link to the actual data. This could help implement #2702.

@notramo
Copy link

notramo commented Jan 9, 2021

Problems with this:
2 huge problems:

  • When the user originally uploads a file, it can reveal his IP address to everyone in the room as the only node that has the file at the beginning.
  • IPFS nodes crash all consumer routers with Intel Puma 6 or 7 chip after few minutes. https://badmodems.com/

Multiple smaller problems:

  • In a room with many members, an upload would cost multiples of the file size in bandwidth because the first few nodes will download from the uploader
  • JS-IPFS isn't capable of seeding content (browser clients can't seed)
  • IPFS use some background traffic to keep the DHT connections, which wastes mobile data. Unacceptable on phones.
  • This would generate much more traffic than needed for downloading
    • If viewers don't seed, then everybody will download from the uploader. In a 200 member room, a 8 MiB image would cost 1.6 GiB for the uploader. Unacceptable on mobile data.
    • If viewers seed, then viewing an image would cost multitudes of the file size in traffic. (Dowload it once, and seed it for multiple peers.) Unacceptable on mobile data.
  • If the uploader has a slow connection, it will be slow for everybody
  • It has to be pinned on the client, which means it will use lot of storage. Unacceptable on phones. If it would be done that way, I would have to delete Element from my phone because it would use so much space that I don't have on my phone.
  • If client sends a file in a DM, then becomes offline, the chat partner can't download it

Possible solution:
Upload to the server, and the server hashes, pins it, then every server can access, cache, replicate, or download it on IPFS.
Benefits:

  • Network traffic equals to the size of file both when downloading and uploading.
    • No unnecessary seeding, or DHT gossip
  • Servers have fast connections
  • Servers have lot of storage to pin it
  • Servers are always online, so the uploader can go offline after uploading
  • No change to the client-server spec.
  • No client-side implementation needed.

@turt2live
Copy link
Member Author

@notramo please use threads to receive replies.

@notramo
Copy link

notramo commented Jan 11, 2021

@turt2live What do you mean by threads?

@momack2
Copy link

momack2 commented Feb 25, 2021

This is exciting! Anything needed from us on the IPFS side to land this? (cc @aschmahmann @Stebalien @lidel)

@lidel
Copy link

lidel commented Feb 25, 2021

@notramo thank you for this comprehensive list.

Small clarifications/updates:

  • Crashing consumer routers happens only when you run non-desktop settings on consumer network.
    Consumer IPFS nodes in Brave and IPFS Desktop are running with lower connection count (trying to avg. 50-300)
  • JS-IPFS is capable of delegating seeding content to preload nodes (config, FAQ)
  • DHT traffic can be decreased by setting routing type to dhtclient. go-ipfs >0.5.0 automatically detects when node is behind a NAT and runs as a dhtclient automatically.

That being said, the approach you proposed (using IPFS to simplify backend media management) sounds sensible.

Matrix would no longer need to worry about facilitating data transfers: only CID would have to be passed around, and the actual data would be found and fetched over IPFS.

Moreover, instance operators could decide that their IPFS node acts only as a hot cache, and/or pin data using vendor-agnostic api to a remote service(s) like Pinata (expect more in the future). This simplifies archival of old media without the need for managing archive on your own.

Leveraging IPFS on the client can be added later, but I believe that as long you use CIDs and content-paths in URLs user agents like Brave or IPFS Companion will be able to leverage them and load data from IPFS thanks to protocol upgrade path.

Let us know if you have any questions / concerns / ideas.

@turt2live turt2live added the needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. label Jun 8, 2021
@ple1n

This comment was marked as off-topic.

@notramo

This comment was marked as off-topic.

@ple1n

This comment was marked as off-topic.

@turt2live
Copy link
Member Author

(please use comments on the actual diff if you're expecting engagement from MSC authors or the SCT - otherwise feedback is outright ignored)


## Security considerations

**TODO: This.**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPFS is known to have anonymity and privacy issues (fingerprintable, no audits, no Tor support), so it might be problematic for anonymous users. Running it on homeservers could mitigate this issue

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPFS was never designed to be private. It was designed to be censorship and attack resistant, but not private.
So it's important to run it on servers instead of clients.

Copy link

@notramo notramo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't connect to IPFS from clients as it's a terrible idea.


## Security considerations

**TODO: This.**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPFS was never designed to be private. It was designed to be censorship and attack resistant, but not private.
So it's important to run it on servers instead of clients.

it would be extremely useful to allow the client to bypass the `/upload` endpoint and publish its
own MXC URI after having used a local IPFS node. Considering `ipfs://` support is not proposed here,
clients will need to get a homeserver name/origin to put into the `mxc://` URI. They'll also need to
know if the server even supports IPFS to be able to bypass `/upload` entirely.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple problems with using IPFS on clients:

  • In a room with many members, an upload would cost multiples of the file size in bandwidth because the first few nodes will download from the uploader
  • JS-IPFS isn't capable of seeding content (browser clients can't seed)
  • IPFS use some background traffic to keep the DHT connections, which wastes mobile data. Unacceptable on phones.
  • This would generate much more traffic than needed for downloading
    • If viewers don't seed, then everybody will download from the uploader. In a 200 member room, a 8 MiB image would cost 1.6 GiB for the uploader. Unacceptable on mobile data.
    • If viewers seed, then viewing an image would cost multitudes of the file size in traffic. (Dowload it once, and seed it for multiple peers.) Unacceptable on mobile data.
  • If the uploader has a slow connection, it will be slow for everybody
  • It has to be pinned on the client, which means it will use lot of storage. Unacceptable on phones. If it would be done that way, I would have to delete Element from my phone because it would use so much space that I don't have on my phone.
  • If client sends a file in a DM, then becomes offline, the chat partner can't download it

Copy link

@lidel lidel Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the above, including privacy concerns when providing data to IPFS swarm, assume a full IPFS node runs on the client that is also the only provider for the CID.

However, that is not the only possible architecture. P2P part of IPFS is optional. I'd argue even running full node is optional:

  • CID could be created on the client, and sent to Matrix server as CAR file for pinning and providing.
  • Or the original file could be sent to Matrix server, and it would do the work and produce CID for it.

In both cases, the client does not leak IP by providing data to the IPFS network.
In both cases, Matrix server performs a similar storage/retrieval function to what it does right now.
The only real difference is standardizing on identifying data with CIDs.

In my mind, the value of using IPFS in Matrix are CIDs. Content-addressed identifiers allow the community to keep the data alive in addition to Matrix server operators, and use the same data outside of Matrix (and benefit from the pinning and caching on various layers).

Matrix servers could cap costs and set up policy to "pin CIDs for X amount of time/space" and if people want them to be available for longer, they can cache them on their clients, external pinning services, or run their own IPFS node and start reproviding it to the IPFS swarm on their own.

When opening very old messages which are no longer kept around by Matrix server, one could still be able to retrieve the content, as long it was pinned somewhere.

  • Providing CIDs to IPFS swarm can be handled by Matrix server, OR multiple external pinning services.
  • Even when Matrix server does not have the data anymore, HTTP Gateways allow delegating retrieval of IPFS content without running full node or providing anything to the network.
    • Client can request deserialized data (and trust the gateway did it correctly - fine if it is provided by Matrix server), or make a trustless request for verifiable application/vnd.ipld.raw or vnd.ipld.car response and do inexpensive validation and deserialization on the client – this way, any gateway can be used, it no longer needs to be run by the same operator as Matrix server.

Copy link

@DarkKirb DarkKirb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expanded a bit on this MSC here: DarkKirb@d8516e5

The biggest difference between the current msc and my changes is that I added a pinning endpoint. I also changed the model a bit from the original vision. In many cases, nothing will change for the client, except that they can download the media from an IPFS gateway of their choice. Clients will continue uploading to the media server by default, due to the concerns others have listed here. They are also now listed in the MSC. I linked some related RFCs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature MSC for not-core and not-maintenance stuff needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. proposal A matrix spec change proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants