-
Notifications
You must be signed in to change notification settings - Fork 30
Re-organizing exchanges, the dag, and blockstores #255
Comments
Most things make sense to me, I like this design, though I'm not fully familiar with the current design so I might miss some things. Here are my notes so far:
|
Good points. Thanks!
I didn't want to overload the concept of generic DAGs with "The DAG". However, in go-ipfs, I believe this is called DAG anyways so... you're probably right.
The
Good point. I used Kill because mostly because my mail program uses the term to "kill" threads. However,
I don't think that would work well as described but you bring up a good point. Simply returning random blocks from a query would:
Also, while that might make it easier to stream data from multiple peers, that would make duplicate data much worse for cases where we already have subparts of the graph. One thing we could do, if the query language ends up being powerful enough, is to ask one peer for the first half of the children of a node and another peer for the second half (without knowing the children and/or number of them up-front when we initiate the query). We'd still receive the root node at least twice but that's not that bad (it's receiving long paths through the DAG that's the problem). However, this is an application-level optimization we can add in later.
I agree. It should be a combination of us telling the DAG and the DAG introspecting into the data (possibly allowing plugins to register rules). As for how to tell the DAG, we can:
|
As a note, we have discovered that memory caching blocks in memory doesn't give a big benefits is synthetic work loads. It might be different in case of real world scenarios but we will need more metrics on that. |
What about caching fully parsed nodes? Also, I wouldn't be surprised if caching blocks starts to make a difference as we fix some of our other performance bottlenecks. |
@Stebalien caching fully parsed nodes is fine IFF we make sure the ipldnode.Node.Copy() method is implemented correctly all over. (or some other method of mutation protection is implemented) |
I'm 👍 to the Block/Node/DAG layering. This makes sense to me and is what I would lean towards doing in any case. For the query, mixing the I havent yet put much thought into the event proposal, but a quick idea on 'selecting which events' you want: It could be done by passing a filter on the context used to subscribe. Another performance issue to keep in mind, when giving bitswap the dag, we dont necessarily want to unmarshal every single block we're sending out. Either bitswap needs to be careful about this, or we need to have lazy unmarshaling support on Node objects. will comment more as i digest things |
My plan is to get rid of Copy from the interface entirely as nodes nodes are immutable. If you want to mutate a node, you either need to:
My worry is that we'll want to be able to select over several queries at once (from multiple sources). We could use lots of go routines but that seems a bit icky.
I'm planning on using types to differentiate between different events so doing that would be a bit tricky (hello reflection). I really don't think subscribing to everything and throwing away stuff you're not interested in will be an issue in general.
True... for DAGSwap (IPLD selectors), we'll need the fully deserialized object in some cases so I think lazy unmarshaling support with link caches is probably the best bet here. |
So, this had some interesting ideas but is way to monolithic (with no clear a, then b, then c path) to be a useful roadmap. |
So, after writing this, I realized that it should be multiple proposals/documents (it branched off in many directions). However, I'll clean that up later (and probably turn it into PRs on the relevant spec repos). I'm posting it here to get some early(ish) feedback.
So, there have been rumblings about improving the blockstore, blockservice, dag, and exchange APIs and how they relate. While I've seen a lot of one-off PRs and miscellaneous issues, I haven't seen any concerted efforts to re-imagine how all these pieces fit together. This proposal attempts to collect all these issues in one place and sketch out a proposal for addressing them in a cohesive manner.
Note: I have no intention of trying to actually design the IPLD selector language in this doc. I only talk about IPLD selectors because quite a few things will need to change to make them work once we've designed them and we absolutely need them for performance.
Observations
First, let's start with some observations. I've tried to add links to relevant issues but I know I'm missing quite a few so please add any you can think of.
IPLD Selectors
Small aside for those that aren't familiar with the concept of IPLD selectors...
Currently, to fetch a DAG, we ask for the root, enumerate it's children, ask for it's children, etc. recursively. This means two things for us:
IPLD selectors allow us to ask for an entire DAG, or select some portion of a DAG, all at once to avoid this back and forth.
Back on topic...
The
Exchange
currently lives inside theBlockService
. However, to support IPLD selectors, it will need to understand IPLD and should therefore have access to a DAG, not a blockservice/blockstore.Issues:
Multiple Sources
In the future, we'd like to add Ethereum, Bitcoin, and possibly even services like GitHub as external read-only BlockStores. The current architecture:
Issues:
The Exchange Interface
The exchange interface is responsible for two things:
Bitswap Is Messy
According to @Kubuxu, it has grown organically over time and needs a cleanup. According to @whyrusleeping, the provider service should go elsewhere.
Other issues include:
Issues:
Slow IO Path
We're doing too much in the IO path. That is, when a user adds/removes a file, we should make doing that the top priority and potentially delay everything else if it bogs us down. For example, we currently use a lot of resources publishing provider records as we add files to IPFS instead of doing this lazily after the fact. I believe a redesign of how the exchange plugs into the storage layer can both fix this problem and make it unlikely to happen again as we add more exchanges.
N Different Event Systems
We have two "notifiee" system for handling stream add/remove events, a
Has
function for notifying theExchange
that we now have a block, and we're about to get yet another event system built into the blockservice to help with GC. All of these systems have slightly different semantics.Issues:
Wantlists Can Get Large
We should consider sending diffs. This is especially a problem when downloading large dags.
Goals
So, now that we've discussed some of the outstanding issues, let's lay out a few concrete goals. We need to support:
Finally, we need to make it easier to reason about our code, simplify some of our interfaces, and remove as much code as we can.
Design Discussion
Instead of just laying out the design, I'm going to try to walk through it motivating every choice. I should probably summarize this sans-motivation below but good and accurate summaries are actually quite time-consuming to write...
IPLD Selectors and Multiple Sources
By themselves, the following three features are fairly straightforward:
However, this all becomes rather tricky when you combine either the tiered source or the parallel source requirements with IPLD selectors.
Latency Tiers
When combining tiered sources with IPLD selectors, different latency tiers may have different parts of the desired DAG and there's no way to know this before partially executing the query. This will likely be a very common situation for websites that use IPNS as updates to the website will always change the root node but won't usually change many other many other nodes.
For example, take the following simple query to retrieve a node and all of its children (we'll want to support this query regardless of the query language we choose):
/ipld/QmObj/**
. We'd first query our 0-latency (memory) tier to collect everything reachable from the root (QmObj) that we have in-memory. Then, we'd go to disk and finally to bitswap. However, now lets say that bitswap returns a node that links to/ipld/QmSomethingWeHave
. We now need to tell all of our peers to stop sending us nodes from that sub-DAG and start over from the top (querying memory, then disk, then the network). Worse, determining that we have something may not be very fast (it may require asking a local NAS or the peer next door).So, we need some way to "kill" subtrees. There are at least two ways to do this:
For the reasons outlined above, I'm going to propose the second option.
However, for this kill mechanism to be effective, we need some time to actually propagate this information. I propose two compatible solutions: pick a good search algorithm that gives us time to kill subtrees and use a per-query send window mechanism.
First, order in which we process the DAG being queried is actually quite important.
If we use a depth-first search, the client has (almost) no time between receiving the root of a tree and its first child. This means it won't have any time to "kill" that subtree before it receives blocks it doesn't want.
If we use a breadth-first search, the client has all the time in the world (assuming a large branching factor) to tell the server not to send it a subtree. Unfortunately, time efficient BFS algorithms use memory linear in the number of nodes.
However, we can do a hybrid that should work fairly well: traverse the immediate children of first, then recurse on each of the children. This gives us a "buffer" of the branching factor in which we can "kill" the first child without ever receiving a block we don't want. One potential worry with this strategy is that we could have to load blocks from disk twice if we have a very large, very flat DAG (where the children of a single node don't fit into memory). However, at this point, I think that extra disk read will be the least of our concerns.
Second, we should use a window mechanism like TCP. This will allow us to avoid buffering a bunch of blocks we don't want without noticing (because we're behind on processing incoming blocks). This will also help with parallel requests as we'll see in a moment.
Parallel Requests
The other wrinkle is fetching data from multiple sources; we need to avoid downloading too many duplicate blocks but want to parallelize our downloads as much as possible. Currently, we use bitswap sessions to avoid downloading too many duplicate blocks but, if we're going to start asking for entire DAGs at a time, we'll probably want to do a bit better than that.
To avoid downloading duplicate blocks from multiple peers, we can use windows. We can ask many peers (limited using the session mechanism) to execute a query but give them all small windows. Then we can pick the first peer that responds and extend its window. If that peer fails to send us data within some time frame, we can pick a different one and extend its window. We'll receive some duplicate data up-front but hopefully not too much.
To better download from multiple sources, we should pick nodes that we know aren't going to be sent to us for a while† and add sub-queries rooted at those nodes to want-lists for other peers (peers from which we're not currently downloading anything). If one of these other peers responds, we can add that node to our do-not-want-list for the peer on which we're currently executing the original query.
Finally, we should consider having peers send "do-not-have" lists/messages, especially when executing queries and encountering sub-graphs that they cannot traverse. This will allow peers to pre-emptily execute that part of the query on different nodes.
†We can put a lower bound on how many blocks need to be processed and sent to us before the remote peer will process a given node assuming that the query execution algorithm is deterministic and fully define by the spec.
Event Systems
There are a few design questions to consider when coming up with an event system:
Currently, we use blocking event systems that usually wait forever if a subscriber is slow to handle an event. As far as I can tell, this is our major source of slowdown when adding large files/directories to IPFS. The blockstore appears to notify the exchange in a blocking manner and then wait for the exchange to start publishing provides for the block in question before continuing (we do rate-limit provides so at least we don't DoS ourselves). So, I'd like to use a non-blocking event system.
Given that requirement, we now need to deal with slow subscribers. To deal with occasional slowdowns, we can do a bit of limited buffering but that still won't cut it if we're producing events at a faster rate than can be handled by a subsystem. There are three possible solutions to this:
Infinite buffering is a great way to slow everything down, run out of memory, and/or expose oneself to DoS attacks so that's a non-starter.
Dropping events makes the system very hard to reason about so I'd rather not even consider that.
Finally, we're left with disconnecting slow subscribers. Basically, when a small event buffer (buffered channel) fills up and we would block publishing an event to this subscriber, we'd close the channel instead. Then, when this subscriber catches up, it would first sleep a bit (possibly using an exponential backoff), then re-subscribe and finally catch-up (re-initialize its state).
Why sleep with a backoff after being unsubscribed? When a subscriber gets disconnected, that means that the subsystem producing the events is moving significantly faster than subsystem processing events and the subsystem processing events should wait for everything to die down (assuming processing the events isn't time-sensitive).
To concretize this, if we used such an event system system to notify the provider system of new blocks, the provider system would automatically pause for a bit if we start adding more files than it can handle in a short period of time.
Architecture
TODO: This needs to be fleshed out a lot but it's getting late.
TODO: We need to work on the GC story here... this will be hard. We'll probably have to add new interfaces for GC able stores?
TODO: We might want to consider adding new interfaces for stores that we consider local enough to provide out of?
TODO: Filecoin/Bitswap relationship
TODO: Don't like the names in this section? Suggest new ones. I spent very little to negative (for kicks) effort on names (e.g.,
TheDAG
).I propose introducing the following interfaces (precise(ish) interfaces in the code section at the bottom):
BlockResolver
: Read-only block fetcher. Doesn't have an existing equivalent.BlockStore
: Read-write blockstore (inheritsBlockResolver
). ReplacesBlockstore
; there is noBlockService
(we won't need it).NodeResolver
: Read-only DAG/Node fetcher. ReplacesNodeGetter
.NodeStore
: Read-write DAG store (inheritsNodeResolver
). ReplacesDAGService
.DAGResolver
: NodeResolver that can be queried with IPLD selectors.DAGStore
: NodeStore that can be queried with IPLD selectors.Diagram
See the following sections for a rough description of what this is.
Things in this diagram:
Other:
TheDAG
It all begins at
TheDAG
, the top-levelDAGStore
. This DAG talks to a set of remote and local DAGStores, DAGResolvers, NodeStores, and NodeResolvers. It's responsible for fetching and storing DAGs from/in these subordinate services. In the current system, we do this at the block level but that's not sufficient when we have IPLD selectors.Latency Tiers
In this diagram, I only include two latency tiers: local and remote. In reality, these would be broken up into finer-grained tiers that may factor in other metrics like cost (e.g., for Filecoin) and/or API limits (e.g., for GitHub).
Local Stores
TheDAG will usually talk to two local storage providers:
LocalNodeStore
andMemNodeStore
. Deadening on the machine, some of the remote stores (e.g., the Ethereum one if one is running a full client) may move over here and become local stores.MemNodeStore
is a node-level cache andMemBlockStore
is a block-level cache.TheDAG
is responsible for tellingMemNodeStore
what to store andMemNodeStore
is responsible for determining when to push down toMemBlockStore
to be stored as blocks, when to store parsed nodes inMemNodeStore
, and when to evict them all together. In reality,MemBlockStore
will probably be folded intoMemNodeStore
but it's nice to think about them as if they were different services.LocalNodeStore
is the service through which all nodes are persisted locally. It talks toLocalBlockStores
which, in turn, writes data to the datastores. In IPFS today,LocalBlockStore
would be the Blockstore.Remote Stores
TheDAG
can talk to many remote stores but usually talk toBitswapClient
at a minimum.BitswapClient
runs all the client-side bitswap logic and shouldn't need to know anything about being a bitswap server. All communication betweenBitswapClient
andBitswapServer
happen throughTheDAG
and through theDecisionEngine
(the engine to decide which clients should have priority when bitswapping).RemoteNodeStore
is a genericNodeStore
that wraps arbitrary remoteBlockStore
s. It's mostly here for illustration purposes.Ethereum
,GitHub
, andBitcoin
are CID specific remoteDAGResolver
s.TheDAG
should have enough information to decide when and how to use these.Filecoin
exists for illustration purposes. This is where Filecoin could hypothetically plug in to IPFS (althoughBitswapClient
may merge with it andBitswapServer
may become a filecoin retrieval miner).Exchanging and Providing
There are three pieces here:
ExchangeNodeStore
,BitswapServer
andDecisionEngine
.ExchangeNodeStore
isn't a real NodeStore, it's just a facade that watches the rest of the system for nodes that we want to provide to the network, publishes provider records, and gives bitswap (and friends) access to these nodes. In the future, it could also handle some basic authentication by having bitswap pass-through peer-id information (e.g., in the context or possibly using a dedicated method).BitswapServer
is the server-side logic of bitswap (fulfills want requests and, eventually IPLD selector requests), andDecisionEngine
is the same as our current decision engine.(Note:
BestestExchange
is an imaginary exchange to make a point).We actually have a few design decisions here.
For one, I'm tempted to push a lot of the
ExchangeNodeStore
intoTheDAG
(leaving the actual provider record publishing logic in a different service). My reasoning is that TheDAG need to know about permissions (for when we get per-application permissions) and what's stored at what latency tiers anyways. However, we need to be careful not to get too carried away pushing stuff intoTheDAG
because it's convenient.Code
Event System
Simple implementation of the proposed event system. This implementation expects subscribers to implement their own backoff mechanisms if desired.
Another limitation of this interface is that there's no way to specify what events one cares about. However, I don't think that will be much of an issue in practice (you can always add multiple event "hubs" per object if necessary).
Interfaces
These are the proposed replacements for the exchange, blockstore, blockservice, dagservice, etc. You'll notice there is no blockservice or exchange interfaces. We shouldn't need them (use events!).
TODO: This lacks events. It would be nice to just add
Evented
to every interface but it's not that simple. Getting add events from, e.g., remote DAGs/blockservices usually won't be possible so we may want to be pickier about this... Really, I think we mostly need events atTheDAG
level (and theExchangeNodeStore
level if we have that).Diagram DOT File
The text was updated successfully, but these errors were encountered: