Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

renepay: an experimental payment plugin #6376

Merged
merged 7 commits into from
Jul 31, 2023
Merged

renepay: an experimental payment plugin #6376

merged 7 commits into from
Jul 31, 2023

Conversation

Lagrang3
Copy link
Collaborator

@Lagrang3 Lagrang3 commented Jul 4, 2023

Overview

This PR about implementing a payment plugin called renepay that seeks to
construct optimal Multi-Path-Payments (o-MPP) in terms of fees and reliability.
The original idea was published by Rene Pickhardt and Stefan Richter [1].
There exists a python implementation of o-MPP by Pickhardt (pickhardtpayments).

This work was possible thanks to a Build on L2 Grant.

Pickhardtpayments

Any node in the Lightning Network (LN) has knowledge of the existence of public
channels and their capacities, but the channel liquidity is unknown unless that
channel is local to the node.
Therefore, when a node constructs a route to send a payment it is assuming that
all channels along the path have enough liquidity to forward the payment amount.
If the payment fails, alternative routes can be tried and the process repeats.

It is reasonable to assume that the smaller the payment amount it is more likely
that the payment attempt succeeds. With the use of MPP one could split the
payment into smaller parts and send it along different routes. However, also the
more routes a MPP has, the more likely it will fail. One must find a compromise
between the number of routes and the size of each payment part as well as an
strategy to select routes if one wishes to maximize the probability of success
of the payment.

Pickhardt-Richter [1] paper formalizes this problem stating that the liquidity
of every channel is a random variable distributed in the integer range between
0 and C, where C is the capacity of the channel. For simplicity we are
neglecting the channel reserved fund.
A MPP can also be modeled as flow in the LN, that is an integer value is
associated to each channel, which corresponds to the amount of satoshis this
channel must forward so that this MPP can be completed. Notice that a channel
could be used by two different routes within the same MPP and the flow
associated to that channel is the sum of all funds that need to get through it.
Formally a flow is an integer function in the domain of the channels.

For a realistic payment the Flow constraints include:

  1. for any channel c the flow cannot exceed the channel's capacity:
    0 <= Flow(c) <= Capacity(c);
  2. there is satoshi conservation, hence each node has a satoshi balance computed
    as the sum of all incoming flows minus the outgoing flows,
    the sum of all nodes balance must be zero.
  3. also for the payment to considered valid one must have that
    the destination node have an balance which equals the requested payment
    amount.
    The payer node have a negative balance and must correspond to the
    payment amount plus the routing fees.
    Routing nodes will have a positive balance that corresponds to the total fees
    they collect.

The probability of success of a certain MPP, or its associated Flow,
is then the multiplication of the
probabilities that each channel is able to forward the corresponding Flow:

    Prob(Flow) = Product_{channel c} Prob( c can forward Flow(c))

Maximizing this probability is equivalent to minimizing the function -log Prob(Flow)

    - log Prob(Flow) = - sum_{channel c} log Prob( c can forward Flow(c))

If we assume that:

  1. the probability distribution of the liquidity of every channel c is such that
    -log Prob(c can forward x) is always a convex function of x;
  2. and the routing fees are negligible, hence we can assume that the balance of
    every node different from the source and the destination is zero;

then the problem of finding an optimal Flow that satisfies the capacity and balance
constraints, while maximizing Prob(Flow) can be solved efficiently using
polynomial algorithms. Chapter 14 of Ahuja-Magnanti-Orlin's Network Flows book [2]
is dedicated to this class of problems.

References

[1] Rene Pickhardt, Stefan Richter. Optimally Reliable & Cheap Payment Flows on the Lightning Network https://arxiv.org/abs/2107.05322
[2] R.K. Ahuja, T.L. Magnanti, and J.B. Orlin. Network Flows: Theory, Algorithms, and Applications. Prentice Hall, 1993.

Copy link
Member

@cdecker cdecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to submit these comments as usual, sorry if they are already out of date.

I'm not quite through the entire PR, but what I've seen is looking excellent 🚀

plugins/renepay/debug.h Outdated Show resolved Hide resolved
plugins/renepay/dijkstra.c Outdated Show resolved Hide resolved
plugins/renepay/dijkstra.c Outdated Show resolved Hide resolved
plugins/renepay/dijkstra.c Outdated Show resolved Hide resolved
@rustyrussell rustyrussell added this to the v23.08 milestone Jul 25, 2023
@rustyrussell
Copy link
Contributor

Great, I will rebase this, since db changes broke it...

Copy link
Contributor

@rustyrussell rustyrussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was so nice to read! 🧡

No serious changes, mainly tweaks to things which work, but can improve, and answering some TODO.

For review, I squashed and rebased your commits without any changes, and split the ccan intro into a separate commit. Then I played with some fixups on top for you to look at:

https://github.com/rustyrussell/lightning/tree/guilt/pr-6376

plugins/renepay/dijkstra.c Outdated Show resolved Hide resolved
plugins/renepay/heap.h Outdated Show resolved Hide resolved
plugins/renepay/mcf.c Outdated Show resolved Hide resolved
plugins/renepay/mcf.c Outdated Show resolved Hide resolved
plugins/renepay/mcf.c Show resolved Hide resolved
plugins/renepay/pay.c Outdated Show resolved Hide resolved
fraction);
pay_plugin->last_time = now_sec;

// TODO(eduardo): are there route hints for B12?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We have blinded paths instead, which must be used. Though, a node can just include a dummy blinded path to/from itself.

So, if it doesn't, we need to isolate the destination from any real channels, and create fake "channels" from the entry point using the blinded paths (they have info on fees and cltv). But that's just typing, basically, since algo stays the same.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have several blinded paths which do not sum to the full but a higher amount? basically like receivers MPP split? In that case the minimum cost flow problem could be redefined as a single source multi destination problem but the constraints become weired as the blinded paths probably expect either the full amount or no amount at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you can offer multiple paths with lesser capacity, it will Just Work. Or multiple paths each with sufficient capacity, and sender chooses how to use them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to solve before the release. Can I just drop support for Bolt12 until we fix this?

plugins/renepay/pay.c Outdated Show resolved Hide resolved
// * notification. */
// // plugin_log(pay_plugin->plugin,LOG_DBG,"received shutdown notification, freeing data.");
// pay_plugin->ctx = tal_free(pay_plugin->ctx);
// return notification_handled(cmd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, this should work...

But you shouldn't do this anyway.

plugins/renepay/pay.h Show resolved Hide resolved
Comment on lines +68 to +73
/* Remove self from map when done */
// TODO(eduardo):
// Is this desctructor really necessary? the chan_extra will deallocated
// when the chan_extra_map is freed. Anyways valgrind complains that the
// hash table is removing the element with a freed pointer.
// tal_add_destructor2(ce, destroy_chan_extra, chan_extra_map);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you free the "ce" though, the pointer stays in the hash table! valgrind is right, this can be fatal.

You have two choices:

  1. Remove self from table on free automatically, using destructor. This is simplest!
  2. Remove from table manually. This optimization might happen after benchmarking, only!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the chan_extra stored in the chan_extra_map are allocated with the map as the parent, their destructor will first remove them from the hash table. The problem IMO is that tal will free the hash table first and then call the destructor of every child which will try to remove every chan_extra_map from a hash table that no longer exists.

I'm stuck here.
One possible solution would be to allocate the chan_extras with another parent context and ensure that that context is freed before the hash table, maybe by having that ctx as parent of the hashtable itself.

Copy link
Collaborator

@renepickhardt renepickhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not through with reviewing and have only had a high level look at the code and comments and also left only a few high level comments yet.

I am confident that this PR is a huge step in the right direction and I am looking forward to finnish / extend the review soon.

plugins/renepay/pay.c Outdated Show resolved Hide resolved
plugins/renepay/pay.c Outdated Show resolved Hide resolved
/*
* mu (μ) is used as follows in the cost function:
*
* -log((c_e + 1 - f_e) / (c_e + 1)) + μ fee
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the paper suggests a weighted arithmetic mean between routing cost and uncertainty cost. However given all the issues that one can see from this I want to open the discussion about other means (like harmonic as deiscussed in #4771 ) or at least adding laplacian smoothing by adding some constant (c.f.: https://en.wikipedia.org/wiki/Additive_smoothing )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of adding prob. cost P and routing cost R using a fraction f (0<=f<=1):

cost = P * f + (1-f) R

so that when f=1 you get the entire costs is related to uncertainty, while with f=0 the costs is only fees. The good thing about this choice is that the interval to search is compact [0,1], and one can do binary search to look for the best value of f.
Then we upgraded this to integer numbers by adopting the current parameter mu that lies in the interval [0,MU_MAX] and

cost = P*mu + R*(MU_MAX-mu)

in essence is the same but with integers.

Allow me some time to study the discussion #4771.

plugins/renepay/payment.h Outdated Show resolved Hide resolved
plugins/renepay/flow.c Show resolved Hide resolved
plugins/renepay/flow.c Show resolved Hide resolved
plugins/renepay/flow.c Show resolved Hide resolved
bfee = c->half[dir].base_fee,
delay = c->half[dir].delay;

return pfee + bfee* base_fee_penalty+ delay*delay_feefactor;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the CLTV still be a penalty in finding payment flows?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could we consider it otherwise? If the delay is non-important then we set the delay_feefactor=0 or just a small number.

* problem is:
*
* Find a routing solution that pays the least of fees while keeping
* the probability of success above a certain value `min_probability`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

META COMMENT: -log(Probability) can be defined as an uncertainty cost and routing fees can be considered as a routing cost. In that case we would wish to minimize the total cost which consists of the combination of both routing and uncertainty cost. I think that is a more general and less confusing way to formulate the problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We could formulate the problem in that way, the issue is that we don't know how to combine the uncertainty cost and routing cost, and in practice as a user I am interested in getting my payment through quickly and with low fees. It's a two dimensional optimization problem, in general the cheapest routes will not correspond to the most reliable ones. We thought the easiest user experience will be: get me any solution that satisfies this constraint in uncertainty (eg. prob must be greater than 50%) and routing costs (eg. fees are less than 10sats), in the background I will try to go as cheap as possible, but those bounds have the priority. That was the simplest choice I could make.

This process is taken care of by the binary search in the minflow function and the is_better function decides the priorities (both in mcf.c).

In the future we can improve this.

* arc_2 -> [a+(b-a)*f1, a+(b-a)*f2)
* arc_3 -> [a+(b-a)*f2, a+(b-a)*f3)
*
* where f1 = 0.5, f2 = 0.8, f3 = 0.95;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the non uniform split of the capacity! However I believe 0.5 is a pretty high probability. I fear this will trigger and saturate too large payments. I'd love to have a discussion on the choices. also one does not need to take a uniform distribution for the probability to begin with. LND uses a a bimodal distribution. However that it hard to linearize.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were just chosen by hand:

  • 4 sections seemed like a nice number because it is a power of two (so that arcs can be identified with bits)
  • from 0 to 0.5 the uniform distribution of liquidity gives an almost linear cost function, then the following pieces must have ever decreasing capacities to preserve linearity, that's why from 0.5 to the next pivot I chose dx=0.3, resulting in 0.8, while the last segment is 0.8 + 0.15 = 0.95

The last two segments are worst linear approximation than the first.

Figure_1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are more parameters can we have to tune with experience.

Copy link
Collaborator Author

@Lagrang3 Lagrang3 Jul 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@renepickhardt, there is another detail here that bothers me, it is the fact that we cut the tail of the distribution at 95% of the channel capacity. That means there might be solutions that the algorithm will not see becauseit will think that the channel is already saturated at 95%.
We need either to add more arcs above that increasing the computational and memory burden, or produce a less precise linearization, for example like extending the last bin from 80% to 99%.

This is easy to fix, because these are just parameters.

Comment on lines 1173 to 1188
struct flow *fp = tal(list_ctx,struct flow);
struct flow *fp = tal(this_ctx,struct flow);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tallocate this off flows...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caveat: these are flow candidates, minflow produces different sets flows and selects one. Ownership of the flows to the flow array is elegant and allows me to do:

if(is_better(candidate_flow_array,best_flow_array))
{
    tmp = best_flow_array;
    best_flow_array = candidate_flow_array;
    tal_free(tmp);
}

Comment on lines 1218 to 1240
/* Stablish ownership. */
for(int i=0;i<tal_count(flows);++i)
{
flows[pos++] = tal_steal(flows,ld->flow_path);
flows[i] = tal_steal(flows,flows[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and skip this loop.

plugins/renepay/pay.c Outdated Show resolved Hide resolved
[ Split into separate commit --RR ]
Signed-off-by: Lagrang3 <eduardo.quintana@pm.me>
@rustyrussell
Copy link
Contributor

Trivial rebase which merged many of the fixup commits together, made flake8 happy with the final commit, and added a Changelog-Added line.

Ack bbc9591

(Cleanups can go in followup PRs...)

rustyrussell and others added 6 commits July 31, 2023 11:21
Signed-off-by: Lagrang3 <eduardo.quintana@pm.me>
Changelog-Added: Plugins: `renepay`: an experimental pay plugin implementing Pickhardt payments (`renepay` and `renepaystatus`).
Signed-off-by: Lagrang3 <eduardo.quintana@pm.me>
- remove internal gheap checks
- add check for arc_t.chanidx overflow
- remove outdated comments
- check the delta flow bounds before augmenting along a path
- get_flow_paths uses a dynamic tal array instead of a list.
- fix a unit test that depended on the order of returned flows
- fix bug: lightnind doesn't like if I reuse the partid of a failed
  flow, therefore use a higher partid than any of the previous attempts.
- plugin_err instead of LOG_BROKEN if sendpay fails and we cannot get a
  an error code.
- fix wrong comments.
- remove the background timer.
- This is a bugfix. Previous to this the MCF network was built using the
knowledge of the min and max liquidity but it didn't take into account
pending HTLCs.
- Also remove the min_prob_success option but hardcode a 90% value.

Removing some options that are not relevant to the user, they're kept
for developer mode only:
- base_fee_penalty
- min_prob_success
- prob_cost_factor
- remove heap.h, not used

Signed-off-by: Lagrang3 <eduardo.quintana@pm.me>
The global is an *internal* hack because dijkstra_item_mover doesn't
take a context arg!  It should be used with care.

Easy, since all the accessors exist: we just hand in the struct dijkstra.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
- adopt "const <type> *"convention
- remove use_shadow option for some pyln tests
- show prob. information of flows into paynotes
- show prob. of success of entire payment flow in paynotes
- minflow: We were not releasing the memory of flow arrays when replacing
  them with a new canditate.
- use memleak_scan_obj in memleak_check
- replace u64 with size_t

Signed-off-by: Lagrang3 <eduardo.quintana@pm.me>
Signed-off-by: Lagrang3 <eduardo.quintana@pm.me>
@rustyrussell
Copy link
Contributor

... and fixed a typo in run-mcf-diamond.c, which had string length wrong:

diff --git b/plugins/renepay/test/run-mcf-diamond.c a/plugins/renepay/test/run-mcf-diamond.c
index 3ca75cec6..b764d5b0e 100644
--- b/plugins/renepay/test/run-mcf-diamond.c
+++ a/plugins/renepay/test/run-mcf-diamond.c
@@ -80,10 +80,10 @@ int main(int argc, char *argv[])
 	assert(node_id_from_hexstr("0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518", 66, &l2));
 	assert(node_id_from_hexstr("035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d", 66, &l3));
 	assert(node_id_from_hexstr("0382ce59ebf18be7d84677c2e35f23294b9992ceca95491fcf8a56c6cb2d9de199", 66, &l4));
-	assert(short_channel_id_from_str("1x2x0", 7, &scid12));
-	assert(short_channel_id_from_str("1x3x0", 7, &scid13));
-	assert(short_channel_id_from_str("2x4x0", 7, &scid24));
-	assert(short_channel_id_from_str("3x4x0", 7, &scid34));
+	assert(short_channel_id_from_str("1x2x0", 5, &scid12));
+	assert(short_channel_id_from_str("1x3x0", 5, &scid13));
+	assert(short_channel_id_from_str("2x4x0", 5, &scid24));
+	assert(short_channel_id_from_str("3x4x0", 5, &scid34));
 
 	mods = gossmap_localmods_new(tmpctx);

And rebased on master to get latest test flake fixes.

@rustyrussell
Copy link
Contributor

Ack 5056ebc

@rustyrussell rustyrussell merged commit e5695b3 into ElementsProject:master Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants