-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Probabilistic channel scoring #1227
Probabilistic channel scoring #1227
Conversation
lightning/src/routing/router.rs
Outdated
// The RouteHintHop fields which will eventually be used if this hop is used in a final Route. | ||
// Note that node_features is calculated separately after our initial graph walk. | ||
node_id: NodeId, | ||
short_channel_id: u64, | ||
channel_features: &'a ChannelFeatures, | ||
channel_features: ChannelFeatures, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holy malloc-traffic-in-a-tight-loop Batman! Somewhat astoundingly to me this doesn't actually seem to slow things down much, but I'd still strongly prefer to avoid the additional malloc traffic if possible. This does so and drops further fields on this struct too, which is nice:
diff --git a/lightning/src/routing/router.rs b/lightning/src/routing/router.rs
index 14844171e..a822a68d7 100644
--- a/lightning/src/routing/router.rs
+++ b/lightning/src/routing/router.rs
@@ -331,6 +331,7 @@ impl cmp::PartialOrd for RouteGraphNode {
/// A wrapper around the various hop representations.
///
/// Used to construct a [`PathBuildingHop`] and to estimate [`EffectiveCapacity`].
+#[derive(Clone, Debug)]
enum CandidateRouteHop<'a> {
/// A hop from the payer to the next hop to the payee, where the outbound liquidity is known.
FirstHop {
@@ -409,22 +410,19 @@ impl<'a> CandidateRouteHop<'a> {
/// Fee values should be updated only in the context of the whole path, see update_value_and_recompute_fees.
/// These fee values are useful to choose hops as we traverse the graph "payee-to-payer".
#[derive(Clone, Debug)]
-struct PathBuildingHop {
+struct PathBuildingHop<'a> {
// The RouteHintHop fields which will eventually be used if this hop is used in a final Route.
// Note that node_features is calculated separately after our initial graph walk.
node_id: NodeId,
short_channel_id: u64,
- channel_features: ChannelFeatures,
+ candidate: CandidateRouteHop<'a>,
fee_msat: u64,
- cltv_expiry_delta: u32,
/// Minimal fees required to route to the source node of the current hop via any of its inbound channels.
src_lowest_inbound_fees: RoutingFees,
- /// Fees of the channel used in this hop.
- channel_fees: RoutingFees,
/// All the fees paid *after* this channel on the way to the destination
next_hops_fee_msat: u64,
- /// Fee paid for the use of the current channel (see channel_fees).
+ /// Fee paid for the use of the current channel (see candidate.fees()).
/// The value will be actually deducted from the counterparty balance on the previous link.
hop_use_fee_msat: u64,
/// Used to compare channels when choosing the for routing.
@@ -432,10 +430,6 @@ struct PathBuildingHop {
/// an estimated cost of reaching this hop.
/// Might get stale when fees are recomputed. Primarily for internal use.
total_fee_msat: u64,
- /// This is useful for update_value_and_recompute_fees to make sure
- /// we don't fall below the minimum. Should not be updated manually and
- /// generally should not be accessed.
- htlc_minimum_msat: u64,
/// A mirror of the same field in RouteGraphNode. Note that this is only used during the graph
/// walk and may be invalid thereafter.
path_htlc_minimum_msat: u64,
@@ -459,11 +453,11 @@ struct PathBuildingHop {
// Instantiated with a list of hops with correct data in them collected during path finding,
// an instance of this struct should be further modified only via given methods.
#[derive(Clone)]
-struct PaymentPath {
- hops: Vec<(PathBuildingHop, NodeFeatures)>,
+struct PaymentPath<'a> {
+ hops: Vec<(PathBuildingHop<'a>, NodeFeatures)>,
}
-impl PaymentPath {
+impl<'a> PaymentPath<'a> {
// TODO: Add a value_msat field to PaymentPath and use it instead of this function.
fn get_value_msat(&self) -> u64 {
self.hops.last().unwrap().0.fee_msat
@@ -514,7 +508,7 @@ impl PaymentPath {
// set it too high just to maliciously take more fees by exploiting this
// match htlc_minimum_msat logic.
let mut cur_hop_transferred_amount_msat = total_fee_paid_msat + value_msat;
- if let Some(extra_fees_msat) = cur_hop.htlc_minimum_msat.checked_sub(cur_hop_transferred_amount_msat) {
+ if let Some(extra_fees_msat) = cur_hop.candidate.htlc_minimum_msat().checked_sub(cur_hop_transferred_amount_msat) {
// Note that there is a risk that *previous hops* (those closer to us, as we go
// payee->our_node here) would exceed their htlc_maximum_msat or available balance.
//
@@ -542,7 +536,7 @@ impl PaymentPath {
// Irrelevant for the first hop, as it doesn't have the previous hop, and the use of
// this channel is free for us.
if i != 0 {
- if let Some(new_fee) = compute_fees(cur_hop_transferred_amount_msat, cur_hop.channel_fees) {
+ if let Some(new_fee) = compute_fees(cur_hop_transferred_amount_msat, cur_hop.candidate.fees()) {
cur_hop.hop_use_fee_msat = new_fee;
total_fee_paid_msat += new_fee;
} else {
@@ -870,18 +864,15 @@ where L::Target: Logger {
PathBuildingHop {
node_id: $dest_node_id.clone(),
short_channel_id: 0,
- channel_features: $candidate.features(),
+ candidate: $candidate.clone(),
fee_msat: 0,
- cltv_expiry_delta: 0,
src_lowest_inbound_fees: RoutingFees {
base_msat: fee_base_msat,
proportional_millionths: fee_proportional_millionths,
},
- channel_fees: $candidate.fees(),
next_hops_fee_msat: u64::max_value(),
hop_use_fee_msat: u64::max_value(),
total_fee_msat: u64::max_value(),
- htlc_minimum_msat: $candidate.htlc_minimum_msat(),
path_htlc_minimum_msat,
path_penalty_msat: u64::max_value(),
was_processed: false,
@@ -977,11 +968,8 @@ where L::Target: Logger {
old_entry.total_fee_msat = total_fee_msat;
old_entry.node_id = $dest_node_id.clone();
old_entry.short_channel_id = short_channel_id;
- old_entry.channel_features = $candidate.features();
+ old_entry.candidate = $candidate.clone();
old_entry.fee_msat = 0; // This value will be later filled with hop_use_fee_msat of the following channel
- old_entry.cltv_expiry_delta = $candidate.cltv_expiry_delta();
- old_entry.channel_fees = $candidate.fees();
- old_entry.htlc_minimum_msat = $candidate.htlc_minimum_msat();
old_entry.path_htlc_minimum_msat = path_htlc_minimum_msat;
old_entry.path_penalty_msat = path_penalty_msat;
#[cfg(any(test, feature = "fuzztarget"))]
@@ -1287,12 +1275,10 @@ where L::Target: Logger {
// so that fees paid for a HTLC forwarding on the current channel are
// associated with the previous channel (where they will be subtracted).
ordered_hops.last_mut().unwrap().0.fee_msat = new_entry.hop_use_fee_msat;
- ordered_hops.last_mut().unwrap().0.cltv_expiry_delta = new_entry.cltv_expiry_delta;
ordered_hops.push((new_entry.clone(), NodeFeatures::empty()));
}
ordered_hops.last_mut().unwrap().0.fee_msat = value_contribution_msat;
ordered_hops.last_mut().unwrap().0.hop_use_fee_msat = 0;
- ordered_hops.last_mut().unwrap().0.cltv_expiry_delta = final_cltv_expiry_delta;
log_trace!(logger, "Found a path back to us from the target with {} hops contributing up to {} msat: {:?}",
ordered_hops.len(), value_contribution_msat, ordered_hops);
@@ -1463,7 +1449,7 @@ where L::Target: Logger {
// Now, substract the overpaid value from the most-expensive path.
// TODO: this could also be optimized by also sorting by feerate_per_sat_routed,
// so that the sender pays less fees overall. And also htlc_minimum_msat.
- cur_route.sort_by_key(|path| { path.hops.iter().map(|hop| hop.0.channel_fees.proportional_millionths as u64).sum::<u64>() });
+ cur_route.sort_by_key(|path| { path.hops.iter().map(|hop| hop.0.candidate.fees().proportional_millionths as u64).sum::<u64>() });
let expensive_payment_path = cur_route.first_mut().unwrap();
// We already dropped all the small channels above, meaning all the
// remaining channels are larger than remaining overpaid_value_msat.
@@ -1481,16 +1467,21 @@ where L::Target: Logger {
drawn_routes.sort_by_key(|paths| paths.iter().map(|path| path.get_total_fee_paid_msat()).sum::<u64>());
let mut selected_paths = Vec::<Vec<Result<RouteHop, LightningError>>>::new();
for payment_path in drawn_routes.first().unwrap() {
- selected_paths.push(payment_path.hops.iter().map(|(payment_hop, node_features)| {
+ let mut path = payment_path.hops.iter().map(|(payment_hop, node_features)| {
Ok(RouteHop {
pubkey: PublicKey::from_slice(payment_hop.node_id.as_slice()).map_err(|_| LightningError{err: format!("Public key {:?} is invalid", &payment_hop.node_id), action: ErrorAction::IgnoreAndLog(Level::Trace)})?,
node_features: node_features.clone(),
short_channel_id: payment_hop.short_channel_id,
- channel_features: payment_hop.channel_features.clone(),
+ channel_features: payment_hop.candidate.features(),
fee_msat: payment_hop.fee_msat,
- cltv_expiry_delta: payment_hop.cltv_expiry_delta,
+ cltv_expiry_delta: payment_hop.candidate.cltv_expiry_delta(),
})
- }).collect());
+ }).collect::<Vec<_>>();
+ let mut prev_cltv_expiry_delta = final_cltv_expiry_delta;
+ for hop in path.iter_mut().rev() {
+ core::mem::swap(&mut prev_cltv_expiry_delta, &mut hop.as_mut().unwrap().cltv_expiry_delta);
+ }
+ selected_paths.push(path);
}
if let Some(features) = &payee.features {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, thought there might be some concern over that. Thanks for the workaround! Will incorporate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that you can also drop the short_channel_id
field trivially. We should also be able to drop the node_id
field, even, given that its all findable, but its a bit more work (and probably involves dropping the use of the DirectedChannelInfo
constructors entirely, or at least moving the NodeId
into the DirectionalChannelInfo
s instead of being outside of them in the ChannelInfo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance you could edit the patch to use tabs for indentation? Probably a copy/paste issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that you can also drop the
short_channel_id
field trivially.
Looks like your most recent version did so. 🙂
We should also be able to drop the
node_id
field, even, given that its all findable, but its a bit more work (and probably involves dropping the use of theDirectedChannelInfo
constructors entirely, or at least moving theNodeId
into theDirectionalChannelInfo
s instead of being outside of them in theChannelInfo
Haven't taken a look at this in detailed, but note that DirectedChannelInfo
has accessors for the source and target NodeId
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't taken a look at this in detailed, but note that DirectedChannelInfo has accessors for the source and target NodeIds.
Right, I guess we could store the DirectedChannelInfo
directly in the CandidateRouteHop::PublicHop
variant, but for now its not there and there's no easy way to get at the NodeId
because it loses direction information. I hacked up a solution to this to see and it wasn't really any faster despite saving the 32 bytes. I think there's a direction there but I may have added more branching which made things worse enough that saving the memcpys didn't help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah.. I think I couldn't get that to work when returning the DirectedChannelInfo
from a closure. Probably just need to have the variant to hold the ChannelInfo
reference, too. But I also liked that direction
was not an Option
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not worth worrying about now.
d64b8d4
to
63b509f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All generally pretty straightforward once I wrapped my head around the type indirection, which I always find a bit confusing, but its not clear how to cut down the size of the patch much more without making it harder to reason about the logic.
lightning/src/routing/scoring.rs
Outdated
enum Probability { | ||
Zero, | ||
One, | ||
Ratio { numerator: u64, denominator: u64 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just store an f64 given its gonna be divided at the callsite? I'm not really sure why we need a struct here at all, f64 is pretty well understood for probability and you can still check for equality with zero and one without loss of precision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had been considering if having it in this form would be useful for approximating the log as per #1166 (comment). Basically, avoid the division and approximate the log of the numerator and denominator separately and taking the difference. Though, admittedly, the approximation technique wasn't very clear to me from that comment. Could use f64
if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, fair, avoiding the FPU in the inner loop of the router may be useful, but given this is an internal API so we can change it later, I don't see why we should eat the complexity now vs just adding it later if we want to do some approximation. Sadly our benchmarks for routing currently don't hit the scoring code much because the scorer is uninitialized initially. Doesn't need to happen in this PR, but ideally we'd add the scorer to the benchmark(s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving in for now to try and update the benchmark in parallel with review, but happy to remove it before merge if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while the approximation of the negative log probability is just this mentioned fraction all my tests show that this linear approximation is not too useful as it (similarly with the linear fee rate) tends to saturate channels when computing the optimal split. As far as I understand here we are not computing optimal splits anyway this but I thought it is useful to point out that the convexity of the negativ log prob is actually very desirable.
This problem of saturating channels fully when switching to the linearized problem can be avoided by changing the model of the uncertainty network and constrain channel capacities to be lower than they actually are. However as of right now it is totally not clear to me how to do this properly without arbitrary choices. Yet it would allow us to use simplex based min cost flow solvers for linear min cost flow problems which would yield a huge runtime improvement (assuming the base fee is set to zero or heavily deincentivised as ZmnSCPxj suggested)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@renepickhardt Not sure I quite understand the comment. The discussion here is only about approximating log10
, not about approximating the negative log of the success probability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're saying we should just use a linear division instead of log entirely? That is confusing to me.
lightning/src/routing/scoring.rs
Outdated
/// [`ProbabilisticScoringParameters`] for details. | ||
/// | ||
/// [1]: https://arxiv.org/abs/2107.05322 | ||
pub type ProbabilisticScorer<G> = ProbabilisticScorerUsingTime::<G, ConfiguredTime>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, if we're not gonna delete Scorer
(I think we can, tbh, given its naive and we have a trait for it anyway, but if you don't want to that's fine), then we should probably put this new stuff in its own module, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured we get rid of Scorer
eventually. Could do so now, if you want.. I have no strong preference. But I didn't want to use the same name given the serialization formats are different. Not sure if it makes sense to blindly upgrade users (if we could even detect that we're reading the old format?) or force them to by changing the type name to ProbabilisticScorer
. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, fair, I guess if the intent is to leave both for a version or two and then drop Scorer
after a version or two that seems fine to me. Fine to leave them in the same module for now, then.
lightning/src/routing/scoring.rs
Outdated
Probability::One => 0, | ||
Probability::Ratio { numerator, denominator } => { | ||
let success_probability = numerator as f64 / denominator as f64; | ||
(-(success_probability.log10()) * amount_msat as f64) as u64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to multiply by the amount here given we're not trying to do any bandwidth-based scoring techniques? If we do eventually get a smarter router it seems like that will just blow up the path into a ton of MPP retries. I'm a bit confused cause this seems non-intuitive to me at all, I'd think you'd want some (configurable) constant multiplied by the success probability log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't really sure about this and had posed a question here: #1172 (comment). Had left the amount in the draft while waiting for a response, but yeah maybe a constant/param makes more sense.
I was primarily concerned with mixing two costs (a probability and the channel fees). Wasn't sure if we should normalize to use msats (as done here) --and if so how -- or have the scorer work in terms of either a probability or a penalty (e.g., by returning an enum) and having the router determine if the channel fees should be considered at all based on the variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we have to normalize, but to normalize we'd want to make it a knob, I think - "msats charged per log(failure %)", though. The lack of knob here means users aren't able to control by how much they want to avoid failures, ie they may be willing to take a little longer to pay by retrying a few times in order to avoid paying additional fees. Multiplying by the amount being sent doesn't seem like a useful "penalty per log(failure rate)" answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a liquidity_penalty_multiplier_msat
parameter and used that instead of the amount.
99e637b
to
22d285d
Compare
Codecov Report
@@ Coverage Diff @@
## main #1227 +/- ##
==========================================
+ Coverage 90.41% 90.56% +0.14%
==========================================
Files 71 71
Lines 38356 39248 +892
==========================================
+ Hits 34680 35544 +864
- Misses 3676 3704 +28
Continue to review full report at Codecov.
|
22d285d
to
eb0b930
Compare
eb0b930
to
055a356
Compare
Benchmark using
Benchmark using
Benchmark using
Note that each iteration of the |
lightning/src/routing/router.rs
Outdated
@@ -4968,6 +4994,8 @@ mod benches { | |||
let dst = PublicKey::from_slice(nodes.keys().skip(seed % nodes.len()).next().unwrap().as_slice()).unwrap(); | |||
let payee = Payee::from_node_id(dst).with_features(InvoiceFeatures::known()); | |||
let amt = seed as u64 % 1_000_000; | |||
let params = ProbabilisticScoringParameters::default(); | |||
let scorer = ProbabilisticScorer::new(params, &src, &graph); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add new benchmarks - one with the classic Scorer
with no data (or, better yet, dummy scorer that always returns 0 that can be inlined), one with a Scorer
with random data, and one with ProbabilisticScorer
. In the interest of expediting this PR we can skip the second because the first just requires copying the function and maybe a five-line struct definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth keeping around both generate_routes
and generate_mpp_routes
variations of the benchmarks? AFAICT, get_route
will use MPP if node announcement in the NetworkGraph
indicates support, even if the invoice doesn't. Looking at some results, there isn't a material difference between the two for a given scorer.
test routing::router::benches::generate_mpp_routes_with_default_scorer ... bench: 92,654,109 ns/iter (+/- 31,531,162)
test routing::router::benches::generate_mpp_routes_with_probabilistic_scorer ... bench: 107,804,711 ns/iter (+/- 53,006,509)
test routing::router::benches::generate_mpp_routes_with_zero_penalty_scorer ... bench: 75,926,475 ns/iter (+/- 37,127,351)
test routing::router::benches::generate_routes_with_default_scorer ... bench: 92,888,708 ns/iter (+/- 32,657,001)
test routing::router::benches::generate_routes_with_probabilistic_scorer ... bench: 107,996,503 ns/iter (+/- 51,133,247)
test routing::router::benches::generate_routes_with_zero_penalty_scorer ... bench: 75,003,204 ns/iter (+/- 36,967,684)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, seems I can hit the non-MPP case by using Some(InvoiceFeatures::empty())
instead of None
.
Added some commits up front that:
- Refactors out a benchmark utility to use with different scorers and features
- Exercises
first_hops
, which simplifies testingProbabilisticScorer
- Seeds the scorer with success and failure data
- Benchmarks a zero-penalty scorer
The commits are fairly self-contained, but let me know if you prefer that I pull them out into a separate PR.
0544c75
to
25f8816
Compare
lightning/src/routing/router.rs
Outdated
match self { | ||
CandidateRouteHop::FirstHop { .. } => 0, | ||
CandidateRouteHop::PublicHop { info, .. } => { | ||
info.direction().unwrap().cltv_expiry_delta as u32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not obvious to me that this is safe to unwrap, similar elsewhere 🤔 Could we add a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! One case was inadvertently removed in a fixup. The other is guarded by an if let
. Added a comment to the field.
lightning/src/routing/router.rs
Outdated
let mut prev_cltv_expiry_delta = final_cltv_expiry_delta; | ||
for hop in path.iter_mut().rev() { | ||
core::mem::swap(&mut prev_cltv_expiry_delta, &mut hop.as_mut().unwrap().cltv_expiry_delta); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see tests fail without this but I'm confused on what it's accomplishing, comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was from Matt's fixup which removed modification of this field in ordered_hops
inside the path_walk
loop. Added a comment and used fold
to propagate the delta backwards. @TheBlueMatt Let me know if the comment makes sense or could be expanded somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this was previously in the if node_id == our_node_id
loop, above, see comment there for more info, it moved here to handle moving the data into the CandidateRouteHop
.
lightning/src/routing/scoring.rs
Outdated
let leading_zeros = x.leading_zeros(); | ||
let integer_part = (63 - leading_zeros) as f64; | ||
let fractional_part = LOG2_FRACTIONAL_PART[ | ||
(((x << leading_zeros) >> (63 - FRACTIONAL_BITS)) & FRACTIONAL_BITMASK) as usize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit kinda adds a good amount of complexity imo... Are we sure the perf gainz are worthwhile? Maybe better for follow-up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the latest benchmark numbers both before and after this optimization.
Without approximation:
test routing::router::benches::generate_mpp_routes_with_default_scorer ... bench: 88,374,024 ns/iter (+/- 19,792,155)
test routing::router::benches::generate_mpp_routes_with_probabilistic_scorer ... bench: 104,953,746 ns/iter (+/- 48,003,617)
test routing::router::benches::generate_mpp_routes_with_zero_penalty_scorer ... bench: 74,455,389 ns/iter (+/- 37,400,486)
test routing::router::benches::generate_routes_with_default_scorer ... bench: 43,661,367 ns/iter (+/- 11,777,146)
test routing::router::benches::generate_routes_with_probabilistic_scorer ... bench: 52,373,973 ns/iter (+/- 25,136,830)
test routing::router::benches::generate_routes_with_zero_penalty_scorer ... bench: 36,233,816 ns/iter (+/- 18,920,800)
With approximation:
test routing::router::benches::generate_mpp_routes_with_default_scorer ... bench: 90,098,517 ns/iter (+/- 20,624,987)
test routing::router::benches::generate_mpp_routes_with_probabilistic_scorer ... bench: 102,870,120 ns/iter (+/- 50,438,276)
test routing::router::benches::generate_mpp_routes_with_zero_penalty_scorer ... bench: 74,644,550 ns/iter (+/- 39,539,073)
test routing::router::benches::generate_routes_with_default_scorer ... bench: 43,759,396 ns/iter (+/- 10,977,942)
test routing::router::benches::generate_routes_with_probabilistic_scorer ... bench: 50,402,866 ns/iter (+/- 25,601,834)
test routing::router::benches::generate_routes_with_zero_penalty_scorer ... bench: 35,911,935 ns/iter (+/- 17,529,777)
So roughly 2 ms faster, though the numbers may be a little noisy. Happy to drop the commit if you both agree. Would save me the effort from updating the unit test assertions. 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial reaction is that those gains seem somewhat not worth it or at least better suited for follow-up, but would have to get @TheBlueMatt's thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this seems....not useful at all? Its maybe useful but clearly dwarfed by noise, so unclear if useful. I'd say avoid it until a followup PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, FWIW I think the noise is somewhat inherent to the benchmark setup. There's bound to be variance if the distance between source and destination nodes vary between iteration. We may want to use a smaller set of pairs but use all of them on each iteration instead of just one per iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran some benchmarks with a single path greater than 7 hops. The variance went down but similar results in terms of performance. Reverted the approximation commit and removed the Probability
enum.
25f8816
to
d2806cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in the middle of review and couldn't add more comments as the diff was changed. I hope the comments are not lost but I thought maybe I can just review the comments that have already been made
fn default() -> Self { | ||
Self { | ||
liquidity_penalty_multiplier_msat: 1000, | ||
liquidity_offset_half_life: Duration::from_secs(3600), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is that value being chosen? Is there emperical evidence or is this just an arbitrary choice to start with and collect experience? Might make sense to make this configureable either in config or in the API that invokes path selection. In this way if the returned candidate paths are not satisfiable one might be able to give the half life duration as an argument and make all the knowledge to be forgotten for example by setting the value to 0 or 1 second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is that value being chosen? Is there emperical evidence or is this just an arbitrary choice to start with and collect experience? Might make sense to make this configureable either in config or in the API that invokes path selection.
This is an arbitrary value and is only configurable upon initialization as of now.
In this way if the returned candidate paths are not satisfiable one might be able to give the half life duration as an argument and make all the knowledge to be forgotten for example by setting the value to 0 or 1 second.
Hmm... not sure the best way to go about this (and if we'd want to do so). The caller is parameterized with any scorer, one which may not have any concepts of decay. I suppose the Score
interface could have a method that returns an "empty" instance of itself to use temporarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the data is updated during processing I'm not sure how it could be configurable per-route-finding call, but it is configurable.
/// Returns the [`EffectiveCapacity`] of the channel in a specific direction. | ||
/// | ||
/// This is either the total capacity from the funding transaction, if known, or the | ||
/// `htlc_maximum_msat` for the direction as advertised by the gossip network, if known, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let there be a channel of capacity
100M msats as seen in the funding tx and let the htlc_maxiumum_msat
be 10M msats. This means of course that the probability for any single htlc
to be forwarded that has an amount larger than 10M msats should be 0. However I don't the probability to forward exactly 10M msats should be close to 0 bur rather close to 0.9 = (100M - 10M )/ (100M) ~ (100M - 10M +1)/ (100M +1)
. What I am trying to say: "in probabilistic socring we should always start from the full capacity". of course given prior knowledge we might use terms as effective capacity that reduces the interval in which uncertainty exists. I will have to find the position in the code where the actual score is being computed. Still I thought i leave this comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had spoken the @TheBlueMatt about this but in the context of MPP. In get_route
we reduce the effective capacity by any potential in-flight HTLCs for the same payment. But, as you noted, if htlc_maximum_msat
<<< capacity
then this may not be desirable as we should be able route more through the channel using multiple HTLCs. IIRC, Matt mentioned that implementations typically default htlc_maximum_msat
to a value close the channel capacity, which may alleviate both concerns even though not necessarily correct.
One thought I had was to pass the EffectiveCapacity
to the scorer (instead of converting to msats) possibly with in-flight amounts and have it determine whether or not to subtract those. This is partly why I wanted to add EffectiveCapacity
. I guess ideally htlc_maximum_msat
would only need to be considered in get_route
, obviating the need for the MaximumHTLC
variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One further issue is many clients only have the htlc_maximum_msat
value - some clients do not verify the on-chain output at all (subject to some potential DoS attacks). Depending on the direction gossip goes this may be increasingly common.
lightning/src/routing/scoring.rs
Outdated
enum Probability { | ||
Zero, | ||
One, | ||
Ratio { numerator: u64, denominator: u64 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while the approximation of the negative log probability is just this mentioned fraction all my tests show that this linear approximation is not too useful as it (similarly with the linear fee rate) tends to saturate channels when computing the optimal split. As far as I understand here we are not computing optimal splits anyway this but I thought it is useful to point out that the convexity of the negativ log prob is actually very desirable.
This problem of saturating channels fully when switching to the linearized problem can be avoided by changing the model of the uncertainty network and constrain channel capacities to be lower than they actually are. However as of right now it is totally not clear to me how to do this properly without arbitrary choices. Yet it would allow us to use simplex based min cost flow solvers for linear min cost flow problems which would yield a huge runtime improvement (assuming the base fee is set to zero or heavily deincentivised as ZmnSCPxj suggested)
lightning/src/routing/scoring.rs
Outdated
impl Default for ProbabilisticScoringParameters { | ||
fn default() -> Self { | ||
Self { | ||
liquidity_penalty_multiplier_msat: 1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems only relevent if (as one should) also fees are taken into consideration as a feature to compute the cost of a hop / path (I think you internally name the cost Score, so please don't get confused when I talk about cost instead of score). As our benchmarks and prior to them simulations in c-lightning demonstrated (ElementsProject/lightning#4771 (comment)) using the harmonic mean instead of a weighted arithmetic mean to combine several features (negative log probs, fees, cltv delta) worked better. In case you would do the same this value seems useless. Otherwise one could of course ask why is it chosen as 1000
right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems only relevent if (as one should) also fees are taken into consideration as a feature to compute the cost of a hop / path (I think you internally name the cost Score, so please don't get confused when I talk about cost instead of score). As our benchmarks and prior to them simulations in c-lightning demonstrated (ElementsProject/lightning#4771 (comment)) using the harmonic mean instead of a weighted arithmetic mean to combine several features (negative log probs, fees, cltv delta) worked better. In case you would do the same this value seems useless.
Currently, the scorer returns a penalty in msats for using a channel, which is simply added to the fees to get the cost:
rust-lightning/lightning/src/routing/router.rs
Lines 957 to 962 in d2806cb
let old_cost = cmp::max(old_entry.total_fee_msat, old_entry.path_htlc_minimum_msat) | |
.checked_add(old_entry.path_penalty_msat) | |
.unwrap_or_else(|| u64::max_value()); | |
let new_cost = cmp::max(total_fee_msat, path_htlc_minimum_msat) | |
.checked_add(path_penalty_msat) | |
.unwrap_or_else(|| u64::max_value()); |
Given a scorer that computes a negative log of a success probability, how would you recommend combining this "score" with the fees to get a cost?
Otherwise one could of course ask why is it chosen as
1000
right now.
FYI, the value is an arbitrary default which can be customized by the user. That said, we may want a more reasonable default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean "better", here? There is no objective metric by which one could argue one way of combining different goals is "better" than another - hence why this is configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean "better", here? There is no objective metric by which one could argue one way of combining different goals is "better" than another - hence why this is configurable.
Yes you have already mentioned that on the mailing list and I agree with you (never disagreed) that there is not one single objective, universally true metric here. The way I used the word "better" was having in mind to combine the relevant features which had previously been identified (eg probabilities or fees)
That being said the topic here goes far beyond Lightning Path finding / payment delivery. In many cases where you have several features (like in our case Probability to model reliability (and with it potentially speed), fees to model actual costs, latency or geodistance to model speed, ...) you will always have the questions: "How should those features be combined into one single score / metric?". It turns out that many people intuitively take the arithmetic mean which is essentially just adding the features or a weighted arithmetic mean which is almost the same as normalizing the features and adding them without weight. You are currently also doing this and I would strongly advice against doing that.
You take your old score and add the new one and you currently weight it with a factor of 1000 as in liquidity_penalty_multiplier_msat
. In almost all situations in my life when I was doing data science / machine learning it turned out that in practice the arithmetic mean a+w*b
is not as useful as using a*b/(a+b)
(I ignore constants). The later seems to have several advantages in practise:
- You do not need to learn and test for the weight(s)
- features (in our case for channels) might not be correlated. so a weighting for one channel might be totally wrong for another channel
- It seems to be more resilient against outliers and seems to be able to capture the mix of features better even if they are in different orders of magnitude or are differently distributed.
I have discussed this in great length with the c-lightning team - in particular @cdecker - and my preliminary simulations as well as our mainnet experiments and evaluations show that arithmetic mean heavily underperformed in comparison to the other way to compute a mean. Also recently I was made aware that there might actually be deeper mathematical motivation in there which in the field of economics and risk management is being used all the time and might have already been discovered in the early 18th hundred by Bernoulli in this work - though I haven't looked deeper into this yet.
Also on a side note. The arithmetic mean obviously has the advantage of being linear and useful for the case of min cost flows if all the features are linear and convex. This is not true for the mean that I suggest. But as your code currently only looks at dijkstra we don't need to have a linear or convex cost function.
Also as suggested to the c-lightning team. if you don't believe me it should be very easy to have both versions of mean computation and do a simple A/B test against any of your favourite test criteria (I usually like "number of attempts" or time to delivery) to see if you want to keep the arithmetic mean here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still very confused by this, I read through the links you provide and none of them discuss failure modes or seem to provide any justification for harmonic mean here, aside from some anecdotal suggestions. The discussion from c-lightning just links to an lnd issue, which provides no justification, and it looks like c-lightning stuck with simply adding the scores as well?
Indeed, the code here (and the current model) focuses entirely on fees as the metric we're optimizing for. As a result of this, we denominate the scorer as "amount of fee you're willing to pay to avoid this channel". This is not only nicely simple from a theoretical perspective, but also simple enough that our users can at least roughly understand the metric and can set the tunables here without much interaction from us - they can easily run trivial calculations to determine how much their "avoid channels" cost will be, and can tune things appropriately based on how much fee they're willing to pay. In the future we could even allow these tunables to be set on a per-payment basis, given we have a model here that doesn't incorporate the concrete fee values.
Indeed, in the future, I'd very much like to have multiple top-to-bottom routing algorithms, eg there has been very little work done on how you'd design a lightning routing algorithm "for privacy", without seriously considering fees or time-to-pay/-retries, but even just as-is, users having the ability to drastically tune the knob between retry-count and fees-paid seems pretty important, which gets much harder with the harmonic mean, IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shows what c-lightning does and it was evaluated bu conducting mainnet evaluation: https://medium.com/blockstream/c-lightning-v0-10-2-bitcoin-dust-consensus-rule-33e777d58657
The blog article however left out the comparison between the means but I remember evaluating the statistics for test payments myself.
Of course the evaluation that @cdecker and I did might not have been sufficiently large or realistic from the assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw I think the much bigger argument for using an arithmetic mean is that it is linear which means that if the features are linear the cost function stays linear (of if features are convex the cost function stays convex). When using min cost flow solvers to compute optimal splits (given the cost function and assuming the cost function properly captures the goals of the node) then the arithmetic mean will not have a negative impact on the runtime of the min cost flows solvers (assuming all features are at worst convex). As the cost function is currently only used in dijkstra we need neither linear nor convex so we don't care.
/// Returns the success probability of routing the given HTLC `amount_msat` through the channel | ||
/// in this direction. | ||
fn success_probability(&self, amount_msat: u64) -> Probability { | ||
let max_liquidity_msat = self.max_liquidity_msat(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding one of my earlier comments I tried to chase down the logic of how this value is being derived and as far as I can tell this is the channel capacity or a potentially decayed learned bound but it is not constrained by htlc_max_msat
like in the effective capacity data structure. I just leave the comment as I would like to suggest a second pair of eyes double checks this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EffectiveCapacity::as_msats
is passed to the scorer's channel_penalty_msat
method, which is used as the capacity here via ChannelLiquidity::as_directed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this value comes from the as_directed
call in channel_penalty_msat
which gets its information from the available_liquidity
in the router, which calls DirectedChannelInfo::effective_capacity
which is bound by the htlc_maximum_msat
. Not quite sure whether you were saying this should or should not be, but it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I addressed this in another comment. I say for probability computation you should use the actual capacity (reduced by potential knowlege about existing liquidity) instead of capping this at htlc_maximum_msat
. Using htlc_maxiumum_msat
will result in potentially strange result (see this comment: #1227 (comment))
|
||
impl<L: DerefMut<Target = u64>, T: Time, U: DerefMut<Target = T>> DirectedChannelLiquidity<L, T, U> { | ||
/// Adjusts the channel liquidity balance bounds when failing to route `amount_msat`. | ||
fn failed_at_channel(&mut self, amount_msat: u64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have to say that I really love to see this and the following functions being implemented here and that you will track the uncertainty / information gain properly! I will be really curios how much this also improves payment delivery in practice in comparison to the recent updates that we have seen from c-lightning using only probabilistic scoring but forgetting most of the learnt information directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully Cash App switches and uses the new one and we can get some good data from that!
|
||
/// Adjusts the channel liquidity balance bounds when successfully routing `amount_msat`. | ||
fn successful(&mut self, amount_msat: u64) { | ||
let max_liquidity_msat = self.max_liquidity_msat().checked_sub(amount_msat).unwrap_or(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a but confused if this and the prior two functions always work properly with concurrency. But it seems to me that concurrency is actually not messing up the book keeping here. But how do you handle the decay of the knowledge when substracting from max_liquidity? This could happen at two very different moments. Let's say you forward one htlc at 12:00 and the next at 12:45 both have been successful. How will you compute the max liquidity at 13:05? As far as I understand the knowledge from the first routing event should have been decyed to half at that time. I am not sure how this is reflected in this data structure. might not be a problem as the halflife is chosen arbitrarlity at this time anyway but I thought I should point this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a but confused if this and the prior two functions always work properly with concurrency. But it seems to me that concurrency is actually not messing up the book keeping here.
It's accomplished by having get_route
take a non-mut
reference to a Score
, while the event handler needs a mut
reference. The compiler ensures we only ever have one mut
reference, which we accomplish through using a Mutex
. If you're curious how this works see:
rust-lightning/lightning/src/routing/scoring.rs
Lines 123 to 146 in e4387fa
/// A scorer that is accessed under a lock. | |
/// | |
/// Needed so that calls to [`Score::channel_penalty_msat`] in [`find_route`] can be made while | |
/// having shared ownership of a scorer but without requiring internal locking in [`Score`] | |
/// implementations. Internal locking would be detrimental to route finding performance and could | |
/// result in [`Score::channel_penalty_msat`] returning a different value for the same channel. | |
/// | |
/// [`find_route`]: crate::routing::router::find_route | |
pub trait LockableScore<'a> { | |
/// The locked [`Score`] type. | |
type Locked: 'a + Score; | |
/// Returns the locked scorer. | |
fn lock(&'a self) -> Self::Locked; | |
} | |
/// (C-not exported) | |
impl<'a, T: 'a + Score> LockableScore<'a> for Mutex<T> { | |
type Locked = MutexGuard<'a, T>; | |
fn lock(&'a self) -> MutexGuard<'a, T> { | |
Mutex::lock(self).unwrap() | |
} | |
} |
rust-lightning/lightning-invoice/src/payment.rs
Lines 166 to 167 in e4387fa
R: for <'a> Router<<<S as Deref>::Target as LockableScore<'a>>::Locked>, | |
S::Target: for <'a> LockableScore<'a>, |
But how do you handle the decay of the knowledge when substracting from max_liquidity? This could happen at two very different moments. Let's say you forward one htlc at 12:00 and the next at 12:45 both have been successful. How will you compute the max liquidity at 13:05? As far as I understand the knowledge from the first routing event should have been decyed to half at that time. I am not sure how this is reflected in this data structure. might not be a problem as the halflife is chosen arbitrarlity at this time anyway but I thought I should point this out.
max_liquidity_msat
and min_liquidity_msat
in DirectedChannelLiquidity
return the decayed values. The internal representation in ChannelLiquidity
consists of min and max un-decayed offsets along with a time the values were modified. The decay is realized in the offset representation any time they need to be modified. See set_min_liquidity_msat
and set_max_liquidity_msat
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not look at the test cases yet. Overall I really like the tracking of the channel liquidity to make path finding decisions and I am really looking forward to see how the final version will perform!
As already addressed in my first part of the review I think I have now discovered the line where you seem to reduce the effective liquidity too heavily in computing the probabilities and I have provided a counter example (assuming my understanding of the code is correct) in my comment. While of course channels should have a 0 probability if the amount is larger than the htlc_max_size
constraint the probabilities for amounts smaller or equal to the constraint should still be computed with the actual channel capacity and not this effective capacity measure.
Also I did not see how you handle channel reserve values. Technically one would always have to reduce the reserves too in the computation of the probabilities as the reserves are not accessible for routing but I guess the effect can be neglected as we usually try not to saturate the full liquidity anyway.
Let me now if you need help in benchmarking and with evaluating in a mainnet experiment.
lightning/src/routing/scoring.rs
Outdated
Probability::Zero => u64::max_value(), | ||
Probability::One => 0, | ||
Probability::Ratio { numerator, denominator } => { | ||
let log_success_probability = log10_approx(numerator, denominator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the log is merely introduced to switch from the multiplicative nature of channel probabilities for an entire path to adding the scores per hop and making this metric usable for shortest path algos and min cost flows the base of the log does not really matter. for the actual computation / ranking of the scored paths. Of corse it might when combinging with other features in a weighted arithmetic mean...
However from an information theoretic point of view the base 2 log is usually used as that would measure the entropy and information gain in bits. it would also allow us to have very fast approximations as we just needed the first significant bit of a number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the log is merely introduced to switch from the multiplicative nature of channel probabilities for an entire path to adding the scores per hop and making this metric usable for shortest path algos and min cost flows the base of the log does not really matter. for the actual computation / ranking of the scored paths. Of corse it might when combinging with other features in a weighted arithmetic mean...
Note earlier comment about how we combine penalty and fees to get the cost.
However from an information theoretic point of view the base 2 log is usually used as that would measure the entropy and information gain in bits. it would also allow us to have very fast approximations as we just needed the first significant bit of a number.
IIRC, when using base 2, I would have a zero penalty for larger success probabilities when base 10 would give me something small but non-zero. Though this may also have had to due with the value I was multiplying by.
Additionally, when using log2(numerator) - log2(denominator)
, it's easy to get a probability of zero if the amount is close to the min liquidity. That is, subtracting them from max liquidity to get the numerator and denominator, respectively, may give two values with identical most significant bits (i.e., zero when subtracted).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the log is merely introduced to switch from the multiplicative nature of channel probabilities for an entire path to adding the scores per hop and making this metric usable for shortest path algos and min cost flows the base of the log does not really matter. for the actual computation / ranking of the scored paths. Of corse it might when combinging with other features in a weighted arithmetic mean...
Note earlier comment about how we combine penalty and fees to get the cost.
note my reply: #1227 (comment) I advice against the way how you currently do that.
IIRC, when using base 2, I would have a zero penalty for larger success probabilities when base 10 would give me something small but non-zero. Though this may also have had to due with the value I was multiplying by.
nit: which - when rounding is taking place - might actually be ok. small payments kind of don't mind which channel to take. They have a really high probability and the log could be close to 0 or 0 (though technically only for probability of 1 we want zero as a result.
Additionally, when using
log2(numerator) - log2(denominator)
, it's easy to get a probability of zero if the amount is close to the min liquidity. That is, subtracting them from max liquidity to get the numerator and denominator, respectively, may give two values with identical most significant bits (i.e., zero when subtracted).
Shouldn't that only be true if you take integers of the base2 log or like I suggested estimate this with the position of the first non zero bit of the numerator
and denominator
?
lightning/src/routing/scoring.rs
Outdated
|
||
let capacity_msat = network_graph.channels() | ||
.get(&hop.short_channel_id) | ||
.and_then(|channel| channel.as_directed_to(&target).map(|d| d.effective_capacity())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the effective_capacity is filled with values like htlc_maxsize_msat
I fear that the computed probabilities will be lower than the actual probabilities and the negative log probabilities will be overestimated. One might argue that this might not even be a problem if we do the same mistake for all channels but the htlc_maxsize_msat
value will be different for various channels let me make this example clear.
let's assume we want to route 1000 msat and we have two channels to chose from:
a) a channel of size 3000 msats with max htlc size of 3000 msats o
b) a channel of size 30000000 msats with max htlc size of 2000 msats
with your definition of effective capacity being bound by the max htlc size you would give channel a) a higher likelihood where in reality the successprobability of channel b) should be muich higher!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I assume this is the same point made in the following conversation?
https://github.com/lightningdevkit/rust-lightning/pull/1227/files#r787934743
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is the same - only that here is the point where it is actually being initialized. I think this should really be fixed. While stuffing several small HTLCs into a channel in the context of MPP is one concern I think the much bigger concern is that the probability computation becomes severely wrong. Especially large channels should be highly attractive / likely. But I guess folks with large channels will want to cap the max size of htlcs which means that you would potentially not find such channels that have a high chance for high liquidity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be worth fixing, though in practice today its a relatively rare issue. Its also worth noting that not all users have chain data and don't bother fetching the channel's capacity by looking at the chain, using htlc_maximum_msat
instead to determine it, because its the best info they have. Some of our users aren't the only ones doing this, I believe Electrum does this as well. Thus, we really need the implementation here to be robust even if we're using htlc_maximum_msat
data to infer channel sizes, I'm not sure how best to do that, but just ensuring we pipe the values from on-chain through doesn't really solve the concern.
Ultimately this needs to be resolved at the spec level, of course, but given the values are basically always the same today we may just wait until we have a new gossip format for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed this as per a top-level comment, but dropped it in a favor of a follow-up as per offline discussion. Commit was a bit more involved and had a performance regressions, though pre-computing the DirectedChannelInfo::htlc_maximum_msat
seem to fix it, oddly given it should only be calculated once.
d2806cb
to
16392a4
Compare
Looks like this needs rebase now. |
Sure does! Mind if I squash the fixups for 3c2df0c? Should make it a little easier. |
e0d8080
to
97fdf76
Compare
lightning/src/routing/scoring.rs
Outdated
/// such as a chain data, network gossip, or invoice hints, the latter indicating sufficient | ||
/// capacity (i.e., near [`u64::max_value`]). Thus, implementations should be overflow-safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the previous wording might've been clearer here, suggestion being: "For hints provided in the invoice, we assume the channel has sufficient capacity to accept the invoice's full amount, and provide a channel_capacity_msat
of u64::max_value
. Thus..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded a bit to tighten it up. Technically, it can be less than u64::max_value
since in-flight HTLCs are deducted for the time being.
421833c
to
e4c49d7
Compare
e4c49d7
to
575cae0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically there, mostly doc comments. Still need to finish discussing #1227 (comment)
/// Knowledge about channel liquidity balances takes the form of upper and lower bounds on the | ||
/// possible liquidity. Certainty of the bounds is decreased over time using a decay function. See | ||
/// [`ProbabilisticScoringParameters`] for details. | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should say something here about how this is likely to be really great for probing-based nodes or noes with high payment flow, but because we learn the specific liquidity on a channel, instead of learning how often a channel is available, this is unlikely to be a great approach for nodes which have very low payment volume. Scorer isn't a ton better, admittedly, but we likely want to extend this to measure node "badness" later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. I think its still a little vague, I suppose the decaying is also bad for long-term learning, but my other long-term-learning concern here is just about how we go about learning to begin with. Our whole thing here is about trying to remember the current channel balances on every channel on the network. We don't so much learn anything as we do try to keep a mental map of the whole network up-to-date all the time. That's great if you're probing or have payment flows, but otherwise we're really not learning at all, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fully agree! Those seem to be two orthogonal problems.
- The probability that liquidity is available (as a probabilistic model for the reliability)
- The actual observed reliability of a specific node / channel (as a provenance score or distribution)
I claim - but obviously don't know - that the first one will be much more impactful. In particular for the first one we know how to properly quantify this as the probabilistic model gives us the propper information theoretic framework.
That being said I am very interested in examining the second one and investigating if it is useful to include it to the cost function. I have drafts for an approach for assigning provenance scores as a probability distribution that can be used to create the joint distribution between this and the liquidity based model. The good thing is that models for the second question can be learnt from log files and also tested against the log file. Will be very happy to collaborate on this in particular as you seem to have access to data / logfiles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I assume that both will be important (though somewhat annoyingly the lightning onion errors dont tell us which happened when we get a fail - liquidity limit or node offline), but either way we don't yet do either and that would be bad for some users :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-worded the first part of the sentence as discussed offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having two opposing conjectures / hypothesis is one of the best starting points for research. As said I am very interested in these two question (1. How to define a reliability based probabilistic provenance score for nodes / channels 2. How useful will this be in the cost function) and happy to collaborate and find this out. Also very happy to at least anecdotaly be proven wrong on my conjecture
max_liquidity_offset_msat, | ||
capacity_msat, | ||
last_updated: &self.last_updated, | ||
now: T::now(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that's a lot of system-time-gets batman! I guess its pretty quick cause it rtdsc's on x86 usually when its warm, but we should probably fix this later. Maybe add a comment here noting that when we remove old entries, we should just decay them then and not bother decaying during lookups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that's a lot of system-time-gets batman! I guess its pretty quick cause it rtdsc's on x86 usually when its warm, but we should probably fix this later.
This should be the similar behavior as Scorer::decayed_penalty_msat
, right?
Maybe add a comment here noting that when we remove old entries, we should just decay them then and not bother decaying during lookups.
Not sure I follow. The penalty may not be accurate if we don't decay during lookup. Could you clarify what you mean by "remove old entries"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the similar behavior as Scorer::decayed_penalty_msat, right?
I think so? Doesn't mean its great, though. IIRC it can be quite cheap on x86, but on other platforms it can be a full syscall.
Not sure I follow. The penalty may not be accurate if we don't decay during lookup. Could you clarify what you mean by "remove old entries"?
I'm assuming when we implement the "remove entries for closed channels" logic, we'll instead just remove entries on a timer that have decayed almost all the way to zero. Once we have a timer, we can just do the decays on the timer, instead of doing the decays while routing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Yeah, that would be a simple way to remove closed channels. I'd hope there would be a way to use the scorer without a background thread, though.
Another thought would be to push Time
into either the Score
or find_route
interface so T::now
could be called once and passed in as needed. Or even build it in to LockableScore
in some way such that locking gives back some (S, T)
implementing Score
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that would work too. I dunno what the right answer is. We already require a background thread for the network graph to expire channels, which we'd discussed at the time either doing there or on serialization and decided background thread was simpler. I guess we could have an explicit "channel is dead" call here like we do for the network graph, but that feels like about as much work as just hooking it up to the background thread/a timer? Anyway, its something we can discuss another time, I don't think this needs fixing here, but I worry this is only practically performant on x86.
18e467b
to
f372fe9
Compare
Note the failing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One further note about the "why not use this without lots of payment flows" comment at #1227 (comment) but I'm good with where we're at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
A channel's capacity may be inferred or learned and is used to make routing decisions, including as a parameter to channel scoring. Define an EffectiveCapacity for this purpose. Score::channel_penalty_msat takes the effective capacity (less in-flight HTLCs for the same payment), and never None. Thus, for hops given in an invoice, the effective capacity is now considered (near) infinite if over a private channel or based on learned information if over a public channel. If a Score implementations needs the effective capacity when updating a channel's score, i.e. in payment_path_failed or payment_path_successful, it can access the channel's EffectiveCapacity via the NetworkGraph by first looking up the channel and then specifying which direction is desired using ChannelInfo::as_directed.
Add a Score implementation based on "Optimally Reliable & Cheap Payment Flows on the Lightning Network" by Rene Pickhardt and Stefan Richter[1]. Given the uncertainty of channel liquidity balances, probability distributions are defined based on knowledge learned from successful and unsuccessful attempts. Then the negative log of the success probability is used to determine the cost of routing a specific HTLC amount through a channel. [1]: https://arxiv.org/abs/2107.05322
ProbabilisticScorer uses successful and unsuccessful payments to gain more certainty of a channel's liquidity balance. Decay this knowledge over time to indicate decreasing certainty about the liquidity balance.
d5db85e
f372fe9
to
d5db85e
Compare
Add a
Score
implementation based on Optimally Reliable & Cheap Payment Flows on the Lightning Network by Rene Pickhardt and Stefan Richter [1].Given the uncertainty of channel liquidity balances, probability distributions are defined based on knowledge learned from successful and unsuccessful attempts. Then the negative log of the success probability is used to determine the cost of routing a specific HTLC amount through a channel.
Knowledge about channel liquidity balances takes the form of upper and lower bounds on the possible liquidity. Certainty of the bounds is decreased over time using a decay function.
Resolves #1170.
Resolves #1172.