Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probabilistic channel scoring #1227

Merged

Conversation

jkczyz
Copy link
Contributor

@jkczyz jkczyz commented Jan 5, 2022

Add a Score implementation based on Optimally Reliable & Cheap Payment Flows on the Lightning Network by Rene Pickhardt and Stefan Richter [1].

Given the uncertainty of channel liquidity balances, probability distributions are defined based on knowledge learned from successful and unsuccessful attempts. Then the negative log of the success probability is used to determine the cost of routing a specific HTLC amount through a channel.

Knowledge about channel liquidity balances takes the form of upper and lower bounds on the possible liquidity. Certainty of the bounds is decreased over time using a decay function.

Resolves #1170.
Resolves #1172.

// The RouteHintHop fields which will eventually be used if this hop is used in a final Route.
// Note that node_features is calculated separately after our initial graph walk.
node_id: NodeId,
short_channel_id: u64,
channel_features: &'a ChannelFeatures,
channel_features: ChannelFeatures,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holy malloc-traffic-in-a-tight-loop Batman! Somewhat astoundingly to me this doesn't actually seem to slow things down much, but I'd still strongly prefer to avoid the additional malloc traffic if possible. This does so and drops further fields on this struct too, which is nice:

diff --git a/lightning/src/routing/router.rs b/lightning/src/routing/router.rs
index 14844171e..a822a68d7 100644
--- a/lightning/src/routing/router.rs
+++ b/lightning/src/routing/router.rs
@@ -331,6 +331,7 @@ impl cmp::PartialOrd for RouteGraphNode {
 /// A wrapper around the various hop representations.
 ///
 /// Used to construct a [`PathBuildingHop`] and to estimate [`EffectiveCapacity`].
+#[derive(Clone, Debug)]
 enum CandidateRouteHop<'a> {
        /// A hop from the payer to the next hop to the payee, where the outbound liquidity is known.
        FirstHop {
@@ -409,22 +410,19 @@ impl<'a> CandidateRouteHop<'a> {
 /// Fee values should be updated only in the context of the whole path, see update_value_and_recompute_fees.
 /// These fee values are useful to choose hops as we traverse the graph "payee-to-payer".
 #[derive(Clone, Debug)]
-struct PathBuildingHop {
+struct PathBuildingHop<'a> {
        // The RouteHintHop fields which will eventually be used if this hop is used in a final Route.
        // Note that node_features is calculated separately after our initial graph walk.
        node_id: NodeId,
        short_channel_id: u64,
-       channel_features: ChannelFeatures,
+       candidate: CandidateRouteHop<'a>,
        fee_msat: u64,
-       cltv_expiry_delta: u32,
 
        /// Minimal fees required to route to the source node of the current hop via any of its inbound channels.
        src_lowest_inbound_fees: RoutingFees,
-       /// Fees of the channel used in this hop.
-       channel_fees: RoutingFees,
        /// All the fees paid *after* this channel on the way to the destination
        next_hops_fee_msat: u64,
-       /// Fee paid for the use of the current channel (see channel_fees).
+       /// Fee paid for the use of the current channel (see candidate.fees()).
        /// The value will be actually deducted from the counterparty balance on the previous link.
        hop_use_fee_msat: u64,
        /// Used to compare channels when choosing the for routing.
@@ -432,10 +430,6 @@ struct PathBuildingHop {
        /// an estimated cost of reaching this hop.
        /// Might get stale when fees are recomputed. Primarily for internal use.
        total_fee_msat: u64,
-       /// This is useful for update_value_and_recompute_fees to make sure
-       /// we don't fall below the minimum. Should not be updated manually and
-       /// generally should not be accessed.
-       htlc_minimum_msat: u64,
        /// A mirror of the same field in RouteGraphNode. Note that this is only used during the graph
        /// walk and may be invalid thereafter.
        path_htlc_minimum_msat: u64,
@@ -459,11 +453,11 @@ struct PathBuildingHop {
 // Instantiated with a list of hops with correct data in them collected during path finding,
 // an instance of this struct should be further modified only via given methods.
 #[derive(Clone)]
-struct PaymentPath {
-       hops: Vec<(PathBuildingHop, NodeFeatures)>,
+struct PaymentPath<'a> {
+       hops: Vec<(PathBuildingHop<'a>, NodeFeatures)>,
 }
 
-impl PaymentPath {
+impl<'a> PaymentPath<'a> {
        // TODO: Add a value_msat field to PaymentPath and use it instead of this function.
        fn get_value_msat(&self) -> u64 {
                self.hops.last().unwrap().0.fee_msat
@@ -514,7 +508,7 @@ impl PaymentPath {
                        // set it too high just to maliciously take more fees by exploiting this
                        // match htlc_minimum_msat logic.
                        let mut cur_hop_transferred_amount_msat = total_fee_paid_msat + value_msat;
-                       if let Some(extra_fees_msat) = cur_hop.htlc_minimum_msat.checked_sub(cur_hop_transferred_amount_msat) {
+                       if let Some(extra_fees_msat) = cur_hop.candidate.htlc_minimum_msat().checked_sub(cur_hop_transferred_amount_msat) {
                                // Note that there is a risk that *previous hops* (those closer to us, as we go
                                // payee->our_node here) would exceed their htlc_maximum_msat or available balance.
                                //
@@ -542,7 +536,7 @@ impl PaymentPath {
                        // Irrelevant for the first hop, as it doesn't have the previous hop, and the use of
                        // this channel is free for us.
                        if i != 0 {
-                               if let Some(new_fee) = compute_fees(cur_hop_transferred_amount_msat, cur_hop.channel_fees) {
+                               if let Some(new_fee) = compute_fees(cur_hop_transferred_amount_msat, cur_hop.candidate.fees()) {
                                        cur_hop.hop_use_fee_msat = new_fee;
                                        total_fee_paid_msat += new_fee;
                                } else {
@@ -870,18 +864,15 @@ where L::Target: Logger {
                                                        PathBuildingHop {
                                                                node_id: $dest_node_id.clone(),
                                                                short_channel_id: 0,
-                                                               channel_features: $candidate.features(),
+                                                               candidate: $candidate.clone(),
                                                                fee_msat: 0,
-                                                               cltv_expiry_delta: 0,
                                                                src_lowest_inbound_fees: RoutingFees {
                                                                        base_msat: fee_base_msat,
                                                                        proportional_millionths: fee_proportional_millionths,
                                                                },
-                                                               channel_fees: $candidate.fees(),
                                                                next_hops_fee_msat: u64::max_value(),
                                                                hop_use_fee_msat: u64::max_value(),
                                                                total_fee_msat: u64::max_value(),
-                                                               htlc_minimum_msat: $candidate.htlc_minimum_msat(),
                                                                path_htlc_minimum_msat,
                                                                path_penalty_msat: u64::max_value(),
                                                                was_processed: false,
@@ -977,11 +968,8 @@ where L::Target: Logger {
                                                                old_entry.total_fee_msat = total_fee_msat;
                                                                old_entry.node_id = $dest_node_id.clone();
                                                                old_entry.short_channel_id = short_channel_id;
-                                                               old_entry.channel_features = $candidate.features();
+                                                               old_entry.candidate = $candidate.clone();
                                                                old_entry.fee_msat = 0; // This value will be later filled with hop_use_fee_msat of the following channel
-                                                               old_entry.cltv_expiry_delta = $candidate.cltv_expiry_delta();
-                                                               old_entry.channel_fees = $candidate.fees();
-                                                               old_entry.htlc_minimum_msat = $candidate.htlc_minimum_msat();
                                                                old_entry.path_htlc_minimum_msat = path_htlc_minimum_msat;
                                                                old_entry.path_penalty_msat = path_penalty_msat;
                                                                #[cfg(any(test, feature = "fuzztarget"))]
@@ -1287,12 +1275,10 @@ where L::Target: Logger {
                                        // so that fees paid for a HTLC forwarding on the current channel are
                                        // associated with the previous channel (where they will be subtracted).
                                        ordered_hops.last_mut().unwrap().0.fee_msat = new_entry.hop_use_fee_msat;
-                                       ordered_hops.last_mut().unwrap().0.cltv_expiry_delta = new_entry.cltv_expiry_delta;
                                        ordered_hops.push((new_entry.clone(), NodeFeatures::empty()));
                                }
                                ordered_hops.last_mut().unwrap().0.fee_msat = value_contribution_msat;
                                ordered_hops.last_mut().unwrap().0.hop_use_fee_msat = 0;
-                               ordered_hops.last_mut().unwrap().0.cltv_expiry_delta = final_cltv_expiry_delta;
 
                                log_trace!(logger, "Found a path back to us from the target with {} hops contributing up to {} msat: {:?}",
                                        ordered_hops.len(), value_contribution_msat, ordered_hops);
@@ -1463,7 +1449,7 @@ where L::Target: Logger {
                                // Now, substract the overpaid value from the most-expensive path.
                                // TODO: this could also be optimized by also sorting by feerate_per_sat_routed,
                                // so that the sender pays less fees overall. And also htlc_minimum_msat.
-                               cur_route.sort_by_key(|path| { path.hops.iter().map(|hop| hop.0.channel_fees.proportional_millionths as u64).sum::<u64>() });
+                               cur_route.sort_by_key(|path| { path.hops.iter().map(|hop| hop.0.candidate.fees().proportional_millionths as u64).sum::<u64>() });
                                let expensive_payment_path = cur_route.first_mut().unwrap();
                                // We already dropped all the small channels above, meaning all the
                                // remaining channels are larger than remaining overpaid_value_msat.
@@ -1481,16 +1467,21 @@ where L::Target: Logger {
        drawn_routes.sort_by_key(|paths| paths.iter().map(|path| path.get_total_fee_paid_msat()).sum::<u64>());
        let mut selected_paths = Vec::<Vec<Result<RouteHop, LightningError>>>::new();
        for payment_path in drawn_routes.first().unwrap() {
-               selected_paths.push(payment_path.hops.iter().map(|(payment_hop, node_features)| {
+               let mut path = payment_path.hops.iter().map(|(payment_hop, node_features)| {
                        Ok(RouteHop {
                                pubkey: PublicKey::from_slice(payment_hop.node_id.as_slice()).map_err(|_| LightningError{err: format!("Public key {:?} is invalid", &payment_hop.node_id), action: ErrorAction::IgnoreAndLog(Level::Trace)})?,
                                node_features: node_features.clone(),
                                short_channel_id: payment_hop.short_channel_id,
-                               channel_features: payment_hop.channel_features.clone(),
+                               channel_features: payment_hop.candidate.features(),
                                fee_msat: payment_hop.fee_msat,
-                               cltv_expiry_delta: payment_hop.cltv_expiry_delta,
+                               cltv_expiry_delta: payment_hop.candidate.cltv_expiry_delta(),
                        })
-               }).collect());
+               }).collect::<Vec<_>>();
+               let mut prev_cltv_expiry_delta = final_cltv_expiry_delta;
+               for hop in path.iter_mut().rev() {
+                       core::mem::swap(&mut prev_cltv_expiry_delta, &mut hop.as_mut().unwrap().cltv_expiry_delta);
+               }
+               selected_paths.push(path);
        }
 
        if let Some(features) = &payee.features {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, thought there might be some concern over that. Thanks for the workaround! Will incorporate it.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you can also drop the short_channel_id field trivially. We should also be able to drop the node_id field, even, given that its all findable, but its a bit more work (and probably involves dropping the use of the DirectedChannelInfo constructors entirely, or at least moving the NodeId into the DirectionalChannelInfos instead of being outside of them in the ChannelInfo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance you could edit the patch to use tabs for indentation? Probably a copy/paste issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you can also drop the short_channel_id field trivially.

Looks like your most recent version did so. 🙂

We should also be able to drop the node_id field, even, given that its all findable, but its a bit more work (and probably involves dropping the use of the DirectedChannelInfo constructors entirely, or at least moving the NodeId into the DirectionalChannelInfos instead of being outside of them in the ChannelInfo

Haven't taken a look at this in detailed, but note that DirectedChannelInfo has accessors for the source and target NodeIds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't taken a look at this in detailed, but note that DirectedChannelInfo has accessors for the source and target NodeIds.

Right, I guess we could store the DirectedChannelInfo directly in the CandidateRouteHop::PublicHop variant, but for now its not there and there's no easy way to get at the NodeId because it loses direction information. I hacked up a solution to this to see and it wasn't really any faster despite saving the 32 bytes. I think there's a direction there but I may have added more branching which made things worse enough that saving the memcpys didn't help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah.. I think I couldn't get that to work when returning the DirectedChannelInfo from a closure. Probably just need to have the variant to hold the ChannelInfo reference, too. But I also liked that direction was not an Option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not worth worrying about now.

@jkczyz jkczyz force-pushed the 2021-12-probabilistic-scorer branch from d64b8d4 to 63b509f Compare January 6, 2022 01:18
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All generally pretty straightforward once I wrapped my head around the type indirection, which I always find a bit confusing, but its not clear how to cut down the size of the patch much more without making it harder to reason about the logic.

lightning/src/routing/network_graph.rs Outdated Show resolved Hide resolved
enum Probability {
Zero,
One,
Ratio { numerator: u64, denominator: u64 },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just store an f64 given its gonna be divided at the callsite? I'm not really sure why we need a struct here at all, f64 is pretty well understood for probability and you can still check for equality with zero and one without loss of precision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had been considering if having it in this form would be useful for approximating the log as per #1166 (comment). Basically, avoid the division and approximate the log of the numerator and denominator separately and taking the difference. Though, admittedly, the approximation technique wasn't very clear to me from that comment. Could use f64 if you prefer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, fair, avoiding the FPU in the inner loop of the router may be useful, but given this is an internal API so we can change it later, I don't see why we should eat the complexity now vs just adding it later if we want to do some approximation. Sadly our benchmarks for routing currently don't hit the scoring code much because the scorer is uninitialized initially. Doesn't need to happen in this PR, but ideally we'd add the scorer to the benchmark(s).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving in for now to try and update the benchmark in parallel with review, but happy to remove it before merge if needed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while the approximation of the negative log probability is just this mentioned fraction all my tests show that this linear approximation is not too useful as it (similarly with the linear fee rate) tends to saturate channels when computing the optimal split. As far as I understand here we are not computing optimal splits anyway this but I thought it is useful to point out that the convexity of the negativ log prob is actually very desirable.
This problem of saturating channels fully when switching to the linearized problem can be avoided by changing the model of the uncertainty network and constrain channel capacities to be lower than they actually are. However as of right now it is totally not clear to me how to do this properly without arbitrary choices. Yet it would allow us to use simplex based min cost flow solvers for linear min cost flow problems which would yield a huge runtime improvement (assuming the base fee is set to zero or heavily deincentivised as ZmnSCPxj suggested)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@renepickhardt Not sure I quite understand the comment. The discussion here is only about approximating log10, not about approximating the negative log of the success probability.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're saying we should just use a linear division instead of log entirely? That is confusing to me.

/// [`ProbabilisticScoringParameters`] for details.
///
/// [1]: https://arxiv.org/abs/2107.05322
pub type ProbabilisticScorer<G> = ProbabilisticScorerUsingTime::<G, ConfiguredTime>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if we're not gonna delete Scorer (I think we can, tbh, given its naive and we have a trait for it anyway, but if you don't want to that's fine), then we should probably put this new stuff in its own module, no?

Copy link
Contributor Author

@jkczyz jkczyz Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured we get rid of Scorer eventually. Could do so now, if you want.. I have no strong preference. But I didn't want to use the same name given the serialization formats are different. Not sure if it makes sense to blindly upgrade users (if we could even detect that we're reading the old format?) or force them to by changing the type name to ProbabilisticScorer. Let me know what you think.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt Jan 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, fair, I guess if the intent is to leave both for a version or two and then drop Scorer after a version or two that seems fine to me. Fine to leave them in the same module for now, then.

Probability::One => 0,
Probability::Ratio { numerator, denominator } => {
let success_probability = numerator as f64 / denominator as f64;
(-(success_probability.log10()) * amount_msat as f64) as u64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to multiply by the amount here given we're not trying to do any bandwidth-based scoring techniques? If we do eventually get a smarter router it seems like that will just blow up the path into a ton of MPP retries. I'm a bit confused cause this seems non-intuitive to me at all, I'd think you'd want some (configurable) constant multiplied by the success probability log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't really sure about this and had posed a question here: #1172 (comment). Had left the amount in the draft while waiting for a response, but yeah maybe a constant/param makes more sense.

I was primarily concerned with mixing two costs (a probability and the channel fees). Wasn't sure if we should normalize to use msats (as done here) --and if so how -- or have the scorer work in terms of either a probability or a penalty (e.g., by returning an enum) and having the router determine if the channel fees should be considered at all based on the variant.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we have to normalize, but to normalize we'd want to make it a knob, I think - "msats charged per log(failure %)", though. The lack of knob here means users aren't able to control by how much they want to avoid failures, ie they may be willing to take a little longer to pay by retrying a few times in order to avoid paying additional fees. Multiplying by the amount being sent doesn't seem like a useful "penalty per log(failure rate)" answer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a liquidity_penalty_multiplier_msat parameter and used that instead of the amount.

lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
lightning/src/routing/network_graph.rs Outdated Show resolved Hide resolved
@jkczyz jkczyz force-pushed the 2021-12-probabilistic-scorer branch 2 times, most recently from 99e637b to 22d285d Compare January 7, 2022 15:54
@codecov-commenter
Copy link

codecov-commenter commented Jan 7, 2022

Codecov Report

Merging #1227 (18e467b) into main (457e48e) will increase coverage by 0.14%.
The diff coverage is 93.51%.

❗ Current head 18e467b differs from pull request most recent head d5db85e. Consider uploading reports for the commit d5db85e to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1227      +/-   ##
==========================================
+ Coverage   90.41%   90.56%   +0.14%     
==========================================
  Files          71       71              
  Lines       38356    39248     +892     
==========================================
+ Hits        34680    35544     +864     
- Misses       3676     3704      +28     
Impacted Files Coverage Δ
lightning-invoice/src/payment.rs 92.75% <ø> (ø)
lightning/src/util/test_utils.rs 82.26% <ø> (ø)
lightning/src/routing/router.rs 91.79% <80.79%> (-0.02%) ⬇️
lightning/src/routing/network_graph.rs 89.98% <90.74%> (+0.25%) ⬆️
lightning/src/routing/scoring.rs 95.03% <96.66%> (+1.76%) ⬆️
lightning-background-processor/src/lib.rs 93.04% <100.00%> (ø)
lightning-invoice/src/utils.rs 84.07% <100.00%> (ø)
lightning/src/ln/channelmanager.rs 84.47% <100.00%> (ø)
lightning/src/ln/functional_test_utils.rs 95.23% <100.00%> (ø)
lightning/src/ln/functional_tests.rs 97.21% <100.00%> (-0.07%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 457e48e...d5db85e. Read the comment docs.

@jkczyz jkczyz mentioned this pull request Jan 7, 2022
@jkczyz jkczyz force-pushed the 2021-12-probabilistic-scorer branch from 22d285d to eb0b930 Compare January 11, 2022 16:03
@jkczyz jkczyz marked this pull request as ready for review January 11, 2022 16:11
@jkczyz jkczyz force-pushed the 2021-12-probabilistic-scorer branch from eb0b930 to 055a356 Compare January 11, 2022 21:39
@TheBlueMatt TheBlueMatt added this to the 0.0.105 milestone Jan 13, 2022
@jkczyz
Copy link
Contributor Author

jkczyz commented Jan 13, 2022

Benchmark using Scorer::with_fixed_penalty(0):

test routing::router::benches::generate_mpp_routes        ... bench:  50,453,566 ns/iter (+/- 34,709,542)
test routing::router::benches::generate_routes            ... bench:  49,923,040 ns/iter (+/- 37,960,924)

Benchmark using ProbabilisticScorer with log10:

test routing::router::benches::generate_mpp_routes        ... bench:  71,292,994 ns/iter (+/- 47,668,027)
test routing::router::benches::generate_routes            ... bench:  68,730,551 ns/iter (+/- 48,475,471)

Benchmark using ProbabilisticScorer with log10_approx:

test routing::router::benches::generate_mpp_routes        ... bench:  67,494,709 ns/iter (+/- 44,359,544)
test routing::router::benches::generate_routes            ... bench:  65,806,830 ns/iter (+/- 48,331,735)

Note that each iteration of the ProbabilisticScorer versions needs to crate a new ProbabilisticScorer and thus must copy the node's pubkey with NodeId::from_pubkey.

@@ -4968,6 +4994,8 @@ mod benches {
let dst = PublicKey::from_slice(nodes.keys().skip(seed % nodes.len()).next().unwrap().as_slice()).unwrap();
let payee = Payee::from_node_id(dst).with_features(InvoiceFeatures::known());
let amt = seed as u64 % 1_000_000;
let params = ProbabilisticScoringParameters::default();
let scorer = ProbabilisticScorer::new(params, &src, &graph);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add new benchmarks - one with the classic Scorer with no data (or, better yet, dummy scorer that always returns 0 that can be inlined), one with a Scorer with random data, and one with ProbabilisticScorer. In the interest of expediting this PR we can skip the second because the first just requires copying the function and maybe a five-line struct definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth keeping around both generate_routes and generate_mpp_routes variations of the benchmarks? AFAICT, get_route will use MPP if node announcement in the NetworkGraph indicates support, even if the invoice doesn't. Looking at some results, there isn't a material difference between the two for a given scorer.

test routing::router::benches::generate_mpp_routes_with_default_scorer       ... bench:  92,654,109 ns/iter (+/- 31,531,162)
test routing::router::benches::generate_mpp_routes_with_probabilistic_scorer ... bench: 107,804,711 ns/iter (+/- 53,006,509)
test routing::router::benches::generate_mpp_routes_with_zero_penalty_scorer  ... bench:  75,926,475 ns/iter (+/- 37,127,351)
test routing::router::benches::generate_routes_with_default_scorer           ... bench:  92,888,708 ns/iter (+/- 32,657,001)
test routing::router::benches::generate_routes_with_probabilistic_scorer     ... bench: 107,996,503 ns/iter (+/- 51,133,247)
test routing::router::benches::generate_routes_with_zero_penalty_scorer      ... bench:  75,003,204 ns/iter (+/- 36,967,684)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, seems I can hit the non-MPP case by using Some(InvoiceFeatures::empty()) instead of None.

Added some commits up front that:

  • Refactors out a benchmark utility to use with different scorers and features
  • Exercises first_hops, which simplifies testing ProbabilisticScorer
  • Seeds the scorer with success and failure data
  • Benchmarks a zero-penalty scorer

The commits are fairly self-contained, but let me know if you prefer that I pull them out into a separate PR.

@valentinewallace valentinewallace self-requested a review January 14, 2022 17:47
@jkczyz jkczyz force-pushed the 2021-12-probabilistic-scorer branch 3 times, most recently from 0544c75 to 25f8816 Compare January 17, 2022 04:56
lightning/src/routing/network_graph.rs Outdated Show resolved Hide resolved
match self {
CandidateRouteHop::FirstHop { .. } => 0,
CandidateRouteHop::PublicHop { info, .. } => {
info.direction().unwrap().cltv_expiry_delta as u32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not obvious to me that this is safe to unwrap, similar elsewhere 🤔 Could we add a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! One case was inadvertently removed in a fixup. The other is guarded by an if let. Added a comment to the field.

Comment on lines 1473 to 1476
let mut prev_cltv_expiry_delta = final_cltv_expiry_delta;
for hop in path.iter_mut().rev() {
core::mem::swap(&mut prev_cltv_expiry_delta, &mut hop.as_mut().unwrap().cltv_expiry_delta);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see tests fail without this but I'm confused on what it's accomplishing, comment?

Copy link
Contributor Author

@jkczyz jkczyz Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was from Matt's fixup which removed modification of this field in ordered_hops inside the path_walk loop. Added a comment and used fold to propagate the delta backwards. @TheBlueMatt Let me know if the comment makes sense or could be expanded somehow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this was previously in the if node_id == our_node_id loop, above, see comment there for more info, it moved here to handle moving the data into the CandidateRouteHop.

lightning/src/routing/scoring.rs Show resolved Hide resolved
lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
let leading_zeros = x.leading_zeros();
let integer_part = (63 - leading_zeros) as f64;
let fractional_part = LOG2_FRACTIONAL_PART[
(((x << leading_zeros) >> (63 - FRACTIONAL_BITS)) & FRACTIONAL_BITMASK) as usize
Copy link
Contributor

@valentinewallace valentinewallace Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit kinda adds a good amount of complexity imo... Are we sure the perf gainz are worthwhile? Maybe better for follow-up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the latest benchmark numbers both before and after this optimization.

Without approximation:

test routing::router::benches::generate_mpp_routes_with_default_scorer       ... bench:  88,374,024 ns/iter (+/- 19,792,155)
test routing::router::benches::generate_mpp_routes_with_probabilistic_scorer ... bench: 104,953,746 ns/iter (+/- 48,003,617)
test routing::router::benches::generate_mpp_routes_with_zero_penalty_scorer  ... bench:  74,455,389 ns/iter (+/- 37,400,486)
test routing::router::benches::generate_routes_with_default_scorer           ... bench:  43,661,367 ns/iter (+/- 11,777,146)
test routing::router::benches::generate_routes_with_probabilistic_scorer     ... bench:  52,373,973 ns/iter (+/- 25,136,830)
test routing::router::benches::generate_routes_with_zero_penalty_scorer      ... bench:  36,233,816 ns/iter (+/- 18,920,800)

With approximation:

test routing::router::benches::generate_mpp_routes_with_default_scorer       ... bench:  90,098,517 ns/iter (+/- 20,624,987)
test routing::router::benches::generate_mpp_routes_with_probabilistic_scorer ... bench: 102,870,120 ns/iter (+/- 50,438,276)
test routing::router::benches::generate_mpp_routes_with_zero_penalty_scorer  ... bench:  74,644,550 ns/iter (+/- 39,539,073)
test routing::router::benches::generate_routes_with_default_scorer           ... bench:  43,759,396 ns/iter (+/- 10,977,942)
test routing::router::benches::generate_routes_with_probabilistic_scorer     ... bench:  50,402,866 ns/iter (+/- 25,601,834)
test routing::router::benches::generate_routes_with_zero_penalty_scorer      ... bench:  35,911,935 ns/iter (+/- 17,529,777)

So roughly 2 ms faster, though the numbers may be a little noisy. Happy to drop the commit if you both agree. Would save me the effort from updating the unit test assertions. 😛

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial reaction is that those gains seem somewhat not worth it or at least better suited for follow-up, but would have to get @TheBlueMatt's thoughts

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this seems....not useful at all? Its maybe useful but clearly dwarfed by noise, so unclear if useful. I'd say avoid it until a followup PR?

Copy link
Contributor Author

@jkczyz jkczyz Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, FWIW I think the noise is somewhat inherent to the benchmark setup. There's bound to be variance if the distance between source and destination nodes vary between iteration. We may want to use a smaller set of pairs but use all of them on each iteration instead of just one per iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran some benchmarks with a single path greater than 7 hops. The variance went down but similar results in terms of performance. Reverted the approximation commit and removed the Probability enum.

lightning/src/routing/scoring.rs Show resolved Hide resolved
@jkczyz jkczyz force-pushed the 2021-12-probabilistic-scorer branch from 25f8816 to d2806cb Compare January 19, 2022 17:18
Copy link

@renepickhardt renepickhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in the middle of review and couldn't add more comments as the diff was changed. I hope the comments are not lost but I thought maybe I can just review the comments that have already been made

fn default() -> Self {
Self {
liquidity_penalty_multiplier_msat: 1000,
liquidity_offset_half_life: Duration::from_secs(3600),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is that value being chosen? Is there emperical evidence or is this just an arbitrary choice to start with and collect experience? Might make sense to make this configureable either in config or in the API that invokes path selection. In this way if the returned candidate paths are not satisfiable one might be able to give the half life duration as an argument and make all the knowledge to be forgotten for example by setting the value to 0 or 1 second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is that value being chosen? Is there emperical evidence or is this just an arbitrary choice to start with and collect experience? Might make sense to make this configureable either in config or in the API that invokes path selection.

This is an arbitrary value and is only configurable upon initialization as of now.

In this way if the returned candidate paths are not satisfiable one might be able to give the half life duration as an argument and make all the knowledge to be forgotten for example by setting the value to 0 or 1 second.

Hmm... not sure the best way to go about this (and if we'd want to do so). The caller is parameterized with any scorer, one which may not have any concepts of decay. I suppose the Score interface could have a method that returns an "empty" instance of itself to use temporarily.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the data is updated during processing I'm not sure how it could be configurable per-route-finding call, but it is configurable.

/// Returns the [`EffectiveCapacity`] of the channel in a specific direction.
///
/// This is either the total capacity from the funding transaction, if known, or the
/// `htlc_maximum_msat` for the direction as advertised by the gossip network, if known,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let there be a channel of capacity 100M msats as seen in the funding tx and let the htlc_maxiumum_msat be 10M msats. This means of course that the probability for any single htlc to be forwarded that has an amount larger than 10M msats should be 0. However I don't the probability to forward exactly 10M msats should be close to 0 bur rather close to 0.9 = (100M - 10M )/ (100M) ~ (100M - 10M +1)/ (100M +1). What I am trying to say: "in probabilistic socring we should always start from the full capacity". of course given prior knowledge we might use terms as effective capacity that reduces the interval in which uncertainty exists. I will have to find the position in the code where the actual score is being computed. Still I thought i leave this comment here.

Copy link
Contributor Author

@jkczyz jkczyz Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had spoken the @TheBlueMatt about this but in the context of MPP. In get_route we reduce the effective capacity by any potential in-flight HTLCs for the same payment. But, as you noted, if htlc_maximum_msat <<< capacity then this may not be desirable as we should be able route more through the channel using multiple HTLCs. IIRC, Matt mentioned that implementations typically default htlc_maximum_msat to a value close the channel capacity, which may alleviate both concerns even though not necessarily correct.

One thought I had was to pass the EffectiveCapacity to the scorer (instead of converting to msats) possibly with in-flight amounts and have it determine whether or not to subtract those. This is partly why I wanted to add EffectiveCapacity. I guess ideally htlc_maximum_msat would only need to be considered in get_route, obviating the need for the MaximumHTLC variant.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One further issue is many clients only have the htlc_maximum_msat value - some clients do not verify the on-chain output at all (subject to some potential DoS attacks). Depending on the direction gossip goes this may be increasingly common.

enum Probability {
Zero,
One,
Ratio { numerator: u64, denominator: u64 },

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while the approximation of the negative log probability is just this mentioned fraction all my tests show that this linear approximation is not too useful as it (similarly with the linear fee rate) tends to saturate channels when computing the optimal split. As far as I understand here we are not computing optimal splits anyway this but I thought it is useful to point out that the convexity of the negativ log prob is actually very desirable.
This problem of saturating channels fully when switching to the linearized problem can be avoided by changing the model of the uncertainty network and constrain channel capacities to be lower than they actually are. However as of right now it is totally not clear to me how to do this properly without arbitrary choices. Yet it would allow us to use simplex based min cost flow solvers for linear min cost flow problems which would yield a huge runtime improvement (assuming the base fee is set to zero or heavily deincentivised as ZmnSCPxj suggested)

impl Default for ProbabilisticScoringParameters {
fn default() -> Self {
Self {
liquidity_penalty_multiplier_msat: 1000,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems only relevent if (as one should) also fees are taken into consideration as a feature to compute the cost of a hop / path (I think you internally name the cost Score, so please don't get confused when I talk about cost instead of score). As our benchmarks and prior to them simulations in c-lightning demonstrated (ElementsProject/lightning#4771 (comment)) using the harmonic mean instead of a weighted arithmetic mean to combine several features (negative log probs, fees, cltv delta) worked better. In case you would do the same this value seems useless. Otherwise one could of course ask why is it chosen as 1000 right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems only relevent if (as one should) also fees are taken into consideration as a feature to compute the cost of a hop / path (I think you internally name the cost Score, so please don't get confused when I talk about cost instead of score). As our benchmarks and prior to them simulations in c-lightning demonstrated (ElementsProject/lightning#4771 (comment)) using the harmonic mean instead of a weighted arithmetic mean to combine several features (negative log probs, fees, cltv delta) worked better. In case you would do the same this value seems useless.

Currently, the scorer returns a penalty in msats for using a channel, which is simply added to the fees to get the cost:

let old_cost = cmp::max(old_entry.total_fee_msat, old_entry.path_htlc_minimum_msat)
.checked_add(old_entry.path_penalty_msat)
.unwrap_or_else(|| u64::max_value());
let new_cost = cmp::max(total_fee_msat, path_htlc_minimum_msat)
.checked_add(path_penalty_msat)
.unwrap_or_else(|| u64::max_value());

Given a scorer that computes a negative log of a success probability, how would you recommend combining this "score" with the fees to get a cost?

Otherwise one could of course ask why is it chosen as 1000 right now.

FYI, the value is an arbitrary default which can be customized by the user. That said, we may want a more reasonable default.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean "better", here? There is no objective metric by which one could argue one way of combining different goals is "better" than another - hence why this is configurable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean "better", here? There is no objective metric by which one could argue one way of combining different goals is "better" than another - hence why this is configurable.

Yes you have already mentioned that on the mailing list and I agree with you (never disagreed) that there is not one single objective, universally true metric here. The way I used the word "better" was having in mind to combine the relevant features which had previously been identified (eg probabilities or fees)

That being said the topic here goes far beyond Lightning Path finding / payment delivery. In many cases where you have several features (like in our case Probability to model reliability (and with it potentially speed), fees to model actual costs, latency or geodistance to model speed, ...) you will always have the questions: "How should those features be combined into one single score / metric?". It turns out that many people intuitively take the arithmetic mean which is essentially just adding the features or a weighted arithmetic mean which is almost the same as normalizing the features and adding them without weight. You are currently also doing this and I would strongly advice against doing that.

You take your old score and add the new one and you currently weight it with a factor of 1000 as in liquidity_penalty_multiplier_msat. In almost all situations in my life when I was doing data science / machine learning it turned out that in practice the arithmetic mean a+w*b is not as useful as using a*b/(a+b) (I ignore constants). The later seems to have several advantages in practise:

  1. You do not need to learn and test for the weight(s)
  2. features (in our case for channels) might not be correlated. so a weighting for one channel might be totally wrong for another channel
  3. It seems to be more resilient against outliers and seems to be able to capture the mix of features better even if they are in different orders of magnitude or are differently distributed.

I have discussed this in great length with the c-lightning team - in particular @cdecker - and my preliminary simulations as well as our mainnet experiments and evaluations show that arithmetic mean heavily underperformed in comparison to the other way to compute a mean. Also recently I was made aware that there might actually be deeper mathematical motivation in there which in the field of economics and risk management is being used all the time and might have already been discovered in the early 18th hundred by Bernoulli in this work - though I haven't looked deeper into this yet.

Also on a side note. The arithmetic mean obviously has the advantage of being linear and useful for the case of min cost flows if all the features are linear and convex. This is not true for the mean that I suggest. But as your code currently only looks at dijkstra we don't need to have a linear or convex cost function.

Also as suggested to the c-lightning team. if you don't believe me it should be very easy to have both versions of mean computation and do a simple A/B test against any of your favourite test criteria (I usually like "number of attempts" or time to delivery) to see if you want to keep the arithmetic mean here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still very confused by this, I read through the links you provide and none of them discuss failure modes or seem to provide any justification for harmonic mean here, aside from some anecdotal suggestions. The discussion from c-lightning just links to an lnd issue, which provides no justification, and it looks like c-lightning stuck with simply adding the scores as well?

Indeed, the code here (and the current model) focuses entirely on fees as the metric we're optimizing for. As a result of this, we denominate the scorer as "amount of fee you're willing to pay to avoid this channel". This is not only nicely simple from a theoretical perspective, but also simple enough that our users can at least roughly understand the metric and can set the tunables here without much interaction from us - they can easily run trivial calculations to determine how much their "avoid channels" cost will be, and can tune things appropriately based on how much fee they're willing to pay. In the future we could even allow these tunables to be set on a per-payment basis, given we have a model here that doesn't incorporate the concrete fee values.

Indeed, in the future, I'd very much like to have multiple top-to-bottom routing algorithms, eg there has been very little work done on how you'd design a lightning routing algorithm "for privacy", without seriously considering fees or time-to-pay/-retries, but even just as-is, users having the ability to drastically tune the knob between retry-count and fees-paid seems pretty important, which gets much harder with the harmonic mean, IIUC.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/ElementsProject/lightning/blob/a98ccac7774297f255f0a4f585384609af87c3d6/plugins/libplugin-pay.c#L735

shows what c-lightning does and it was evaluated bu conducting mainnet evaluation: https://medium.com/blockstream/c-lightning-v0-10-2-bitcoin-dust-consensus-rule-33e777d58657

The blog article however left out the comparison between the means but I remember evaluating the statistics for test payments myself.

Of course the evaluation that @cdecker and I did might not have been sufficiently large or realistic from the assumptions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I think the much bigger argument for using an arithmetic mean is that it is linear which means that if the features are linear the cost function stays linear (of if features are convex the cost function stays convex). When using min cost flow solvers to compute optimal splits (given the cost function and assuming the cost function properly captures the goals of the node) then the arithmetic mean will not have a negative impact on the runtime of the min cost flows solvers (assuming all features are at worst convex). As the cost function is currently only used in dijkstra we need neither linear nor convex so we don't care.

lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
/// Returns the success probability of routing the given HTLC `amount_msat` through the channel
/// in this direction.
fn success_probability(&self, amount_msat: u64) -> Probability {
let max_liquidity_msat = self.max_liquidity_msat();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding one of my earlier comments I tried to chase down the logic of how this value is being derived and as far as I can tell this is the channel capacity or a potentially decayed learned bound but it is not constrained by htlc_max_msat like in the effective capacity data structure. I just leave the comment as I would like to suggest a second pair of eyes double checks this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EffectiveCapacity::as_msats is passed to the scorer's channel_penalty_msat method, which is used as the capacity here via ChannelLiquidity::as_directed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this value comes from the as_directed call in channel_penalty_msat which gets its information from the available_liquidity in the router, which calls DirectedChannelInfo::effective_capacity which is bound by the htlc_maximum_msat. Not quite sure whether you were saying this should or should not be, but it is.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I addressed this in another comment. I say for probability computation you should use the actual capacity (reduced by potential knowlege about existing liquidity) instead of capping this at htlc_maximum_msat. Using htlc_maxiumum_msat will result in potentially strange result (see this comment: #1227 (comment))


impl<L: DerefMut<Target = u64>, T: Time, U: DerefMut<Target = T>> DirectedChannelLiquidity<L, T, U> {
/// Adjusts the channel liquidity balance bounds when failing to route `amount_msat`.
fn failed_at_channel(&mut self, amount_msat: u64) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have to say that I really love to see this and the following functions being implemented here and that you will track the uncertainty / information gain properly! I will be really curios how much this also improves payment delivery in practice in comparison to the recent updates that we have seen from c-lightning using only probabilistic scoring but forgetting most of the learnt information directly

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully Cash App switches and uses the new one and we can get some good data from that!


/// Adjusts the channel liquidity balance bounds when successfully routing `amount_msat`.
fn successful(&mut self, amount_msat: u64) {
let max_liquidity_msat = self.max_liquidity_msat().checked_sub(amount_msat).unwrap_or(0);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a but confused if this and the prior two functions always work properly with concurrency. But it seems to me that concurrency is actually not messing up the book keeping here. But how do you handle the decay of the knowledge when substracting from max_liquidity? This could happen at two very different moments. Let's say you forward one htlc at 12:00 and the next at 12:45 both have been successful. How will you compute the max liquidity at 13:05? As far as I understand the knowledge from the first routing event should have been decyed to half at that time. I am not sure how this is reflected in this data structure. might not be a problem as the halflife is chosen arbitrarlity at this time anyway but I thought I should point this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a but confused if this and the prior two functions always work properly with concurrency. But it seems to me that concurrency is actually not messing up the book keeping here.

It's accomplished by having get_route take a non-mut reference to a Score, while the event handler needs a mut reference. The compiler ensures we only ever have one mut reference, which we accomplish through using a Mutex. If you're curious how this works see:

/// A scorer that is accessed under a lock.
///
/// Needed so that calls to [`Score::channel_penalty_msat`] in [`find_route`] can be made while
/// having shared ownership of a scorer but without requiring internal locking in [`Score`]
/// implementations. Internal locking would be detrimental to route finding performance and could
/// result in [`Score::channel_penalty_msat`] returning a different value for the same channel.
///
/// [`find_route`]: crate::routing::router::find_route
pub trait LockableScore<'a> {
/// The locked [`Score`] type.
type Locked: 'a + Score;
/// Returns the locked scorer.
fn lock(&'a self) -> Self::Locked;
}
/// (C-not exported)
impl<'a, T: 'a + Score> LockableScore<'a> for Mutex<T> {
type Locked = MutexGuard<'a, T>;
fn lock(&'a self) -> MutexGuard<'a, T> {
Mutex::lock(self).unwrap()
}
}

R: for <'a> Router<<<S as Deref>::Target as LockableScore<'a>>::Locked>,
S::Target: for <'a> LockableScore<'a>,

But how do you handle the decay of the knowledge when substracting from max_liquidity? This could happen at two very different moments. Let's say you forward one htlc at 12:00 and the next at 12:45 both have been successful. How will you compute the max liquidity at 13:05? As far as I understand the knowledge from the first routing event should have been decyed to half at that time. I am not sure how this is reflected in this data structure. might not be a problem as the halflife is chosen arbitrarlity at this time anyway but I thought I should point this out.

max_liquidity_msat and min_liquidity_msat in DirectedChannelLiquidity return the decayed values. The internal representation in ChannelLiquidity consists of min and max un-decayed offsets along with a time the values were modified. The decay is realized in the offset representation any time they need to be modified. See set_min_liquidity_msat and set_max_liquidity_msat.

Copy link

@renepickhardt renepickhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not look at the test cases yet. Overall I really like the tracking of the channel liquidity to make path finding decisions and I am really looking forward to see how the final version will perform!

As already addressed in my first part of the review I think I have now discovered the line where you seem to reduce the effective liquidity too heavily in computing the probabilities and I have provided a counter example (assuming my understanding of the code is correct) in my comment. While of course channels should have a 0 probability if the amount is larger than the htlc_max_size constraint the probabilities for amounts smaller or equal to the constraint should still be computed with the actual channel capacity and not this effective capacity measure.

Also I did not see how you handle channel reserve values. Technically one would always have to reduce the reserves too in the computation of the probabilities as the reserves are not accessible for routing but I guess the effect can be neglected as we usually try not to saturate the full liquidity anyway.

Let me now if you need help in benchmarking and with evaluating in a mainnet experiment.

Probability::Zero => u64::max_value(),
Probability::One => 0,
Probability::Ratio { numerator, denominator } => {
let log_success_probability = log10_approx(numerator, denominator);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the log is merely introduced to switch from the multiplicative nature of channel probabilities for an entire path to adding the scores per hop and making this metric usable for shortest path algos and min cost flows the base of the log does not really matter. for the actual computation / ranking of the scored paths. Of corse it might when combinging with other features in a weighted arithmetic mean...

However from an information theoretic point of view the base 2 log is usually used as that would measure the entropy and information gain in bits. it would also allow us to have very fast approximations as we just needed the first significant bit of a number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the log is merely introduced to switch from the multiplicative nature of channel probabilities for an entire path to adding the scores per hop and making this metric usable for shortest path algos and min cost flows the base of the log does not really matter. for the actual computation / ranking of the scored paths. Of corse it might when combinging with other features in a weighted arithmetic mean...

Note earlier comment about how we combine penalty and fees to get the cost.

However from an information theoretic point of view the base 2 log is usually used as that would measure the entropy and information gain in bits. it would also allow us to have very fast approximations as we just needed the first significant bit of a number.

IIRC, when using base 2, I would have a zero penalty for larger success probabilities when base 10 would give me something small but non-zero. Though this may also have had to due with the value I was multiplying by.

Additionally, when using log2(numerator) - log2(denominator), it's easy to get a probability of zero if the amount is close to the min liquidity. That is, subtracting them from max liquidity to get the numerator and denominator, respectively, may give two values with identical most significant bits (i.e., zero when subtracted).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the log is merely introduced to switch from the multiplicative nature of channel probabilities for an entire path to adding the scores per hop and making this metric usable for shortest path algos and min cost flows the base of the log does not really matter. for the actual computation / ranking of the scored paths. Of corse it might when combinging with other features in a weighted arithmetic mean...

Note earlier comment about how we combine penalty and fees to get the cost.

note my reply: #1227 (comment) I advice against the way how you currently do that.

IIRC, when using base 2, I would have a zero penalty for larger success probabilities when base 10 would give me something small but non-zero. Though this may also have had to due with the value I was multiplying by.

nit: which - when rounding is taking place - might actually be ok. small payments kind of don't mind which channel to take. They have a really high probability and the log could be close to 0 or 0 (though technically only for probability of 1 we want zero as a result.

Additionally, when using log2(numerator) - log2(denominator), it's easy to get a probability of zero if the amount is close to the min liquidity. That is, subtracting them from max liquidity to get the numerator and denominator, respectively, may give two values with identical most significant bits (i.e., zero when subtracted).

Shouldn't that only be true if you take integers of the base2 log or like I suggested estimate this with the position of the first non zero bit of the numerator and denominator?


let capacity_msat = network_graph.channels()
.get(&hop.short_channel_id)
.and_then(|channel| channel.as_directed_to(&target).map(|d| d.effective_capacity()))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the effective_capacity is filled with values like htlc_maxsize_msat I fear that the computed probabilities will be lower than the actual probabilities and the negative log probabilities will be overestimated. One might argue that this might not even be a problem if we do the same mistake for all channels but the htlc_maxsize_msat value will be different for various channels let me make this example clear.

let's assume we want to route 1000 msat and we have two channels to chose from:

a) a channel of size 3000 msats with max htlc size of 3000 msats o
b) a channel of size 30000000 msats with max htlc size of 2000 msats

with your definition of effective capacity being bound by the max htlc size you would give channel a) a higher likelihood where in reality the successprobability of channel b) should be muich higher!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I assume this is the same point made in the following conversation?

https://github.com/lightningdevkit/rust-lightning/pull/1227/files#r787934743

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is the same - only that here is the point where it is actually being initialized. I think this should really be fixed. While stuffing several small HTLCs into a channel in the context of MPP is one concern I think the much bigger concern is that the probability computation becomes severely wrong. Especially large channels should be highly attractive / likely. But I guess folks with large channels will want to cap the max size of htlcs which means that you would potentially not find such channels that have a high chance for high liquidity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be worth fixing, though in practice today its a relatively rare issue. Its also worth noting that not all users have chain data and don't bother fetching the channel's capacity by looking at the chain, using htlc_maximum_msat instead to determine it, because its the best info they have. Some of our users aren't the only ones doing this, I believe Electrum does this as well. Thus, we really need the implementation here to be robust even if we're using htlc_maximum_msat data to infer channel sizes, I'm not sure how best to do that, but just ensuring we pipe the values from on-chain through doesn't really solve the concern.

Ultimately this needs to be resolved at the spec level, of course, but given the values are basically always the same today we may just wait until we have a new gossip format for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this as per a top-level comment, but dropped it in a favor of a follow-up as per offline discussion. Commit was a bit more involved and had a performance regressions, though pre-computing the DirectedChannelInfo::htlc_maximum_msat seem to fix it, oddly given it should only be calculated once.

@TheBlueMatt
Copy link
Collaborator

Looks like this needs rebase now.

@jkczyz
Copy link
Contributor Author

jkczyz commented Jan 20, 2022

Looks like this needs rebase now.

Sure does! Mind if I squash the fixups for 3c2df0c? Should make it a little easier.

@jkczyz jkczyz force-pushed the 2021-12-probabilistic-scorer branch from e0d8080 to 97fdf76 Compare February 1, 2022 03:23
lightning/src/routing/network_graph.rs Outdated Show resolved Hide resolved
lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
Comment on lines 90 to 91
/// such as a chain data, network gossip, or invoice hints, the latter indicating sufficient
/// capacity (i.e., near [`u64::max_value`]). Thus, implementations should be overflow-safe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the previous wording might've been clearer here, suggestion being: "For hints provided in the invoice, we assume the channel has sufficient capacity to accept the invoice's full amount, and provide a channel_capacity_msat of u64::max_value. Thus..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded a bit to tighten it up. Technically, it can be less than u64::max_value since in-flight HTLCs are deducted for the time being.

lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
@jkczyz jkczyz force-pushed the 2021-12-probabilistic-scorer branch 2 times, most recently from 421833c to e4c49d7 Compare February 1, 2022 17:39
lightning/src/routing/scoring.rs Show resolved Hide resolved
lightning/src/routing/network_graph.rs Outdated Show resolved Hide resolved
lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
@jkczyz jkczyz force-pushed the 2021-12-probabilistic-scorer branch from e4c49d7 to 575cae0 Compare February 1, 2022 23:38
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically there, mostly doc comments. Still need to finish discussing #1227 (comment)

lightning/src/routing/router.rs Outdated Show resolved Hide resolved
lightning/src/routing/router.rs Outdated Show resolved Hide resolved
/// Knowledge about channel liquidity balances takes the form of upper and lower bounds on the
/// possible liquidity. Certainty of the bounds is decreased over time using a decay function. See
/// [`ProbabilisticScoringParameters`] for details.
///
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should say something here about how this is likely to be really great for probing-based nodes or noes with high payment flow, but because we learn the specific liquidity on a channel, instead of learning how often a channel is available, this is unlikely to be a great approach for nodes which have very low payment volume. Scorer isn't a ton better, admittedly, but we likely want to extend this to measure node "badness" later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I think its still a little vague, I suppose the decaying is also bad for long-term learning, but my other long-term-learning concern here is just about how we go about learning to begin with. Our whole thing here is about trying to remember the current channel balances on every channel on the network. We don't so much learn anything as we do try to keep a mental map of the whole network up-to-date all the time. That's great if you're probing or have payment flows, but otherwise we're really not learning at all, IMO.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agree! Those seem to be two orthogonal problems.

  1. The probability that liquidity is available (as a probabilistic model for the reliability)
  2. The actual observed reliability of a specific node / channel (as a provenance score or distribution)

I claim - but obviously don't know - that the first one will be much more impactful. In particular for the first one we know how to properly quantify this as the probabilistic model gives us the propper information theoretic framework.

That being said I am very interested in examining the second one and investigating if it is useful to include it to the cost function. I have drafts for an approach for assigning provenance scores as a probability distribution that can be used to create the joint distribution between this and the liquidity based model. The good thing is that models for the second question can be learnt from log files and also tested against the log file. Will be very happy to collaborate on this in particular as you seem to have access to data / logfiles

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I assume that both will be important (though somewhat annoyingly the lightning onion errors dont tell us which happened when we get a fail - liquidity limit or node offline), but either way we don't yet do either and that would be bad for some users :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-worded the first part of the sentence as discussed offline.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having two opposing conjectures / hypothesis is one of the best starting points for research. As said I am very interested in these two question (1. How to define a reliability based probabilistic provenance score for nodes / channels 2. How useful will this be in the cost function) and happy to collaborate and find this out. Also very happy to at least anecdotaly be proven wrong on my conjecture

lightning/src/routing/scoring.rs Show resolved Hide resolved
lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
lightning/src/routing/scoring.rs Show resolved Hide resolved
lightning/src/routing/scoring.rs Outdated Show resolved Hide resolved
max_liquidity_offset_msat,
capacity_msat,
last_updated: &self.last_updated,
now: T::now(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's a lot of system-time-gets batman! I guess its pretty quick cause it rtdsc's on x86 usually when its warm, but we should probably fix this later. Maybe add a comment here noting that when we remove old entries, we should just decay them then and not bother decaying during lookups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that's a lot of system-time-gets batman! I guess its pretty quick cause it rtdsc's on x86 usually when its warm, but we should probably fix this later.

This should be the similar behavior as Scorer::decayed_penalty_msat, right?

Maybe add a comment here noting that when we remove old entries, we should just decay them then and not bother decaying during lookups.

Not sure I follow. The penalty may not be accurate if we don't decay during lookup. Could you clarify what you mean by "remove old entries"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the similar behavior as Scorer::decayed_penalty_msat, right?

I think so? Doesn't mean its great, though. IIRC it can be quite cheap on x86, but on other platforms it can be a full syscall.

Not sure I follow. The penalty may not be accurate if we don't decay during lookup. Could you clarify what you mean by "remove old entries"?

I'm assuming when we implement the "remove entries for closed channels" logic, we'll instead just remove entries on a timer that have decayed almost all the way to zero. Once we have a timer, we can just do the decays on the timer, instead of doing the decays while routing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Yeah, that would be a simple way to remove closed channels. I'd hope there would be a way to use the scorer without a background thread, though.

Another thought would be to push Time into either the Score or find_route interface so T::now could be called once and passed in as needed. Or even build it in to LockableScore in some way such that locking gives back some (S, T) implementing Score.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that would work too. I dunno what the right answer is. We already require a background thread for the network graph to expire channels, which we'd discussed at the time either doing there or on serialization and decided background thread was simpler. I guess we could have an explicit "channel is dead" call here like we do for the network graph, but that feels like about as much work as just hooking it up to the background thread/a timer? Anyway, its something we can discuss another time, I don't think this needs fixing here, but I worry this is only practically performant on x86.

@jkczyz jkczyz force-pushed the 2021-12-probabilistic-scorer branch 2 times, most recently from 18e467b to f372fe9 Compare February 2, 2022 23:34
@TheBlueMatt
Copy link
Collaborator

Note the failing benchmarks CI run is already reported as an upstream rustc nightly bug at rust-lang/rust#93599

TheBlueMatt
TheBlueMatt previously approved these changes Feb 3, 2022
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One further note about the "why not use this without lots of payment flows" comment at #1227 (comment) but I'm good with where we're at.

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

lightning/src/routing/network_graph.rs Outdated Show resolved Hide resolved
A channel's capacity may be inferred or learned and is used to make
routing decisions, including as a parameter to channel scoring. Define
an EffectiveCapacity for this purpose. Score::channel_penalty_msat takes
the effective capacity (less in-flight HTLCs for the same payment), and
never None. Thus, for hops given in an invoice, the effective capacity
is now considered (near) infinite if over a private channel or based on
learned information if over a public channel.

If a Score implementations needs the effective capacity when updating a
channel's score, i.e. in payment_path_failed or payment_path_successful,
it can access the channel's EffectiveCapacity via the NetworkGraph by
first looking up the channel and then specifying which direction is
desired using ChannelInfo::as_directed.
Add a Score implementation based on "Optimally Reliable & Cheap Payment
Flows on the Lightning Network" by Rene Pickhardt and Stefan Richter[1].
Given the uncertainty of channel liquidity balances, probability
distributions are defined based on knowledge learned from successful and
unsuccessful attempts. Then the negative log of the success probability
is used to determine the cost of routing a specific HTLC amount through
a channel.

[1]: https://arxiv.org/abs/2107.05322
ProbabilisticScorer uses successful and unsuccessful payments to gain
more certainty of a channel's liquidity balance. Decay this knowledge
over time to indicate decreasing certainty about the liquidity balance.
@jkczyz jkczyz dismissed stale reviews from valentinewallace and TheBlueMatt via d5db85e February 3, 2022 02:23
@jkczyz jkczyz force-pushed the 2021-12-probabilistic-scorer branch from f372fe9 to d5db85e Compare February 3, 2022 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
6 participants