Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pay: Avoid clobbering the channel_hints at all costs #4195

Merged
merged 4 commits into from
Nov 17, 2020

Conversation

cdecker
Copy link
Member

@cdecker cdecker commented Nov 11, 2020

Due to the asynchronous getroute and createonion calls we were issuing multiple concurrent getroute calls for multi-part payments. This is particularly painful with the presplit modifier which causes all of the initial parts to be scheduled at the same time, competing for a route.

Since #4093 we no longer have an asynchronous call to getroute which alleviates the problem, but doesn't resolve it: createonion is still asynchronous and we were only updating the channel_hints after it was called. In addition plugins can yield control of the io_loop if they perform any asynchronous RPC call, resulting in the same issue again.

This PR moves the application of the route to the channel_hints up to just before the createonion call, which makes the route creation and postprocessing synchronous unless a plugin performs an RPC call.

In order to address the final venue for colliding route computations we introduce a new state, that serves only to repeat the getroute call, but nothing else (theoretically a plugin could hook into it as well, but that's unlikely to be of use).

@cdecker cdecker marked this pull request as draft November 11, 2020 18:03
@cdecker
Copy link
Member Author

cdecker commented Nov 11, 2020

As far as I can see the test_mpp_adaptive test is the only one failing at the moment, will need to dig into that one.

@m-schmoock
Copy link
Collaborator

@cdecker

I'm now running this on mainnet and will report if this works better in the next days...

@m-schmoock
Copy link
Collaborator

m-schmoock commented Nov 13, 2020

Tried this changes today with a 150$ payment on mainnet (this is impossible with unpatched code for me).
With this patch it worked in 5 seconds on the first attempt.

I captured stdout/info/debug incase someone needs to look...

@rustyrussell
Copy link
Contributor

@niftynei I think this may need consideration for rc2?

@cdecker
Copy link
Member Author

cdecker commented Nov 16, 2020

@niftynei I think this may need consideration for rc2?

👍 I'll try to fix up the failing mpp_adaptive test asap

@cdecker cdecker force-pushed the pay-getroute-retry branch 4 times, most recently from 37be428 to 2646e1c Compare November 16, 2020 12:46
@cdecker
Copy link
Member Author

cdecker commented Nov 16, 2020

Ok, now ready to review. Added the following changes:

  • Fixed the route applicability check (I had a parenthesis in the wrong place)
  • Split the applicability condition into multiple, documented, parts in order to avoid the above.

See end-to-end diff between the last version and current version.

@cdecker cdecker marked this pull request as ready for review November 16, 2020 13:45
Copy link
Collaborator

@m-schmoock m-schmoock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today I retried this both versions (this and also the one from yesterday 23ab1c5 ) on mainnet in order to get a payment from my node to my Phoenix wallet (not connected to my node, phoenix has their own channel management). I tried first 0.01btc then 0.006btc and then 0.003btc (~50$). Unfortunately all of these attempts failed for both Git revisions today.

Last time, when I paid a bitrefill invoice of ~150$, it worked almost instantly (which does not work on master for me). We should test this version and the master version internally between the devs to see if one or the other is really better.

Comment on lines 483 to 484
if (remove)
continue;
Copy link
Collaborator

@m-schmoock m-schmoock Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function argument remove is never written to within this function. Thus, if remove is true it can be expected that thisif (remove) continue; statement leads to a pointless(?) iteration through this route and it would be the same as if (remove) break; just a little slower. Is this intended?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, I just didn't want to indent the whole for-loop into an if :-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then why not break instead of continue? But not that important ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an excellent usecase for goto

@cdecker
Copy link
Member Author

cdecker commented Nov 16, 2020

Today I retried this both versions (this and also the one from yesterday 23ab1c5 ) on mainnet in order to get a payment from my node to my Phoenix wallet (not connected to my node, phoenix has their own channel management). I tried first 0.01btc then 0.006btc and then 0.003btc (~50$). Unfortunately all of these attempts failed for both Git revisions today.

Last time, when I paid a bitrefill invoice of ~150$, it worked almost instantly (which does not work on master for me). We should test this version and the master version internally between the devs to see if one or the other is really better.

Kind of hard to tell where the issue is. It could be local, but it could also be in the network (min-cut across all paths from you to the destination doesn't have sufficient capacity), or it might be the recipient not having incoming capacity (though Phoenix opens channels on-demand iirc).

I think the PR itself is an improvement, since we reduce the time to completion in all cases (not attempting parts/routes that are doomed to fail). While I was initially thrilled to see that you managed to perform a payment with such a high value, I don't think that or the more recent negative result are representative or indicative of the full effect this PR might have or not.

As such I'd judge this PR more on its own merits to increase correctness and efficiency, and consider improvements to the success rate as anecdotal until we can perform really representative measurements.

Regarding representative measurements, an idea that came to mind was to create a fake-MPP receiver plugin, that'll pretend unknown HTLCs destined for the local node are part of an MPP payment and hold on to the parts for the 60 second timeout just like a real MPP recipient would, and return different errors depending of whether the MPP payment would have been successful. This way we could use random payment hashes, that are not redemable, to test MPP payments. With a number of devs running this plugin, we could perform MPP tests without actually sending any funds back and forth (always a contentious issue among friends) 😉

@m-schmoock
Copy link
Collaborator

I made sure the payment would have fit in the channel I already have on phoenix (you can see your dynamic channel list). In any case your right we can't compare the cases, since bitrefill != phoenix and the situation channel wise might have been different suboptimal this time.

@niftynei niftynei added this to the v0.9.2 milestone Nov 16, 2020
We were delaying the channel_hint update till after the `createonion`
call which gave us the same situation we had with concurrent
`getroute` calls. Now we update the hints as soon as the plugins have
had their say in the route construction. If we still fail, either
because a modifier changed the route causing the failure, or because
we interleaved the route computation for multiple parts, we reset the
attempt and retry inline (i.e., without creating a new sub-payment).

Notice that interleaved route computations now only happen if the
modifier makes an async call to some RPC or similar.
This allows us to atomically update all channel_hints and determine if
we had a collision and therefore should retry.
This adds a new state `PAYMENT_STEP_RETRY_GETROUTE` which is used to
retry just that one step, without spawning a completely new
attempt. It's a new state so that modifiers do not act on it twice.

Changelog-Fixed: pay: Improved the performance of the `pay` command considerably by avoiding conflicting changes to our local network view.
I previously mistyped the rather lengthy conditions for failures, so
let's dissect it into its smaller components and add rationale behind
the individual parts of the decision.
Copy link
Contributor

@niftynei niftynei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 72efbbb

Turned the continue to a goto for maximum readability.

Comment on lines 483 to 484
if (remove)
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an excellent usecase for goto

@rustyrussell rustyrussell merged commit 313976e into ElementsProject:master Nov 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants