Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(swaps): recover crashed swap deals #1081

Merged
merged 2 commits into from
Sep 12, 2019
Merged

Conversation

sangaman
Copy link
Collaborator

@sangaman sangaman commented Jul 5, 2019

Closes #1079 & #1183. This builds off of PR #1080.

This commit attempts to recover swap deals that were interrupted due to a system or xud crash. In the case where we are the maker and have attempted to send payment for the second leg of the swap, we attempt to query the swap client for the preimage of that payment in case it went through. We can then use that preimage to try to claim the payment from the first leg of the swap. In all other cases, we simply attempt to close any open invoices and mark the swap deal as having errored.

Raiden currently does not expose an API call to query for the preimage of a completed payment. Any pending swaps are listed in the GetInfo response.

The recovery attempts happen once on xud startup by looking for any swap deals in the database that have an Active state.

I haven't tested this code yet but it is ready to review, if I have time I will also start adding some test cases. This may be a good candidate for a simulation test, since recovering funds in the case of xud crashing while waiting for the sendPayment response depends substantially on lnd.

This commit includes a suite of test cases for the newly added functionality. It also adds a new suite of simulation tests to test how xud responds to instability. It simulates crashes immediately after the maker sends payment as part of a swap. When the maker comes back online, it will fetch the preimage for the successful payment and use it to settle the incoming payment from the interrupted swap.

This tests two key cases, one where the maker's payment goes through while the maker is offline, and another where the payment goes through only after the maker has come back online. In the latter case, the maker must detect that the payment is in a pending state and monitor it in case it goes through.

@sangaman sangaman added the swaps label Jul 5, 2019
@sangaman sangaman requested review from a user and kilrau July 5, 2019 20:29
@sangaman sangaman self-assigned this Jul 5, 2019
@sangaman sangaman force-pushed the swaps/recover-swap-deals branch 2 times, most recently from f50795b to 16216a5 Compare July 7, 2019 15:15
@kilrau
Copy link
Contributor

kilrau commented Jul 23, 2019

Questions:

  • how about timeouts? If xud goes down and forgets timeout how can we be sure the payment failed?
  • this issue covers the scenario: we are sending a payment and xud crashes before we get the success/fail answer. Which others are covered?
  • Problem: listpayments doesn't return pending payments: check lnd & raiden apis!? Answer: listchannels shows pending htlcs (problem: when sendpayment is used we don't know which route the payment was on. EDIT: there is a payment hash that enables us to identify a payment in the list of pendingn HTLCs).

TODO:

  • list all scenarios to be covered: e.g. what if lnd and raiden crashes and comes back (without xud crashing restarting). Revamp Breaking swaps tests #942 with @kilrau
  • all this for raiden

Misc:

This may be a good candidate for a simulation test, since recovering funds in the case of xud crashing while waiting for the sendPayment response depends substantially on lnd.

True, but isn't that exactly what Moshe's breaking swap tests are supposed to do? In any way, it's close. So I think we should unify these.

@kilrau kilrau requested a review from offerm July 23, 2019 15:04
@kilrau kilrau mentioned this pull request Jul 23, 2019
@kilrau
Copy link
Contributor

kilrau commented Jul 25, 2019

Also: in the recovery cases which are not handled here, e.g. lnd crashing this oldie comes in relevant I think: #684 @sangaman

@sangaman
Copy link
Collaborator Author

sangaman commented Aug 1, 2019

Here's the summary of what I'm trying to cover in this PR, I'm going to push out updated code later today that covers the pending payment cases more thoroughly. The notable gap that I have is how we can claim or settle a raiden payment upon resuming xud. I think in terms of how to handle edge cases where lnd or raiden crash while xud is running (if a payment goes through without us getting a response, for instance), that should be covered in a separate PR since the scope of this PR is already quite large.

Phases

SwapCreated, SwapRequested, & SwapAgreed

No payments have begun and the swap cannot currently be recovered or resumed, so we mark the swap deal as errored in the database.

SendingPayment

Taker

The taker is not at risk of losing funds, as only they know the preimage for any pending payments. In this case, the appropriate action is to cancel any invoices (lnd only) for the swap and mark it as errored in the database.

Canceling the invoice for lnd should cancel any pending HTLCs that the maker would have made to us.

Maker

The maker is at risk of losing funds. If there is a crash on either xud or lnd while we are attempting to send payment, we need to determine whether our payment went through or not. If it goes through, we need to claim our payment on the other chain before it expires. Once we're certain it has not and will not go through, we can cancel our invoice on the receiving network (lnd only).

To determine the status of our payment attempt we call ListPayments in lnd and /api/(version)/payments/(token_address)/(target_address) as well as as /api/(version)/pending_transfersin raiden.

  • For failed payments or payments that are not found (indicating that they were never actually attempted), we cancel any invoices (lnd only) and mark the swap as errored in the database.
  • For successful payments, we retrieve the preimage from the Payment message (for lnd) or EventPaymentReceivedSuccess event (for raiden).
    • For lnd, we can claim the payment on the other network immediately using SettleInvoice.
    • For raiden, there is not currently a way to "push" a preimage to raiden so that it can settle a payment - more research/development is needed on this issue and for now we will log a warning.
  • For pending payments, we will schedule a timer to continue querying for the status of pending payments until they have all resolved.

PaymentReceived

We have received payment for our side of the swap, and the only further action would be to mark the swap as completed in the database.

@sangaman
Copy link
Collaborator Author

sangaman commented Aug 5, 2019

The updated approach tracks any outgoing payments that are still pending upon xud restarting. If an outgoing payment is still in flight and we do not have the preimage for it, we add it to a set of "pending" swaps and check on it on a scheduled interval until we can determine whether it has failed or succeeded. I set the timer to trigger every 5 minutes to check on any swaps that are still pending.

A new SwapRecovery class is introduced to contain the logic for recovering interrupted swap deals and for tracking swaps that are still pending.

Swaps that time out during the SendingPayment phase also get handled by by the swap recovery logic, since we will need to continue checking on them to determine whether our payment ever goes through.

The key missing piece is that Raiden currently does not expose an API call to push a preimage to claim an incoming payment, instead we print a warning to the log for now. One solution to this shortcoming would be to have Raiden continuously query the resolver endpoint if it does not receive a response, in which case once xud learns of the preimage it can pass it to raiden the next time there is a resolve request for that hash.

Another consideration would be to list any pending swaps via the rpc layer as a way to inform users that xud is monitoring pending payments. This could be added to the GetInfo call and/or a separate call to list pending swaps (this could also return swaps that are ongoing in xud during normal operation).

The next step is to add simulation tests to ensure that this functionality is working as expected, since this is hard to test manually. I'll add that to this PR in a separate commit.


public beginTimer = () => {
if (!this.pendingSwapsTimer) {
this.pendingSwapsTimer = setInterval(this.checkPendingSwaps, 300000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment that this is in ms

@kilrau
Copy link
Contributor

kilrau commented Aug 6, 2019

Moshe's test cases (#1048, #1049, #1050) should pass after this is merged @erkarl

@kilrau kilrau requested a review from moshababo August 22, 2019 18:55
@sangaman sangaman force-pushed the swaps/recover-swap-deals branch from e464677 to 2656d12 Compare August 26, 2019 12:50
@sangaman sangaman added the security securing xud from unauthorized actions label Aug 26, 2019
@sangaman sangaman force-pushed the swaps/recover-swap-deals branch from 2656d12 to 7e5a8f6 Compare August 26, 2019 13:00
@sangaman sangaman force-pushed the swaps/recover-swap-deals branch from 7e5a8f6 to 8d4b763 Compare August 26, 2019 13:39
@sangaman sangaman requested a review from a user August 26, 2019 13:47
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks that the logic that detects pending HTLCs for lnd is working. 👍

I simulated a xud crash immediately after sending a payment to the taker:

deal.rPreimage = await swapClient.sendPayment(deal);
console.log('maker got preimage', deal.rPreimage);
process.exit(1); // xud "crashes" after sending payment
// return deal.rPreimage;

The taker claimed the payment successfully. When I restarted maker's xud node I got the following output:

[1] 30/08/2019 14:37:25.460 [SWAPS] info: recovering swap deal 5df992aee84c500633500827bcd5e98a8c84e339996d50acddd4fb7d59cbeace
[1] 30/08/2019 14:37:25.462 [SWAPS] error: TypeError: this.invoices[methodName] is not a function
[1]     at Promise (/home/ar/xud/dist/lndclient/LndClient.js:113:42)
[1]     at new Promise (<anonymous>)
[1]     at LndClient.unaryInvoiceCall (/home/ar/xud/dist/lndclient/LndClient.js:108:20)
[1]     at LndClient.listPayments (/home/ar/xud/dist/lndclient/LndClient.js:587:25)
[1]     at LndClient.<anonymous> (/home/ar/xud/dist/lndclient/LndClient.js:562:41)
[1]     at Generator.next (<anonymous>)
[1]     at /home/ar/xud/dist/lndclient/LndClient.js:7:71
[1]     at new Promise (<anonymous>)
[1]     at __awaiter (/home/ar/xud/dist/lndclient/LndClient.js:3:12)
[1]     at LndClient.lookupPayment (/home/ar/xud/dist/lndclient/LndClient.js:561:41)
[1]     at SwapRecovery.<anonymous> (/home/ar/xud/dist/swaps/SwapRecovery.js:76:69)
[1]     at Generator.next (<anonymous>)
[1]     at /home/ar/xud/dist/swaps/SwapRecovery.js:7:71
[1]     at new Promise (<anonymous>)
[1]     at __awaiter (/home/ar/xud/dist/swaps/SwapRecovery.js:3:12)
[1]     at SwapRecovery.recoverDeal (/home/ar/xud/dist/swaps/SwapRecovery.js:51:38)
[1]     at swapDealInstances.forEach (/home/ar/xud/dist/swaps/Swaps.js:64:39)
[1]     at Array.forEach (<anonymous>)
[1]     at Swaps.<anonymous> (/home/ar/xud/dist/swaps/Swaps.js:61:31)
[1]     at Generator.next (<anonymous>)
[1]     at fulfilled (/home/ar/xud/dist/swaps/Swaps.js:4:58)

@sangaman
Copy link
Collaborator Author

sangaman commented Sep 7, 2019

@erkarl I moved the type and fixed the issue you ran into, thanks for catching that.

@sangaman sangaman requested a review from a user September 7, 2019 12:29
@kilrau kilrau added the P1 top priority label Sep 9, 2019
@ghost
Copy link

ghost commented Sep 9, 2019

@sangaman after testing this with your updated code I'm getting the following output for maker's xud:

[1] 09/09/2019 12:11:20.721 [SWAPS] info: recovered LTC swap payment of 1000 using preimage f8a7cd8560654c7a64855ff5a769ae309505d347431cce07c446873a8d524a14

However, when I check the channel balances it looks like the maker did not receive any LTC. I also checked the balance on lnd level.

Looking at the listchannels for LTC the HTLC is still showing up as pending.

@sangaman
Copy link
Collaborator Author

sangaman commented Sep 9, 2019

@sangaman after testing this with your updated code I'm getting the following output for maker's xud:

[1] 09/09/2019 12:11:20.721 [SWAPS] info: recovered LTC swap payment of 1000 using preimage f8a7cd8560654c7a64855ff5a769ae309505d347431cce07c446873a8d524a14

However, when I check the channel balances it looks like the maker did not receive any LTC. I also checked the balance on lnd level.

Looking at the listchannels for LTC the HTLC is still showing up as pending.

That's strange, as this means setle invoice should have been called. I'll see if I can reproduce.

@sangaman sangaman force-pushed the swaps/recover-swap-deals branch from 724679c to e06e29e Compare September 11, 2019 06:05
@sangaman sangaman requested a review from kilrau September 11, 2019 06:11
@sangaman
Copy link
Collaborator Author

@erkarl @kilrau I fixed the issue with the recovered payment not settling correctly, and added a new suite of "instability" simulation tests that use a new instability branch. It wasn't easy but it simulates exactly the cases we are trying to address with this PR and the tests pass consistently. See the second commit message and the updated top post of this PR for more details.

@ghost
Copy link

ghost commented Sep 11, 2019

@sangaman looks like I'm now able to successfully recover and claim the payment when the maker's xud crashes after sending the payment. 💯

It's really nice that we now have simulation tests in place for these scenarios. My only concern is the maintainability of the instability branch. That branch will pretty quickly be behind the master. Would it make sense to add the BREAKSWAP environment variable check to production code so that we're always testing against the latest changes?

Also, should we add the instability tests to our CI pipeline in .travis.yml?

@sangaman
Copy link
Collaborator Author

I think keeping the instability branch separate is fine, the code changes are minimal and it should be simple enough to just merge the latest without conflicts most of the time. One key change is it shortens the delay between checking on pending swaps from 30 seconds to 1 second to make the tests go faster, but I think every second would be too often for actual use.

I'm neutral on whether it should be part of the travis tests, I followed the security tests which aren't currently run in travis.

This commit attempts to recover swap deals that were interrupted due to
a system or `xud` crash. In the case where we are the maker and have
attempted to send payment for the second leg of the swap, we attempt to
query the swap client for the preimage of that payment in case it went
through. We can then use that preimage to try to claim the payment from
the first leg of the swap. In case the payment is known to have failed,
we simply attempt to close any open invoices and mark the swap deal as
having errored.

If an outgoing payment is still in flight and we do not have the
preimage for it, we add it to a set of "pending" swaps and check on it
on a scheduled interval until we can determine whether it has failed
or succeeded.

A new `SwapRecovery` class is introduced to contain the logic for
recovering interrupted swap deals and for tracking swaps that are
still pending. Any pending swaps are listed in the `GetInfo` response.

Raiden currently does not expose an API call to push a preimage to claim
an incoming payment or to reject an incoming payment, instead we print
a warning to the log for now.

The recovery attempts happen on `xud` startup by looking for any swap
deals in the database that have an `Active` state.

This commit includes a suite of test cases for the newly added
functionality.

Closes #1079.
This adds a new suite of simulation tests to test how `xud` responds to
instability. It simulates crashes immediately after the maker sends
payment as part of a swap. When the maker comes back online, it will
fetch the preimage for the successful payment and use it to settle
the incoming payment from the interrupted swap.

This tests two key cases, one where the maker's payment goes through
while the maker is offline, and another where the payment goes through
only after the maker has come back online. In the latter case, the maker
must detect that the payment is in a pending state and monitor it in
case it goes through.

Closes #1183.
@sangaman sangaman force-pushed the swaps/recover-swap-deals branch from e06e29e to d401b0c Compare September 11, 2019 22:50
@sangaman sangaman merged commit 46b7adf into master Sep 12, 2019
@sangaman sangaman deleted the swaps/recover-swap-deals branch September 12, 2019 00:53
@kilrau kilrau mentioned this pull request Sep 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 top priority security securing xud from unauthorized actions swaps
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Persist swap phase in db
2 participants