Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/dcrdex: Unable to stop app after shutting down dcrd node. #321

Closed
JoeGruffins opened this issue May 1, 2020 · 13 comments · Fixed by #325
Closed

cmd/dcrdex: Unable to stop app after shutting down dcrd node. #321

JoeGruffins opened this issue May 1, 2020 · 13 comments · Fixed by #325

Comments

@JoeGruffins
Copy link
Member

If you shut down your dcrd node/asset before shutting down the server, the server hangs on shutdown. The logs stop here:

^C2020-05-01 21:03:08.847 [INF] MAIN: Received signal (interrupt). Shutting down...
2020-05-01 21:03:08.847 [INF] MAIN: Stopping DEX...
2020-05-01 21:03:08.847 [INF] DEX: Stopping subsystems...
2020-05-01 21:03:08.847 [INF] COMM: RPC server shutting down...
2020-05-01 21:03:08.847 [DBG] COMM: RPC listener done for 127.0.0.1:7232
2020-05-01 21:03:08.847 [INF] COMM: RPC server shutdown complete
2020-05-01 21:03:08.847 [INF] DEX: Comms Server shutdown.
2020-05-01 21:03:08.847 [INF] DEX: BookRouter shutdown.
2020-05-01 21:03:08.847 [DBG] MKT: Market "dcr_btc" stopped.
2020-05-01 21:03:08.847 [INF] DEX: Market[dcr_btc] shutdown.
2020-05-01 21:03:08.847 [INF] DEX: Swapper shutdown.
2020-05-01 21:03:08.847 [INF] DEX: Auth manager shutdown.

Starting the node back up does not seem to help.

@buck54321
Copy link
Member

I've seen this too. Probably start looking around

btc.run(ctx)
err := btc.wallet.LockUnspent(true, nil)

@JoeGruffins you want to investigate?

@JoeGruffins
Copy link
Member Author

sure!

@chappjc
Copy link
Member

chappjc commented May 1, 2020

Part of updating our dcrd/rpcclient dependency will be using the new context.Context input arg. That might be related if it's an RPC that hangs. That won't help with btc though since I don't think it's rpcclient is updated.

@JoeGruffins
Copy link
Member Author

JoeGruffins commented May 2, 2020

I have tracked this down as being a function of our rpcclient.

The dcrd backend polls for best blocks:

bestHash, err := dcr.node.GetBestBlockHash()

When there is no connection, dcrd does not return an error when auto-reconnect is on. It saves all requests to fire them all at once when a reconnect happens. And so it stops there when not connected and never makes to the break out

This issue can be solved by turning off autoreconnect here:

config := &rpcclient.ConnConfig{
Host: host,
Endpoint: "ws", // websocket
User: user,
Pass: pass,
Certificates: dcrdCerts,
}

by adding the line DisableAutoReconnect: true,

We "solved" a similar issue in dcrstakpool by writing a reconnect function. jrick also has a convenient package https://github.com/jrick/wsrpc that I think simplifies things. However, we would still need to handle reconnects manually.

Also, the client does reconnect and shutdown continues with auto-reconnect on when using testnet. For some reason it does not seem to work the same when using the simnet harness.

Part of updating our dcrd/rpcclient dependency will be using the new context.Context input arg.

Or maybe this? Not sure what it is yet, but passing a context and having the client stop waiting on ctx.Done() would also solve the issue.

@chappjc
Copy link
Member

chappjc commented May 4, 2020

PR #325 looks good to fix the hang, but let's open an issue or two for: (1) running with auto-reconnect disabled to prevent RPC calls from simply hanging, and (2) using the newer rpcclient API that takes Contexts, which would solve this another way and give us a way to set timeouts on any given RPC call. I'm not certain, but it seems like setting timeouts on RPC calls might allow us to keep auto-reconnect.

@chappjc
Copy link
Member

chappjc commented May 4, 2020

@JoeGruffins Assuming we stick with rpcclient, could you investigate the two approaches I named? What are the pros/cons of autoreconnect, and does the Context-enabled rpcclient API effectively fix the issues without requiring manual reconnect code?

@JoeGruffins
Copy link
Member Author

Is the rpcclient that uses a context dcrd/rpcclient/v6 ? If so I don't see where it has changed to use a context throughout.

@chappjc
Copy link
Member

chappjc commented May 5, 2020

Is the rpcclient that uses a context dcrd/rpcclient/v6 ? If so I don't see where it has changed to use a context throughout.

There's a context.Context input on just about all the Client methods, at least on master. e.g.:

https://github.com/decred/dcrd/blob/ce2195fbc3de0ee60ebaebfe7a967f0d8b041498/rpcclient/chain.go#L56-L60

// GetBestBlockHash returns the hash of the best block in the longest block
// chain.
func (c *Client) GetBestBlockHash(ctx context.Context) (*chainhash.Hash, error) {
	return c.GetBestBlockHashAsync(ctx).Receive()
}

Currently that's rpcclient/v6.

@JoeGruffins
Copy link
Member Author

It looks like that context doesn't concern websocket connections, only http.

https://github.com/decred/dcrd/blob/ce2195fbc3de0ee60ebaebfe7a967f0d8b041498/rpcclient/infrastructure.go#L878-L910

Will test out v6 anyway to see if canceling that ctx has the desired effect for us.

@chappjc
Copy link
Member

chappjc commented May 6, 2020

But https://github.com/decred/dcrd/blob/ce2195fbc3de0ee60ebaebfe7a967f0d8b041498/rpcclient/infrastructure.go#L1403

@chappjc chappjc reopened this May 6, 2020
@JoeGruffins
Copy link
Member Author

JoeGruffins commented May 6, 2020

@chappjc
Copy link
Member

chappjc commented May 6, 2020

Ohhhh, that's unfortunate. Seems like if the context is cancelled then it should cease reconnection attempts.

@JoeGruffins
Copy link
Member Author

rpcserver/v6 should now be able to handle this better. decred/dcrd#2198 I guess closing this issue will depend on v6 being finalized, and us using it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants