Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disconnect sockets when they're removed #260

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

TylerLubeck
Copy link

Hiya! I've been plagued by docker/for-mac/issues/1374 for a while now, so I wanted to dig in and see if I could do anything about it. I think I might have found a potential fix for it, but here's the kicker: I can not for the life of me figure out how to compile VPNKit, with or without this patch.

So setting aside the "this hasn't been tested because it hasn't been compiled" thing for a minute, would you be up for checking my thought process?

  • It was mentioned here that the problem might be coming from VPNKit rather than docker for mac
  • It was mentioned in a few places (including here) that people have noticed a buildup of sockets in a CLOSE_WAIT state
  • It's mentioned towards the end of this section in the documentation that the Mirage stack is shut down for a src/dest pair if that pair hasn't seen traffic within a certain time frame

So I'm wondering if shutting down that Mirage stack is leaving sockets around that need to be formally closed.

I'm in a bit (a lot) over my head jumping in here so I'd love to treat this PR as a discussion point to see if I can't help track down the cause of docker/for-mac/issues/1374, if that's something that would work for y'all?

Back to the compiling thing:
The instructions in the README for compiling don't seem to quite cover it. The guide here gets me closer, but I still run in to issues. If anybody has steps I can follow to configure ocaml on my Mac and get vpnkit to compile, I'd love to see them! If not, I'll keep chipping away at it and try to submit a different PR with instructions

@samoht
Copy link
Member

samoht commented Jul 26, 2017

Thanks for your contribution!

I am working on refreshing the dependencies of vpnkit to make it easier to compile, but this needs quite few changes to be approved by @djs55 first (who is coming back from PTO next week and who will be able to comment more on your issue).

@TylerLubeck
Copy link
Author

Brilliant, thank you!

@djs55
Copy link
Collaborator

djs55 commented Jul 31, 2017

@TylerLubeck thanks for taking a look at this! I hadn't considered a possible interaction between the silent filter rule dropping and the connection leak.

If the remote machine closes the connection then the socket on the Mac host should enter the CLOSE_WAIT state (as observed) and the call to Mirage_flow.proxy will call close on the half of the connection going to the VM which should transmit a segment with the FIN flag set. If the filter rule was dropped then replies would never be heard (I believe new outgoing packets would be sent to a fresh TCP/IP stack and ignored). The original connection in the original stack would remain in the state Send_fin for a long time (I suspect).

There's also a (slightly tacky) background keep-alive thread whose job is supposed to keep prodding the host to make sure the connection still exists but this may not be working. I notice the Port.write function does not update the last_active_time but the packet receive callback does. So if the keep-alive were not provoking the VM to send a packet, the last_active_time would not be updated making the timeout more likely to hit.

I'll see if I can add some more debugging information, so we can see more easily what's going on. Thanks for your help so far!

@TylerLubeck
Copy link
Author

So, to be honest with you, a lot of what you said is going over my head. I have no particular reason to suspect that it's a bug in the silent filter rule dropping other than that that's the first area I traced it to that looks kinda funny.

I'd love to get to a point that I can compile things and try them, but until then I would be more than happy to take any binaries you can produce and test them either myself or with a handful of people I work with that are experiencing the same issue if that's something you'd be up for.

@samoht
Copy link
Member

samoht commented Aug 2, 2017

@TylerLubeck if you do:

$ opam remote add vpnkit ./repo/darwin
$ opam update
$ opam install --deps-only vpnkit
$ make

that should now work. If not, please report any issue and I will fix it :-)

We are still trying to upgrade all the dependencies to the latest version (so the opam remote add line would not be needed anymore) but it a bit more painful than expected, see the discussions on #262

EDIT: replaced slirp by vpnkit

@TylerLubeck
Copy link
Author

Hey @samoht, two things came up:

opam install --deps-only slirp fails with [ERROR] No package named slirp found.

I tried running opam install --deps-only vpnkit like the README says now, and it fails with [ERROR] The compilation of mirage-conduit failed at "ocamlfind query conduit.mirage". The log file associated with the install contains this:

# opam-version 1.2.2
# os           darwin
# command      ocamlfind query conduit.mirage
# path         /Users/tylerl/.opam/4.04.1/build/mirage-conduit.2.2.0
# compiler     4.04.1
# env-file     /Users/tylerl/.opam/4.04.1/build/mirage-conduit.2.2.0/mirage-conduit-10012-dc3e96.env
# stdout-file  /Users/tylerl/.opam/4.04.1/build/mirage-conduit.2.2.0/mirage-conduit-10012-dc3e96.out
# stderr-file  /Users/tylerl/.opam/4.04.1/build/mirage-conduit.2.2.0/mirage-conduit-10012-dc3e96.err

I'm happy to get you more logs, move this to a separate issue, or just sit tight until your work over in #262 is done. Just let me know, and thank you for your help so far!

@samoht
Copy link
Member

samoht commented Aug 3, 2017

@TylerLubeck thanks for your patience. It should work better with #266 (and indeed you should be using vpnkit instead of slirp).

@MagnusS
Copy link
Collaborator

MagnusS commented Aug 3, 2017

@samoht I still see the ocamlfind query conduit.mirage error after #266. It looks like the mirage-conduit.2.2.0 package in repo/darwin is missing an upper bound on conduit compared to https://github.com/ocaml/opam-repository/blob/master/packages/mirage-conduit/mirage-conduit.2.2.0/opam -- my installed version is conduit 1.0.0 and ocamlfind query conduit.mirage fails (it succeeds with 0.15.0). I tried to update the dependencies in repo, but the update script fails with:

The following dependencies couldn't be met:
  - vpnkit -> mirage-conduit = 2.2.0 -> mirage-dns < 3.0.0 -> dns < 1.0.0
Your request can't be satisfied:
  - Conflicting version constraints for dns
  - dns.0.19.0 is in conflict with mirage-types-lwt.2.8.999
  - dns.0.19.1 is in conflict with mirage-types-lwt.2.8.999
  - dns.0.20.0 is in conflict with mirage-types-lwt.2.8.999
  - dns.0.20.1 is in conflict with mirage-types-lwt.2.8.999

@samoht
Copy link
Member

samoht commented Aug 3, 2017

Ha I think I found the issue: ocaml/opam-repository@dba6c74 added an upper bound on dns. Fixing that now.

@samoht
Copy link
Member

samoht commented Aug 3, 2017

See #267 for a fix. Don't forget to add a local repo as master is still using a bunch of unreleased packages (hopefully not for long).

@TylerLubeck
Copy link
Author

Success! I can now build everything. I'll keep poking around to see if I can track this thing down. Thanks for your help!

@TylerLubeck TylerLubeck force-pushed the close_sockets_when_removed branch 2 times, most recently from 7ca240f to a0aa83f Compare August 3, 2017 22:08
attempts to fix docker/for-mac/issues/#1374
@TylerLubeck TylerLubeck force-pushed the close_sockets_when_removed branch from a0aa83f to a39ea09 Compare August 3, 2017 22:10
@djs55
Copy link
Collaborator

djs55 commented Aug 7, 2017

I've started having more of a look too. I added this debug patch: djs55@41cdadc which

  • logs the connection table every 10s
  • times out switch ports after 5s

I've let it time out an ssh connection to the host and the socket is still open after 1.5hrs so far (which seems bad). Out of curiosity I'm going to leave it overnight to see if the connection is still open in the morning.

@TylerLubeck
Copy link
Author

That's great, thank you!

Another question for y'all - how can I get these custom versions running with Docker for Mac so I can reproduce my test cases? I've been digging through documentation and have tried to replace the binaries with my new compiled ones, but no luck with anything so far. Any suggestions?

djs55 added a commit to djs55/vpnkit that referenced this pull request Aug 8, 2017
Previously there was no way to locate the connections associated with
an endpoint to shut them down. This patch adds a map of TCP `id` to
`unit Lwt.u` and a function `Endpoint.destroy` which triggers the
disconnection of all the active connections.

Related to moby#260

Signed-off-by: David Scott <dave.scott@docker.com>
@djs55
Copy link
Collaborator

djs55 commented Aug 8, 2017

@TylerLubeck could you try a binary for me? The CI just built this from #269. If you download and untar the archive, there should be a vpnkit binary somewhere inside. If you replace the version in /Applications/Docker.app/Contents/Resources/bin/vpnkit and restart the Docker app, it should (hopefully) close more connections when timing out switch ports (as you suggested!)

@TylerLubeck
Copy link
Author

Awesome, thank you. I'll get this distributed out to a few folks I'm working with that are able to reproduce this on a much more regular basis, and we'll monitor for a bit and get you logs as they come up.

Thank you so much for digging in to this!!!

@djs55
Copy link
Collaborator

djs55 commented Aug 17, 2017

@TylerLubeck have you heard any feedback? I think I'll probably merge the PR in #269 since I think it's good to have anyway, but I'd love to know if it makes a difference to the people suffering from the original problem.

djs55 added a commit to djs55/vpnkit that referenced this pull request Aug 17, 2017
Previously there was no way to locate the connections associated with
an endpoint to shut them down. This patch adds a map of TCP `id` to
`unit Lwt.u` and a function `Endpoint.destroy` which triggers the
disconnection of all the active connections.

Related to moby#260

Signed-off-by: David Scott <dave.scott@docker.com>
djs55 added a commit to djs55/vpnkit that referenced this pull request Aug 17, 2017
Previously there was no way to locate the connections associated with
an endpoint to shut them down. This patch adds a map of TCP `id` to
`unit Lwt.u` and a function `Endpoint.destroy` which triggers the
disconnection of all the active connections.

Related to moby#260

Signed-off-by: David Scott <dave.scott@docker.com>
@TylerLubeck
Copy link
Author

Unfortunately I haven't heard much back yet, but I'll bug some folks to get some information

@sulphur
Copy link

sulphur commented Aug 22, 2017

i'm having this issue as well on rc3 i juste tried the binary but it didn't helped. I have no problem on rc1. In my case i have haproxy and each haproxy check leave connection on CLOSE_WAIT

@TylerLubeck
Copy link
Author

Hey @djs55, just heard back from some folks and unfortunately this didn't seem to help the problem. That said, we're also not seeing CLOSE_WAIT anymore so I'm not sure where to go. This also doesn't line up with what @sulphur says above, which is odd. I'm gonna set us up to run this overnight and see if we can get some CLOSE_WAITs back.

@sulphur
Copy link

sulphur commented Nov 1, 2017

any news on that ? I'm still hagin the CLOSE_WAIT and FIN_WAIT_2. I've just tested on 17.10.0-ce-mac36. I've moved back to 17.06 for now where lsof -p $(pgrep vpnkit) | wc -l is around 100 and can stay like that for days on my machine. With the latset version my dev environement (which is quite heavy) jumps to 2000 in like 2 minutes after starting up.

@toblerpwn
Copy link

toblerpwn commented Nov 6, 2017

I'm having the same case as @sulphur fwiw. lsof -p $(pgrep vpnkit) | wc -l simply climbs to whatever limit (quickly).

Some additional info that may or may not help:

  1. In my environment, this only seems to happen when Docker containers are running socket.io or an HTTP server on Node.js - and only when they run along side a separate HAProxy container.

  2. The amount of open connections does not reduce even after running docker stop and docker rm on all of my Docker for Mac containers. Quitting the Docker for Mac daemon is the only way to reduce the number of open connections / finally do away with the CLOSE_WAIT/FIN_WAIT_2 connections.

I've got some more detail here: docker/for-mac#1374 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants