Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Networking issues (connection reset) with podman (and discrepancies podman/docker) #9083

Closed
r-cheologist opened this issue Jan 25, 2021 · 22 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless slirp4netns Bug is in slirp4netns stale-issue

Comments

@r-cheologist
Copy link

r-cheologist commented Jan 25, 2021

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

Networking issues (connection reset) with podman (and discrepancies podman/docker)

Steps to reproduce the issue:

  1. podman pull rocker/tidyverse

  2. podman run -d -p 127.0.0.1:8787:8787 -v /tmp:/tmp -e ROOT=TRUE -e DISABLE_AUTH=TRUE --tz=local rocker/tidyverse

  3. Access rstudio using a browser @ localhost:8787, run touch ~/test.txt on the console;

  4. podman stop -l (note printed hash)

  5. podman commit <HASH> local_test

  6. podman rm <HASH>

  7. podman run -d -p 127.0.0.1:8787:8787 -v /tmp:/tmp -e ROOT=TRUE -e DISABLE_AUTH=TRUE --tz=local local_test

  8. Try accessing localhost:8787

Describe the results you received:

  1. I can't connect to the localhost:8787 port - connection reset.

  2. What is furthermore strange is that a) pushing the image to the registry of a private gitlab instance, b) pulling it using docker and running it with docker run -d -p 127.0.0.1:8787:8787 -v /tmp:/tmp -e ROOT=TRUE -e disable_auth=TRUE local_test works just fine (deleting the local image in podman and pulling it from the same registry makes no difference with respect to the connection reset phenotype).

Describe the results you expected:
Network access as to the original container as well as phenocopying of docker.

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

Version:      2.2.1
API Version:  2.1.0
Go Version:   go1.15.6
Git Commit:   a0d478edea7f775b7ce32f8eb1a01e75374486cb
Built:        Tue Dec  8 22:48:23 2020
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.18.0
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: Unknown
    path: /usr/bin/conmon
    version: 'conmon version 2.0.25, commit: 05ce716ac6d1cfeeb27b9280832abd2e9d1a085f'
  cpus: 8
  distribution:
    distribution: arch
    version: unknown
  eventLogger: journald
  hostname: KI-P0695
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1004
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1002
      size: 1
    - container_id: 1
      host_id: 100000
      size: 65536
  kernel: 5.10.10-arch1-1
  linkmode: dynamic
  memFree: 48069582848
  memTotal: 67370360832
  ociRuntime:
    name: runc
    package: Unknown
    path: /usr/bin/runc
    version: |-
      runc version 1.0.0-rc92
      commit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
      spec: 1.0.2-dev
  os: linux
  remoteSocket:
    path: /run/user/1002/podman/podman.sock
  rootless: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: Unknown
    version: |-
      slirp4netns version 1.1.8
      commit: d361001f495417b880f20329121e3aa431a8f90f
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.1
  swapFree: 64135098368
  swapTotal: 64135098368
  uptime: 4h 0m 42.59s (Approximately 0.17 days)
registries:
  search:
  - docker.io
  - registry.fedoraproject.org
  - quay.io
  - registry.access.redhat.com
  - registry.centos.org
store:
  configFile: /home/professional/.config/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mount_program:
      Executable: /usr/bin/fuse-overlayfs
      Package: Unknown
      Version: |-
        fusermount3 version: 3.10.1
        fuse-overlayfs: version 1.4
        FUSE library version 3.10.1
        using FUSE kernel interface version 7.31
  graphRoot: /home/professional/.local/share/containers/storage
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageStore:
    number: 5
  runRoot: /run/user/1002/containers
  volumePath: /home/professional/.local/share/containers/storage/volumes
version:
  APIVersion: 2.1.0
  Built: 1607464103
  BuiltTime: Tue Dec  8 22:48:23 2020
  GitCommit: a0d478edea7f775b7ce32f8eb1a01e75374486cb
  GoVersion: go1.15.6
  OsArch: linux/amd64
  Version: 2.2.1

Package info (e.g. output of rpm -q podman or apt list podman):

> pacman -Q podman
podman 2.2.1-1

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide?

I appear to be running the latest release. 3.0 seems to be bringing in some networking related fixes. Will that take care of my issues?
The issue appears uncovered in the trouble shooting guide.

Additional environment details (AWS, VirtualBox, physical, etc.):

NA.

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 25, 2021
@mheon
Copy link
Member

mheon commented Jan 25, 2021

Root or rootless podman?

@r-cheologist
Copy link
Author

# podman info | grep rootless
rootless: true
# podman unshare cat /proc/self/uid_map
         0       1002          1
         1     100000      65536

@r-cheologist
Copy link
Author

Edited the issue to reflect trouble shooting guide and version question revisit.

@mheon mheon added rootless slirp4netns Bug is in slirp4netns labels Jan 26, 2021
@mheon
Copy link
Member

mheon commented Jan 26, 2021

Can you, with no other containers running, try the first part of your reproducer (everything up to and including podman stop -l on the first container) and then check the output of mount to see if there are any nsfs mounts present? Also, check if any slirp4netns processes are still running at that point.

@r-cheologist
Copy link
Author

Here's what I came up with:

  1. AS ROOT:

     ROOT> lsns | grep podman
    

    --> NO output

  2. AS USER - start the container:

     USER> podman run -d -p 127.0.0.1:8787:8787 -v /tmp:/tmp -e ROOT=TRUE -e disable_auth=TRUE --tz=local rocker/tidyverse
    
  3. AGAIN AS ROOT:

     ROOT> lsns | grep podman
     4026532491 user       10  1377 <MYUSER>    podman
     4026532492 mnt         5  1377 <MYUSER>    podman
     
     ROOT> ps -A | grep slirp4netns
     1393 pts/0    00:00:00 slirp4netns
    
  4. AS USER - stop the container:

     USER> podman stop -l
    
  5. AS ROOT:

     ROOT> lsns | grep podman
     4026532491 user        1  1377 <MYUSER>    podman
     4026532492 mnt         1  1377 <MYUSER>    podman
     ROOT> ps -A | grep slirp4netns
    

    --> NO output of ps -A | grep slirp4netns

Do I interpret this correctly as the nsfs mounts erroneously persisting?

@mheon
Copy link
Member

mheon commented Jan 26, 2021

That's a user and mount namespace - I'm looking for the network namespace. I would expect the user namespace to be persisted by our pause process.

An alternative would be podman unshare mount | grep fuser-overlayfs. This isn't a perfect check (it's verifying if the container's filesystem is still mounted, not the network - but we do unmount the filesystem and clean up the network in the same place, so it's a good indication if that is actually firing). If there is any output, we still have a mounted filesystem, and I can assume the cleanup process did not succeed in cleaning up the container's storage and networking. Also, a podman inspect --format '{{ .State.Status }}' on the container after it is stopped would help - I'd expect to see Exited. If the container is still in Stopped then cleanup did not happen.

@edsantiago
Copy link
Member

@r-cheologist ping, have you had a chance to try @mheon's suggestions?

@r-cheologist
Copy link
Author

r-cheologist commented Feb 4, 2021

Sorry for the delay.

podman unshare mount | grep fuser-overlayfs

(after stopping the container) gives NO output. Does it matter in this context that I'm on btrfs?

podman inspect --format '{{ .State.Status }}' <CONTAINER_HASH>

produces:

exited

There were several updates over the last days (in the context of my Arch system), but the problem persists as described above.

@mheon
Copy link
Member

mheon commented Feb 5, 2021

Alright. Exited indicates that we successfully tore down the network stack, so that idea's a bust.

Can you try adding the --net=slirp4netns:port_handler=slirp4netns option to your podman run commands and see if that resolves it?

@r-cheologist
Copy link
Author

Adding --net=slirp4netns:port_handler=slirp4netns to the podman run commands does not make it work - the error now changes to This site can’t be reached. The webpage at http://localhost:8787/ might be temporarily down or it may have moved permanently to a new web address. ERR_SOCKET_NOT_CONNECTED, though.

@mheon
Copy link
Member

mheon commented Feb 8, 2021

Can you confirm that this is only on the second invocation of Podman, as it was before? Or is this for every invocation of Podman now?

@AkihiroSuda PTAL

@r-cheologist
Copy link
Author

Yes I followed exactly my recipe above and end up with a running non-localhost:8787 accessible container from the new image.

@r-cheologist
Copy link
Author

Any further follow up I could provide?

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Mar 22, 2021

@AkihiroSuda @mheon What is the scoop on this one?

@mheon
Copy link
Member

mheon commented Mar 23, 2021

I suppose this could be related to the Conmon issue we've been tracking where ports are held open - testing with newest released Conmon could help. If it's not that, it's more on the slirp side of the fence from what I can see.

@r-cheologist
Copy link
Author

podman version 3.0.1
Is what I currently have in Manjaro testing. That's likely not recent enough, right?

@mheon
Copy link
Member

mheon commented Mar 26, 2021

That's the Podman version - Conmon is a separate utility binary we ship. The versions of the two are independent.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Apr 27, 2021

This is not a Podman issue, so I am going to close.

@rhatdan rhatdan closed this as completed Apr 27, 2021
@r-cheologist
Copy link
Author

As of the following versions I can report the issue as resolved (on Manjaro testing):

> podman -v
podman version 3.1.2
> conmon --version
conmon version 2.0.27
commit: 65fad4bfcb250df0435ea668017e643e7f462155
> slirp4netns -v
slirp4netns version 1.1.9
commit: 4e37ea557562e0d7a64dc636eff156f64927335e
libslirp: 4.4.0
SLIRP_CONFIG_VERSION_MAX: 3
libseccomp: 2.5.1

@marcopolo4k
Copy link

marcopolo4k commented May 19, 2023

I'm seeing this on AlmaLinux 9.2 and Rocky 8.7. I've tried podman system reset, and podman system prune --all --force. I found the rootlessport process was holding the port open, and killed it to solve the issue temporarily, but I bet it's going to come back. Is this 9083 issue related?

» podman -v
podman version 4.4.1
» conmon --version
conmon version 2.1.7
commit: fab2fef7227d2dc16478d29f1185953f81451702
» slirp4netns -v
slirp4netns version 1.2.0
commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
libslirp: 4.4.0
SLIRP_CONFIG_VERSION_MAX: 3
libseccomp: 2.5.2
» podman pod create --publish 8080:8080 --publish 8000:8000 "$POD_NAME";
c889f1873504a85c23ec45cc008dd196d05f096c0d21fc7858e3adc8b9f66f41
» podman run --pod="$POD_NAME" --name="${POD_NAME}-db" --detach --volume one-db:/var/lib/postgresql/data -e=POSTGRES_DB=onedb -e=POSTGRES_USER=one -e=POSTGRES_PASSWORD=onepass docker.io/library/postgres:15-alpine;
ERRO[0003] Starting some container dependencies
ERRO[0003] "rootlessport listen tcp 0.0.0.0:8000: bind: address already in use"
Error: starting some containers: internal libpod error
» 126»
» 126» sudo netstat -plan | grep :8000
tcp6       0      0 :::8000                 :::*                    LISTEN      40976/rootlessport
»

Related question.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless slirp4netns Bug is in slirp4netns stale-issue
Projects
None yet
Development

No branches or pull requests

6 participants