Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many open files under RustBoard (EMFILE) #4955

Open
wchargin opened this issue May 11, 2021 · 6 comments
Open

Too many open files under RustBoard (EMFILE) #4955

wchargin opened this issue May 11, 2021 · 6 comments
Labels
core:rustboard //tensorboard/data/server/...

Comments

@wchargin
Copy link
Contributor

I am getting a lot of warnings about too many open files -- is there a way to reduce or cap the number of open file descriptors?

2021-05-11T14:31:46Z WARN rustboard_core::run] Failed to open event file EventFileBuf("[RUN NAME]"): Os { code: 24, kind: Other, message: "Too many open files" }

I don't have that many runs (~2000), so it shouldn't really be an issue. Using lsof to count the number of open FDs shows over 12k being used...

>> lsof | awk '{print $1}' | sort | uniq -c | sort -r -n | head
   6210 tokio-run
   6210 Reloader-
   1035 StdinWatc
   1035 server
   1035 Reloader
    184 gmain
    168 gdbus
    134 grpc_glob
     85 bash
     80 snapd

Compared to <500 in "slow" mode.

>> lsof | awk '{print $1}' | sort | uniq -c | sort -r -n | head
    427 tensorboa
    184 gmain
    168 gdbus
     85 bash
     80 snapd
     72 systemd
     71 screen
     52 dconf\x20
     51 dbus-daem
     48 llvmpipe-

In my case, the "slow" mode actually loads files faster since it doesn't run into this issue.

Originally posted by @Raphtor in #4784 (comment)

@wchargin
Copy link
Contributor Author

Hi @Raphtor! Thanks for the report and the helpful info. Some questions:

  • Could you run diagnose_tensorboard.py in the same environment
    from which you usually run TensorBoard and post the full output in a
    comment (sanitizing as desired if you want to redact anything)?

  • Are you able to share the log directory with us? If not, could you
    describe the structure of the event files? You say that you only
    have ~2000 runs, but I wonder if each run tends to have many event
    files (can happen if your training workers restart a lot). If so,
    it’s possible that that explains the difference, since the
    particulars around how we handle multiple event files in the same
    directory differ somewhat.

    Broadly, there are three potential behaviors. In all cases, we read
    all event files in lexicographical order. When we hit EOF on an
    event file, we keep polling it iff…

    • in all-files mode: always keep polling it
    • in last-file mode: keep polling iff it’s the last file
    • in multifile mode: keep polling iff its last event was not
      too long ago (defaults to 86400 seconds = 1 day)

    TensorBoard with --load_fast=false uses last-file mode by default
    (and can also be told to use multifile mode), but with
    --load_fast=true uses all-files mode.

  • Can you also reproduce the issue when running TensorBoard with

    --load_fast=false --reload_multifile=true --reload_multifile_inactive_secs=-1
    

    ? Same train of thought as above; this enables multifile mode with
    an unbounded age threshold, making it equivalent to all-files mode.
    If this reproduces the issue, we can probably fix this by making
    --load_fast=true also implement last-file and/or multifile modes,
    which would be nice, anyway.

  • What lsof do you have? My lsof (4.93.2, Linux) uses the first
    column for the command name, but (e.g.) tensorboard and bash are
    process names whereas Reloader and StdinWatcher are thread
    names. So my lsof output has lines like:

    COMMAND    PID     USER   FD      TYPE  DEVICE SIZE/OFF     NODE NAME
    server  692802 wchargin   11r      REG   254,1 11096888 15361542 /HOMEDIR/tensorboard_data/mnist/lr_1E-03,conv=2,fc=2/events.out.tfevents.1563406405.HOSTNAME
    

    …and I don’t see how your lsof | awk '{ print $1 }' is giving the
    output that you’re seeing. Probably just a reporting thing, but I’d
    like to be able to reproduce your interaction if possible.

@Raphtor
Copy link

Raphtor commented May 11, 2021

Thanks for your quick response! Here are the things you requested.

diagnose_tensorboard.py output

Diagnostics

Diagnostics output
--- check: autoidentify
INFO: diagnose_tensorboard.py version e43767ef2b648d0d5d57c00f38ccbd38390e38da

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=6, micro=9, releaselevel='final', serial=0)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='nai-testing-2', release='4.15.0-143-generic', version='#147-Ubuntu SMP Wed Apr 14 16:10:11 UTC 2021', machine='x86_64')
INFO: sys.getwindowsversion(): N/A

--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: '[VIRTUAL ENV]'

--- check: installed_packages
INFO: installed: tb-nightly==2.6.0a20210510
WARNING: no installation among: ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-2.0-preview', 'tf-nightly-gpu', 'tf-nightly-gpu-2.0-preview']
WARNING: no installation among: ['tensorflow-estimator', 'tensorflow-estimator-2.0-preview', 'tf-estimator-nightly']
INFO: installed: tensorboard-data-server==0.6.1

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.6.0a20210510'

--- check: tensorflow_python_version
Traceback (most recent call last):
  File "diagnose_tensorboard.py", line 522, in main
    suggestions.extend(check())
  File "diagnose_tensorboard.py", line 75, in wrapper
    result = fn()
  File "diagnose_tensorboard.py", line 278, in tensorflow_python_version
    import tensorflow as tf
ModuleNotFoundError: No module named 'tensorflow'

--- check: tensorboard_data_server_version
INFO: data server binary: '[VIRTUAL ENV]/lib/python3.6/site-packages/tensorboard_data_server/bin/server'
Traceback (most recent call last):
  File "diagnose_tensorboard.py", line 522, in main
    suggestions.extend(check())
  File "diagnose_tensorboard.py", line 75, in wrapper
    result = fn()
  File "diagnose_tensorboard.py", line 301, in tensorboard_data_server_version
    check=True,
  File "/usr/lib/python3.6/subprocess.py", line 423, in run
    with Popen(*popenargs, **kwargs) as process:
TypeError: __init__() got an unexpected keyword argument 'capture_output'

--- check: tensorboard_binary_path
INFO: which tensorboard: b'[VIRTUAL ENV]/bin/tensorboard\n'

--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::1', 0, 0, 0)), (<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]

--- check: readable_fqdn
INFO: socket.getfqdn(): '[HOST]'

--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=6431253, st_dev=64768, st_nlink=2, st_uid=1003, st_gid=1003, st_size=4096, st_atime=1620705993, st_mtime=1620752382, st_ctime=1620752382)
INFO: mode: 0o40777

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['[VIRTUAL ENV]/lib/python3.6/site-packages']; bad_roots (0): []

--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==0.12.0
aiohttp==3.7.4.post0
aiohttp-cors==0.7.0
aioredis==1.3.1
async-timeout==3.0.1
attrs==21.2.0
blessings==1.7
cachetools==4.2.2
certifi==2020.12.5
chardet==4.0.0
click==7.1.2
colorama==0.4.4
contextvars==2.4
cycler==0.10.0
dataclasses==0.8
filelock==3.0.12
google-api-core==1.26.3
google-auth==1.30.0
google-auth-oauthlib==0.4.4
googleapis-common-protos==1.53.0
gpustat==0.6.0
grpcio==1.37.1
hiredis==2.0.0
idna==2.10
idna-ssl==1.1.0
immutables==0.15
importlib-metadata==4.0.1
joblib==1.0.1
jsonschema==3.2.0
kiwisolver==1.3.1
Markdown==3.3.4
matplotlib==3.1.0
msgpack==1.0.2
multidict==5.1.0
nengo==3.1.0
numpy==1.19.5
nvidia-ml-py3==7.352.0
oauthlib==3.1.0
opencensus==0.7.12
opencensus-context==0.1.2
packaging==20.9
pandas==1.1.5
pip==21.1.1
prometheus-client==0.10.1
protobuf==3.16.0
psutil==5.8.0
py-spy==0.3.6
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyparsing==2.4.7
pyrsistent==0.17.3
python-dateutil==2.8.1
pytz==2021.1
PyYAML==5.4.1
ray==1.3.0
redis==3.5.3
requests==2.25.1
requests-oauthlib==1.3.0
rsa==4.7.2
scikit-learn==0.23.2
scipy==1.5.2
setuptools==56.0.0
six==1.16.0
tabulate==0.8.9
tb-nightly==2.6.0a20210510
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorboardX==2.2
threadpoolctl==2.1.0
torch==1.7.1
tqdm==4.60.0
typing-extensions==3.10.0.0
urllib3==1.26.4
Werkzeug==1.0.1
wheel==0.36.2
yarl==1.6.3
zipp==3.4.1

Next steps

No action items identified. Please copy ALL of the above output,
including the lines containing only backticks, into your GitHub issue
or comment. Be sure to redact any sensitive information.

logdir structure

I am using ray[tune] to tune hyperparameters for the training of a RNN in pytorch. It produces a structure sort of like:

- experiment/
     - basic-variant-state-[datetime].json        
     -  experiment_state-[datetime].json                                                       
     -  '[function name]_[hyperparameters set 1]_[datetime]'/
          - events.out.tfevents.1620750386.nai-testing-2
          - params.json
          - params.pkl
          - progress.csv
          - result.json
      ....
      -  '[function name]_[hyperparameters set 2000]_[datetime]'/
               

The files in the runs are each <50KB

Using multifile mode

Seems not to reproduce the issue.

lsof version

I am using lsof 4.89 -- the whole command lists all FDs, sorts and counts the unique FDs for each process name. However, I am no longer sure that its accurately counting what we want, since running it just now with Ray running shows that over 20K open FDs but no error...

@wchargin
Copy link
Contributor Author

An easy workaround here would be to just increase the fd cap to the hard
limit. On my system, ulimit -n -S (the soft limit) is 1024, but
ulimit -n -H (the hard limit) is 1048576. I don’t think that raising
this should have any adverse effects: since we never touch fds directly,
any such bugs would have to be in our dependencies, which are all fairly
widely used.

At a glance, pulling in the rlimit crate and adding something like

fn increase_fd_limit() -> std::io::Result<()> {
    #[cfg(unix)]
    {
        use rlimit::Resource;
        let (old_soft_limit, hard_limit) = Resource::NOFILE.get()?;
        Resource::NOFILE.set(hard_limit, hard_limit)?;
        debug!("Changed file descriptor limit from {} to {}", old_soft_limit, hard_limit);
    }
    #[cfg(not(unix))]
    {
        debug!("Non-Unix; leaving file descriptor limit alone");
    }
    Ok(())
}

fn try_increase_fd_limit() {
    if let Err(e) = increase_fd_limit() {
        warn!("Failed to increase file descriptor limit: {}", e);
    }
}

to cli.rs and calling it from main should do the trick.

@stephanwlee stephanwlee added the core:rustboard //tensorboard/data/server/... label Oct 5, 2021
@Tilps
Copy link

Tilps commented Nov 8, 2021

I've also run into this problem after an upgrade. I have 10s of thousands of runs however...
Attempting to increase the file limit seemed to get some weird behaviour where it stopped being able to be served remotely (but maybe I screwed something else up...) I've gone back to slow loading for now...

@janacht
Copy link

janacht commented Apr 21, 2022

I've also encountered the problem and found that raising the "open files" limit by executing e.g.

ulimit -n 50000

solves the problem for me (without requiring superuser permissions).

@AlexanderYaroshevichIAC

fyi
ulimit only works for the current shell session

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core:rustboard //tensorboard/data/server/...
Projects
None yet
Development

No branches or pull requests

6 participants