Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/air] air_example_gptj_deepspeed_fine_tuning.gce release test is failing #36274

Closed
justinvyu opened this issue Jun 9, 2023 · 0 comments · Fixed by #36276
Closed

[release/air] air_example_gptj_deepspeed_fine_tuning.gce release test is failing #36274

justinvyu opened this issue Jun 9, 2023 · 0 comments · Fixed by #36276
Assignees
Labels
P0 Issues that should be fixed in short order release-test release test

Comments

@justinvyu
Copy link
Contributor

justinvyu commented Jun 9, 2023

This release test fails due to not being able to download the model from AWS s3:

Error log:

    subprocess.run(
        [
            "aws",
            "s3",
            "sync",
            "--quiet",
            "s3://large-dl-models-mirror/models--EleutherAI--gpt-j-6B/main/",
            os.path.join(path, "snapshots", "main"),
        ]
    )  # aws s3 sync fails on GCE
    with open(os.path.join(path, "snapshots", "main", "hash"), "r") as f:
        f_hash = f.read().strip()
Traceback (most recent call last):
  File "/tmp/tmp7_lq_gyw", line 146, in <module>
    _ = run_on_every_node(download_model)
  File "/tmp/tmp7_lq_gyw", line 113, in run_on_every_node
    return ray.get(refs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 2540, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(FileNotFoundError): ray::download_model() (pid=6869, ip=10.138.1.165)
  File "/tmp/tmp7_lq_gyw", line 137, in download_model
FileNotFoundError: [Errno 2] No such file or directory: '/home/ray/.cache/huggingface/hub/models--EleutherAI--gpt-j-6B/snapshots/main/hash'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P0 Issues that should be fixed in short order release-test release test
Projects
None yet
2 participants