Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 bucket script #624

Merged
merged 18 commits into from
Apr 3, 2025
Merged

S3 bucket script #624

merged 18 commits into from
Apr 3, 2025

Conversation

benjaminking
Copy link
Collaborator

@benjaminking benjaminking commented Jan 9, 2025

This PR modifies the previous S3 bucket script to support automatic mounting of the S3 bucket in a dev container for both WSL2-based Ubuntu systems and standalone Ubuntu systems. It splits the logic of the previous script into two portions:

  • the part that does not depend on any SILNLP repo files (this logic has been moved into the Dockerfile) and
  • the part that uses files in the SILNLP repo (this remains in the s3_bucket_setup.sh script, which is run as a post-install command)

The contents of the S3 bucket should now be accessible at /silnlp upon creation of the dev container. The major change required to allow this to run on standalone Ubuntu systems is to run the Docker container with a different AppArmor profile. For simplicity, I configured the container to run in "unconfined" mode, but if more fine-grained security is desired (or if ClearML requires it), we should be able to create a custom AppArmor profile that we distribute in the SILNLP repo.

This has been tested on both WSL2-based Ubuntu (in Windows 11) and in standalone Linux Mint, a Ubuntu-based distro.


This change is Reviewable

Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested on my Ubuntu machine, and it's also working there. I'll check with Bryan from TechOps to see if he has any security concerns with running the container like this.

Reviewed 1 of 3 files at r1, all commit messages.
Reviewable status: 1 of 3 files reviewed, 3 unresolved discussions (waiting on @benjaminking)


.devcontainer/Dockerfile line 26 at r1 (raw file):

    curl \
    unzip
RUN apt-get update

I'm not sure it's necessary to call apt-get update again, since it's already called right before this, unless apt-get needs to be updated only after certain packages have been installed.


.devcontainer/Dockerfile line 28 at r1 (raw file):

RUN apt-get update
RUN apt-get install --no-install-recommends -y \
    less \

I don't see these two packages used anywhere. Are they just being added as helpful developer tools? I think they'd be useful, especially since I use nano quite a bit. I think it'd be cleaner to move this to the larger apt-get install section directly above it


.devcontainer/Dockerfile line 54 at r1 (raw file):

CMD ["bash"]
# Set up the S3 bucket
RUN apt update

Is this separated out because it needs to use apt instead of apt-get, or could it be added to the longer apt-get install earlier in the dockerfile?

@benjaminking
Copy link
Collaborator Author

I updated the Dockerfile based on your comments. Firstly, less and nano were included by mistake -- I had those in my Dockerfile for convenience. (Strangely, to get apt-get to recognize them, you have to call apt-get update a second time) But I've removed those for now.

Also, for reasons I don't understand, if you want to install fuse3 and rclone on WSL2, they have to be in a separate apt-get command from the other installs. I've moved them up to the same section as the other apt-get installs. I also switched to using the bundled version of rclone instead of downloading it from its website, and that doesn't seem to cause any problems.

Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @benjaminking)


.devcontainer/Dockerfile line 48 at r2 (raw file):

ENV SIL_NLP_CACHE_PROJECT_DIR=/root/.cache/silnlp/projects
# Set up the S3 bucket
RUN mkdir -p /silnlp

To be consistent with the readme instructions, which mount at ~/S, it's best to mount it at /root/S. I double checked in my docker container, and the ~ is equivalent to the /root folder.

@benjaminking
Copy link
Collaborator Author

I've just pushed a commit that changes the mount point to ~/S, adds an AppArmor profile, and updates devcontainer.json to use the new AppArmor profile.

To register the new AppArmor profile, you need to run sudo apparmor_parser -r -W docker-apparmor. This should only need to be done once for each machine.

I believe that Docker will ignore the AppArmor profile on systems that don't support AppArmor (or don't have it installed), so there shouldn't be any impact to non-Linux users. The AppArmor profile is also the same as the default Docker profile, except that it allows the container to perform mounts.

Copy link
Collaborator

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r3, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @benjaminking)

Copy link
Collaborator

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I will let someone who has a better understanding of these things approve it.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @benjaminking)

Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjaminking Sorry this was in my backlog to review for so long. I've tested it and it's working for me. Now that we have an app armor profile, would we like to go ahead and extend this not just for dev containers but for the AQuA/Cheetah servers as well? This might help with avoiding the failures for uploading checkpoints on those servers.

Reviewed 3 of 4 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @benjaminking)

@benjaminking
Copy link
Collaborator Author

That's ok. It's been on my backlog as well. I think the next step for us is to install the App Armor profile on one of our servers, modify the Docker setup in clearml_connection.py to use the new profile, and test that the container is still created normally. If that works, we should be able to test connecting to the S3 bucket automatically.

@mshannon-sil
Copy link
Collaborator

I actually already installed the app armor profile on the AQuA and Cheetah servers about an hour ago to prepare for testing on them, using the command you listed. So it should be all set for any changes you'd like to make to clearml_connection.py.

@benjaminking
Copy link
Collaborator Author

Excellent. I will try to test out creating a Docker container using the new profile.

@benjaminking
Copy link
Collaborator Author

I was able to successfully create a Docker container with the new AppArmor profile and run a task with it. My next step is to try to automatically connect to the S3 bucket from inside the container.

@benjaminking
Copy link
Collaborator Author

It looks like my attempts at connecting to the S3 bucket from ClearML are failing because the rclone.conf file on the Docker image no longer has a [s3silnlp] section. Have we removed support for S3 on ClearML already?

(There are workarounds I can probably do, but I wasn't sure what the status was regarding S3 and ClearML)

@mshannon-sil
Copy link
Collaborator

@TaperChipmunk32 I think we were still supporting aws for a little while longer, so it might be good to add the section back to rclone.conf? Also I thought that IDX might be using their own aws S3 bucket with SILNLP, and if that's the case it would be good to leave the support in.

@benjaminking We are switching over to MinIO next week, and support for it was recently added to master, so it also might make sense to try adapting your changes to MinIO.

@TaperChipmunk32
Copy link
Collaborator

TaperChipmunk32 commented Mar 17, 2025

@TaperChipmunk32 I think we were still supporting aws for a little while longer, so it might be good to add the section back to rclone.conf? Also I thought that IDX might be using their own aws S3 bucket with SILNLP, and if that's the case it would be good to leave the support in.

@benjaminking We are switching over to MinIO next week, and support for it was recently added to master, so it also might make sense to try adapting your changes to MinIO.

Good point about IDX, if they do still use AWS, we can leave AWS support in. I'll just need to update the warning message and add back AWS to rclone.conf in a separate PR.

Copy link
Collaborator

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @benjaminking)

@TaperChipmunk32
Copy link
Collaborator

@ddaspit I just realized something from our discussions on the translate.py PR. Once this is merged in, we could potentially require the bucket to be mounted. This would cause us to no longer need most of the S3 specific functions and libraries (boto3, S3Path, copy_experiment_to_bucket, etc.) and the cache system in silnlp, since rclone would handle all of the uploads/downloads. The exceptions would be clean_s3.py and sync_minio_b2.py

@TaperChipmunk32
Copy link
Collaborator

@benjaminking I can add a commit to switch this PR over to using MinIO and B2 instead of AWS, if you would like me to.

@benjaminking
Copy link
Collaborator Author

Yes, please feel free to change it over to using MinIO and Backblaze.

@TaperChipmunk32
Copy link
Collaborator

Yes, please feel free to change it over to using MinIO and Backblaze.

Support should be in for MinIO now.

Copy link
Collaborator

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TaperChipmunk32 That is a good point. I would love to be able to strip out all of the S3 code. We should discuss that option further. It might be the way forward for #685.

Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @benjaminking)

@benjaminking
Copy link
Collaborator Author

That is a good point. I would love to be able to strip out all of the S3 code. We should discuss that option further.

I am looking again into alternative ways to automatically mount the bucket on the AQuA servers, which is the last major hurdle before we can merge this in.

@benjaminking
Copy link
Collaborator Author

I was able to get the AQuA machines to automatically connect to the MinIO bucket on ClearML jobs. We should be able to look into merging this soon, if everyone is comfortable with the way I implemented this.

Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r5, 5 of 5 files at r7, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @benjaminking)


minio_bucket_setup.sh line 5 at r7 (raw file):

sed -i -e "s#access_key_id = x*#access_key_id = $MINIO_ACCESS_KEY#" ~/.config/rclone/rclone.conf
sed -i -e "s#secret_access_key = x*#secret_access_key = $MINIO_SECRET_KEY#" ~/.config/rclone/rclone.conf

Does this script need to have the endpoint variable as well? Also, this script looks very similar to s3_bucket_setup.sh. Do they serve different use cases?

@benjaminking
Copy link
Collaborator Author

I don't think it should need the endpoint variable. My understanding is that users of the dev container will have a value of truenas.psonet... for their endpoint variable, (and that value is already present in the rclone.conf file), rather than the one with the IP address, where we would need to run the sed command. As long as update_hosts.sh is called prior to this script, users should be able to connect.

I had actually been intending for minio_bucket_setup.sh to replace s3_bucket_setup.sh, but had forgotten to remove the old one. I will make that change.

Copy link
Collaborator

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r8, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @benjaminking)

Copy link
Collaborator Author

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 5 of 8 files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)


.devcontainer/Dockerfile line 48 at r2 (raw file):

Previously, mshannon-sil wrote…

To be consistent with the readme instructions, which mount at ~/S, it's best to mount it at /root/S. I double checked in my docker container, and the ~ is equivalent to the /root folder.

Done. (just trying to wrap up all of the unresolved things on reviewable, so we can merge)

@benjaminking benjaminking merged commit 2a43129 into master Apr 3, 2025
1 check was pending
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants