Cluster Toolkit is an open-source software offered by Google Cloud which makes it easy for customers to deploy AI/ML and HPC environments on Google Cloud.
Cluster Toolkit allows customers to deploy turnkey AI/ML and HPC environments (compute, networking, storage, etc.) following Google Cloud best-practices, in a repeatable manner. The Cluster Toolkit is designed to be highly customizable and extensible, and intends to address the AI/ML and HPC deployment needs of a broad range of customers.
The Toolkit comes with a suite of tutorials, examples, and full documentation for a suite of modules that have been designed for AI/ML and HPC use cases. More information can be found on the Google Cloud Docs.
Running through the quickstart tutorial is the recommended path to get started with the Cluster Toolkit.
If a self directed path is preferred, you can use the following commands to
build the gcluster
binary:
git clone https://github.com/GoogleCloudPlatform/cluster-toolkit
cd cluster-toolkit
make
./gcluster --version
./gcluster --help
NOTE: You may need to install dependencies first.
Take the following steps to deploy and test Arc's slurm staging cluster.
gcloud iam service-accounts enable --project=arc-ops 592634133521-compute@developer.gserviceaccount.com
gcloud projects add-iam-policy-binding arc-ops --member=serviceAccount:592634133521-compute@developer.gserviceaccount.com --role=roles/editor
gcloud auth application-default login
git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
cd cluster-toolkit/
make
./gcluster --version
Also install Terraform:
Example for macOS
brew tap hashicorp/tap
brew install hashicorp/tap/terraform
terraform -help
Slurm config files can be edited in community/modules/scheduler/schedmd-slurm-gcp-v6-controller/etc/htc-slurm.conf.tpl
A cluster blueprint is a YAML file that defines the cluster. The gcluster command, that is built in previous step, uses the cluster blueprint to create a deployment folder. The deployment folder can then be used to deploy the cluster.
Create the hpc-slurm.yaml
blueprint with:
./gcluster create examples/hpc-slurm.yaml \
-l ERROR --vars project_id=arc-ops
Deploy the Cluster with
./gcluster deploy hpc-slurm
- Navigate to Compute Engine Console
- Connect to the hpcslurm-login-* VM using SSH-in-browser.
- Confirm slurm cluster is online:
[jeremy_sullivan_arcinstitute_org@hpcslurm-slurm-login-001 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute up infinite 20 idle~ hpcslurm-computenodeset-[0-19]
debug* up infinite 4 idle~ hpcslurm-debugnodeset-[0-3]
Run a test job:
srun -N 3 hostname
This command creates three compute nodes for your HPC cluster. This may take a minute while Slurm auto-scales to create the three nodes.
When the job finishes you should see an output similar to:
srun -N 3 hostname
hpcslurm-debug-ghpc-0
hpcslurm-debug-ghpc-1
hpcslurm-debug-ghpc-2
The auto-scaled nodes are automatically destroyed by the Slurm controller if left idle for more than 60 seconds.
To avoid incurring charges to our Google Cloud account for the resources used in the staging cluster, follow these steps:
- Go to the VM instances page and check that the compute nodes are deleted. Compute nodes use the following naming convention:
hpcslurm-debug-ghpc-*
If you see any of these nodes, wait several minutes for them to be automatically deleted. This might take up to four minutes.
- After the compute nodes are removed, run the following command:
./gcluster destroy hpc-slurm --auto-approve
...
Destroy complete!
Resources: xx destroyed.
- Go to the VM instances page and check that the VMs are deleted.
Note: If the destroy command is run before Slurm shuts down the auto-scale nodes then the destroy command might fail. In this case, you can delete the VMs manually and rerun the destroy command.
Learn about the components that make up the Cluster Toolkit and more on how it works on the Google Cloud Docs Product Overview.
Terraform can discover credentials for authenticating to Google Cloud Platform in several ways. We will summarize Terraform's documentation for using gcloud from your workstation and for automatically finding credentials in cloud environments. We do not recommend following Hashicorp's instructions for downloading service account keys.
You can generate cloud credentials associated with your Google Cloud account using the following command:
gcloud auth application-default login
You will be prompted to open your web browser and authenticate to Google Cloud and make your account accessible from the command-line. Once this command completes, Terraform will automatically use your "Application Default Credentials."
If you receive failure messages containing "quota project" you should change the quota project associated with your Application Default Credentials with the following command and provide your current project ID as the argument:
gcloud auth application-default set-quota-project ${PROJECT-ID}
In virtualized settings, the cloud credentials of accounts can be attached directly to the execution environment. For example: a VM or a container can have service accounts attached to them. The Google Cloud Shell is an interactive command line environment which inherits the credentials of the user logged in to the Google Cloud Console.
Many of the above examples are easily executed within a Cloud Shell environment. Be aware that Cloud Shell has several limitations, in particular an inactivity timeout that will close running shells after 20 minutes. Please consider it only for blueprints that are quickly deployed.
The Cluster Toolkit officially supports the following VM images:
- HPC CentOS 7
- HPC Rocky Linux 8
- Debian 11
- Ubuntu 20.04 LTS
For more information on these and other images, see docs/vm-images.md.
Warning: Slurm Terraform modules cannot be directly used on the standard OS images. They must be used in combination with images built for the versioned release of the Terraform module.
The Cluster Toolkit provides modules and examples for implementing pre-built and custom Slurm VM images, see Slurm on GCP
The Toolkit contains "validator" functions that perform basic tests of the blueprint to ensure that deployment variables are valid and that the AI/ML and HPC environment can be provisioned in your Google Cloud project. Further information can be found in dedicated documentation.
In a new GCP project there are several APIs that must be enabled to deploy your
cluster. These will be caught when you perform terraform apply
but you can
save time by enabling them upfront.
See Google Cloud Docs for instructions.
You may need to request additional quota to be able to deploy and use your cluster.
See Google Cloud Docs for more information.
You can view your billing reports for your cluster on the
Cloud Billing Reports
page. To view the Cloud Billing reports for your Cloud Billing account,
including viewing the cost information for all of the Cloud projects that are
linked to the account, you need a role that includes the
billing.accounts.getSpendingInformation
permission on your Cloud Billing
account.
To view the Cloud Billing reports for your Cloud Billing account:
- In the Google Cloud Console, go to
Navigation Menu
>Billing
. - At the prompt, choose the Cloud Billing account for which you'd like to view reports. The Billing Overview page opens for the selected billing account.
- In the Billing navigation menu, select
Reports
.
In the right side, expand the Filters view and then filter by label, specifying the key ghpc_deployment
(or ghpc_blueprint
) and the desired value.
Confirm that you have properly setup Google Cloud credentials
Please see the dedicated troubleshooting guide for Slurm.
When terraform apply
fails, Terraform generally provides a useful error
message. Here are some common reasons for the deployment to fail:
- GCP Access: The credentials being used to call
terraform apply
do not have access to the GCP project. This can be fixed by granting access inIAM & Admin
. - Disabled APIs: The GCP project must have the proper APIs enabled. See Enable GCP APIs.
- Insufficient Quota: The GCP project does not have enough quota to provision the requested resources. See GCP Quotas.
- Filestore resource limit: When regularly deploying Filestore instances
with a new VPC you may see an error during deployment such as:
System limit for internal resources has been reached
. See this doc for the solution. - Required permission not found:
- Example:
Required 'compute.projects.get' permission for 'projects/... forbidden
- Credentials may not be set, or are not set correctly. Please follow instructions at Cloud credentials on your workstation.
- Ensure proper permissions are set in the cloud console IAM section.
- Example:
If terraform destroy
fails with an error such as the following:
│ Error: Error when reading or editing Subnetwork: googleapi: Error 400: The subnetwork resource 'projects/<project_name>/regions/<region>/subnetworks/<subnetwork_name>' is already being used by 'projects/<project_name>/zones/<zone>/instances/<instance_name>', resourceInUseByAnotherResource
or
│ Error: Error waiting for Deleting Network: The network resource 'projects/<project_name>/global/networks/<vpc_network_name>' is already being used by 'projects/<project_name>/global/firewalls/<firewall_rule_name>'
These errors indicate that the VPC network cannot be destroyed because resources were added outside of Terraform and that those resources depend upon the network. These resources should be deleted manually. The first message indicates that a new VM has been added to a subnetwork within the VPC network. The second message indicates that a new firewall rule has been added to the VPC network. If your error message does not look like these, examine it carefully to identify the type of resource to delete and its unique name. In the two messages above, the resource names appear toward the end of the error message. The following links will take you directly to the areas within the Cloud Console for managing VMs and Firewall rules. Make certain that your project ID is selected in the drop-down menu at the top-left.
The deployment will be created with the following directory structure:
<<OUTPUT_PATH>>/<<DEPLOYMENT_NAME>>/{<<DEPLOYMENT_GROUPS>>}/
If an output directory is provided with the --output/-o
flag, the deployment
directory will be created in the output directory, represented as
<<OUTPUT_PATH>>
here. If not provided, <<OUTPUT_PATH>>
will default to the
current working directory.
The deployment directory is created in <<OUTPUT_PATH>>
as a directory matching
the provided deployment_name
deployment variable (vars
) in the blueprint.
Within the deployment directory are directories representing each deployment
group in the blueprint named the same as the group
field for each element
in deployment_groups
.
In each deployment group directory, are all of the configuration scripts and
modules needed to deploy. The modules are in a directory named modules
named
the same as the source module, for example the
vpc module is in a directory named vpc
.
A hidden directory containing meta information and backups is also created and
named .ghpc
.
From the hpc-slurm.yaml example, we get the following deployment directory:
hpc-slurm/
primary/
main.tf
modules/
providers.tf
terraform.tfvars
variables.tf
versions.tf
.ghpc/
See Cloud Docs on Installing Dependencies.
The Toolkit supports Packer templates in the contemporary HCL2 file format and not in the legacy JSON file format. We require the use of Packer 1.7.9 or above, and recommend using the latest release.
The Toolkit's Packer template module documentation describes input variables and their behavior. An image-building example and usage instructions are provided. The example integrates Packer, Terraform and startup-script runners to demonstrate the power of customizing images using the same scripts that can be applied at boot-time.
The following setup is in addition to the dependencies needed to build and run Cluster-Toolkit.
Please use the pre-commit
hooks configured in
this repository to ensure that all changes are validated, tested and properly
documented before pushing code changes. The pre-commits configured
in the Cluster Toolkit have a set of dependencies that need to be installed before
successfully passing.
Follow these steps to install and setup pre-commit in your cloned repository:
-
Install pre-commit using the instructions from the pre-commit website.
-
Install TFLint using the instructions from the TFLint documentation.
NOTE: The version of TFLint must be compatible with the Google plugin version identified in tflint.hcl. Versions of the plugin
>=0.20.0
should usetflint>=0.40.0
. These versions are readily available via GitHub or package managers. Please review the TFLint Ruleset for Google Release Notes for up-to-date requirements.
-
Install ShellCheck using the instructions from the ShellCheck documentation
-
The other dev dependencies can be installed by running the following command in the project root directory:
make install-dev-deps
-
Pre-commit is enabled on a repo-by-repo basis by running the following command in the project root directory:
pre-commit install
Now pre-commit is configured to automatically run before you commit.
While macOS is a supported environment for building and executing the Toolkit, it is not supported for Toolkit development due to GNU specific shell scripts.
If developing on a mac, a workaround is to install GNU tooling by installing
coreutils
and findutils
from a package manager such as homebrew or conda.
Please refer to the contributing file in our GitHub repository, or to Google’s Open Source documentation.