From 1115fa582339c7667dab4d080a180475a305a6dd Mon Sep 17 00:00:00 2001 From: hongye-sun <43763191+hongye-sun@users.noreply.github.com> Date: Thu, 18 Apr 2019 11:22:00 -0700 Subject: [PATCH] Apply latest doc review changes to github docs (#1128) * Apply latest doc review changes to github docs * merge changes from tech writer * adding missing dataproc components --- components/gcp/bigquery/query/README.md | 105 ++++++++----- components/gcp/bigquery/query/sample.ipynb | 113 ++++++++----- .../gcp/dataflow/launch_python/README.md | 112 +++++++------ .../gcp/dataflow/launch_python/sample.ipynb | 115 ++++++++------ .../gcp/dataflow/launch_template/README.md | 100 +++++++----- .../gcp/dataflow/launch_template/sample.ipynb | 102 +++++++----- .../gcp/dataproc/create_cluster/README.md | 94 ++++++----- .../gcp/dataproc/create_cluster/sample.ipynb | 110 +++++++------ .../gcp/dataproc/delete_cluster/README.md | 72 +++++---- .../gcp/dataproc/delete_cluster/sample.ipynb | 94 ++++++----- .../gcp/dataproc/submit_hadoop_job/README.md | 109 +++++++------ .../dataproc/submit_hadoop_job/sample.ipynb | 123 +++++++++------ .../gcp/dataproc/submit_hive_job/README.md | 85 +++++----- .../gcp/dataproc/submit_hive_job/sample.ipynb | 101 +++++++----- .../gcp/dataproc/submit_pig_job/README.md | 89 ++++++----- .../gcp/dataproc/submit_pig_job/sample.ipynb | 105 +++++++------ .../gcp/dataproc/submit_pyspark_job/README.md | 87 +++++----- .../dataproc/submit_pyspark_job/sample.ipynb | 101 +++++++----- .../gcp/dataproc/submit_spark_job/README.md | 104 +++++++----- .../dataproc/submit_spark_job/sample.ipynb | 123 +++++++++------ .../dataproc/submit_sparksql_job/README.md | 71 +++++---- .../dataproc/submit_sparksql_job/sample.ipynb | 86 +++++----- .../gcp/ml_engine/batch_predict/README.md | 104 +++++++----- .../gcp/ml_engine/batch_predict/sample.ipynb | 111 ++++++++----- components/gcp/ml_engine/deploy/README.md | 141 +++++++++++------ components/gcp/ml_engine/deploy/sample.ipynb | 148 ++++++++++++------ components/gcp/ml_engine/train/README.md | 120 ++++++++------ components/gcp/ml_engine/train/sample.ipynb | 125 +++++++++------ 28 files changed, 1767 insertions(+), 1183 deletions(-) diff --git a/components/gcp/bigquery/query/README.md b/components/gcp/bigquery/query/README.md index ea6b36faf19..f42dff1e85e 100644 --- a/components/gcp/bigquery/query/README.md +++ b/components/gcp/bigquery/query/README.md @@ -1,49 +1,78 @@ -# Submitting a query using BigQuery -A Kubeflow Pipeline component to submit a query to Google Cloud Bigquery service and dump outputs to a Google Cloud Storage blob. +# Name -## Intended Use -The component is intended to export query data from BiqQuery service to Cloud Storage. +Gather training data by querying BigQuery -## Runtime arguments -Name | Description | Data type | Optional | Default -:--- | :---------- | :-------- | :------- | :------ -query | The query used by Bigquery service to fetch the results. | String | No | -project_id | The project to execute the query job. | GCPProjectID | No | -dataset_id | The ID of the persistent dataset to keep the results of the query. If the dataset does not exist, the operation will create a new one. | String | Yes | ` ` -table_id | The ID of the table to keep the results of the query. If absent, the operation will generate a random id for the table. | String | Yes | ` ` -output_gcs_path | The path to the Cloud Storage bucket to store the query output. | GCSPath | Yes | ` ` -dataset_location | The location to create the dataset. Defaults to `US`. | String | Yes | `US` -job_config | The full config spec for the query job. See [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) for details. | Dict | Yes | ` ` +# Labels + +GCP, BigQuery, Kubeflow, Pipeline + + +# Summary + +A Kubeflow Pipeline component to submit a query to BigQuery and store the result in a Cloud Storage bucket. + + +# Details + + +## Intended use + +Use this Kubeflow component to: +* Select training data by submitting a query to BigQuery. +* Output the training data into a Cloud Storage bucket as CSV files. + + +## Runtime arguments: + + +| Argument | Description | Optional | Data type | Accepted values | Default | +|----------|-------------|----------|-----------|-----------------|---------| +| query | The query used by BigQuery to fetch the results. | No | String | | | +| project_id | The project ID of the Google Cloud Platform (GCP) project to use to execute the query. | No | GCPProjectID | | | +| dataset_id | The ID of the persistent BigQuery dataset to store the results of the query. If the dataset does not exist, the operation will create a new one. | Yes | String | | None | +| table_id | The ID of the BigQuery table to store the results of the query. If the table ID is absent, the operation will generate a random ID for the table. | Yes | String | | None | +| output_gcs_path | The path to the Cloud Storage bucket to store the query output. | Yes | GCSPath | | None | +| dataset_location | The location where the dataset is created. Defaults to US. | Yes | String | | US | +| job_config | The full configuration specification for the query job. See [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) for details. | Yes | Dict | A JSONobject which has the same structure as [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) | None | +## Input data schema + +The input data is a BigQuery job containing a query that pulls data f rom various sources. + + +## Output: -## Outputs Name | Description | Type :--- | :---------- | :--- output_gcs_path | The path to the Cloud Storage bucket containing the query output in CSV format. | GCSPath -## Cautions and requirements +## Cautions & requirements + To use the component, the following requirements must be met: -* BigQuery API is enabled -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -```python -bigquery_query_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) +* The BigQuery API is enabled. +* The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow Pipeline cluster. For example: -``` + ``` + bigquery_query_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` +* The Kubeflow user service account is a member of the `roles/bigquery.admin` role of the project. +* The Kubeflow user service account is a member of the `roles/storage.objectCreator `role of the Cloud Storage output bucket. -* The Kubeflow user service account is a member of `roles/bigquery.admin` role of the project. -* The Kubeflow user service account is also a member of `roles/storage.objectCreator` role of the Cloud Storage output bucket. +## Detailed description +This Kubeflow Pipeline component is used to: +* Submit a query to BigQuery. + * The query results are persisted in a dataset table in BigQuery. + * An extract job is created in BigQuery to extract the data from the dataset table and output it to a Cloud Storage bucket as CSV files. -## Detailed Description -The component does several things: -1. Creates persistent dataset and table if they do not exist. -1. Submits a query to BigQuery service and persists the result to the table. -1. Creates an extraction job to output the table data to a Cloud Storage bucket in CSV format. + Use the code below as an example of how to run your BigQuery job. -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +### Sample +Note: The following sample code works in an IPython notebook or directly in Python code. + +#### Set sample parameters ```python @@ -64,13 +93,6 @@ bigquery_query_op = comp.load_component_from_url( help(bigquery_query_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/bigquery/_query.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/bigquery/query/sample.ipynb) -* [BigQuery query REST API](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query) - - ### Sample Note: The following sample code works in IPython notebook or directly in Python code. @@ -161,3 +183,12 @@ run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arg ```python !gsutil cat OUTPUT_PATH ``` + +## References +* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/bigquery/_query.py) +* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/bigquery/query/sample.ipynb) +* [BigQuery query REST API](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query) + +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/bigquery/query/sample.ipynb b/components/gcp/bigquery/query/sample.ipynb index ee1945c637c..9da2362ef87 100644 --- a/components/gcp/bigquery/query/sample.ipynb +++ b/components/gcp/bigquery/query/sample.ipynb @@ -4,50 +4,80 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Submitting a query using BigQuery \n", - "A Kubeflow Pipeline component to submit a query to Google Cloud Bigquery service and dump outputs to a Google Cloud Storage blob. \n", + "# Name\n", "\n", - "## Intended Use\n", - "The component is intended to export query data from BiqQuery service to Cloud Storage. \n", + "Gather training data by querying BigQuery \n", "\n", - "## Runtime arguments\n", - "Name | Description | Data type | Optional | Default\n", - ":--- | :---------- | :-------- | :------- | :------\n", - "query | The query used by Bigquery service to fetch the results. | String | No |\n", - "project_id | The project to execute the query job. | GCPProjectID | No |\n", - "dataset_id | The ID of the persistent dataset to keep the results of the query. If the dataset does not exist, the operation will create a new one. | String | Yes | ` `\n", - "table_id | The ID of the table to keep the results of the query. If absent, the operation will generate a random id for the table. | String | Yes | ` `\n", - "output_gcs_path | The path to the Cloud Storage bucket to store the query output. | GCSPath | Yes | ` `\n", - "dataset_location | The location to create the dataset. Defaults to `US`. | String | Yes | `US`\n", - "job_config | The full config spec for the query job. See [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) for details. | Dict | Yes | ` `\n", "\n", + "# Labels\n", + "\n", + "GCP, BigQuery, Kubeflow, Pipeline\n", + "\n", + "\n", + "# Summary\n", + "\n", + "A Kubeflow Pipeline component to submit a query to BigQuery and store the result in a Cloud Storage bucket.\n", + "\n", + "\n", + "# Details\n", + "\n", + "\n", + "## Intended use\n", + "\n", + "Use this Kubeflow component to:\n", + "* Select training data by submitting a query to BigQuery.\n", + "* Output the training data into a Cloud Storage bucket as CSV files.\n", + "\n", + "\n", + "## Runtime arguments:\n", + "\n", + "\n", + "| Argument | Description | Optional | Data type | Accepted values | Default |\n", + "|----------|-------------|----------|-----------|-----------------|---------|\n", + "| query | The query used by BigQuery to fetch the results. | No | String | | |\n", + "| project_id | The project ID of the Google Cloud Platform (GCP) project to use to execute the query. | No | GCPProjectID | | |\n", + "| dataset_id | The ID of the persistent BigQuery dataset to store the results of the query. If the dataset does not exist, the operation will create a new one. | Yes | String | | None |\n", + "| table_id | The ID of the BigQuery table to store the results of the query. If the table ID is absent, the operation will generate a random ID for the table. | Yes | String | | None |\n", + "| output_gcs_path | The path to the Cloud Storage bucket to store the query output. | Yes | GCSPath | | None |\n", + "| dataset_location | The location where the dataset is created. Defaults to US. | Yes | String | | US |\n", + "| job_config | The full configuration specification for the query job. See [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) for details. | Yes | Dict | A JSONobject which has the same structure as [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) | None |\n", + "## Input data schema\n", + "\n", + "The input data is a BigQuery job containing a query that pulls data f rom various sources. \n", + "\n", + "\n", + "## Output:\n", "\n", - "## Outputs\n", "Name | Description | Type\n", ":--- | :---------- | :---\n", "output_gcs_path | The path to the Cloud Storage bucket containing the query output in CSV format. | GCSPath\n", "\n", - "## Cautions and requirements\n", + "## Cautions & requirements\n", + "\n", "To use the component, the following requirements must be met:\n", - "* BigQuery API is enabled\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", "\n", - "```python\n", - "bigquery_query_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + "* The BigQuery API is enabled.\n", + "* The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow Pipeline cluster. For example:\n", + "\n", + " ```\n", + " bigquery_query_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "* The Kubeflow user service account is a member of the `roles/bigquery.admin` role of the project.\n", + "* The Kubeflow user service account is a member of the `roles/storage.objectCreator `role of the Cloud Storage output bucket.\n", + "\n", + "## Detailed description\n", + "This Kubeflow Pipeline component is used to:\n", + "* Submit a query to BigQuery.\n", + " * The query results are persisted in a dataset table in BigQuery.\n", + " * An extract job is created in BigQuery to extract the data from the dataset table and output it to a Cloud Storage bucket as CSV files.\n", "\n", - "```\n", + " Use the code below as an example of how to run your BigQuery job.\n", "\n", - "* The Kubeflow user service account is a member of `roles/bigquery.admin` role of the project.\n", - "* The Kubeflow user service account is also a member of `roles/storage.objectCreator` role of the Cloud Storage output bucket.\n", + "### Sample\n", "\n", - "## Detailed Description\n", - "The component does several things:\n", - "1. Creates persistent dataset and table if they do not exist.\n", - "1. Submits a query to BigQuery service and persists the result to the table.\n", - "1. Creates an extraction job to output the table data to a Cloud Storage bucket in CSV format.\n", + "Note: The following sample code works in an IPython notebook or directly in Python code.\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "#### Set sample parameters" ] }, { @@ -86,13 +116,6 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/bigquery/_query.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/bigquery/query/sample.ipynb)\n", - "* [BigQuery query REST API](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query)\n", - "\n", - "\n", "### Sample\n", "\n", "Note: The following sample code works in IPython notebook or directly in Python code.\n", @@ -241,6 +264,20 @@ "source": [ "!gsutil cat OUTPUT_PATH" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/bigquery/_query.py)\n", + "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/bigquery/query/sample.ipynb)\n", + "* [BigQuery query REST API](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -259,7 +296,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataflow/launch_python/README.md b/components/gcp/dataflow/launch_python/README.md index 9d6490db9c7..514609a8a39 100644 --- a/components/gcp/dataflow/launch_python/README.md +++ b/components/gcp/dataflow/launch_python/README.md @@ -1,54 +1,65 @@ -# Executing an Apache Beam Python job in Cloud Dataflow -A Kubeflow Pipeline component that submits an Apache Beam job (authored in Python) to Cloud Dataflow for execution. The Python Beam code is run with the Cloud Dataflow Runner. +# Name +Data preparation by executing an Apache Beam job in Cloud Dataflow -## Intended Use -Use this component to run a Python Beam code to submit a Dataflow job as a step of a KFP pipeline. The component will wait until the job finishes. +# Labels +GCP, Cloud Dataflow, Apache Beam, Python, Kubeflow + +# Summary +A Kubeflow Pipeline component that prepares data by submitting an Apache Beam job (authored in Python) to Cloud Dataflow for execution. The Python Beam code is run with Cloud Dataflow Runner. + +# Details +## Intended use + +Use this component to run a Python Beam code to submit a Cloud Dataflow job as a step of a Kubeflow pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -python_file_path | The Cloud Storage or the local path to the python file being run. | String | No | -project_id | The ID of the parent project of the Dataflow job. | GCPProjectID | No | -staging_dir | The Cloud Storage directory for keeping staging files. A random subdirectory will be created under the directory to keep job info for resuming the job in case of failure and it will be passed as `staging_location` and `temp_location` command line args of the beam code. | GCSPath | Yes | ` ` -requirements_file_path | The Cloud Storageor the local path to the pip requirements file. | String | Yes | ` ` -args | The list of arguments to pass to the python file. | List | Yes | `[]` -wait_interval | The seconds to wait between calls to get the job status. | Integer | Yes | `30` - -## Output: -Name | Description | Type -:--- | :---------- | :--- -job_id | The id of the created dataflow job. | String - -## Cautions and requirements +Name | Description | Optional | Data type| Accepted values | Default | +:--- | :----------| :----------| :----------| :----------| :---------- | +python_file_path | The path to the Cloud Storage bucket or local directory containing the Python file to be run. | | GCSPath | | | +project_id | The ID of the Google Cloud Platform (GCP) project containing the Cloud Dataflow job.| | GCPProjectID | | | +staging_dir | The path to the Cloud Storage directory where the staging files are stored. A random subdirectory will be created under the staging directory to keep the job information.This is done so that you can resume the job in case of failure. `staging_dir` is passed as the command line arguments (`staging_location` and `temp_location`) of the Beam code. | Yes | GCPPath | | None | +requirements_file_path | The path to the Cloud Storage bucket or local directory containing the pip requirements file. | Yes | GCSPath | | None | +args | The list of arguments to pass to the Python file. | No | List | A list of string arguments | None | +wait_interval | The number of seconds to wait between calls to get the status of the job. | Yes | Integer | | 30 | + +## Input data schema + +Before you use the component, the following files must be ready in a Cloud Storage bucket: +- A Beam Python code file. +- A `requirements.txt` file which includes a list of dependent packages. + +The Beam Python code should follow the [Beam programming guide](https://beam.apache.org/documentation/programming-guide/) as well as the following additional requirements to be compatible with this component: +- It accepts the command line arguments `--project`, `--temp_location`, `--staging_location`, which are [standard Dataflow Runner options](https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#setting-other-cloud-pipeline-options). +- It enables `info logging` before the start of a Cloud Dataflow job in the Python code. This is important to allow the component to track the status and ID of the job that is created. For example, calling `logging.getLogger().setLevel(logging.INFO)` before any other code. + + +## Output +Name | Description +:--- | :---------- +job_id | The id of the Cloud Dataflow job that is created. + +## Cautions & requirements To use the components, the following requirements must be met: -* Dataflow API is enabled. -* The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a KFP cluster. For example: +- Cloud Dataflow API is enabled. +- The component is running under a secret Kubeflow user service account in a Kubeflow Pipeline cluster. For example: ``` component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) ``` -* The Kubeflow user service account is a member of `roles/dataflow.developer` role of the project. -* The Kubeflow user service account is a member of `roles/storage.objectViewer` role of the Cloud Storage Objects `python_file_path` and `requirements_file_path`. -* The Kubeflow user service account is a member of `roles/storage.objectCreator` role of the Cloud Storage Object `staging_dir`. +The Kubeflow user service account is a member of: +- `roles/dataflow.developer` role of the project. +- `roles/storage.objectViewer` role of the Cloud Storage Objects `python_file_path` and `requirements_file_path`. +- `roles/storage.objectCreator` role of the Cloud Storage Object `staging_dir`. ## Detailed description -Before using the component, make sure the following files are prepared in a Cloud Storage bucket. -* A Beam Python code file. -* A `requirements.txt` file which includes a list of dependent packages. - -The Beam Python code should follow [Beam programing model](https://beam.apache.org/documentation/programming-guide/) and the following additional requirements to be compatible with this component: -* It accepts command line arguments: `--project`, `--temp_location`, `--staging_location`, which are [standard Dataflow Runner options](https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#setting-other-cloud-pipeline-options). -* Enable info logging before the start of a Dataflow job in the Python code. This is important to allow the component to track the status and ID of create job. For example: calling `logging.getLogger().setLevel(logging.INFO)` before any other code. - The component does several things during the execution: -* Download `python_file_path` and `requirements_file_path` to local files. -* Start a subprocess to launch the Python program. -* Monitor the logs produced from the subprocess to extract Dataflow job information. -* Store Dataflow job information in `staging_dir` so the job can be resumed in case of failure. -* Wait for the job to finish. - -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +- Downloads `python_file_path` and `requirements_file_path` to local files. +- Starts a subprocess to launch the Python program. +- Monitors the logs produced from the subprocess to extract the Cloud Dataflow job information. +- Stores the Cloud Dataflow job information in `staging_dir` so the job can be resumed in case of failure. +- Waits for the job to finish. +The steps to use the component in a pipeline are: +1. Install the Kubeflow Pipelines SDK: @@ -70,17 +81,9 @@ dataflow_python_op = comp.load_component_from_url( help(dataflow_python_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataflow/_launch_python.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataflow/launch_python/sample.ipynb) -* [Dataflow Python Quickstart](https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python) - ### Sample - -Note: the sample code below works in both IPython notebook or python code directly. - -In this sample, we run a wordcount sample code in a KFP pipeline. The output will be stored in a Cloud Storage bucket. Here is the sample code: +Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template. +In this sample, we run a wordcount sample code in a Kubeflow Pipeline. The output will be stored in a Cloud Storage bucket. Here is the sample code: ```python @@ -292,3 +295,12 @@ run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arg ```python !gsutil cat $OUTPUT_FILE ``` + +## References +* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataflow/_launch_python.py) +* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataflow/launch_python/sample.ipynb) +* [Dataflow Python Quickstart](https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python) + +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataflow/launch_python/sample.ipynb b/components/gcp/dataflow/launch_python/sample.ipynb index 93113512c4d..61a663439ec 100644 --- a/components/gcp/dataflow/launch_python/sample.ipynb +++ b/components/gcp/dataflow/launch_python/sample.ipynb @@ -4,56 +4,67 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Executing an Apache Beam Python job in Cloud Dataflow\n", - "A Kubeflow Pipeline component that submits an Apache Beam job (authored in Python) to Cloud Dataflow for execution. The Python Beam code is run with the Cloud Dataflow Runner.\n", + "# Name\n", + "Data preparation by executing an Apache Beam job in Cloud Dataflow\n", "\n", - "## Intended Use\n", - "Use this component to run a Python Beam code to submit a Dataflow job as a step of a KFP pipeline. The component will wait until the job finishes.\n", + "# Labels\n", + "GCP, Cloud Dataflow, Apache Beam, Python, Kubeflow\n", + "\n", + "# Summary\n", + "A Kubeflow Pipeline component that prepares data by submitting an Apache Beam job (authored in Python) to Cloud Dataflow for execution. The Python Beam code is run with Cloud Dataflow Runner.\n", + "\n", + "# Details\n", + "## Intended use\n", + "\n", + "Use this component to run a Python Beam code to submit a Cloud Dataflow job as a step of a Kubeflow pipeline. \n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "python_file_path | The Cloud Storage or the local path to the python file being run. | String | No |\n", - "project_id | The ID of the parent project of the Dataflow job. | GCPProjectID | No |\n", - "staging_dir | The Cloud Storage directory for keeping staging files. A random subdirectory will be created under the directory to keep job info for resuming the job in case of failure and it will be passed as `staging_location` and `temp_location` command line args of the beam code. | GCSPath | Yes | ` `\n", - "requirements_file_path | The Cloud Storageor the local path to the pip requirements file. | String | Yes | ` `\n", - "args | The list of arguments to pass to the python file. | List | Yes | `[]`\n", - "wait_interval | The seconds to wait between calls to get the job status. | Integer | Yes | `30`\n", + "Name | Description | Optional | Data type| Accepted values | Default |\n", + ":--- | :----------| :----------| :----------| :----------| :---------- |\n", + "python_file_path | The path to the Cloud Storage bucket or local directory containing the Python file to be run. | | GCSPath | | |\n", + "project_id | The ID of the Google Cloud Platform (GCP) project containing the Cloud Dataflow job.| | GCPProjectID | | |\n", + "staging_dir | The path to the Cloud Storage directory where the staging files are stored. A random subdirectory will be created under the staging directory to keep the job information.This is done so that you can resume the job in case of failure. `staging_dir` is passed as the command line arguments (`staging_location` and `temp_location`) of the Beam code. | Yes | GCPPath | | None |\n", + "requirements_file_path | The path to the Cloud Storage bucket or local directory containing the pip requirements file. | Yes | GCSPath | | None |\n", + "args | The list of arguments to pass to the Python file. | No | List | A list of string arguments | None |\n", + "wait_interval | The number of seconds to wait between calls to get the status of the job. | Yes | Integer | | 30 |\n", + "\n", + "## Input data schema\n", + "\n", + "Before you use the component, the following files must be ready in a Cloud Storage bucket:\n", + "- A Beam Python code file.\n", + "- A `requirements.txt` file which includes a list of dependent packages.\n", + "\n", + "The Beam Python code should follow the [Beam programming guide](https://beam.apache.org/documentation/programming-guide/) as well as the following additional requirements to be compatible with this component:\n", + "- It accepts the command line arguments `--project`, `--temp_location`, `--staging_location`, which are [standard Dataflow Runner options](https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#setting-other-cloud-pipeline-options).\n", + "- It enables `info logging` before the start of a Cloud Dataflow job in the Python code. This is important to allow the component to track the status and ID of the job that is created. For example, calling `logging.getLogger().setLevel(logging.INFO)` before any other code.\n", + "\n", "\n", - "## Output:\n", - "Name | Description | Type\n", - ":--- | :---------- | :---\n", - "job_id | The id of the created dataflow job. | String\n", + "## Output\n", + "Name | Description\n", + ":--- | :----------\n", + "job_id | The id of the Cloud Dataflow job that is created.\n", "\n", - "## Cautions and requirements\n", + "## Cautions & requirements\n", "To use the components, the following requirements must be met:\n", - "* Dataflow API is enabled.\n", - "* The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a KFP cluster. For example:\n", + "- Cloud Dataflow API is enabled.\n", + "- The component is running under a secret Kubeflow user service account in a Kubeflow Pipeline cluster. For example:\n", "```\n", "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", "```\n", - "* The Kubeflow user service account is a member of `roles/dataflow.developer` role of the project.\n", - "* The Kubeflow user service account is a member of `roles/storage.objectViewer` role of the Cloud Storage Objects `python_file_path` and `requirements_file_path`.\n", - "* The Kubeflow user service account is a member of `roles/storage.objectCreator` role of the Cloud Storage Object `staging_dir`.\n", + "The Kubeflow user service account is a member of:\n", + "- `roles/dataflow.developer` role of the project.\n", + "- `roles/storage.objectViewer` role of the Cloud Storage Objects `python_file_path` and `requirements_file_path`.\n", + "- `roles/storage.objectCreator` role of the Cloud Storage Object `staging_dir`. \n", "\n", "## Detailed description\n", - "Before using the component, make sure the following files are prepared in a Cloud Storage bucket.\n", - "* A Beam Python code file.\n", - "* A `requirements.txt` file which includes a list of dependent packages.\n", - "\n", - "The Beam Python code should follow [Beam programing model](https://beam.apache.org/documentation/programming-guide/) and the following additional requirements to be compatible with this component:\n", - "* It accepts command line arguments: `--project`, `--temp_location`, `--staging_location`, which are [standard Dataflow Runner options](https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#setting-other-cloud-pipeline-options).\n", - "* Enable info logging before the start of a Dataflow job in the Python code. This is important to allow the component to track the status and ID of create job. For example: calling `logging.getLogger().setLevel(logging.INFO)` before any other code.\n", - "\n", "The component does several things during the execution:\n", - "* Download `python_file_path` and `requirements_file_path` to local files.\n", - "* Start a subprocess to launch the Python program.\n", - "* Monitor the logs produced from the subprocess to extract Dataflow job information.\n", - "* Store Dataflow job information in `staging_dir` so the job can be resumed in case of failure.\n", - "* Wait for the job to finish.\n", - "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "- Downloads `python_file_path` and `requirements_file_path` to local files.\n", + "- Starts a subprocess to launch the Python program.\n", + "- Monitors the logs produced from the subprocess to extract the Cloud Dataflow job information.\n", + "- Stores the Cloud Dataflow job information in `staging_dir` so the job can be resumed in case of failure.\n", + "- Waits for the job to finish.\n", + "The steps to use the component in a pipeline are:\n", + "1. Install the Kubeflow Pipelines SDK:\n" ] }, { @@ -92,17 +103,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataflow/_launch_python.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataflow/launch_python/sample.ipynb)\n", - "* [Dataflow Python Quickstart](https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python)\n", - "\n", "### Sample\n", - "\n", - "Note: the sample code below works in both IPython notebook or python code directly.\n", - "\n", - "In this sample, we run a wordcount sample code in a KFP pipeline. The output will be stored in a Cloud Storage bucket. Here is the sample code:" + "Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n", + "In this sample, we run a wordcount sample code in a Kubeflow Pipeline. The output will be stored in a Cloud Storage bucket. Here is the sample code:" ] }, { @@ -377,6 +380,20 @@ "source": [ "!gsutil cat $OUTPUT_FILE" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataflow/_launch_python.py)\n", + "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataflow/launch_python/sample.ipynb)\n", + "* [Dataflow Python Quickstart](https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -395,7 +412,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataflow/launch_template/README.md b/components/gcp/dataflow/launch_template/README.md index cf5240af1f2..d04adad6363 100644 --- a/components/gcp/dataflow/launch_template/README.md +++ b/components/gcp/dataflow/launch_template/README.md @@ -1,43 +1,55 @@ -# Submitting a job to Cloud Dataflow service using a template -A Kubeflow Pipeline component to submit a job from a dataflow template to Cloud Dataflow service. +# Name +Data preparation by using a template to submit a job to Cloud Dataflow -## Intended Use +# Labels +GCP, Cloud Dataflow, Kubeflow, Pipeline -A Kubeflow Pipeline component to submit a job from a dataflow template to Google Cloud Dataflow service. +# Summary +A Kubeflow Pipeline component to prepare data by using a template to submit a job to Cloud Dataflow. + +# Details + +## Intended use +Use this component when you have a pre-built Cloud Dataflow template and want to launch it as a step in a Kubeflow Pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The ID of the Cloud Platform project to which the job belongs. | GCPProjectID | No | -gcs_path | A Cloud Storage path to the job creation template. It must be a valid Cloud Storage URL beginning with `gs://`. | GCSPath | No | -launch_parameters | The parameters that are required for the template being launched. The Schema is defined in [LaunchTemplateParameters Parameters](https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters). | Dict | Yes | `{}` -location | The regional endpoint to which the job request is directed. | GCPRegion | Yes | `` -validate_only | If true, the request is validated but not actually executed. | Bool | Yes | `False` -staging_dir | The Cloud Storage path for keeping staging files. A random subdirectory will be created under the directory to keep job info for resuming the job in case of failure. | GCSPath | Yes | `` -wait_interval | The seconds to wait between calls to get the job status. | Integer | Yes |`30` - -## Output: -Name | Description | Type -:--- | :---------- | :--- -job_id | The id of the created dataflow job. | String - -## Cautions and requirements -To use the components, the following requirements must be met: -* Dataflow API is enabled. -* The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a KFP cluster. For example: -``` -component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) -``` -* The Kubeflow user service account is a member of `roles/dataflow.developer` role of the project. -* The Kubeflow user service account is a member of `roles/storage.objectViewer` role of the Cloud Storage Object `gcs_path`. -* The Kubeflow user service account is a member of `roles/storage.objectCreator` role of the Cloud Storage Object `staging_dir`. +Argument | Description | Optional | Data type | Accepted values | Default | +:--- | :---------- | :----------| :----------| :---------- | :----------| +project_id | The ID of the Google Cloud Platform (GCP) project to which the job belongs. | No | GCPProjectID | | | +gcs_path | The path to a Cloud Storage bucket containing the job creation template. It must be a valid Cloud Storage URL beginning with 'gs://'. | No | GCSPath | | | +launch_parameters | The parameters that are required to launch the template. The schema is defined in [LaunchTemplateParameters](https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters). The parameter `jobName` is replaced by a generated name. | Yes | Dict | A JSON object which has the same structure as [LaunchTemplateParameters](https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters) | None | +location | The regional endpoint to which the job request is directed.| Yes | GCPRegion | | None | +staging_dir | The path to the Cloud Storage directory where the staging files are stored. A random subdirectory will be created under the staging directory to keep the job information. This is done so that you can resume the job in case of failure.| Yes | GCSPath | | None | +validate_only | If True, the request is validated but not executed. | Yes | Boolean | | False | +wait_interval | The number of seconds to wait between calls to get the status of the job. | Yes | Integer | | 30 | + +## Input data schema + +The input `gcs_path` must contain a valid Cloud Dataflow template. The template can be created by following the instructions in [Creating Templates](https://cloud.google.com/dataflow/docs/guides/templates/creating-templates). You can also use [Google-provided templates](https://cloud.google.com/dataflow/docs/guides/templates/provided-templates). + +## Output +Name | Description +:--- | :---------- +job_id | The id of the Cloud Dataflow job that is created. + +## Caution & requirements + +To use the component, the following requirements must be met: +- Cloud Dataflow API is enabled. +- The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow Pipeline cluster. For example: + ``` + component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` +* The Kubeflow user service account is a member of: + - `roles/dataflow.developer` role of the project. + - `roles/storage.objectViewer` role of the Cloud Storage Object `gcs_path.` + - `roles/storage.objectCreator` role of the Cloud Storage Object `staging_dir.` ## Detailed description -The input `gcs_path` must contain a valid Dataflow template. The template can be created by following the guide [Creating Templates](https://cloud.google.com/dataflow/docs/guides/templates/creating-templates). Or, you can use [Google-provided templates](https://cloud.google.com/dataflow/docs/guides/templates/provided-templates). - -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +You can execute the template locally by following the instructions in [Executing Templates](https://cloud.google.com/dataflow/docs/guides/templates/executing-templates). See the sample code below to learn how to execute the template. +Follow these steps to use the component in a pipeline: +1. Install the Kubeflow Pipeline SDK: @@ -59,17 +71,10 @@ dataflow_template_op = comp.load_component_from_url( help(dataflow_template_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataflow/_launch_template.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataflow/launch_template/sample.ipynb) -* [Cloud Dataflow Templates overview](https://cloud.google.com/dataflow/docs/guides/templates/overview) - ### Sample -Note: the sample code below works in both IPython notebook or python code directly. - -In this sample, we run a Google provided word count template from `gs://dataflow-templates/latest/Word_Count`. The template takes a text file as input and output word counts to a Cloud Storage bucket. Here is the sample input: +Note: The following sample code works in an IPython notebook or directly in Python code. +In this sample, we run a Google-provided word count template from `gs://dataflow-templates/latest/Word_Count`. The template takes a text file as input and outputs word counts to a Cloud Storage bucket. Here is the sample input: ```python @@ -159,3 +164,14 @@ run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arg ```python !gsutil cat $OUTPUT_PATH* ``` + +## References + +* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataflow/_launch_template.py) +* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataflow/launch_template/sample.ipynb) +* [Cloud Dataflow Templates overview](https://cloud.google.com/dataflow/docs/guides/templates/overview) + +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. + diff --git a/components/gcp/dataflow/launch_template/sample.ipynb b/components/gcp/dataflow/launch_template/sample.ipynb index 706d69549a6..ec313804895 100644 --- a/components/gcp/dataflow/launch_template/sample.ipynb +++ b/components/gcp/dataflow/launch_template/sample.ipynb @@ -4,45 +4,57 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Submitting a job to Cloud Dataflow service using a template\n", - "A Kubeflow Pipeline component to submit a job from a dataflow template to Cloud Dataflow service.\n", + "# Name\n", + "Data preparation by using a template to submit a job to Cloud Dataflow\n", "\n", - "## Intended Use\n", + "# Labels\n", + "GCP, Cloud Dataflow, Kubeflow, Pipeline\n", "\n", - "A Kubeflow Pipeline component to submit a job from a dataflow template to Google Cloud Dataflow service.\n", + "# Summary\n", + "A Kubeflow Pipeline component to prepare data by using a template to submit a job to Cloud Dataflow.\n", + "\n", + "# Details\n", + "\n", + "## Intended use\n", + "Use this component when you have a pre-built Cloud Dataflow template and want to launch it as a step in a Kubeflow Pipeline.\n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The ID of the Cloud Platform project to which the job belongs. | GCPProjectID | No |\n", - "gcs_path | A Cloud Storage path to the job creation template. It must be a valid Cloud Storage URL beginning with `gs://`. | GCSPath | No |\n", - "launch_parameters | The parameters that are required for the template being launched. The Schema is defined in [LaunchTemplateParameters Parameters](https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters). | Dict | Yes | `{}`\n", - "location | The regional endpoint to which the job request is directed. | GCPRegion | Yes | ``\n", - "validate_only | If true, the request is validated but not actually executed. | Bool | Yes | `False`\n", - "staging_dir | The Cloud Storage path for keeping staging files. A random subdirectory will be created under the directory to keep job info for resuming the job in case of failure. | GCSPath | Yes | ``\n", - "wait_interval | The seconds to wait between calls to get the job status. | Integer | Yes |`30`\n", + "Argument | Description | Optional | Data type | Accepted values | Default |\n", + ":--- | :---------- | :----------| :----------| :---------- | :----------|\n", + "project_id | The ID of the Google Cloud Platform (GCP) project to which the job belongs. | No | GCPProjectID | | |\n", + "gcs_path | The path to a Cloud Storage bucket containing the job creation template. It must be a valid Cloud Storage URL beginning with 'gs://'. | No | GCSPath | | |\n", + "launch_parameters | The parameters that are required to launch the template. The schema is defined in [LaunchTemplateParameters](https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters). The parameter `jobName` is replaced by a generated name. | Yes | Dict | A JSON object which has the same structure as [LaunchTemplateParameters](https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters) | None |\n", + "location | The regional endpoint to which the job request is directed.| Yes | GCPRegion | | None |\n", + "staging_dir | The path to the Cloud Storage directory where the staging files are stored. A random subdirectory will be created under the staging directory to keep the job information. This is done so that you can resume the job in case of failure.| Yes | GCSPath | | None |\n", + "validate_only | If True, the request is validated but not executed. | Yes | Boolean | | False |\n", + "wait_interval | The number of seconds to wait between calls to get the status of the job. | Yes | Integer | | 30 |\n", "\n", - "## Output:\n", - "Name | Description | Type\n", - ":--- | :---------- | :---\n", - "job_id | The id of the created dataflow job. | String\n", + "## Input data schema\n", "\n", - "## Cautions and requirements\n", - "To use the components, the following requirements must be met:\n", - "* Dataflow API is enabled.\n", - "* The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a KFP cluster. For example:\n", - "```\n", - "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", - "```\n", - "* The Kubeflow user service account is a member of `roles/dataflow.developer` role of the project.\n", - "* The Kubeflow user service account is a member of `roles/storage.objectViewer` role of the Cloud Storage Object `gcs_path`.\n", - "* The Kubeflow user service account is a member of `roles/storage.objectCreator` role of the Cloud Storage Object `staging_dir`.\n", + "The input `gcs_path` must contain a valid Cloud Dataflow template. The template can be created by following the instructions in [Creating Templates](https://cloud.google.com/dataflow/docs/guides/templates/creating-templates). You can also use [Google-provided templates](https://cloud.google.com/dataflow/docs/guides/templates/provided-templates).\n", "\n", - "## Detailed description\n", - "The input `gcs_path` must contain a valid Dataflow template. The template can be created by following the guide [Creating Templates](https://cloud.google.com/dataflow/docs/guides/templates/creating-templates). Or, you can use [Google-provided templates](https://cloud.google.com/dataflow/docs/guides/templates/provided-templates).\n", + "## Output\n", + "Name | Description\n", + ":--- | :----------\n", + "job_id | The id of the Cloud Dataflow job that is created.\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "## Caution & requirements\n", + "\n", + "To use the component, the following requirements must be met:\n", + "- Cloud Dataflow API is enabled.\n", + "- The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow Pipeline cluster. For example:\n", + " ```\n", + " component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "* The Kubeflow user service account is a member of:\n", + " - `roles/dataflow.developer` role of the project.\n", + " - `roles/storage.objectViewer` role of the Cloud Storage Object `gcs_path.`\n", + " - `roles/storage.objectCreator` role of the Cloud Storage Object `staging_dir.` \n", + "\n", + "## Detailed description\n", + "You can execute the template locally by following the instructions in [Executing Templates](https://cloud.google.com/dataflow/docs/guides/templates/executing-templates). See the sample code below to learn how to execute the template.\n", + "Follow these steps to use the component in a pipeline:\n", + "1. Install the Kubeflow Pipeline SDK:\n" ] }, { @@ -81,17 +93,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataflow/_launch_template.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataflow/launch_template/sample.ipynb)\n", - "* [Cloud Dataflow Templates overview](https://cloud.google.com/dataflow/docs/guides/templates/overview)\n", - "\n", "### Sample\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly.\n", - "\n", - "In this sample, we run a Google provided word count template from `gs://dataflow-templates/latest/Word_Count`. The template takes a text file as input and output word counts to a Cloud Storage bucket. Here is the sample input:" + "Note: The following sample code works in an IPython notebook or directly in Python code.\n", + "In this sample, we run a Google-provided word count template from `gs://dataflow-templates/latest/Word_Count`. The template takes a text file as input and outputs word counts to a Cloud Storage bucket. Here is the sample input:" ] }, { @@ -239,6 +244,21 @@ "source": [ "!gsutil cat $OUTPUT_PATH*" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "\n", + "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataflow/_launch_template.py)\n", + "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataflow/launch_template/sample.ipynb)\n", + "* [Cloud Dataflow Templates overview](https://cloud.google.com/dataflow/docs/guides/templates/overview)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control.\n" + ] } ], "metadata": { @@ -257,7 +277,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataproc/create_cluster/README.md b/components/gcp/dataproc/create_cluster/README.md index 1945c524085..2ffedc57163 100644 --- a/components/gcp/dataproc/create_cluster/README.md +++ b/components/gcp/dataproc/create_cluster/README.md @@ -1,44 +1,62 @@ -# Creating a Cluster with Cloud Dataproc -A Kubeflow Pipeline component to create a cluster in Cloud Dataproc service. +# Name +Data processing by creating a cluster in Cloud Dataproc -## Intended Use -This component can be used at the start of a KFP pipeline to create a temporary Dataproc cluster to run Dataproc jobs as subsequent steps in the pipeline. The cluster can be later recycled by the [Dataproc delete cluster component](https://github.com/kubeflow/pipelines/tree/master/components/gcp/dataproc/delete_cluster). +# Label +Cloud Dataproc, cluster, GCP, Cloud Storage, KubeFlow, Pipeline + + +# Summary +A Kubeflow Pipeline component to create a cluster in Cloud Dataproc. + +# Details +## Intended use + +Use this component at the start of a Kubeflow Pipeline to create a temporary Cloud Dataproc cluster to run Cloud Dataproc jobs as steps in the pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No | -region | The Cloud Dataproc region runs the newly created cluster. | GCPRegion | No | -name | The name of the newly created cluster. Cluster names within a project must be unique. Names of deleted clusters can be reused. | String | Yes | ` ` -name_prefix | The prefix of the cluster name. | String | Yes | ` ` -initialization_actions | List of Cloud Storage URIs of executables to execute on each node after the configuration is completed. By default, executables are run on the master and all the worker nodes. | List | Yes | `[]` -config_bucket | A Cloud Storage bucket used to stage the job dependencies, the configuration files, and the job driver console’s output. | GCSPath | Yes | ` ` -image_version | The version of the software inside the cluster. | String | Yes | ` ` -cluster | The full [cluster config] (https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#Cluster). | Dict | Yes | `{}` -wait_interval | The number of seconds to pause between polling the operation done status. | Integer | Yes | `30` + +| Argument | Description | Optional | Data type | Accepted values | Default | +|----------|-------------|----------|-----------|-----------------|---------| +| project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectID | | | +| region | The Cloud Dataproc region to create the cluster in. | No | GCPRegion | | | +| name | The name of the cluster. Cluster names within a project must be unique. You can reuse the names of deleted clusters. | Yes | String | | None | +| name_prefix | The prefix of the cluster name. | Yes | String | | None | +| initialization_actions | A list of Cloud Storage URIs identifying executables to execute on each node after the configuration is completed. By default, executables are run on the master and all the worker nodes. | Yes | List | | None | +| config_bucket | The Cloud Storage bucket to use to stage the job dependencies, the configuration files, and the job driver console’s output. | Yes | GCSPath | | None | +| image_version | The version of the software inside the cluster. | Yes | String | | None | +| cluster | The full [cluster configuration](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#Cluster). | Yes | Dict | | None | +| wait_interval | The number of seconds to pause before polling the operation. | Yes | Integer | | 30 | ## Output Name | Description | Type :--- | :---------- | :--- -cluster_name | The cluster name of the created cluster. | String +cluster_name | The name of the cluster. | String + +Note: You can recycle the cluster by using the [Dataproc delete cluster component](https://github.com/kubeflow/pipelines/tree/master/components/gcp/dataproc/delete_cluster). + ## Cautions & requirements -To use the component, you must: -* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -``` -component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) -``` -* Grant Kubeflow user service account the read access to the Cloud Storage buckets which contains initialization action files. -* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project. -## Detailed Description -This component creates a new Dataproc cluster by using [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create). +To use the component, you must: +* Set up the GCP project by following these [steps](https://cloud.google.com/dataproc/docs/guides/setup-project). +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -Here are the steps to use the component in a pipeline: -1. Install KFP SDK + ``` + component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` +* Grant the following types of access to the Kubeflow user service account: + * Read access to the Cloud Storage buckets which contains initialization action files. + * The role, `roles/dataproc.editor` on the project. + +## Detailed description + +This component creates a new Dataproc cluster by using the [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create). + +Follow these steps to use the component in a pipeline: + +1. Install the Kubeflow Pipeline SDK: @@ -60,16 +78,8 @@ dataproc_create_cluster_op = comp.load_component_from_url( help(dataproc_create_cluster_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_create_cluster.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/create_cluster/sample.ipynb) -* [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create) - - ### Sample - -Note: the sample code below works in both IPython notebook or python code directly. +Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template. #### Set sample parameters @@ -142,3 +152,13 @@ experiment = client.create_experiment(EXPERIMENT_NAME) run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` + +## References +* [Kubernetes Engine for Kubeflow](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) +* [Component Python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_create_cluster.py) +* [Component Docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/create_cluster/sample.ipynb) +* [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create) + +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataproc/create_cluster/sample.ipynb b/components/gcp/dataproc/create_cluster/sample.ipynb index 1c9a000406d..16a7dd8c60b 100644 --- a/components/gcp/dataproc/create_cluster/sample.ipynb +++ b/components/gcp/dataproc/create_cluster/sample.ipynb @@ -4,46 +4,64 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Creating a Cluster with Cloud Dataproc\n", - "A Kubeflow Pipeline component to create a cluster in Cloud Dataproc service.\n", + "# Name\n", + "Data processing by creating a cluster in Cloud Dataproc\n", "\n", - "## Intended Use\n", - "This component can be used at the start of a KFP pipeline to create a temporary Dataproc cluster to run Dataproc jobs as subsequent steps in the pipeline. The cluster can be later recycled by the [Dataproc delete cluster component](https://github.com/kubeflow/pipelines/tree/master/components/gcp/dataproc/delete_cluster).\n", "\n", + "# Label\n", + "Cloud Dataproc, cluster, GCP, Cloud Storage, KubeFlow, Pipeline\n", + "\n", + "\n", + "# Summary\n", + "A Kubeflow Pipeline component to create a cluster in Cloud Dataproc.\n", + "\n", + "# Details\n", + "## Intended use\n", + "\n", + "Use this component at the start of a Kubeflow Pipeline to create a temporary Cloud Dataproc cluster to run Cloud Dataproc jobs as steps in the pipeline.\n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No |\n", - "region | The Cloud Dataproc region runs the newly created cluster. | GCPRegion | No |\n", - "name | The name of the newly created cluster. Cluster names within a project must be unique. Names of deleted clusters can be reused. | String | Yes | ` `\n", - "name_prefix | The prefix of the cluster name. | String | Yes | ` `\n", - "initialization_actions | List of Cloud Storage URIs of executables to execute on each node after the configuration is completed. By default, executables are run on the master and all the worker nodes. | List | Yes | `[]`\n", - "config_bucket | A Cloud Storage bucket used to stage the job dependencies, the configuration files, and the job driver console’s output. | GCSPath | Yes | ` `\n", - "image_version | The version of the software inside the cluster. | String | Yes | ` `\n", - "cluster | The full [cluster config] (https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#Cluster). | Dict | Yes | `{}`\n", - "wait_interval | The number of seconds to pause between polling the operation done status. | Integer | Yes | `30`\n", + "\n", + "| Argument | Description | Optional | Data type | Accepted values | Default |\n", + "|----------|-------------|----------|-----------|-----------------|---------|\n", + "| project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectID | | |\n", + "| region | The Cloud Dataproc region to create the cluster in. | No | GCPRegion | | |\n", + "| name | The name of the cluster. Cluster names within a project must be unique. You can reuse the names of deleted clusters. | Yes | String | | None |\n", + "| name_prefix | The prefix of the cluster name. | Yes | String | | None |\n", + "| initialization_actions | A list of Cloud Storage URIs identifying executables to execute on each node after the configuration is completed. By default, executables are run on the master and all the worker nodes. | Yes | List | | None |\n", + "| config_bucket | The Cloud Storage bucket to use to stage the job dependencies, the configuration files, and the job driver console’s output. | Yes | GCSPath | | None |\n", + "| image_version | The version of the software inside the cluster. | Yes | String | | None |\n", + "| cluster | The full [cluster configuration](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#Cluster). | Yes | Dict | | None |\n", + "| wait_interval | The number of seconds to pause before polling the operation. | Yes | Integer | | 30 |\n", "\n", "## Output\n", "Name | Description | Type\n", ":--- | :---------- | :---\n", - "cluster_name | The cluster name of the created cluster. | String\n", + "cluster_name | The name of the cluster. | String\n", + "\n", + "Note: You can recycle the cluster by using the [Dataproc delete cluster component](https://github.com/kubeflow/pipelines/tree/master/components/gcp/dataproc/delete_cluster).\n", + "\n", "\n", "## Cautions & requirements\n", - "To use the component, you must:\n", - "* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", - "```\n", - "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", - "```\n", - "* Grant Kubeflow user service account the read access to the Cloud Storage buckets which contains initialization action files.\n", - "* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project.\n", - "\n", - "## Detailed Description\n", - "This component creates a new Dataproc cluster by using [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create).\n", - "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "\n", + "To use the component, you must:\n", + "* Set up the GCP project by following these [steps](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "\n", + " ```\n", + " component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "* Grant the following types of access to the Kubeflow user service account:\n", + " * Read access to the Cloud Storage buckets which contains initialization action files.\n", + " * The role, `roles/dataproc.editor` on the project.\n", + "\n", + "## Detailed description\n", + "\n", + "This component creates a new Dataproc cluster by using the [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create). \n", + "\n", + "Follow these steps to use the component in a pipeline:\n", + "\n", + "1. Install the Kubeflow Pipeline SDK:\n" ] }, { @@ -82,22 +100,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_create_cluster.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/create_cluster/sample.ipynb)\n", - "* [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create)\n", - "\n", - "\n", "### Sample\n", + "Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ "#### Set sample parameters" ] }, @@ -205,6 +210,21 @@ "run_name = pipeline_func.__name__ + ' run'\n", "run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "* [Kubernetes Engine for Kubeflow](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts)\n", + "* [Component Python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_create_cluster.py)\n", + "* [Component Docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/create_cluster/sample.ipynb)\n", + "* [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -223,7 +243,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataproc/delete_cluster/README.md b/components/gcp/dataproc/delete_cluster/README.md index fb2be0b9722..5cb238c607f 100644 --- a/components/gcp/dataproc/delete_cluster/README.md +++ b/components/gcp/dataproc/delete_cluster/README.md @@ -1,33 +1,43 @@ -# Deleting a Cluster with Cloud Dataproc -A Kubeflow Pipeline component to delete a cluster in Cloud Dataproc service. +# Name + +Data preparation by deleting a cluster in Cloud Dataproc + +# Label +Cloud Dataproc, cluster, GCP, Cloud Storage, Kubeflow, Pipeline + + +# Summary +A Kubeflow Pipeline component to delete a cluster in Cloud Dataproc. + +## Intended use +Use this component at the start of a Kubeflow Pipeline to delete a temporary Cloud Dataproc cluster to run Cloud Dataproc jobs as steps in the pipeline. This component is usually used with an [exit handler](https://github.com/kubeflow/pipelines/blob/master/samples/basic/exit_handler.py) to run at the end of a pipeline. -## Intended Use -Use the component to recycle a Dataproc cluster as one of the step in a KFP pipeline. This component is usually used with an [exit handler](https://github.com/kubeflow/pipelines/blob/master/samples/basic/exit_handler.py) to run at the end of a pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No | -region | The Cloud Dataproc region runs the cluster to delete. | GCPRegion | No | -name | The cluster name to delete. | String | No | -wait_interval | The number of seconds to pause between polling the delete operation done status. | Integer | Yes | `30` +| Argument | Description | Optional | Data type | Accepted values | Default | +|----------|-------------|----------|-----------|-----------------|---------| +| project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectID | | | +| region | The Cloud Dataproc region in which to handle the request. | No | GCPRegion | | | +| name | The name of the cluster to delete. | No | String | | | +| wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | | 30 | + ## Cautions & requirements To use the component, you must: -* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -``` -component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) -``` -* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project. +* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -## Detailed Description -This component deletes a Dataproc cluster by using [Dataproc delete cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/delete). + ``` + component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` +* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project. -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +## Detailed description +This component deletes a Dataproc cluster by using [Dataproc delete cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/delete). +Follow these steps to use the component in a pipeline: +1. Install the Kubeflow Pipeline SDK: ```python @@ -48,20 +58,13 @@ dataproc_delete_cluster_op = comp.load_component_from_url( help(dataproc_delete_cluster_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_delete_cluster.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/delete_cluster/sample.ipynb) -* [Dataproc delete cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/delete) - - ### Sample -Note: the sample code below works in both IPython notebook or python code directly. +Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template. #### Prerequisites -Before running the sample code, you need to [create a Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). +[Create a Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) before running the sample code. #### Set sample parameters @@ -122,3 +125,14 @@ experiment = client.create_experiment(EXPERIMENT_NAME) run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` + +## References + +* [Component Python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_delete_cluster.py) +* [Component Docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/delete_cluster/sample.ipynb) +* [Dataproc delete cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/delete) + + +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataproc/delete_cluster/sample.ipynb b/components/gcp/dataproc/delete_cluster/sample.ipynb index 15ad51550e9..d0de6367956 100644 --- a/components/gcp/dataproc/delete_cluster/sample.ipynb +++ b/components/gcp/dataproc/delete_cluster/sample.ipynb @@ -4,34 +4,45 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Deleting a Cluster with Cloud Dataproc\n", - "A Kubeflow Pipeline component to delete a cluster in Cloud Dataproc service.\n", + "# Name\n", + "\n", + "Data preparation by deleting a cluster in Cloud Dataproc\n", + "\n", + "# Label\n", + "Cloud Dataproc, cluster, GCP, Cloud Storage, Kubeflow, Pipeline\n", + "\n", + "\n", + "# Summary\n", + "A Kubeflow Pipeline component to delete a cluster in Cloud Dataproc.\n", + "\n", + "## Intended use\n", + "Use this component at the start of a Kubeflow Pipeline to delete a temporary Cloud Dataproc cluster to run Cloud Dataproc jobs as steps in the pipeline. This component is usually used with an [exit handler](https://github.com/kubeflow/pipelines/blob/master/samples/basic/exit_handler.py) to run at the end of a pipeline.\n", "\n", - "## Intended Use\n", - "Use the component to recycle a Dataproc cluster as one of the step in a KFP pipeline. This component is usually used with an [exit handler](https://github.com/kubeflow/pipelines/blob/master/samples/basic/exit_handler.py) to run at the end of a pipeline.\n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No |\n", - "region | The Cloud Dataproc region runs the cluster to delete. | GCPRegion | No |\n", - "name | The cluster name to delete. | String | No |\n", - "wait_interval | The number of seconds to pause between polling the delete operation done status. | Integer | Yes | `30`\n", + "| Argument | Description | Optional | Data type | Accepted values | Default |\n", + "|----------|-------------|----------|-----------|-----------------|---------|\n", + "| project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectID | | |\n", + "| region | The Cloud Dataproc region in which to handle the request. | No | GCPRegion | | |\n", + "| name | The name of the cluster to delete. | No | String | | |\n", + "| wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | | 30 |\n", + "\n", "\n", "## Cautions & requirements\n", "To use the component, you must:\n", - "* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", - "```\n", - "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", - "```\n", - "* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project.\n", - "\n", - "## Detailed Description\n", + "* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "\n", + " ```\n", + " component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project.\n", + "\n", + "## Detailed description\n", "This component deletes a Dataproc cluster by using [Dataproc delete cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/delete).\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "Follow these steps to use the component in a pipeline:\n", + "1. Install the Kubeflow Pipeline SDK:" ] }, { @@ -70,31 +81,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_delete_cluster.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/delete_cluster/sample.ipynb)\n", - "* [Dataproc delete cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/delete)\n", - "\n", - "\n", "### Sample\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ + "Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n", + "\n", "#### Prerequisites\n", "\n", - "Before running the sample code, you need to [create a Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ + "[Create a Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) before running the sample code.\n", + "\n", "#### Set sample parameters" ] }, @@ -190,6 +184,22 @@ "run_name = pipeline_func.__name__ + ' run'\n", "run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "\n", + "* [Component Python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_delete_cluster.py)\n", + "* [Component Docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/delete_cluster/sample.ipynb)\n", + "* [Dataproc delete cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/delete)\n", + "\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -208,7 +218,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataproc/submit_hadoop_job/README.md b/components/gcp/dataproc/submit_hadoop_job/README.md index 1d5bf42ff88..d1ae5d3c975 100644 --- a/components/gcp/dataproc/submit_hadoop_job/README.md +++ b/components/gcp/dataproc/submit_hadoop_job/README.md @@ -1,22 +1,36 @@ -# Submitting a Hadoop Job to Cloud Dataproc -A Kubeflow Pipeline component to submit an Apache Hadoop MapReduce job on Apache Hadoop YARN in Google Cloud Dataproc service. +# Name +Data preparation using Hadoop MapReduce on YARN with Cloud Dataproc -## Intended Use -Use the component to run an Apache Hadoop MapReduce job as one preprocessing step in a KFP pipeline. +# Label +Cloud Dataproc, GCP, Cloud Storage, Hadoop, YARN, Apache, MapReduce + + +# Summary +A Kubeflow Pipeline component to prepare data by submitting an Apache Hadoop MapReduce job on Apache Hadoop YARN to Cloud Dataproc. + +# Details +## Intended use +Use the component to run an Apache Hadoop MapReduce job as one preprocessing step in a Kubeflow Pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No | -region | The Dataproc region that handles the request. | GCPRegion | No | -cluster_name | The name of the cluster that runs the job. | String | No | -main_jar_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the JAR file containing the main class to execute. Examples: `gs://foo-bucket/analytics-binaries/extract-useful-metrics-mr.jar` `hdfs:/tmp/test-samples/custom-wordcount.jar` `file:///home/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar` | GCSPath | No | -main_class | The name of the driver's main class. The JARfile that contains the class must be in the default CLASSPATH or specified in `hadoop_job.jarFileUris`. | String | No | -args | The arguments to pass to the driver. Do not include arguments, such as -libjars or -Dfoo=bar, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | Yes | List | `[]` -hadoop_job | The payload of a [HadoopJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HadoopJob). | Dict | Yes | `{}` -job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}` -wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30` +| Argument | Description | Optional | Data type | Accepted values | Default | +|----------|-------------|----------|-----------|-----------------|---------| +| project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectID | | | +| region | The Dataproc region to handle the request. | No | GCPRegion | | | +| cluster_name | The name of the cluster to run the job. | No | String | | | +| main_jar_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the JAR file containing the main class to execute. | No | List | | | +| main_class | The name of the driver's main class. The JAR file that contains the class must be either in the default CLASSPATH or specified in `hadoop_job.jarFileUris`. | No | String | | | +| args | The arguments to pass to the driver. Do not include arguments, such as -libjars or -Dfoo=bar, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | Yes | List | | None | +| hadoop_job | The payload of a [HadoopJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HadoopJob). | Yes | Dict | | None | +| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None | +| wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | | 30 | + +Note: +`main_jar_file_uri`: The examples for the files are : +- `gs://foo-bucket/analytics-binaries/extract-useful-metrics-mr.jar` +- `hdfs:/tmp/test-samples/custom-wordcount.jarfile:///home/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar` + ## Output Name | Description | Type @@ -25,19 +39,22 @@ job_id | The ID of the created job. | String ## Cautions & requirements To use the component, you must: -* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). -* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -``` -component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) -``` -* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project. +* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). +* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: + + ```python + component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` +* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project. + +## Detailed description -## Detailed Description This component creates a Hadoop job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit). -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +Follow these steps to use the component in a pipeline: + +1. Install the Kubeflow Pipeline SDK: @@ -59,28 +76,23 @@ dataproc_submit_hadoop_job_op = comp.load_component_from_url( help(dataproc_submit_hadoop_job_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_hadoop_job.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_hadoop_job/sample.ipynb) -* [Dataproc HadoopJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HadoopJob) +## Sample +Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template. -### Sample -Note: the sample code below works in both IPython notebook or python code directly. - -#### Setup a Dataproc cluster +### Setup a Dataproc cluster [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code. -#### Prepare Hadoop job -Upload your Hadoop jar file to a Cloud Storage (GCS) bucket. In the sample, we will use a jar file that is pre-installed in the main cluster, so there is no need to provide the `main_jar_file_uri`. We only set `main_class` to be `org.apache.hadoop.examples.WordCount`. +### Prepare a Hadoop job +Upload your Hadoop JAR file to a Cloud Storage bucket. In the sample, we will use a JAR file that is preinstalled in the main cluster, so there is no need to provide `main_jar_file_uri`. Here is the [WordCount example source code](https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/WordCount.java). -To package a self-contained Hadoop MapReduceapplication from the source code, follow the [MapReduce Tutorial](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html). +To package a self-contained Hadoop MapReduce application from the source code, follow the [MapReduce Tutorial](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html). + -#### Set sample parameters +### Set sample parameters ```python @@ -101,12 +113,10 @@ The input file is a simple text file: !gsutil cat $INTPUT_GCS_PATH ``` -#### Clean up existing output files (Optional) +### Clean up the existing output files (optional) +This is needed because the sample code requires the output folder to be a clean folder. To continue to run the sample, make sure that the service account of the notebook server has access to the `OUTPUT_GCS_PATH`. -This is needed because the sample code requires the output folder to be a clean folder. -To continue to run the sample, make sure that the service account of the notebook server has access to the `OUTPUT_GCS_PATH`. - -**CAUTION**: This will remove all blob files under `OUTPUT_GCS_PATH`. +CAUTION: This will remove all blob files under `OUTPUT_GCS_PATH`. ```python @@ -177,10 +187,19 @@ run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` -#### Inspect the outputs -The sample in the notebook will count the words in the input text and save them in sharded files. Here is the command to inspect them: +### Inspect the output +The sample in the notebook will count the words in the input text and save them in sharded files. The command to inspect the output is: ```python !gsutil cat $OUTPUT_GCS_PATH/* ``` + +## References +* [Component Python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_hadoop_job.py) +* [Component Docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_hadoop_job/sample.ipynb) +* [Dataproc HadoopJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HadoopJob) + +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataproc/submit_hadoop_job/sample.ipynb b/components/gcp/dataproc/submit_hadoop_job/sample.ipynb index 6fa0f822be7..dc4b1230ebe 100644 --- a/components/gcp/dataproc/submit_hadoop_job/sample.ipynb +++ b/components/gcp/dataproc/submit_hadoop_job/sample.ipynb @@ -4,24 +4,38 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Submitting a Hadoop Job to Cloud Dataproc\n", - "A Kubeflow Pipeline component to submit an Apache Hadoop MapReduce job on Apache Hadoop YARN in Google Cloud Dataproc service.\n", + "# Name\n", + "Data preparation using Hadoop MapReduce on YARN with Cloud Dataproc\n", "\n", - "## Intended Use\n", - "Use the component to run an Apache Hadoop MapReduce job as one preprocessing step in a KFP pipeline. \n", + "# Label\n", + "Cloud Dataproc, GCP, Cloud Storage, Hadoop, YARN, Apache, MapReduce\n", + "\n", + "\n", + "# Summary\n", + "A Kubeflow Pipeline component to prepare data by submitting an Apache Hadoop MapReduce job on Apache Hadoop YARN to Cloud Dataproc.\n", + "\n", + "# Details\n", + "## Intended use\n", + "Use the component to run an Apache Hadoop MapReduce job as one preprocessing step in a Kubeflow Pipeline. \n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No |\n", - "region | The Dataproc region that handles the request. | GCPRegion | No |\n", - "cluster_name | The name of the cluster that runs the job. | String | No |\n", - "main_jar_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the JAR file containing the main class to execute. Examples: `gs://foo-bucket/analytics-binaries/extract-useful-metrics-mr.jar` `hdfs:/tmp/test-samples/custom-wordcount.jar` `file:///home/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar` | GCSPath | No |\n", - "main_class | The name of the driver's main class. The JARfile that contains the class must be in the default CLASSPATH or specified in `hadoop_job.jarFileUris`. | String | No |\n", - "args | The arguments to pass to the driver. Do not include arguments, such as -libjars or -Dfoo=bar, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | Yes | List | `[]`\n", - "hadoop_job | The payload of a [HadoopJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HadoopJob). | Dict | Yes | `{}`\n", - "job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}`\n", - "wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30`\n", + "| Argument | Description | Optional | Data type | Accepted values | Default |\n", + "|----------|-------------|----------|-----------|-----------------|---------|\n", + "| project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectID | | |\n", + "| region | The Dataproc region to handle the request. | No | GCPRegion | | |\n", + "| cluster_name | The name of the cluster to run the job. | No | String | | |\n", + "| main_jar_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the JAR file containing the main class to execute. | No | List | | |\n", + "| main_class | The name of the driver's main class. The JAR file that contains the class must be either in the default CLASSPATH or specified in `hadoop_job.jarFileUris`. | No | String | | |\n", + "| args | The arguments to pass to the driver. Do not include arguments, such as -libjars or -Dfoo=bar, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | Yes | List | | None |\n", + "| hadoop_job | The payload of a [HadoopJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HadoopJob). | Yes | Dict | | None |\n", + "| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None |\n", + "| wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | | 30 |\n", + "\n", + "Note: \n", + "`main_jar_file_uri`: The examples for the files are : \n", + "- `gs://foo-bucket/analytics-binaries/extract-useful-metrics-mr.jar` \n", + "- `hdfs:/tmp/test-samples/custom-wordcount.jarfile:///home/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar`\n", + "\n", "\n", "## Output\n", "Name | Description | Type\n", @@ -30,19 +44,22 @@ "\n", "## Cautions & requirements\n", "To use the component, you must:\n", - "* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", - "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", - "```\n", - "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", - "```\n", - "* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project.\n", + "* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", + "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "\n", + " ```python\n", + " component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project.\n", + "\n", + "## Detailed description\n", "\n", - "## Detailed Description\n", "This component creates a Hadoop job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit).\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "Follow these steps to use the component in a pipeline:\n", + "\n", + "1. Install the Kubeflow Pipeline SDK:\n" ] }, { @@ -81,33 +98,23 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_hadoop_job.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_hadoop_job/sample.ipynb)\n", - "* [Dataproc HadoopJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HadoopJob)\n", - "\n", - "### Sample\n", + "## Sample\n", + "Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly.\n", "\n", - "#### Setup a Dataproc cluster\n", + "### Setup a Dataproc cluster\n", "[Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code.\n", "\n", "\n", - "#### Prepare Hadoop job\n", - "Upload your Hadoop jar file to a Cloud Storage (GCS) bucket. In the sample, we will use a jar file that is pre-installed in the main cluster, so there is no need to provide the `main_jar_file_uri`. We only set `main_class` to be `org.apache.hadoop.examples.WordCount`.\n", + "### Prepare a Hadoop job\n", + "Upload your Hadoop JAR file to a Cloud Storage bucket. In the sample, we will use a JAR file that is preinstalled in the main cluster, so there is no need to provide `main_jar_file_uri`. \n", "\n", "Here is the [WordCount example source code](https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/WordCount.java).\n", "\n", - "To package a self-contained Hadoop MapReduceapplication from the source code, follow the [MapReduce Tutorial](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Set sample parameters" + "To package a self-contained Hadoop MapReduce application from the source code, follow the [MapReduce Tutorial](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html).\n", + "\n", + "\n", + "### Set sample parameters" ] }, { @@ -150,12 +157,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### Clean up existing output files (Optional)\n", - "\n", - "This is needed because the sample code requires the output folder to be a clean folder.\n", - "To continue to run the sample, make sure that the service account of the notebook server has access to the `OUTPUT_GCS_PATH`.\n", + "### Clean up the existing output files (optional)\n", + "This is needed because the sample code requires the output folder to be a clean folder. To continue to run the sample, make sure that the service account of the notebook server has access to the `OUTPUT_GCS_PATH`.\n", "\n", - "**CAUTION**: This will remove all blob files under `OUTPUT_GCS_PATH`." + "CAUTION: This will remove all blob files under `OUTPUT_GCS_PATH`." ] }, { @@ -262,8 +267,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### Inspect the outputs\n", - "The sample in the notebook will count the words in the input text and save them in sharded files. Here is the command to inspect them:" + "### Inspect the output\n", + "The sample in the notebook will count the words in the input text and save them in sharded files. The command to inspect the output is:" ] }, { @@ -274,6 +279,20 @@ "source": [ "!gsutil cat $OUTPUT_GCS_PATH/*" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "* [Component Python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_hadoop_job.py)\n", + "* [Component Docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_hadoop_job/sample.ipynb)\n", + "* [Dataproc HadoopJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HadoopJob)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -292,7 +311,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataproc/submit_hive_job/README.md b/components/gcp/dataproc/submit_hive_job/README.md index 8cd1d0b01c9..f73bc257f1c 100644 --- a/components/gcp/dataproc/submit_hive_job/README.md +++ b/components/gcp/dataproc/submit_hive_job/README.md @@ -1,22 +1,29 @@ -# Submitting a Hive Job to Cloud Dataproc -A Kubeflow Pipeline component to submit a Hive job to Google Cloud Dataproc service. +# Name +Data preparation using Apache Hive on YARN with Cloud Dataproc -## Intended Use -Use the component to run an Apache Hive job as one preprocessing step in a KFP pipeline. +# Label +Cloud Dataproc, GCP, Cloud Storage, YARN, Hive, Apache + +# Summary +A Kubeflow Pipeline component to prepare data by submitting an Apache Hive job on YARN to Cloud Dataproc. + +# Details +## Intended use +Use the component to run an Apache Hive job as one preprocessing step in a Kubeflow Pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No | -region | The Dataproc region that handles the request. | GCPRegion | No | -cluster_name | The name of the cluster that runs the job. | String | No | -queries | The queries to execute. You do not need to terminate a query with a semicolon. Multiple queries can be specified in one string by separating each with a semicolon. | List | Yes | `[]` -query_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the script that contains Hive queries. | GCSPath | Yes | ` ` -script_variables | Mapping of query variable names to values (equivalent to the Hive command: SET name="value";). | List | Yes | `[]` -hive_job | The payload of a [HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob). | Dict | Yes | `{}` -job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}` -wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30` +| Argument | Description | Optional | Data type | Accepted values | Default | +|----------|-------------|----------|-----------|-----------------|---------| +| project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectId | | | +| region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | | +| cluster_name | The name of the cluster to run the job. | No | String | | | +| queries | The queries to execute the Hive job. Specify multiple queries in one string by separating them with semicolons. You do not need to terminate queries with semicolons. | Yes | List | | None | +| query_file_uri | The HCFS URI of the script that contains the Hive queries. | Yes | GCPPath | | None | +| script_variables | Mapping of the query’s variable names to their values (equivalent to the Hive command: SET name="value";). | Yes | Dict | | None | +| hive_job | The payload of a [HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob) | Yes | Dict | | None | +| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None | +| wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | | 30 | ## Output Name | Description | Type @@ -25,19 +32,20 @@ job_id | The ID of the created job. | String ## Cautions & requirements To use the component, you must: -* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). -* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -``` -component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) -``` -* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project. +* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). +* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: + + ``` + component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` +* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project. -## Detailed Description +## Detailed description This component creates a Hive job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit). -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +Follow these steps to use the component in a pipeline: +1. Install the Kubeflow Pipeline SDK: @@ -59,23 +67,21 @@ dataproc_submit_hive_job_op = comp.load_component_from_url( help(dataproc_submit_hive_job_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_hive_job.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_hive_job/sample.ipynb) -* [Dataproc HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob) - ### Sample -Note: the sample code below works in both IPython notebook or python code directly. +Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template. + #### Setup a Dataproc cluster + [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code. -#### Prepare Hive query -Directly put your Hive queries in the `queries` list or upload your Hive queries into a file to a Cloud Storage (GCS) bucket and place the path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a public CSV file from GCS. +#### Prepare a Hive query + +Put your Hive queries in the queries list, or upload your Hive queries into a file saved in a Cloud Storage bucket and then enter the Cloud Storage bucket’s path in `query_file_uri.` In this sample, we will use a hard coded query in the queries list to select data from a public CSV file from Cloud Storage. + +For more details, see the [Hive language manual.](https://cwiki.apache.org/confluence/display/Hive/LanguageManual) -For more details, please checkout [Hive language manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual) #### Set sample parameters @@ -166,3 +172,12 @@ experiment = client.create_experiment(EXPERIMENT_NAME) run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` + +## References +* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_hive_job.py) +* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_hive_job/sample.ipynb) +* [Dataproc HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob) + +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataproc/submit_hive_job/sample.ipynb b/components/gcp/dataproc/submit_hive_job/sample.ipynb index a6081328ebe..bfd32c6558a 100644 --- a/components/gcp/dataproc/submit_hive_job/sample.ipynb +++ b/components/gcp/dataproc/submit_hive_job/sample.ipynb @@ -4,24 +4,31 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Submitting a Hive Job to Cloud Dataproc\n", - "A Kubeflow Pipeline component to submit a Hive job to Google Cloud Dataproc service. \n", + "# Name\n", + "Data preparation using Apache Hive on YARN with Cloud Dataproc\n", "\n", - "## Intended Use\n", - "Use the component to run an Apache Hive job as one preprocessing step in a KFP pipeline. \n", + "# Label\n", + "Cloud Dataproc, GCP, Cloud Storage, YARN, Hive, Apache\n", + "\n", + "# Summary\n", + "A Kubeflow Pipeline component to prepare data by submitting an Apache Hive job on YARN to Cloud Dataproc.\n", + "\n", + "# Details\n", + "## Intended use\n", + "Use the component to run an Apache Hive job as one preprocessing step in a Kubeflow Pipeline.\n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No |\n", - "region | The Dataproc region that handles the request. | GCPRegion | No |\n", - "cluster_name | The name of the cluster that runs the job. | String | No |\n", - "queries | The queries to execute. You do not need to terminate a query with a semicolon. Multiple queries can be specified in one string by separating each with a semicolon. | List | Yes | `[]`\n", - "query_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the script that contains Hive queries. | GCSPath | Yes | ` `\n", - "script_variables | Mapping of query variable names to values (equivalent to the Hive command: SET name=\"value\";). | List | Yes | `[]`\n", - "hive_job | The payload of a [HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob). | Dict | Yes | `{}`\n", - "job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}`\n", - "wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30`\n", + "| Argument | Description | Optional | Data type | Accepted values | Default |\n", + "|----------|-------------|----------|-----------|-----------------|---------|\n", + "| project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectId | | |\n", + "| region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | |\n", + "| cluster_name | The name of the cluster to run the job. | No | String | | |\n", + "| queries | The queries to execute the Hive job. Specify multiple queries in one string by separating them with semicolons. You do not need to terminate queries with semicolons. | Yes | List | | None |\n", + "| query_file_uri | The HCFS URI of the script that contains the Hive queries. | Yes | GCPPath | | None |\n", + "| script_variables | Mapping of the query’s variable names to their values (equivalent to the Hive command: SET name=\"value\";). | Yes | Dict | | None |\n", + "| hive_job | The payload of a [HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob) | Yes | Dict | | None |\n", + "| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None |\n", + "| wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | | 30 |\n", "\n", "## Output\n", "Name | Description | Type\n", @@ -30,19 +37,20 @@ "\n", "## Cautions & requirements\n", "To use the component, you must:\n", - "* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", - "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", - "```\n", - "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", - "```\n", - "* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project.\n", - "\n", - "## Detailed Description\n", + "* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", + "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "\n", + " ```\n", + " component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project.\n", + "\n", + "## Detailed description\n", "This component creates a Hive job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit).\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "Follow these steps to use the component in a pipeline:\n", + "1. Install the Kubeflow Pipeline SDK:\n" ] }, { @@ -81,29 +89,22 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_hive_job.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_hive_job/sample.ipynb)\n", - "* [Dataproc HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob)\n", - "\n", "### Sample\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly.\n", + "Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n", + "\n", "\n", "#### Setup a Dataproc cluster\n", + "\n", "[Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code.\n", "\n", - "#### Prepare Hive query\n", - "Directly put your Hive queries in the `queries` list or upload your Hive queries into a file to a Cloud Storage (GCS) bucket and place the path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a public CSV file from GCS.\n", + "#### Prepare a Hive query\n", + "\n", + "Put your Hive queries in the queries list, or upload your Hive queries into a file saved in a Cloud Storage bucket and then enter the Cloud Storage bucket’s path in `query_file_uri.` In this sample, we will use a hard coded query in the queries list to select data from a public CSV file from Cloud Storage.\n", + "\n", + "For more details, see the [Hive language manual.](https://cwiki.apache.org/confluence/display/Hive/LanguageManual)\n", + "\n", "\n", - "For more details, please checkout [Hive language manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ "#### Set sample parameters" ] }, @@ -229,6 +230,20 @@ "run_name = pipeline_func.__name__ + ' run'\n", "run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_hive_job.py)\n", + "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_hive_job/sample.ipynb)\n", + "* [Dataproc HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -247,7 +262,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataproc/submit_pig_job/README.md b/components/gcp/dataproc/submit_pig_job/README.md index 252b0cad638..70ead813b0e 100644 --- a/components/gcp/dataproc/submit_pig_job/README.md +++ b/components/gcp/dataproc/submit_pig_job/README.md @@ -1,22 +1,31 @@ -# Submitting a Pig Job to Cloud Dataproc -A Kubeflow Pipeline component to submit a Pig job to Google Cloud Dataproc service. +# Name +Data preparation using Apache Pig on YARN with Cloud Dataproc -## Intended Use -Use the component to run an Apache Pig job as one preprocessing step in a KFP pipeline. +# Label +Cloud Dataproc, GCP, Cloud Storage, YARN, Pig, Apache, Kubeflow, pipelines, components + + +# Summary +A Kubeflow Pipeline component to prepare data by submitting an Apache Pig job on YARN to Cloud Dataproc. + + +# Details +## Intended use +Use the component to run an Apache Pig job as one preprocessing step in a Kubeflow Pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No | -region | The Dataproc region that handles the request. | GCPRegion | No | -cluster_name | The name of the cluster that runs the job. | String | No | -queries | The queries to execute. You do not need to terminate a query with a semicolon. Multiple queries can be specified in one string by separating each with a semicolon. | List | Yes | `[]` -query_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the script that contains Pig queries.| GCSPath | Yes | ` ` -script_variables | Optional. Mapping of query variable names to values (equivalent to the Pig command: SET name="value";).| List | Yes | `[]` -pig_job | The payload of a [PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob). | Dict | Yes | `{}` -job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs).| Dict | Yes | `{}` -wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30` +| Argument | Description | Optional | Data type | Accepted values | Default | +|----------|-------------|----------|-----------|-----------------|---------| +| project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No | GCPProjectID | | | +| region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | | +| cluster_name | The name of the cluster to run the job. | No | String | | | +| queries | The queries to execute the Pig job. Specify multiple queries in one string by separating them with semicolons. You do not need to terminate queries with semicolons. | Yes | List | | None | +| query_file_uri | The HCFS URI of the script that contains the Pig queries. | Yes | GCSPath | | None | +| script_variables | Mapping of the query’s variable names to their values (equivalent to the Pig command: SET name="value";). | Yes | Dict | | None | +| pig_job | The payload of a [PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob). | Yes | Dict | | None | +| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None | +| wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | | 30 | ## Output Name | Description | Type @@ -24,20 +33,22 @@ Name | Description | Type job_id | The ID of the created job. | String ## Cautions & requirements + To use the component, you must: -* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). -* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -``` -component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) -``` -* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project. +* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). +* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: + + ``` + component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` +* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project. -## Detailed Description +## Detailed description This component creates a Pig job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit). -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +Follow these steps to use the component in a pipeline: +1. Install the Kubeflow Pipeline SDK: @@ -59,23 +70,21 @@ dataproc_submit_pig_job_op = comp.load_component_from_url( help(dataproc_submit_pig_job_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_pig_job.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_pig_job/sample.ipynb) -* [Dataproc PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob) - ### Sample -Note: the sample code below works in both IPython notebook or python code directly. +Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template. + #### Setup a Dataproc cluster + [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code. -#### Prepare Pig query -Directly put your Pig queries in the `queries` list or upload your Pig queries into a file to a Google Cloud Storage (GCS) bucket and place the path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a local `passwd` file. -For more details, please checkout [Pig documentation](http://pig.apache.org/docs/latest/) +#### Prepare a Pig query + +Either put your Pig queries in the `queries` list, or upload your Pig queries into a file to a Cloud Storage bucket and then enter the Cloud Storage bucket’s path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a local `passwd` file. + +For more details on Apache Pig, see the [Pig documentation.](http://pig.apache.org/docs/latest/) #### Set sample parameters @@ -154,7 +163,11 @@ run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` +## References +* [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) +* [Pig documentation](http://pig.apache.org/docs/latest/) +* [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs) +* [PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob) -```python - -``` +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataproc/submit_pig_job/sample.ipynb b/components/gcp/dataproc/submit_pig_job/sample.ipynb index 9da409b8e1d..b695b2eadaa 100644 --- a/components/gcp/dataproc/submit_pig_job/sample.ipynb +++ b/components/gcp/dataproc/submit_pig_job/sample.ipynb @@ -4,24 +4,33 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Submitting a Pig Job to Cloud Dataproc\n", - "A Kubeflow Pipeline component to submit a Pig job to Google Cloud Dataproc service. \n", + "# Name\n", + "Data preparation using Apache Pig on YARN with Cloud Dataproc\n", "\n", - "## Intended Use\n", - "Use the component to run an Apache Pig job as one preprocessing step in a KFP pipeline. \n", + "# Label\n", + "Cloud Dataproc, GCP, Cloud Storage, YARN, Pig, Apache, Kubeflow, pipelines, components\n", + "\n", + "\n", + "# Summary\n", + "A Kubeflow Pipeline component to prepare data by submitting an Apache Pig job on YARN to Cloud Dataproc.\n", + "\n", + "\n", + "# Details\n", + "## Intended use\n", + "Use the component to run an Apache Pig job as one preprocessing step in a Kubeflow Pipeline.\n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No |\n", - "region | The Dataproc region that handles the request. | GCPRegion | No |\n", - "cluster_name | The name of the cluster that runs the job. | String | No |\n", - "queries | The queries to execute. You do not need to terminate a query with a semicolon. Multiple queries can be specified in one string by separating each with a semicolon. | List | Yes | `[]`\n", - "query_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the script that contains Pig queries.| GCSPath | Yes | ` `\n", - "script_variables | Optional. Mapping of query variable names to values (equivalent to the Pig command: SET name=\"value\";).| List | Yes | `[]`\n", - "pig_job | The payload of a [PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob). | Dict | Yes | `{}`\n", - "job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs).| Dict | Yes | `{}`\n", - "wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30`\n", + "| Argument | Description | Optional | Data type | Accepted values | Default |\n", + "|----------|-------------|----------|-----------|-----------------|---------|\n", + "| project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No | GCPProjectID | | |\n", + "| region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | |\n", + "| cluster_name | The name of the cluster to run the job. | No | String | | |\n", + "| queries | The queries to execute the Pig job. Specify multiple queries in one string by separating them with semicolons. You do not need to terminate queries with semicolons. | Yes | List | | None |\n", + "| query_file_uri | The HCFS URI of the script that contains the Pig queries. | Yes | GCSPath | | None |\n", + "| script_variables | Mapping of the query’s variable names to their values (equivalent to the Pig command: SET name=\"value\";). | Yes | Dict | | None |\n", + "| pig_job | The payload of a [PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob). | Yes | Dict | | None |\n", + "| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None |\n", + "| wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | | 30 |\n", "\n", "## Output\n", "Name | Description | Type\n", @@ -29,20 +38,22 @@ "job_id | The ID of the created job. | String\n", "\n", "## Cautions & requirements\n", + "\n", "To use the component, you must:\n", - "* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", - "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", - "```\n", - "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", - "```\n", - "* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project.\n", - "\n", - "## Detailed Description\n", + "* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", + "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "\n", + " ```\n", + " component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project.\n", + "\n", + "## Detailed description\n", "This component creates a Pig job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit).\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "Follow these steps to use the component in a pipeline:\n", + "1. Install the Kubeflow Pipeline SDK:\n" ] }, { @@ -81,29 +92,22 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_pig_job.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_pig_job/sample.ipynb)\n", - "* [Dataproc PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob)\n", - "\n", "### Sample\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly.\n", + "Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n", + "\n", "\n", "#### Setup a Dataproc cluster\n", + "\n", "[Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code.\n", "\n", - "#### Prepare Pig query\n", - "Directly put your Pig queries in the `queries` list or upload your Pig queries into a file to a Google Cloud Storage (GCS) bucket and place the path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a local `passwd` file.\n", "\n", - "For more details, please checkout [Pig documentation](http://pig.apache.org/docs/latest/)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ + "#### Prepare a Pig query\n", + "\n", + "Either put your Pig queries in the `queries` list, or upload your Pig queries into a file to a Cloud Storage bucket and then enter the Cloud Storage bucket’s path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a local `passwd` file.\n", + "\n", + "For more details on Apache Pig, see the [Pig documentation.](http://pig.apache.org/docs/latest/)\n", + "\n", "#### Set sample parameters" ] }, @@ -218,11 +222,18 @@ ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], - "source": [] + "source": [ + "## References\n", + "* [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) \n", + "* [Pig documentation](http://pig.apache.org/docs/latest/)\n", + "* [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs)\n", + "* [PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -241,7 +252,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataproc/submit_pyspark_job/README.md b/components/gcp/dataproc/submit_pyspark_job/README.md index 3a5f6db5f89..7ba0533cb3e 100644 --- a/components/gcp/dataproc/submit_pyspark_job/README.md +++ b/components/gcp/dataproc/submit_pyspark_job/README.md @@ -1,21 +1,31 @@ -# Submitting a PySpark Job to Cloud Dataproc -A Kubeflow Pipeline component to submit a PySpark job to Google Cloud Dataproc service. +# Name +Data preparation using PySpark on Cloud Dataproc + + +# Label +Cloud Dataproc, GCP, Cloud Storage,PySpark, Kubeflow, pipelines, components + + +# Summary +A Kubeflow Pipeline component to prepare data by submitting a PySpark job to Cloud Dataproc. + + +# Details +## Intended use +Use the component to run an Apache PySpark job as one preprocessing step in a Kubeflow Pipeline. -## Intended Use -Use the component to run an Apache PySpark job as one preprocessing step in a KFP pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No | -region | The Dataproc region that handles the request. | GCPRegion | No | -cluster_name | The name of the cluster that runs the job. | String | No | -main_python_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the main Python file to use as the driver. Must be a .py file. | GCSPath | No | -args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | List | Yes | `[]` -pyspark_job | The payload of a [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob). | Dict | Yes | `{}` -job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}` -wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30` +| Argument | Description | Optional | Data type | Accepted values | Default | +|----------------------|------------|----------|--------------|-----------------|---------| +| project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No | GCPProjectID | | | +| region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | | +| cluster_name | The name of the cluster to run the job. | No | String | | | +| main_python_file_uri | The HCFS URI of the Python file to use as the driver. This must be a .py file. | No | GCSPath | | | +| args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | Yes | List | | None | +| pyspark_job | The payload of a [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob). | Yes | Dict | | None | +| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None | ## Output Name | Description | Type @@ -23,21 +33,24 @@ Name | Description | Type job_id | The ID of the created job. | String ## Cautions & requirements + To use the component, you must: -* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). -* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -``` -component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) -``` -* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project. +* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). +* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -## Detailed Description -This component creates a PySpark job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit). + ``` + component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` +* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project. -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +## Detailed description +This component creates a PySpark job from the [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit). + +Follow these steps to use the component in a pipeline: + +1. Install the Kubeflow Pipeline SDK: ```python @@ -58,21 +71,19 @@ dataproc_submit_pyspark_job_op = comp.load_component_from_url( help(dataproc_submit_pyspark_job_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_pyspark_job.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_pyspark_job/sample.ipynb) -* [Dataproc PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob) - ### Sample -Note: the sample code below works in both IPython notebook or python code directly. +Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template. + #### Setup a Dataproc cluster + [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code. -#### Prepare PySpark job -Upload your PySpark code file to a Cloud Storage bucket. For example, thisis a publicly accessible hello-world.py in Cloud Storage: + +#### Prepare a PySpark job + +Upload your PySpark code file to a Cloud Storage bucket. For example, this is a publicly accessible `hello-world.py` in Cloud Storage: ```python @@ -151,7 +162,11 @@ run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` +## References -```python +* [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) +* [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob) +* [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs) -``` +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataproc/submit_pyspark_job/sample.ipynb b/components/gcp/dataproc/submit_pyspark_job/sample.ipynb index 6fac3c069c3..f9f8bc09245 100644 --- a/components/gcp/dataproc/submit_pyspark_job/sample.ipynb +++ b/components/gcp/dataproc/submit_pyspark_job/sample.ipynb @@ -4,23 +4,33 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Submitting a PySpark Job to Cloud Dataproc\n", - "A Kubeflow Pipeline component to submit a PySpark job to Google Cloud Dataproc service. \n", + "# Name\n", + "Data preparation using PySpark on Cloud Dataproc\n", + "\n", + "\n", + "# Label\n", + "Cloud Dataproc, GCP, Cloud Storage,PySpark, Kubeflow, pipelines, components\n", + "\n", + "\n", + "# Summary\n", + "A Kubeflow Pipeline component to prepare data by submitting a PySpark job to Cloud Dataproc.\n", + "\n", + "\n", + "# Details\n", + "## Intended use\n", + "Use the component to run an Apache PySpark job as one preprocessing step in a Kubeflow Pipeline.\n", "\n", - "## Intended Use\n", - "Use the component to run an Apache PySpark job as one preprocessing step in a KFP pipeline. \n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No |\n", - "region | The Dataproc region that handles the request. | GCPRegion | No |\n", - "cluster_name | The name of the cluster that runs the job. | String | No |\n", - "main_python_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the main Python file to use as the driver. Must be a .py file. | GCSPath | No |\n", - "args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | List | Yes | `[]`\n", - "pyspark_job | The payload of a [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob). | Dict | Yes | `{}`\n", - "job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}`\n", - "wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30`\n", + "| Argument | Description | Optional | Data type | Accepted values | Default |\n", + "|----------------------|------------|----------|--------------|-----------------|---------|\n", + "| project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No | GCPProjectID | | |\n", + "| region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | |\n", + "| cluster_name | The name of the cluster to run the job. | No | String | | |\n", + "| main_python_file_uri | The HCFS URI of the Python file to use as the driver. This must be a .py file. | No | GCSPath | | |\n", + "| args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | Yes | List | | None |\n", + "| pyspark_job | The payload of a [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob). | Yes | Dict | | None |\n", + "| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None |\n", "\n", "## Output\n", "Name | Description | Type\n", @@ -28,20 +38,24 @@ "job_id | The ID of the created job. | String\n", "\n", "## Cautions & requirements\n", + "\n", "To use the component, you must:\n", - "* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", - "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", - "```\n", - "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", - "```\n", - "* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project.\n", - "\n", - "## Detailed Description\n", - "This component creates a PySpark job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit).\n", - "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", + "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "\n", + " ```\n", + " component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project.\n", + "\n", + "## Detailed description\n", + "\n", + "This component creates a PySpark job from the [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit).\n", + "\n", + "Follow these steps to use the component in a pipeline:\n", + "\n", + "1. Install the Kubeflow Pipeline SDK:" ] }, { @@ -80,21 +94,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_pyspark_job.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_pyspark_job/sample.ipynb)\n", - "* [Dataproc PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob)\n", - "\n", "### Sample\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly.\n", + "Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n", + "\n", "\n", "#### Setup a Dataproc cluster\n", + "\n", "[Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code.\n", "\n", - "#### Prepare PySpark job\n", - "Upload your PySpark code file to a Cloud Storage bucket. For example, thisis a publicly accessible hello-world.py in Cloud Storage:" + "\n", + "#### Prepare a PySpark job\n", + "\n", + "Upload your PySpark code file to a Cloud Storage bucket. For example, this is a publicly accessible `hello-world.py` in Cloud Storage:" ] }, { @@ -219,11 +231,18 @@ ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], - "source": [] + "source": [ + "## References\n", + "\n", + "* [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) \n", + "* [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob)\n", + "* [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -242,7 +261,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataproc/submit_spark_job/README.md b/components/gcp/dataproc/submit_spark_job/README.md index 4c7ad7fcda8..5cad85794b5 100644 --- a/components/gcp/dataproc/submit_spark_job/README.md +++ b/components/gcp/dataproc/submit_spark_job/README.md @@ -1,22 +1,36 @@ -# Submitting a Spark Job to Cloud Dataproc -A Kubeflow Pipeline component to submit a Spark job to Google Cloud Dataproc service. +# Name -## Intended Use -Use the component to run an Apache Spark job as one preprocessing step in a KFP pipeline. +Data preparation using Spark on YARN with Cloud Dataproc + + +# Label + +Cloud Dataproc, GCP, Cloud Storage, Spark, Kubeflow, pipelines, components, YARN + + +# Summary + +A Kubeflow Pipeline component to prepare data by submitting a Spark job on YARN to Cloud Dataproc. + +# Details + +## Intended use + +Use the component to run an Apache Spark job as one preprocessing step in a Kubeflow Pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No | -region | The Dataproc region that handles the request. | GCPRegion | No | -cluster_name | The name of the cluster that runs the job. | String | No | -main_jar_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the jar file that contains the main class. | GCSPath | No | -main_class | The name of the driver's main class. The jar file that contains the class must be in the default CLASSPATH or specified in `spark_job.jarFileUris`. | String | No | -args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | List | Yes | `[]` -spark_job | The payload of a [SparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkJob). | Dict | Yes | `{}` -job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}` -wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30` +Argument | Description | Optional | Data type | Accepted values | Default | +:--- | :---------- | :--- | :------- | :------| :------| +project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to.|No | GCPProjectID | | | +region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | | +cluster_name | The name of the cluster to run the job. | No | String | | | +main_jar_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the JAR file that contains the main class. | No | GCSPath | | | +main_class | The name of the driver's main class. The JAR file that contains the class must be either in the default CLASSPATH or specified in `spark_job.jarFileUris`.| No | | | | +args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission.| Yes | | | | +spark_job | The payload of a [SparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkJob).| Yes | | | | +job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | | | | +wait_interval | The number of seconds to wait between polling the operation. | Yes | | | 30 | ## Output Name | Description | Type @@ -24,22 +38,33 @@ Name | Description | Type job_id | The ID of the created job. | String ## Cautions & requirements + To use the component, you must: -* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). -* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -``` -component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) -``` -* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project. -## Detailed Description + + +* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). +* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: + + ``` + component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` + + +* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project. + + +## Detailed description + This component creates a Spark job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit). -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +Follow these steps to use the component in a pipeline: + +1. Install the Kubeflow Pipeline SDK: + ```python %%capture --no-stderr @@ -59,25 +84,21 @@ dataproc_submit_spark_job_op = comp.load_component_from_url( help(dataproc_submit_spark_job_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_spark_job.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_spark_job/sample.ipynb) -* [Dataproc SparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkJob) - ### Sample +Note: The following sample code works in an IPython notebook or directly in Python code. -Note: the sample code below works in both IPython notebook or python code directly. -#### Setup a Dataproc cluster +#### Set up a Dataproc cluster [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code. -#### Prepare Spark job -Upload your Spark jar file to a Cloud Storage (GCS) bucket. In the sample, we will use a jar file that is pre-installed in the main cluster `file:///usr/lib/spark/examples/jars/spark-examples.jar`. -Here is the [Pi example source code](https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaSparkPi.java). +#### Prepare a Spark job +Upload your Spark JAR file to a Cloud Storage bucket. In the sample, we use a JAR file that is preinstalled in the main cluster: `file:///usr/lib/spark/examples/jars/spark-examples.jar`. + +Here is the [source code of the sample](https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaSparkPi.java). + +To package a self-contained Spark application, follow these [instructions](https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications). -To package a self-contained spark application, follow the [instructions](https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications). #### Set sample parameters @@ -154,7 +175,12 @@ run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` +## References -```python +* [Component Python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_spark_job.py) +* [Component Docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_spark_job/sample.ipynb) +* [Dataproc SparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkJob) -``` +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataproc/submit_spark_job/sample.ipynb b/components/gcp/dataproc/submit_spark_job/sample.ipynb index 0681629ce31..3d2b79cdc42 100644 --- a/components/gcp/dataproc/submit_spark_job/sample.ipynb +++ b/components/gcp/dataproc/submit_spark_job/sample.ipynb @@ -4,24 +4,38 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Submitting a Spark Job to Cloud Dataproc\n", - "A Kubeflow Pipeline component to submit a Spark job to Google Cloud Dataproc service. \n", + "# Name\n", "\n", - "## Intended Use\n", - "Use the component to run an Apache Spark job as one preprocessing step in a KFP pipeline. \n", + "Data preparation using Spark on YARN with Cloud Dataproc\n", + "\n", + "\n", + "# Label\n", + "\n", + "Cloud Dataproc, GCP, Cloud Storage, Spark, Kubeflow, pipelines, components, YARN\n", + "\n", + "\n", + "# Summary\n", + "\n", + "A Kubeflow Pipeline component to prepare data by submitting a Spark job on YARN to Cloud Dataproc.\n", + "\n", + "# Details\n", + "\n", + "## Intended use\n", + "\n", + "Use the component to run an Apache Spark job as one preprocessing step in a Kubeflow Pipeline.\n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No |\n", - "region | The Dataproc region that handles the request. | GCPRegion | No |\n", - "cluster_name | The name of the cluster that runs the job. | String | No |\n", - "main_jar_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the jar file that contains the main class. | GCSPath | No |\n", - "main_class | The name of the driver's main class. The jar file that contains the class must be in the default CLASSPATH or specified in `spark_job.jarFileUris`. | String | No |\n", - "args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | List | Yes | `[]`\n", - "spark_job | The payload of a [SparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkJob). | Dict | Yes | `{}`\n", - "job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}`\n", - "wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30`\n", + "Argument | Description | Optional | Data type | Accepted values | Default |\n", + ":--- | :---------- | :--- | :------- | :------| :------| \n", + "project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to.|No | GCPProjectID | | |\n", + "region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | | \n", + "cluster_name | The name of the cluster to run the job. | No | String | | |\n", + "main_jar_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the JAR file that contains the main class. | No | GCSPath | | |\n", + "main_class | The name of the driver's main class. The JAR file that contains the class must be either in the default CLASSPATH or specified in `spark_job.jarFileUris`.| No | | | | \n", + "args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission.| Yes | | | |\n", + "spark_job | The payload of a [SparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkJob).| Yes | | | |\n", + "job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | | | |\n", + "wait_interval | The number of seconds to wait between polling the operation. | Yes | | | 30 |\n", "\n", "## Output\n", "Name | Description | Type\n", @@ -29,20 +43,32 @@ "job_id | The ID of the created job. | String\n", "\n", "## Cautions & requirements\n", + "\n", "To use the component, you must:\n", - "* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", - "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", - "```\n", - "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", - "```\n", - "* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project.\n", - "\n", - "## Detailed Description\n", + "\n", + "\n", + "\n", + "* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", + "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "\n", + " ```\n", + " component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "\n", + "\n", + "* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project.\n", + "\n", + "\n", + "## Detailed description\n", + "\n", "This component creates a Spark job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit).\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "Follow these steps to use the component in a pipeline:\n", + "\n", + "\n", + "\n", + "1. Install the Kubeflow Pipeline SDK:" ] }, { @@ -81,31 +107,22 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_spark_job.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_spark_job/sample.ipynb)\n", - "* [Dataproc SparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkJob)\n", - "\n", "### Sample\n", + "Note: The following sample code works in an IPython notebook or directly in Python code.\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly.\n", "\n", - "#### Setup a Dataproc cluster\n", + "#### Set up a Dataproc cluster\n", "[Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code.\n", "\n", - "#### Prepare Spark job\n", - "Upload your Spark jar file to a Cloud Storage (GCS) bucket. In the sample, we will use a jar file that is pre-installed in the main cluster `file:///usr/lib/spark/examples/jars/spark-examples.jar`. \n", "\n", - "Here is the [Pi example source code](https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaSparkPi.java).\n", + "#### Prepare a Spark job\n", + "Upload your Spark JAR file to a Cloud Storage bucket. In the sample, we use a JAR file that is preinstalled in the main cluster: `file:///usr/lib/spark/examples/jars/spark-examples.jar`.\n", + "\n", + "Here is the [source code of the sample](https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaSparkPi.java).\n", + "\n", + "To package a self-contained Spark application, follow these [instructions](https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications).\n", + "\n", "\n", - "To package a self-contained spark application, follow the [instructions](https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ "#### Set sample parameters" ] }, @@ -218,11 +235,19 @@ ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], - "source": [] + "source": [ + "## References\n", + "\n", + "* [Component Python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_spark_job.py)\n", + "* [Component Docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_spark_job/sample.ipynb)\n", + "* [Dataproc SparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkJob)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -241,7 +266,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataproc/submit_sparksql_job/README.md b/components/gcp/dataproc/submit_sparksql_job/README.md index 841e582a06a..4b743859ad8 100644 --- a/components/gcp/dataproc/submit_sparksql_job/README.md +++ b/components/gcp/dataproc/submit_sparksql_job/README.md @@ -1,22 +1,30 @@ -# Submitting a SparkSql Job to Cloud Dataproc -A Kubeflow Pipeline component to submit a SparkSql job to Google Cloud Dataproc service. +# Name +Data preparation using SparkSQL on YARN with Cloud Dataproc -## Intended Use -Use the component to run an Apache SparkSql job as one preprocessing step in a KFP pipeline. +# Label +Cloud Dataproc, GCP, Cloud Storage, YARN, SparkSQL, Kubeflow, pipelines, components + +# Summary +A Kubeflow Pipeline component to prepare data by submitting a SparkSql job on YARN to Cloud Dataproc. + +# Details + +## Intended use +Use the component to run an Apache SparkSql job as one preprocessing step in a Kubeflow Pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No | -region | The Dataproc region that handles the request. | GCPRegion | No | -cluster_name | The name of the cluster that runs the job. | String | No | -queries | The queries to execute. You do not need to terminate a query with a semicolon. Multiple queries can be specified in one string by separating each with a semicolon. | List | Yes | `[]` -query_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the script that contains SQL queries.| GCSPath | Yes | ` ` -script_variables | Mapping of query variable names to values (equivalent to the Spark SQL command: SET name="value";). | List | Yes | `[]` -sparksql_job | The payload of a [SparkSqlJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkSqlJob). | Dict | Yes | `{}` -job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}` -wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30` +Argument| Description | Optional | Data type| Accepted values| Default | +:--- | :---------- | :--- | :------- | :------ | :------ +project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No| GCPProjectID | | | +region | The Cloud Dataproc region to handle the request. | No | GCPRegion| +cluster_name | The name of the cluster to run the job. | No | String| | | +queries | The queries to execute the SparkSQL job. Specify multiple queries in one string by separating them with semicolons. You do not need to terminate queries with semicolons. | Yes | List | | None | +query_file_uri | The HCFS URI of the script that contains the SparkSQL queries.| Yes | GCSPath | | None | +script_variables | Mapping of the query’s variable names to their values (equivalent to the SparkSQL command: SET name="value";).| Yes| Dict | | None | +sparksql_job | The payload of a [SparkSqlJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkSqlJob). | Yes | Dict | | None | +job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None | +wait_interval | The number of seconds to pause between polling the operation. | Yes |Integer | | 30 | ## Output Name | Description | Type @@ -25,20 +33,19 @@ job_id | The ID of the created job. | String ## Cautions & requirements To use the component, you must: -* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). +* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). * [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: ``` component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) ``` -* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project. +* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project. ## Detailed Description This component creates a Pig job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit). -Here are the steps to use the component in a pipeline: -1. Install KFP SDK - +Follow these steps to use the component in a pipeline: +1. Install the Kubeflow Pipeline SDK: ```python @@ -59,23 +66,17 @@ dataproc_submit_sparksql_job_op = comp.load_component_from_url( help(dataproc_submit_sparksql_job_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_sparksql_job.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_sparksql_job/sample.ipynb) -* [Dataproc SparkSqlJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkSqlJob) - ### Sample -Note: the sample code below works in both IPython notebook or python code directly. +Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template. #### Setup a Dataproc cluster [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code. -#### Prepare SparkSQL job -Directly put your SparkSQL queries in the `queires` list or upload your SparkSQL queries into a file to a Google Cloud Storage (GCS) bucket and place the path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a public CSV file from GCS. +#### Prepare a SparkSQL job +Either put your SparkSQL queries in the `queires` list, or upload your SparkSQL queries into a file to a Cloud Storage bucket and then enter the Cloud Storage bucket’s path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a public CSV file from Cloud Storage. -For more details about Spark SQL, please checkout the [programming guide](https://spark.apache.org/docs/latest/sql-programming-guide.html) +For more details about Spark SQL, see [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html) #### Set sample parameters @@ -167,7 +168,11 @@ run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` +## References +* [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html) +* [SparkSqlJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkSqlJob) +* [Cloud Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs) -```python -``` +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataproc/submit_sparksql_job/sample.ipynb b/components/gcp/dataproc/submit_sparksql_job/sample.ipynb index 7d8709fa8c7..7e1ec4b84e8 100644 --- a/components/gcp/dataproc/submit_sparksql_job/sample.ipynb +++ b/components/gcp/dataproc/submit_sparksql_job/sample.ipynb @@ -4,24 +4,32 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Submitting a SparkSql Job to Cloud Dataproc\n", - "A Kubeflow Pipeline component to submit a SparkSql job to Google Cloud Dataproc service. \n", + "# Name\n", + "Data preparation using SparkSQL on YARN with Cloud Dataproc\n", "\n", - "## Intended Use\n", - "Use the component to run an Apache SparkSql job as one preprocessing step in a KFP pipeline. \n", + "# Label\n", + "Cloud Dataproc, GCP, Cloud Storage, YARN, SparkSQL, Kubeflow, pipelines, components \n", + "\n", + "# Summary\n", + "A Kubeflow Pipeline component to prepare data by submitting a SparkSql job on YARN to Cloud Dataproc.\n", + "\n", + "# Details\n", + "\n", + "## Intended use\n", + "Use the component to run an Apache SparkSql job as one preprocessing step in a Kubeflow Pipeline.\n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No |\n", - "region | The Dataproc region that handles the request. | GCPRegion | No |\n", - "cluster_name | The name of the cluster that runs the job. | String | No |\n", - "queries | The queries to execute. You do not need to terminate a query with a semicolon. Multiple queries can be specified in one string by separating each with a semicolon. | List | Yes | `[]`\n", - "query_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the script that contains SQL queries.| GCSPath | Yes | ` `\n", - "script_variables | Mapping of query variable names to values (equivalent to the Spark SQL command: SET name=\"value\";). | List | Yes | `[]`\n", - "sparksql_job | The payload of a [SparkSqlJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkSqlJob). | Dict | Yes | `{}`\n", - "job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}`\n", - "wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30`\n", + "Argument| Description | Optional | Data type| Accepted values| Default |\n", + ":--- | :---------- | :--- | :------- | :------ | :------\n", + "project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No| GCPProjectID | | |\n", + "region | The Cloud Dataproc region to handle the request. | No | GCPRegion|\n", + "cluster_name | The name of the cluster to run the job. | No | String| | |\n", + "queries | The queries to execute the SparkSQL job. Specify multiple queries in one string by separating them with semicolons. You do not need to terminate queries with semicolons. | Yes | List | | None | \n", + "query_file_uri | The HCFS URI of the script that contains the SparkSQL queries.| Yes | GCSPath | | None |\n", + "script_variables | Mapping of the query’s variable names to their values (equivalent to the SparkSQL command: SET name=\"value\";).| Yes| Dict | | None |\n", + "sparksql_job | The payload of a [SparkSqlJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkSqlJob). | Yes | Dict | | None |\n", + "job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None |\n", + "wait_interval | The number of seconds to pause between polling the operation. | Yes |Integer | | 30 |\n", "\n", "## Output\n", "Name | Description | Type\n", @@ -30,19 +38,19 @@ "\n", "## Cautions & requirements\n", "To use the component, you must:\n", - "* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", + "* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", "```\n", "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", "```\n", - "* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project.\n", + "* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project.\n", "\n", "## Detailed Description\n", "This component creates a Pig job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit).\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "Follow these steps to use the component in a pipeline:\n", + "1. Install the Kubeflow Pipeline SDK:" ] }, { @@ -81,29 +89,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_sparksql_job.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_sparksql_job/sample.ipynb)\n", - "* [Dataproc SparkSqlJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkSqlJob)\n", - "\n", "### Sample\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly.\n", + "Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n", "\n", "#### Setup a Dataproc cluster\n", "[Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code.\n", "\n", - "#### Prepare SparkSQL job\n", - "Directly put your SparkSQL queries in the `queires` list or upload your SparkSQL queries into a file to a Google Cloud Storage (GCS) bucket and place the path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a public CSV file from GCS.\n", + "#### Prepare a SparkSQL job\n", + "Either put your SparkSQL queries in the `queires` list, or upload your SparkSQL queries into a file to a Cloud Storage bucket and then enter the Cloud Storage bucket’s path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a public CSV file from Cloud Storage.\n", + "\n", + "For more details about Spark SQL, see [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)\n", "\n", - "For more details about Spark SQL, please checkout the [programming guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ "#### Set sample parameters" ] }, @@ -231,11 +228,18 @@ ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], - "source": [] + "source": [ + "## References\n", + "* [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)\n", + "* [SparkSqlJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkSqlJob)\n", + "* [Cloud Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs)\n", + "\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -254,7 +258,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/ml_engine/batch_predict/README.md b/components/gcp/ml_engine/batch_predict/README.md index 1e38885b54f..c6674458606 100644 --- a/components/gcp/ml_engine/batch_predict/README.md +++ b/components/gcp/ml_engine/batch_predict/README.md @@ -1,23 +1,49 @@ -# Batch predicting using Cloud Machine Learning Engine -A Kubeflow Pipeline component to submit a batch prediction job against a trained model to Cloud ML Engine service. +# Name + +Batch prediction using Cloud Machine Learning Engine + + +# Label + +Cloud Storage, Cloud ML Engine, Kubeflow, Pipeline, Component + + +# Summary + +A Kubeflow Pipeline component to submit a batch prediction job against a deployed model on Cloud ML Engine. + + +# Details + ## Intended use -Use the component to run a batch prediction job against a deployed model in Cloud Machine Learning Engine. The prediction output will be stored in a Cloud Storage bucket. + +Use the component to run a batch prediction job against a deployed model on Cloud ML Engine. The prediction output is stored in a Cloud Storage bucket. + ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The ID of the parent project of the job. | GCPProjectID | No | -model_path | Required. The path to the model. It can be one of the following paths: | String | No | -input_paths | The Cloud Storage location of the input data files. May contain wildcards. For example: `gs://foo/*.csv` | List | No | -input_data_format | The format of the input data files. See [DataFormat](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat). | String | No | -output_path | The Cloud Storage location for the output data. | GCSPath | No | -region | The region in Compute Engine where the prediction job is run. | GCPRegion | No | -output_data_format | The format of the output data files. See [DataFormat](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat). | String | Yes | `JSON` -prediction_input | The JSON input parameters to create a prediction job. See [PredictionInput](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#PredictionInput) to know more. | Dict | Yes | ` ` -job_id_prefix | The prefix of the generated job id. | String | Yes | ` ` -wait_interval | A time-interval to wait for in case the operation has a long run time. | Integer | Yes | `30` + +| Argument | Description | Optional | Data type | Accepted values | Default | +|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------------|-----------------|---------| +| project_id | The ID of the Google Cloud Platform (GCP) project of the job. | No | GCPProjectID | | | +| model_path | The path to the model. It can be one of the following:
| No | GCSPath | | | +| input_paths | The path to the Cloud Storage location containing the input data files. It can contain wildcards, for example, `gs://foo/*.csv` | No | List | GCSPath | | +| input_data_format | The format of the input data files. See [REST Resource: projects.jobs](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat) for more details. | No | String | DataFormat | | +| output_path | The path to the Cloud Storage location for the output data. | No | GCSPath | | | +| region | The Compute Engine region where the prediction job is run. | No | GCPRegion | | | +| output_data_format | The format of the output data files. See [REST Resource: projects.jobs](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat) for more details. | Yes | String | DataFormat | JSON | +| prediction_input | The JSON input parameters to create a prediction job. See [PredictionInput](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#PredictionInput) for more information. | Yes | Dict | | None | +| job_id_prefix | The prefix of the generated job id. | Yes | String | | None | +| wait_interval | The number of seconds to wait in case the operation has a long run time. | Yes | | | 30 | + + +## Input data schema + +The component accepts the following as input: + +* A trained model: It can be a model file in Cloud Storage, a deployed model, or a version in Cloud ML Engine. Specify the path to the model in the `model_path `runtime argument. +* Input data: The data used to make predictions against the trained model. The data can be in [multiple formats](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat). The data path is specified by `input_paths` and the format is specified by `input_data_format`. ## Output Name | Description | Type @@ -29,25 +55,28 @@ output_path | The output path of the batch prediction job | GCSPath ## Cautions & requirements To use the component, you must: -* Setup cloud environment by following the [guide](https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction#setup). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -```python -mlengine_predict_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) +* Set up a cloud environment by following this [guide](https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction#setup). +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -``` -* Grant Kubeflow user service account the read access to the Cloud Storage buckets which contains the input data. -* Grant Kubeflow user service account the write access to the Cloud Storage bucket of the output directory. + ```python + mlengine_predict_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` + + +* Grant the following types of access to the Kubeflow user service account: + * Read access to the Cloud Storage buckets which contains the input data. + * Write access to the Cloud Storage bucket of the output directory. + + +## Detailed description +Follow these steps to use the component in a pipeline: -## Detailed Description -The component accepts following input data: -* A trained model: it can be a model file in Cloud Storage, or a deployed model or version in Cloud Machine Learning Engine. The path to the model is specified by the `model_path` parameter. -* Input data: the data will be used to make predictions against the input trained model. The data can be in [multiple formats](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat). The path of the data is specified by `input_paths` parameter and the format is specified by `input_data_format` parameter. -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +1. Install the Kubeflow Pipeline SDK: + @@ -69,17 +98,11 @@ mlengine_batch_predict_op = comp.load_component_from_url( help(mlengine_batch_predict_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/ml_engine/_batch_predict.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/ml_engine/batch_predict/sample.ipynb) -* [Cloud Machine Learning Engine job REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs) ### Sample Code +Note: The following sample code works in an IPython notebook or directly in Python code. -Note: the sample code below works in both IPython notebook or python code directly. - -In this sample, we batch predict against a pre-built trained model from `gs://ml-pipeline-playground/samples/ml_engine/census/trained_model/` and use the test data from `gs://ml-pipeline-playground/samples/ml_engine/census/test.json`. +In this sample, you batch predict against a pre-built trained model from `gs://ml-pipeline-playground/samples/ml_engine/census/trained_model/` and use the test data from `gs://ml-pipeline-playground/samples/ml_engine/census/test.json`. #### Inspect the test data @@ -175,3 +198,12 @@ run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arg OUTPUT_FILES_PATTERN = OUTPUT_GCS_PATH + '*' !gsutil cat OUTPUT_FILES_PATTERN ``` + +## References +* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/ml_engine/_batch_predict.py) +* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/ml_engine/batch_predict/sample.ipynb) +* [Cloud Machine Learning Engine job REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs) + +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/ml_engine/batch_predict/sample.ipynb b/components/gcp/ml_engine/batch_predict/sample.ipynb index 4a88302f70b..92985e1b112 100644 --- a/components/gcp/ml_engine/batch_predict/sample.ipynb +++ b/components/gcp/ml_engine/batch_predict/sample.ipynb @@ -4,25 +4,51 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Batch predicting using Cloud Machine Learning Engine\n", - "A Kubeflow Pipeline component to submit a batch prediction job against a trained model to Cloud ML Engine service.\n", + "# Name\n", + "\n", + "Batch prediction using Cloud Machine Learning Engine\n", + "\n", + "\n", + "# Label\n", + "\n", + "Cloud Storage, Cloud ML Engine, Kubeflow, Pipeline, Component\n", + "\n", + "\n", + "# Summary\n", + "\n", + "A Kubeflow Pipeline component to submit a batch prediction job against a deployed model on Cloud ML Engine.\n", + "\n", + "\n", + "# Details\n", + "\n", "\n", "## Intended use\n", - "Use the component to run a batch prediction job against a deployed model in Cloud Machine Learning Engine. The prediction output will be stored in a Cloud Storage bucket.\n", + "\n", + "Use the component to run a batch prediction job against a deployed model on Cloud ML Engine. The prediction output is stored in a Cloud Storage bucket.\n", + "\n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The ID of the parent project of the job. | GCPProjectID | No |\n", - "model_path | Required. The path to the model. It can be one of the following paths: | String | No |\n", - "input_paths | The Cloud Storage location of the input data files. May contain wildcards. For example: `gs://foo/*.csv` | List | No |\n", - "input_data_format | The format of the input data files. See [DataFormat](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat). | String | No |\n", - "output_path | The Cloud Storage location for the output data. | GCSPath | No |\n", - "region | The region in Compute Engine where the prediction job is run. | GCPRegion | No |\n", - "output_data_format | The format of the output data files. See [DataFormat](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat). | String | Yes | `JSON`\n", - "prediction_input | The JSON input parameters to create a prediction job. See [PredictionInput](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#PredictionInput) to know more. | Dict | Yes | ` `\n", - "job_id_prefix | The prefix of the generated job id. | String | Yes | ` `\n", - "wait_interval | A time-interval to wait for in case the operation has a long run time. | Integer | Yes | `30`\n", + "\n", + "| Argument | Description | Optional | Data type | Accepted values | Default |\n", + "|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------------|-----------------|---------|\n", + "| project_id | The ID of the Google Cloud Platform (GCP) project of the job. | No | GCPProjectID | | |\n", + "| model_path | The path to the model. It can be one of the following:
| No | GCSPath | | |\n", + "| input_paths | The path to the Cloud Storage location containing the input data files. It can contain wildcards, for example, `gs://foo/*.csv` | No | List | GCSPath | |\n", + "| input_data_format | The format of the input data files. See [REST Resource: projects.jobs](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat) for more details. | No | String | DataFormat | |\n", + "| output_path | The path to the Cloud Storage location for the output data. | No | GCSPath | | |\n", + "| region | The Compute Engine region where the prediction job is run. | No | GCPRegion | | |\n", + "| output_data_format | The format of the output data files. See [REST Resource: projects.jobs](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat) for more details. | Yes | String | DataFormat | JSON |\n", + "| prediction_input | The JSON input parameters to create a prediction job. See [PredictionInput](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#PredictionInput) for more information. | Yes | Dict | | None |\n", + "| job_id_prefix | The prefix of the generated job id. | Yes | String | | None |\n", + "| wait_interval | The number of seconds to wait in case the operation has a long run time. | Yes | | | 30 |\n", + "\n", + "\n", + "## Input data schema\n", + "\n", + "The component accepts the following as input:\n", + "\n", + "* A trained model: It can be a model file in Cloud Storage, a deployed model, or a version in Cloud ML Engine. Specify the path to the model in the `model_path `runtime argument.\n", + "* Input data: The data used to make predictions against the trained model. The data can be in [multiple formats](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat). The data path is specified by `input_paths` and the format is specified by `input_data_format`.\n", "\n", "## Output\n", "Name | Description | Type\n", @@ -34,25 +60,28 @@ "## Cautions & requirements\n", "\n", "To use the component, you must:\n", - "* Setup cloud environment by following the [guide](https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction#setup).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", "\n", - "```python\n", - "mlengine_predict_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + "* Set up a cloud environment by following this [guide](https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction#setup).\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "\n", + " ```python\n", + " mlengine_predict_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "\n", "\n", - "```\n", - "* Grant Kubeflow user service account the read access to the Cloud Storage buckets which contains the input data.\n", - "* Grant Kubeflow user service account the write access to the Cloud Storage bucket of the output directory.\n", + "* Grant the following types of access to the Kubeflow user service account:\n", + " * Read access to the Cloud Storage buckets which contains the input data.\n", + " * Write access to the Cloud Storage bucket of the output directory.\n", "\n", "\n", - "## Detailed Description\n", + "## Detailed description\n", "\n", - "The component accepts following input data:\n", - "* A trained model: it can be a model file in Cloud Storage, or a deployed model or version in Cloud Machine Learning Engine. The path to the model is specified by the `model_path` parameter.\n", - "* Input data: the data will be used to make predictions against the input trained model. The data can be in [multiple formats](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#DataFormat). The path of the data is specified by `input_paths` parameter and the format is specified by `input_data_format` parameter.\n", + "Follow these steps to use the component in a pipeline:\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "\n", + "\n", + "1. Install the Kubeflow Pipeline SDK:\n", + "\n" ] }, { @@ -91,17 +120,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/ml_engine/_batch_predict.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/ml_engine/batch_predict/sample.ipynb)\n", - "* [Cloud Machine Learning Engine job REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs)\n", "\n", "### Sample Code\n", + "Note: The following sample code works in an IPython notebook or directly in Python code. \n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly.\n", - "\n", - "In this sample, we batch predict against a pre-built trained model from `gs://ml-pipeline-playground/samples/ml_engine/census/trained_model/` and use the test data from `gs://ml-pipeline-playground/samples/ml_engine/census/test.json`. \n", + "In this sample, you batch predict against a pre-built trained model from `gs://ml-pipeline-playground/samples/ml_engine/census/trained_model/` and use the test data from `gs://ml-pipeline-playground/samples/ml_engine/census/test.json`.\n", "\n", "#### Inspect the test data" ] @@ -255,6 +278,20 @@ "OUTPUT_FILES_PATTERN = OUTPUT_GCS_PATH + '*'\n", "!gsutil cat OUTPUT_FILES_PATTERN" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/ml_engine/_batch_predict.py)\n", + "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/ml_engine/batch_predict/sample.ipynb)\n", + "* [Cloud Machine Learning Engine job REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -273,7 +310,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/ml_engine/deploy/README.md b/components/gcp/ml_engine/deploy/README.md index ea88337ebbb..de191af2c78 100644 --- a/components/gcp/ml_engine/deploy/README.md +++ b/components/gcp/ml_engine/deploy/README.md @@ -1,55 +1,98 @@ -# Deploying a trained model to Cloud Machine Learning Engine -A Kubeflow Pipeline component to deploy a trained model from a Cloud Storage path to a Cloud Machine Learning Engine service. +# Name + +Deploying a trained model to Cloud Machine Learning Engine + + +# Label + +Cloud Storage, Cloud ML Engine, Kubeflow, Pipeline + + +# Summary + +A Kubeflow Pipeline component to deploy a trained model from a Cloud Storage location to Cloud ML Engine. + + +# Details + ## Intended use -Use the component to deploy a trained model to Cloud Machine Learning Engine service. The deployed model can serve online or batch predictions in a KFP pipeline. - -## Runtime arguments: -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -model_uri | The Cloud Storage URI which contains a model file. Commonly used TF model search paths (export/exporter) will be used. | GCSPath | No | -project_id | The ID of the parent project of the serving model. | GCPProjectID | No | -model_id | The user-specified name of the model. If it is not provided, the operation uses a random name. | String | Yes | ` ` -version_id | The user-specified name of the version. If it is not provided, the operation uses a random name. | String | Yes | ` ` -runtime_version | The [Cloud ML Engine runtime version](https://cloud.google.com/ml-engine/docs/tensorflow/runtime-version-list) to use for this deployment. If it is not set, the Cloud ML Engine uses the default stable version, 1.0. | String | Yes | ` ` -python_version | The version of Python used in the prediction. If it is not set, the default version is `2.7`. Python `3.5` is available when the runtime_version is set to `1.4` and above. Python `2.7` works with all supported runtime versions. | String | Yes | ` ` -version | The JSON payload of the new [Version](https://cloud.google.com/ml-engine/reference/rest/v1/projects.models.versions). | Dict | Yes | ` ` -replace_existing_version | A Boolean flag that indicates whether to replace existing version in case of conflict. | Bool | Yes | False -set_default | A Boolean flag that indicates whether to set the new version as default version in the model. | Bool | Yes | False -wait_interval | A time-interval to wait for in case the operation has a long run time. | Integer | Yes | 30 - -## Output: -Name | Description | Type -:--- | :---------- | :--- -model_uri | The Cloud Storage URI of the trained model. | GCSPath -model_name | The name of the serving model. | String -version_name | The name of the deployed version of the model. | String + +Use the component to deploy a trained model to Cloud ML Engine. The deployed model can serve online or batch predictions in a Kubeflow Pipeline. + + +## Runtime arguments + +| Argument | Description | Optional | Data type | Accepted values | Default | +|--------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------------|-----------------|---------| +| model_uri | The URI of a Cloud Storage directory that contains a trained model file.
Or
An [Estimator export base directory](https://www.tensorflow.org/guide/saved_model#perform_the_export) that contains a list of subdirectories named by timestamp. The directory with the latest timestamp is used to load the trained model file. | No | GCSPath | | | +| project_id | The ID of the Google Cloud Platform (GCP) project of the serving model. | No | GCPProjectID | | | +| model_id | The name of the trained model. | Yes | String | | None | +| version_id | The name of the version of the model. If it is not provided, the operation uses a random name. | Yes | String | | None | +| runtime_version | The Cloud ML Engine runtime version to use for this deployment. If it is not provided, the default stable version, 1.0, is used. | Yes | String | | None | +| python_version | The version of Python used in the prediction. If it is not provided, version 2.7 is used. You can use Python 3.5 if runtime_version is set to 1.4 or above. Python 2.7 works with all supported runtime versions. | Yes | String | | 2.7 | +| model | The JSON payload of the new [model](https://cloud.google.com/ml-engine/reference/rest/v1/projects.models). | Yes | Dict | | None | +| version | The new [version](https://cloud.google.com/ml-engine/reference/rest/v1/projects.models.versions) of the trained model. | Yes | Dict | | None | +| replace_existing_version | Indicates whether to replace the existing version in case of a conflict (if the same version number is found.) | Yes | Boolean | | FALSE | +| set_default | Indicates whether to set the new version as the default version in the model. | Yes | Boolean | | FALSE | +| wait_interval | The number of seconds to wait in case the operation has a long run time. | Yes | Integer | | 30 | + + + +## Input data schema + +The component looks for a trained model in the location specified by the `model_uri` runtime argument. The accepted trained models are: + + +* [Tensorflow SavedModel](https://cloud.google.com/ml-engine/docs/tensorflow/exporting-for-prediction) +* [Scikit-learn & XGBoost model](https://cloud.google.com/ml-engine/docs/scikit/exporting-for-prediction) + +The accepted file formats are: + +* *.pb +* *.pbtext +* model.bst +* model.joblib +* model.pkl + +`model_uri` can also be an [Estimator export base directory, ](https://www.tensorflow.org/guide/saved_model#perform_the_export)which contains a list of subdirectories named by timestamp. The directory with the latest timestamp is used to load the trained model file. + +## Output +| Name | Description | Type | +|:------- |:---- | :--- | +| job_id | The ID of the created job. | String | +| job_dir | The Cloud Storage path that contains the trained model output files. | GCSPath | + ## Cautions & requirements To use the component, you must: -* Setup cloud environment by following the [guide](https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction#setup). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -```python -mlengine_deploy_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) +* [Set up the cloud environment](https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction#setup). +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -``` -* Grant Kubeflow user service account the read access to the Cloud Storage buckets which contains the trained model. + ``` + ```python + mlengine_deploy_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + + ``` +* Grant read access to the Cloud Storage bucket that contains the trained model to the Kubeflow user service account. -## Detailed Description +## Detailed description -The component does: -* Search for the trained model from the user provided Cloud Storage path. -* Create a new model if user provided model doesn’t exist. -* Delete the existing model version if `replace_existing_version` is enabled. -* Create a new model version from the trained model. -* Set the new version as the default version of the model if ‘set_default’ is enabled. +Use the component to: +* Locate the trained model at the Cloud Storage location you specify. +* Create a new model if a model provided by you doesn’t exist. +* Delete the existing model version if `replace_existing_version` is enabled. +* Create a new version of the model from the trained model. +* Set the new version as the default version of the model if `set_default` is enabled. + +Follow these steps to use the component in a pipeline: + +1. Install the Kubeflow Pipeline SDK: -Here are the steps to use the component in a pipeline: -1. Install KFP SDK @@ -71,18 +114,10 @@ mlengine_deploy_op = comp.load_component_from_url( help(mlengine_deploy_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/ml_engine/_deploy.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/ml_engine/deploy/sample.ipynb) -* [Cloud Machine Learning Engine Model REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.models) -* [Cloud Machine Learning Engine Version REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.versions) - - ### Sample Note: The following sample code works in IPython notebook or directly in Python code. -In this sample, we will deploy a pre-built trained model from `gs://ml-pipeline-playground/samples/ml_engine/census/trained_model/` to Cloud Machine Learning Engine service. The deployed model is named `kfp_sample_model`. A new version will be created every time when the sample is run, and the latest version will be set as the default version of the deployed model. +In this sample, you deploy a pre-built trained model from `gs://ml-pipeline-playground/samples/ml_engine/census/trained_model/` to Cloud ML Engine. The deployed model is `kfp_sample_model`. A new version is created every time the sample is run, and the latest version is set as the default version of the deployed model. #### Set sample parameters @@ -157,3 +192,13 @@ experiment = client.create_experiment(EXPERIMENT_NAME) run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` + +## References +* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/ml_engine/_deploy.py) +* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/ml_engine/deploy/sample.ipynb) +* [Cloud Machine Learning Engine Model REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.models) +* [Cloud Machine Learning Engine Version REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.versions) + +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/ml_engine/deploy/sample.ipynb b/components/gcp/ml_engine/deploy/sample.ipynb index e7c0fff2039..1d3926a83ce 100644 --- a/components/gcp/ml_engine/deploy/sample.ipynb +++ b/components/gcp/ml_engine/deploy/sample.ipynb @@ -4,57 +4,100 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Deploying a trained model to Cloud Machine Learning Engine\n", - "A Kubeflow Pipeline component to deploy a trained model from a Cloud Storage path to a Cloud Machine Learning Engine service.\n", + "# Name\n", + "\n", + "Deploying a trained model to Cloud Machine Learning Engine \n", + "\n", + "\n", + "# Label\n", + "\n", + "Cloud Storage, Cloud ML Engine, Kubeflow, Pipeline\n", + "\n", + "\n", + "# Summary\n", + "\n", + "A Kubeflow Pipeline component to deploy a trained model from a Cloud Storage location to Cloud ML Engine.\n", + "\n", + "\n", + "# Details\n", + "\n", "\n", "## Intended use\n", - "Use the component to deploy a trained model to Cloud Machine Learning Engine service. The deployed model can serve online or batch predictions in a KFP pipeline.\n", - "\n", - "## Runtime arguments:\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "model_uri | The Cloud Storage URI which contains a model file. Commonly used TF model search paths (export/exporter) will be used. | GCSPath | No |\n", - "project_id | The ID of the parent project of the serving model. | GCPProjectID | No | \n", - "model_id | The user-specified name of the model. If it is not provided, the operation uses a random name. | String | Yes | ` `\n", - "version_id | The user-specified name of the version. If it is not provided, the operation uses a random name. | String | Yes | ` `\n", - "runtime_version | The [Cloud ML Engine runtime version](https://cloud.google.com/ml-engine/docs/tensorflow/runtime-version-list) to use for this deployment. If it is not set, the Cloud ML Engine uses the default stable version, 1.0. | String | Yes | ` ` \n", - "python_version | The version of Python used in the prediction. If it is not set, the default version is `2.7`. Python `3.5` is available when the runtime_version is set to `1.4` and above. Python `2.7` works with all supported runtime versions. | String | Yes | ` `\n", - "version | The JSON payload of the new [Version](https://cloud.google.com/ml-engine/reference/rest/v1/projects.models.versions). | Dict | Yes | ` `\n", - "replace_existing_version | A Boolean flag that indicates whether to replace existing version in case of conflict. | Bool | Yes | False\n", - "set_default | A Boolean flag that indicates whether to set the new version as default version in the model. | Bool | Yes | False\n", - "wait_interval | A time-interval to wait for in case the operation has a long run time. | Integer | Yes | 30\n", - "\n", - "## Output:\n", - "Name | Description | Type\n", - ":--- | :---------- | :---\n", - "model_uri | The Cloud Storage URI of the trained model. | GCSPath\n", - "model_name | The name of the serving model. | String\n", - "version_name | The name of the deployed version of the model. | String\n", + "\n", + "Use the component to deploy a trained model to Cloud ML Engine. The deployed model can serve online or batch predictions in a Kubeflow Pipeline.\n", + "\n", + "\n", + "## Runtime arguments\n", + "\n", + "| Argument | Description | Optional | Data type | Accepted values | Default |\n", + "|--------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------------|-----------------|---------|\n", + "| model_uri | The URI of a Cloud Storage directory that contains a trained model file.
Or
An [Estimator export base directory](https://www.tensorflow.org/guide/saved_model#perform_the_export) that contains a list of subdirectories named by timestamp. The directory with the latest timestamp is used to load the trained model file. | No | GCSPath | | |\n", + "| project_id | The ID of the Google Cloud Platform (GCP) project of the serving model. | No | GCPProjectID | | |\n", + "| model_id | The name of the trained model. | Yes | String | | None |\n", + "| version_id | The name of the version of the model. If it is not provided, the operation uses a random name. | Yes | String | | None |\n", + "| runtime_version | The Cloud ML Engine runtime version to use for this deployment. If it is not provided, the default stable version, 1.0, is used. | Yes | String | | None |\n", + "| python_version | The version of Python used in the prediction. If it is not provided, version 2.7 is used. You can use Python 3.5 if runtime_version is set to 1.4 or above. Python 2.7 works with all supported runtime versions. | Yes | String | | 2.7 |\n", + "| model | The JSON payload of the new [model](https://cloud.google.com/ml-engine/reference/rest/v1/projects.models). | Yes | Dict | | None |\n", + "| version | The new [version](https://cloud.google.com/ml-engine/reference/rest/v1/projects.models.versions) of the trained model. | Yes | Dict | | None |\n", + "| replace_existing_version | Indicates whether to replace the existing version in case of a conflict (if the same version number is found.) | Yes | Boolean | | FALSE |\n", + "| set_default | Indicates whether to set the new version as the default version in the model. | Yes | Boolean | | FALSE |\n", + "| wait_interval | The number of seconds to wait in case the operation has a long run time. | Yes | Integer | | 30 |\n", + "\n", + "\n", + "\n", + "## Input data schema\n", + "\n", + "The component looks for a trained model in the location specified by the `model_uri` runtime argument. The accepted trained models are:\n", + "\n", + "\n", + "* [Tensorflow SavedModel](https://cloud.google.com/ml-engine/docs/tensorflow/exporting-for-prediction) \n", + "* [Scikit-learn & XGBoost model](https://cloud.google.com/ml-engine/docs/scikit/exporting-for-prediction)\n", + "\n", + "The accepted file formats are:\n", + "\n", + "* *.pb\n", + "* *.pbtext\n", + "* model.bst\n", + "* model.joblib\n", + "* model.pkl\n", + "\n", + "`model_uri` can also be an [Estimator export base directory, ](https://www.tensorflow.org/guide/saved_model#perform_the_export)which contains a list of subdirectories named by timestamp. The directory with the latest timestamp is used to load the trained model file.\n", + "\n", + "## Output\n", + "| Name | Description | Type |\n", + "|:------- |:---- | :--- |\n", + "| job_id | The ID of the created job. | String |\n", + "| job_dir | The Cloud Storage path that contains the trained model output files. | GCSPath |\n", + "\n", "\n", "## Cautions & requirements\n", "\n", "To use the component, you must:\n", - "* Setup cloud environment by following the [guide](https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction#setup).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", "\n", - "```python\n", - "mlengine_deploy_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + "* [Set up the cloud environment](https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction#setup).\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "\n", + " ```\n", + " ```python\n", + " mlengine_deploy_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", "\n", - "```\n", - "* Grant Kubeflow user service account the read access to the Cloud Storage buckets which contains the trained model.\n", + " ```\n", "\n", + "* Grant read access to the Cloud Storage bucket that contains the trained model to the Kubeflow user service account.\n", "\n", - "## Detailed Description\n", + "## Detailed description\n", "\n", - "The component does:\n", - "* Search for the trained model from the user provided Cloud Storage path.\n", - "* Create a new model if user provided model doesn’t exist.\n", - "* Delete the existing model version if `replace_existing_version` is enabled.\n", - "* Create a new model version from the trained model.\n", - "* Set the new version as the default version of the model if ‘set_default’ is enabled.\n", + "Use the component to: \n", + "* Locate the trained model at the Cloud Storage location you specify.\n", + "* Create a new model if a model provided by you doesn’t exist.\n", + "* Delete the existing model version if `replace_existing_version` is enabled.\n", + "* Create a new version of the model from the trained model.\n", + "* Set the new version as the default version of the model if `set_default` is enabled.\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "Follow these steps to use the component in a pipeline:\n", + "\n", + "1. Install the Kubeflow Pipeline SDK:\n", + "\n" ] }, { @@ -93,18 +136,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/ml_engine/_deploy.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/ml_engine/deploy/sample.ipynb)\n", - "* [Cloud Machine Learning Engine Model REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.models)\n", - "* [Cloud Machine Learning Engine Version REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.versions)\n", - "\n", - "\n", "### Sample\n", "Note: The following sample code works in IPython notebook or directly in Python code.\n", "\n", - "In this sample, we will deploy a pre-built trained model from `gs://ml-pipeline-playground/samples/ml_engine/census/trained_model/` to Cloud Machine Learning Engine service. The deployed model is named `kfp_sample_model`. A new version will be created every time when the sample is run, and the latest version will be set as the default version of the deployed model.\n", + "In this sample, you deploy a pre-built trained model from `gs://ml-pipeline-playground/samples/ml_engine/census/trained_model/` to Cloud ML Engine. The deployed model is `kfp_sample_model`. A new version is created every time the sample is run, and the latest version is set as the default version of the deployed model.\n", "\n", "#### Set sample parameters" ] @@ -215,6 +250,21 @@ "run_name = pipeline_func.__name__ + ' run'\n", "run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/ml_engine/_deploy.py)\n", + "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/ml_engine/deploy/sample.ipynb)\n", + "* [Cloud Machine Learning Engine Model REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.models)\n", + "* [Cloud Machine Learning Engine Version REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.versions)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -233,7 +283,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/ml_engine/train/README.md b/components/gcp/ml_engine/train/README.md index cc3e8b7ae18..0322cfc0a83 100644 --- a/components/gcp/ml_engine/train/README.md +++ b/components/gcp/ml_engine/train/README.md @@ -1,57 +1,74 @@ -# Submitting a Cloud ML training job as a pipeline step -A Kubeflow Pipeline component to submit a Cloud Machine Learning (Cloud ML) Engine training job as a step in a pipeline. +# Name +Submitting a Cloud Machine Learning Engine training job as a pipeline step -## Intended Use -This component is intended to submit a training job to Cloud Machine Learning (ML) Engine from a Kubeflow Pipelines workflow. +# Label +GCP, Cloud ML Engine, Machine Learning, pipeline, component, Kubeflow, Kubeflow Pipeline + +# Summary +A Kubeflow Pipeline component to submit a Cloud ML Engine training job as a step in a pipeline. + +# Details +## Intended use +Use this component to submit a training job to Cloud ML Engine from a Kubeflow Pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The ID of the parent project of the job. | GCPProjectID | No | -python_module | The Python module name to run after installing the packages. | String | Yes | `` -package_uris | The Cloud Storage location of the packages (that contain the training program and any additional dependencies). The maximum number of package URIs is 100. | List | Yes | `` -region | The Compute Engine region in which the training job is run. | GCPRegion | Yes | `` -args | The command line arguments to pass to the program. | List | Yes | `` -job_dir | The list of arguments to pass to the Python file. | GCSPath | Yes | `` -python_version | A Cloud Storage path in which to store the training outputs and other data needed for training. This path is passed to your TensorFlow program as the `job-dir` command-line argument. The benefit of specifying this field is that Cloud ML validates the path for use in training. | String | Yes | `` -runtime_version | The Cloud ML Engine runtime version to use for training. If not set, Cloud ML Engine uses the default stable version, 1.0. | String | Yes | `` -master_image_uri | The Docker image to run on the master replica. This image must be in Container Registry. | GCRPath | Yes | `` -worker_image_uri | The Docker image to run on the worker replica. This image must be in Container Registry. | GCRPath | Yes | `` -training_input | The input parameters to create a training job. It is the JSON payload of a [TrainingInput](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#TrainingInput) | Dict | Yes | `` -job_id_prefix | The prefix of the generated job id. | String | Yes | `` -wait_interval | A time-interval to wait for between calls to get the job status. | Integer | Yes | `30` - -## Outputs -Name | Description | Type -:--- | :---------- | :--- -job_id | The ID of the created job. | String -job_dir | The output path in Cloud Storage of the trainning job, which contains the trained model files. | GCSPath +| Argument | Description | Optional | Data type | Accepted values | Default | +|:------------------|:------------------|:----------|:--------------|:-----------------|:-------------| +| project_id | The ID of the Google Cloud Platform (GCP) project of the job. | No | GCPProjectID | | | +| python_module | The name of the Python module to run after installing the training program. | Yes | String | | None | +| package_uris | The Cloud Storage location of the packages that contain the training program and any additional dependencies. The maximum number of package URIs is 100. | Yes | List | | None | +| region | The Compute Engine region in which the training job is run. | Yes | GCPRegion | | us-central1 | +| args | The command line arguments to pass to the training program. | Yes | List | | None | +| job_dir | A Cloud Storage path in which to store the training outputs and other data needed for training. This path is passed to your TensorFlow program as the `job-dir` command-line argument. The benefit of specifying this field is that Cloud ML validates the path for use in training. | Yes | GCSPath | | None | +| python_version | The version of Python used in training. If it is not set, the default version is 2.7. Python 3.5 is available when the runtime version is set to 1.4 and above. | Yes | String | | None | +| runtime_version | The runtime version of Cloud ML Engine to use for training. If it is not set, Cloud ML Engine uses the default. | Yes | String | | 1 | +| master_image_uri | The Docker image to run on the master replica. This image must be in Container Registry. | Yes | GCRPath | | None | +| worker_image_uri | The Docker image to run on the worker replica. This image must be in Container Registry. | Yes | GCRPath | | None | +| training_input | The input parameters to create a training job. | Yes | Dict | [TrainingInput](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#TrainingInput) | None | +| job_id_prefix | The prefix of the job ID that is generated. | Yes | String | | None | +| wait_interval | The number of seconds to wait between API calls to get the status of the job. | Yes | Integer | | 30 | + + + +## Input data schema + +The component accepts two types of inputs: +* A list of Python packages from Cloud Storage. + * You can manually build a Python package and upload it to Cloud Storage by following this [guide](https://cloud.google.com/ml-engine/docs/tensorflow/packaging-trainer#manual-build). +* A Docker container from Container Registry. + * Follow this [guide](https://cloud.google.com/ml-engine/docs/using-containers) to publish and use a Docker container with this component. + +## Output +| Name | Description | Type | +|:------- |:---- | :--- | +| job_id | The ID of the created job. | String | +| job_dir | The Cloud Storage path that contains the trained model output files. | GCSPath | + ## Cautions & requirements To use the component, you must: -* Setup cloud environment by following the [guide](https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction#setup). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -```python -mlengine_train_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) +* Set up a cloud environment by following this [guide](https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction#setup). +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -``` -* Grant Kubeflow user service account the read access to the Cloud Storage buckets which contains the input data, packages or docker images. -* Grant Kubeflow user service account the write access to the Cloud Storage bucket of the output directory. + ``` + mlengine_train_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` +* Grant the following access to the Kubeflow user service account: + * Read access to the Cloud Storage buckets which contain the input data, packages, or Docker images. + * Write access to the Cloud Storage bucket of the output directory. -## Detailed Description +## Detailed description -The component accepts one of the two types of executable inputs: -* A list of Python packages from Cloud Storage. You may manually build a Python package by following the [guide](https://cloud.google.com/ml-engine/docs/tensorflow/packaging-trainer#manual-build) and [upload it to Cloud Storage](https://cloud.google.com/ml-engine/docs/tensorflow/packaging-trainer#uploading_packages_manually). -* Docker container from Google Container Registry (GCR). Follow the [guide](https://cloud.google.com/ml-engine/docs/using-containers) to publish and use a Docker container with this component. +The component builds the [TrainingInput](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#TrainingInput) payload and submits a job via the [Cloud ML Engine REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs). -The component builds the payload of a [TrainingInput](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#TrainingInput) and submit a job by Cloud Machine Learning Engine REST API. +The steps to use the component in a pipeline are: -Here are the steps to use the component in a pipeline: -1. Install KFP SDK + +1. Install the Kubeflow Pipeline SDK: @@ -73,18 +90,12 @@ mlengine_train_op = comp.load_component_from_url( help(mlengine_train_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/ml_engine/_train.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/ml_engine/train/sample.ipynb) -* [Cloud Machine Learning Engine job REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs) - - ### Sample -Note: The following sample code works in IPython notebook or directly in Python code. +Note: The following sample code works in an IPython notebook or directly in Python code. -In this sample, we use the code from [census estimator sample](https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/estimator) to train a model in Cloud Machine Learning Engine service. In order to pass the code to the service, we need to package the python code and upload it in a Cloud Storage bucket. Make sure that you have read and write permissions on the bucket that you use as the working directory. +In this sample, you use the code from the [census estimator sample](https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/estimator) to train a model in Cloud ML Engine. To upload the code to Cloud ML Engine, package the Python code and upload it to a Cloud Storage bucket. +Note: You must have read and write permissions on the bucket that you use as the working directory. #### Set sample parameters @@ -208,13 +219,18 @@ run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arg #### Inspect the results -Follow the `Run` link to open the KFP UI. In the step logs, you should be able to click on the links to: -* Job dashboard -* And realtime logs on Stackdriver - Use the following command to inspect the contents in the output directory: ```python !gsutil ls $OUTPUT_GCS_PATH ``` + +## References +* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/ml_engine/_train.py) +* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/ml_engine/train/sample.ipynb) +* [Cloud Machine Learning Engine job REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs) + +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/ml_engine/train/sample.ipynb b/components/gcp/ml_engine/train/sample.ipynb index d8f5e58d0aa..718c73dccbd 100644 --- a/components/gcp/ml_engine/train/sample.ipynb +++ b/components/gcp/ml_engine/train/sample.ipynb @@ -4,59 +4,76 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Submitting a Cloud ML training job as a pipeline step\n", - "A Kubeflow Pipeline component to submit a Cloud Machine Learning (Cloud ML) Engine training job as a step in a pipeline.\n", + "# Name\n", + "Submitting a Cloud Machine Learning Engine training job as a pipeline step\n", "\n", - "## Intended Use\n", - "This component is intended to submit a training job to Cloud Machine Learning (ML) Engine from a Kubeflow Pipelines workflow.\n", + "# Label\n", + "GCP, Cloud ML Engine, Machine Learning, pipeline, component, Kubeflow, Kubeflow Pipeline\n", + "\n", + "# Summary\n", + "A Kubeflow Pipeline component to submit a Cloud ML Engine training job as a step in a pipeline.\n", + "\n", + "# Details\n", + "## Intended use\n", + "Use this component to submit a training job to Cloud ML Engine from a Kubeflow Pipeline. \n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The ID of the parent project of the job. | GCPProjectID | No |\n", - "python_module | The Python module name to run after installing the packages. | String | Yes | ``\n", - "package_uris | The Cloud Storage location of the packages (that contain the training program and any additional dependencies). The maximum number of package URIs is 100. | List | Yes | ``\n", - "region | The Compute Engine region in which the training job is run. | GCPRegion | Yes | ``\n", - "args | The command line arguments to pass to the program. | List | Yes | ``\n", - "job_dir | The list of arguments to pass to the Python file. | GCSPath | Yes | ``\n", - "python_version | A Cloud Storage path in which to store the training outputs and other data needed for training. This path is passed to your TensorFlow program as the `job-dir` command-line argument. The benefit of specifying this field is that Cloud ML validates the path for use in training. | String | Yes | ``\n", - "runtime_version | The Cloud ML Engine runtime version to use for training. If not set, Cloud ML Engine uses the default stable version, 1.0. | String | Yes | ``\n", - "master_image_uri | The Docker image to run on the master replica. This image must be in Container Registry. | GCRPath | Yes | ``\n", - "worker_image_uri | The Docker image to run on the worker replica. This image must be in Container Registry. | GCRPath | Yes | ``\n", - "training_input | The input parameters to create a training job. It is the JSON payload of a [TrainingInput](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#TrainingInput) | Dict | Yes | ``\n", - "job_id_prefix | The prefix of the generated job id. | String | Yes | ``\n", - "wait_interval | A time-interval to wait for between calls to get the job status. | Integer | Yes | `30`\n", + "| Argument | Description | Optional | Data type | Accepted values | Default |\n", + "|:------------------|:------------------|:----------|:--------------|:-----------------|:-------------|\n", + "| project_id | The ID of the Google Cloud Platform (GCP) project of the job. | No | GCPProjectID | | |\n", + "| python_module | The name of the Python module to run after installing the training program. | Yes | String | | None |\n", + "| package_uris | The Cloud Storage location of the packages that contain the training program and any additional dependencies. The maximum number of package URIs is 100. | Yes | List | | None |\n", + "| region | The Compute Engine region in which the training job is run. | Yes | GCPRegion | | us-central1 |\n", + "| args | The command line arguments to pass to the training program. | Yes | List | | None |\n", + "| job_dir | A Cloud Storage path in which to store the training outputs and other data needed for training. This path is passed to your TensorFlow program as the `job-dir` command-line argument. The benefit of specifying this field is that Cloud ML validates the path for use in training. | Yes | GCSPath | | None |\n", + "| python_version | The version of Python used in training. If it is not set, the default version is 2.7. Python 3.5 is available when the runtime version is set to 1.4 and above. | Yes | String | | None |\n", + "| runtime_version | The runtime version of Cloud ML Engine to use for training. If it is not set, Cloud ML Engine uses the default. | Yes | String | | 1 |\n", + "| master_image_uri | The Docker image to run on the master replica. This image must be in Container Registry. | Yes | GCRPath | | None |\n", + "| worker_image_uri | The Docker image to run on the worker replica. This image must be in Container Registry. | Yes | GCRPath | | None |\n", + "| training_input | The input parameters to create a training job. | Yes | Dict | [TrainingInput](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#TrainingInput) | None |\n", + "| job_id_prefix | The prefix of the job ID that is generated. | Yes | String | | None |\n", + "| wait_interval | The number of seconds to wait between API calls to get the status of the job. | Yes | Integer | | 30 |\n", + "\n", + "\n", + "\n", + "## Input data schema\n", + "\n", + "The component accepts two types of inputs:\n", + "* A list of Python packages from Cloud Storage.\n", + " * You can manually build a Python package and upload it to Cloud Storage by following this [guide](https://cloud.google.com/ml-engine/docs/tensorflow/packaging-trainer#manual-build).\n", + "* A Docker container from Container Registry. \n", + " * Follow this [guide](https://cloud.google.com/ml-engine/docs/using-containers) to publish and use a Docker container with this component.\n", + "\n", + "## Output\n", + "| Name | Description | Type |\n", + "|:------- |:---- | :--- |\n", + "| job_id | The ID of the created job. | String |\n", + "| job_dir | The Cloud Storage path that contains the trained model output files. | GCSPath |\n", "\n", - "## Outputs\n", - "Name | Description | Type\n", - ":--- | :---------- | :---\n", - "job_id | The ID of the created job. | String\n", - "job_dir | The output path in Cloud Storage of the trainning job, which contains the trained model files. | GCSPath\n", "\n", "## Cautions & requirements\n", "\n", "To use the component, you must:\n", - "* Setup cloud environment by following the [guide](https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction#setup).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", "\n", - "```python\n", - "mlengine_train_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + "* Set up a cloud environment by following this [guide](https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction#setup).\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", "\n", - "```\n", - "* Grant Kubeflow user service account the read access to the Cloud Storage buckets which contains the input data, packages or docker images.\n", - "* Grant Kubeflow user service account the write access to the Cloud Storage bucket of the output directory.\n", + " ```\n", + " mlengine_train_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", "\n", + "* Grant the following access to the Kubeflow user service account: \n", + " * Read access to the Cloud Storage buckets which contain the input data, packages, or Docker images.\n", + " * Write access to the Cloud Storage bucket of the output directory.\n", "\n", - "## Detailed Description\n", + "## Detailed description\n", "\n", - "The component accepts one of the two types of executable inputs:\n", - "* A list of Python packages from Cloud Storage. You may manually build a Python package by following the [guide](https://cloud.google.com/ml-engine/docs/tensorflow/packaging-trainer#manual-build) and [upload it to Cloud Storage](https://cloud.google.com/ml-engine/docs/tensorflow/packaging-trainer#uploading_packages_manually).\n", - "* Docker container from Google Container Registry (GCR). Follow the [guide](https://cloud.google.com/ml-engine/docs/using-containers) to publish and use a Docker container with this component. \n", + "The component builds the [TrainingInput](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#TrainingInput) payload and submits a job via the [Cloud ML Engine REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs).\n", "\n", - "The component builds the payload of a [TrainingInput](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#TrainingInput) and submit a job by Cloud Machine Learning Engine REST API.\n", + "The steps to use the component in a pipeline are:\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "\n", + "1. Install the Kubeflow Pipeline SDK:\n" ] }, { @@ -95,18 +112,12 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/ml_engine/_train.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/ml_engine/train/sample.ipynb)\n", - "* [Cloud Machine Learning Engine job REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs)\n", - "\n", - "\n", "### Sample\n", - "Note: The following sample code works in IPython notebook or directly in Python code.\n", + "Note: The following sample code works in an IPython notebook or directly in Python code.\n", "\n", - "In this sample, we use the code from [census estimator sample](https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/estimator) to train a model in Cloud Machine Learning Engine service. In order to pass the code to the service, we need to package the python code and upload it in a Cloud Storage bucket. Make sure that you have read and write permissions on the bucket that you use as the working directory. \n", + "In this sample, you use the code from the [census estimator sample](https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/estimator) to train a model in Cloud ML Engine. To upload the code to Cloud ML Engine, package the Python code and upload it to a Cloud Storage bucket. \n", "\n", + "Note: You must have read and write permissions on the bucket that you use as the working directory.\n", "#### Set sample parameters" ] }, @@ -301,10 +312,6 @@ "source": [ "#### Inspect the results\n", "\n", - "Follow the `Run` link to open the KFP UI. In the step logs, you should be able to click on the links to:\n", - "* Job dashboard\n", - "* And realtime logs on Stackdriver\n", - "\n", "Use the following command to inspect the contents in the output directory:" ] }, @@ -316,6 +323,20 @@ "source": [ "!gsutil ls $OUTPUT_GCS_PATH" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/ml_engine/_train.py)\n", + "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/ml_engine/train/sample.ipynb)\n", + "* [Cloud Machine Learning Engine job REST API](https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -334,7 +355,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4,