From 1115fa582339c7667dab4d080a180475a305a6dd Mon Sep 17 00:00:00 2001 From: hongye-sun <43763191+hongye-sun@users.noreply.github.com> Date: Thu, 18 Apr 2019 11:22:00 -0700 Subject: [PATCH] Apply latest doc review changes to github docs (#1128) * Apply latest doc review changes to github docs * merge changes from tech writer * adding missing dataproc components --- components/gcp/bigquery/query/README.md | 105 ++++++++----- components/gcp/bigquery/query/sample.ipynb | 113 ++++++++----- .../gcp/dataflow/launch_python/README.md | 112 +++++++------ .../gcp/dataflow/launch_python/sample.ipynb | 115 ++++++++------ .../gcp/dataflow/launch_template/README.md | 100 +++++++----- .../gcp/dataflow/launch_template/sample.ipynb | 102 +++++++----- .../gcp/dataproc/create_cluster/README.md | 94 ++++++----- .../gcp/dataproc/create_cluster/sample.ipynb | 110 +++++++------ .../gcp/dataproc/delete_cluster/README.md | 72 +++++---- .../gcp/dataproc/delete_cluster/sample.ipynb | 94 ++++++----- .../gcp/dataproc/submit_hadoop_job/README.md | 109 +++++++------ .../dataproc/submit_hadoop_job/sample.ipynb | 123 +++++++++------ .../gcp/dataproc/submit_hive_job/README.md | 85 +++++----- .../gcp/dataproc/submit_hive_job/sample.ipynb | 101 +++++++----- .../gcp/dataproc/submit_pig_job/README.md | 89 ++++++----- .../gcp/dataproc/submit_pig_job/sample.ipynb | 105 +++++++------ .../gcp/dataproc/submit_pyspark_job/README.md | 87 +++++----- .../dataproc/submit_pyspark_job/sample.ipynb | 101 +++++++----- .../gcp/dataproc/submit_spark_job/README.md | 104 +++++++----- .../dataproc/submit_spark_job/sample.ipynb | 123 +++++++++------ .../dataproc/submit_sparksql_job/README.md | 71 +++++---- .../dataproc/submit_sparksql_job/sample.ipynb | 86 +++++----- .../gcp/ml_engine/batch_predict/README.md | 104 +++++++----- .../gcp/ml_engine/batch_predict/sample.ipynb | 111 ++++++++----- components/gcp/ml_engine/deploy/README.md | 141 +++++++++++------ components/gcp/ml_engine/deploy/sample.ipynb | 148 ++++++++++++------ components/gcp/ml_engine/train/README.md | 120 ++++++++------ components/gcp/ml_engine/train/sample.ipynb | 125 +++++++++------ 28 files changed, 1767 insertions(+), 1183 deletions(-) diff --git a/components/gcp/bigquery/query/README.md b/components/gcp/bigquery/query/README.md index ea6b36faf19..f42dff1e85e 100644 --- a/components/gcp/bigquery/query/README.md +++ b/components/gcp/bigquery/query/README.md @@ -1,49 +1,78 @@ -# Submitting a query using BigQuery -A Kubeflow Pipeline component to submit a query to Google Cloud Bigquery service and dump outputs to a Google Cloud Storage blob. +# Name -## Intended Use -The component is intended to export query data from BiqQuery service to Cloud Storage. +Gather training data by querying BigQuery -## Runtime arguments -Name | Description | Data type | Optional | Default -:--- | :---------- | :-------- | :------- | :------ -query | The query used by Bigquery service to fetch the results. | String | No | -project_id | The project to execute the query job. | GCPProjectID | No | -dataset_id | The ID of the persistent dataset to keep the results of the query. If the dataset does not exist, the operation will create a new one. | String | Yes | ` ` -table_id | The ID of the table to keep the results of the query. If absent, the operation will generate a random id for the table. | String | Yes | ` ` -output_gcs_path | The path to the Cloud Storage bucket to store the query output. | GCSPath | Yes | ` ` -dataset_location | The location to create the dataset. Defaults to `US`. | String | Yes | `US` -job_config | The full config spec for the query job. See [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) for details. | Dict | Yes | ` ` +# Labels + +GCP, BigQuery, Kubeflow, Pipeline + + +# Summary + +A Kubeflow Pipeline component to submit a query to BigQuery and store the result in a Cloud Storage bucket. + + +# Details + + +## Intended use + +Use this Kubeflow component to: +* Select training data by submitting a query to BigQuery. +* Output the training data into a Cloud Storage bucket as CSV files. + + +## Runtime arguments: + + +| Argument | Description | Optional | Data type | Accepted values | Default | +|----------|-------------|----------|-----------|-----------------|---------| +| query | The query used by BigQuery to fetch the results. | No | String | | | +| project_id | The project ID of the Google Cloud Platform (GCP) project to use to execute the query. | No | GCPProjectID | | | +| dataset_id | The ID of the persistent BigQuery dataset to store the results of the query. If the dataset does not exist, the operation will create a new one. | Yes | String | | None | +| table_id | The ID of the BigQuery table to store the results of the query. If the table ID is absent, the operation will generate a random ID for the table. | Yes | String | | None | +| output_gcs_path | The path to the Cloud Storage bucket to store the query output. | Yes | GCSPath | | None | +| dataset_location | The location where the dataset is created. Defaults to US. | Yes | String | | US | +| job_config | The full configuration specification for the query job. See [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) for details. | Yes | Dict | A JSONobject which has the same structure as [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) | None | +## Input data schema + +The input data is a BigQuery job containing a query that pulls data f rom various sources. + + +## Output: -## Outputs Name | Description | Type :--- | :---------- | :--- output_gcs_path | The path to the Cloud Storage bucket containing the query output in CSV format. | GCSPath -## Cautions and requirements +## Cautions & requirements + To use the component, the following requirements must be met: -* BigQuery API is enabled -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -```python -bigquery_query_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) +* The BigQuery API is enabled. +* The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow Pipeline cluster. For example: -``` + ``` + bigquery_query_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` +* The Kubeflow user service account is a member of the `roles/bigquery.admin` role of the project. +* The Kubeflow user service account is a member of the `roles/storage.objectCreator `role of the Cloud Storage output bucket. -* The Kubeflow user service account is a member of `roles/bigquery.admin` role of the project. -* The Kubeflow user service account is also a member of `roles/storage.objectCreator` role of the Cloud Storage output bucket. +## Detailed description +This Kubeflow Pipeline component is used to: +* Submit a query to BigQuery. + * The query results are persisted in a dataset table in BigQuery. + * An extract job is created in BigQuery to extract the data from the dataset table and output it to a Cloud Storage bucket as CSV files. -## Detailed Description -The component does several things: -1. Creates persistent dataset and table if they do not exist. -1. Submits a query to BigQuery service and persists the result to the table. -1. Creates an extraction job to output the table data to a Cloud Storage bucket in CSV format. + Use the code below as an example of how to run your BigQuery job. -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +### Sample +Note: The following sample code works in an IPython notebook or directly in Python code. + +#### Set sample parameters ```python @@ -64,13 +93,6 @@ bigquery_query_op = comp.load_component_from_url( help(bigquery_query_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/bigquery/_query.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/bigquery/query/sample.ipynb) -* [BigQuery query REST API](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query) - - ### Sample Note: The following sample code works in IPython notebook or directly in Python code. @@ -161,3 +183,12 @@ run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arg ```python !gsutil cat OUTPUT_PATH ``` + +## References +* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/bigquery/_query.py) +* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/bigquery/query/sample.ipynb) +* [BigQuery query REST API](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query) + +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/bigquery/query/sample.ipynb b/components/gcp/bigquery/query/sample.ipynb index ee1945c637c..9da2362ef87 100644 --- a/components/gcp/bigquery/query/sample.ipynb +++ b/components/gcp/bigquery/query/sample.ipynb @@ -4,50 +4,80 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Submitting a query using BigQuery \n", - "A Kubeflow Pipeline component to submit a query to Google Cloud Bigquery service and dump outputs to a Google Cloud Storage blob. \n", + "# Name\n", "\n", - "## Intended Use\n", - "The component is intended to export query data from BiqQuery service to Cloud Storage. \n", + "Gather training data by querying BigQuery \n", "\n", - "## Runtime arguments\n", - "Name | Description | Data type | Optional | Default\n", - ":--- | :---------- | :-------- | :------- | :------\n", - "query | The query used by Bigquery service to fetch the results. | String | No |\n", - "project_id | The project to execute the query job. | GCPProjectID | No |\n", - "dataset_id | The ID of the persistent dataset to keep the results of the query. If the dataset does not exist, the operation will create a new one. | String | Yes | ` `\n", - "table_id | The ID of the table to keep the results of the query. If absent, the operation will generate a random id for the table. | String | Yes | ` `\n", - "output_gcs_path | The path to the Cloud Storage bucket to store the query output. | GCSPath | Yes | ` `\n", - "dataset_location | The location to create the dataset. Defaults to `US`. | String | Yes | `US`\n", - "job_config | The full config spec for the query job. See [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) for details. | Dict | Yes | ` `\n", "\n", + "# Labels\n", + "\n", + "GCP, BigQuery, Kubeflow, Pipeline\n", + "\n", + "\n", + "# Summary\n", + "\n", + "A Kubeflow Pipeline component to submit a query to BigQuery and store the result in a Cloud Storage bucket.\n", + "\n", + "\n", + "# Details\n", + "\n", + "\n", + "## Intended use\n", + "\n", + "Use this Kubeflow component to:\n", + "* Select training data by submitting a query to BigQuery.\n", + "* Output the training data into a Cloud Storage bucket as CSV files.\n", + "\n", + "\n", + "## Runtime arguments:\n", + "\n", + "\n", + "| Argument | Description | Optional | Data type | Accepted values | Default |\n", + "|----------|-------------|----------|-----------|-----------------|---------|\n", + "| query | The query used by BigQuery to fetch the results. | No | String | | |\n", + "| project_id | The project ID of the Google Cloud Platform (GCP) project to use to execute the query. | No | GCPProjectID | | |\n", + "| dataset_id | The ID of the persistent BigQuery dataset to store the results of the query. If the dataset does not exist, the operation will create a new one. | Yes | String | | None |\n", + "| table_id | The ID of the BigQuery table to store the results of the query. If the table ID is absent, the operation will generate a random ID for the table. | Yes | String | | None |\n", + "| output_gcs_path | The path to the Cloud Storage bucket to store the query output. | Yes | GCSPath | | None |\n", + "| dataset_location | The location where the dataset is created. Defaults to US. | Yes | String | | US |\n", + "| job_config | The full configuration specification for the query job. See [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) for details. | Yes | Dict | A JSONobject which has the same structure as [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) | None |\n", + "## Input data schema\n", + "\n", + "The input data is a BigQuery job containing a query that pulls data f rom various sources. \n", + "\n", + "\n", + "## Output:\n", "\n", - "## Outputs\n", "Name | Description | Type\n", ":--- | :---------- | :---\n", "output_gcs_path | The path to the Cloud Storage bucket containing the query output in CSV format. | GCSPath\n", "\n", - "## Cautions and requirements\n", + "## Cautions & requirements\n", + "\n", "To use the component, the following requirements must be met:\n", - "* BigQuery API is enabled\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", "\n", - "```python\n", - "bigquery_query_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + "* The BigQuery API is enabled.\n", + "* The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow Pipeline cluster. For example:\n", + "\n", + " ```\n", + " bigquery_query_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "* The Kubeflow user service account is a member of the `roles/bigquery.admin` role of the project.\n", + "* The Kubeflow user service account is a member of the `roles/storage.objectCreator `role of the Cloud Storage output bucket.\n", + "\n", + "## Detailed description\n", + "This Kubeflow Pipeline component is used to:\n", + "* Submit a query to BigQuery.\n", + " * The query results are persisted in a dataset table in BigQuery.\n", + " * An extract job is created in BigQuery to extract the data from the dataset table and output it to a Cloud Storage bucket as CSV files.\n", "\n", - "```\n", + " Use the code below as an example of how to run your BigQuery job.\n", "\n", - "* The Kubeflow user service account is a member of `roles/bigquery.admin` role of the project.\n", - "* The Kubeflow user service account is also a member of `roles/storage.objectCreator` role of the Cloud Storage output bucket.\n", + "### Sample\n", "\n", - "## Detailed Description\n", - "The component does several things:\n", - "1. Creates persistent dataset and table if they do not exist.\n", - "1. Submits a query to BigQuery service and persists the result to the table.\n", - "1. Creates an extraction job to output the table data to a Cloud Storage bucket in CSV format.\n", + "Note: The following sample code works in an IPython notebook or directly in Python code.\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "#### Set sample parameters" ] }, { @@ -86,13 +116,6 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/bigquery/_query.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/bigquery/query/sample.ipynb)\n", - "* [BigQuery query REST API](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query)\n", - "\n", - "\n", "### Sample\n", "\n", "Note: The following sample code works in IPython notebook or directly in Python code.\n", @@ -241,6 +264,20 @@ "source": [ "!gsutil cat OUTPUT_PATH" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/bigquery/_query.py)\n", + "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/bigquery/query/sample.ipynb)\n", + "* [BigQuery query REST API](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -259,7 +296,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataflow/launch_python/README.md b/components/gcp/dataflow/launch_python/README.md index 9d6490db9c7..514609a8a39 100644 --- a/components/gcp/dataflow/launch_python/README.md +++ b/components/gcp/dataflow/launch_python/README.md @@ -1,54 +1,65 @@ -# Executing an Apache Beam Python job in Cloud Dataflow -A Kubeflow Pipeline component that submits an Apache Beam job (authored in Python) to Cloud Dataflow for execution. The Python Beam code is run with the Cloud Dataflow Runner. +# Name +Data preparation by executing an Apache Beam job in Cloud Dataflow -## Intended Use -Use this component to run a Python Beam code to submit a Dataflow job as a step of a KFP pipeline. The component will wait until the job finishes. +# Labels +GCP, Cloud Dataflow, Apache Beam, Python, Kubeflow + +# Summary +A Kubeflow Pipeline component that prepares data by submitting an Apache Beam job (authored in Python) to Cloud Dataflow for execution. The Python Beam code is run with Cloud Dataflow Runner. + +# Details +## Intended use + +Use this component to run a Python Beam code to submit a Cloud Dataflow job as a step of a Kubeflow pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -python_file_path | The Cloud Storage or the local path to the python file being run. | String | No | -project_id | The ID of the parent project of the Dataflow job. | GCPProjectID | No | -staging_dir | The Cloud Storage directory for keeping staging files. A random subdirectory will be created under the directory to keep job info for resuming the job in case of failure and it will be passed as `staging_location` and `temp_location` command line args of the beam code. | GCSPath | Yes | ` ` -requirements_file_path | The Cloud Storageor the local path to the pip requirements file. | String | Yes | ` ` -args | The list of arguments to pass to the python file. | List | Yes | `[]` -wait_interval | The seconds to wait between calls to get the job status. | Integer | Yes | `30` - -## Output: -Name | Description | Type -:--- | :---------- | :--- -job_id | The id of the created dataflow job. | String - -## Cautions and requirements +Name | Description | Optional | Data type| Accepted values | Default | +:--- | :----------| :----------| :----------| :----------| :---------- | +python_file_path | The path to the Cloud Storage bucket or local directory containing the Python file to be run. | | GCSPath | | | +project_id | The ID of the Google Cloud Platform (GCP) project containing the Cloud Dataflow job.| | GCPProjectID | | | +staging_dir | The path to the Cloud Storage directory where the staging files are stored. A random subdirectory will be created under the staging directory to keep the job information.This is done so that you can resume the job in case of failure. `staging_dir` is passed as the command line arguments (`staging_location` and `temp_location`) of the Beam code. | Yes | GCPPath | | None | +requirements_file_path | The path to the Cloud Storage bucket or local directory containing the pip requirements file. | Yes | GCSPath | | None | +args | The list of arguments to pass to the Python file. | No | List | A list of string arguments | None | +wait_interval | The number of seconds to wait between calls to get the status of the job. | Yes | Integer | | 30 | + +## Input data schema + +Before you use the component, the following files must be ready in a Cloud Storage bucket: +- A Beam Python code file. +- A `requirements.txt` file which includes a list of dependent packages. + +The Beam Python code should follow the [Beam programming guide](https://beam.apache.org/documentation/programming-guide/) as well as the following additional requirements to be compatible with this component: +- It accepts the command line arguments `--project`, `--temp_location`, `--staging_location`, which are [standard Dataflow Runner options](https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#setting-other-cloud-pipeline-options). +- It enables `info logging` before the start of a Cloud Dataflow job in the Python code. This is important to allow the component to track the status and ID of the job that is created. For example, calling `logging.getLogger().setLevel(logging.INFO)` before any other code. + + +## Output +Name | Description +:--- | :---------- +job_id | The id of the Cloud Dataflow job that is created. + +## Cautions & requirements To use the components, the following requirements must be met: -* Dataflow API is enabled. -* The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a KFP cluster. For example: +- Cloud Dataflow API is enabled. +- The component is running under a secret Kubeflow user service account in a Kubeflow Pipeline cluster. For example: ``` component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) ``` -* The Kubeflow user service account is a member of `roles/dataflow.developer` role of the project. -* The Kubeflow user service account is a member of `roles/storage.objectViewer` role of the Cloud Storage Objects `python_file_path` and `requirements_file_path`. -* The Kubeflow user service account is a member of `roles/storage.objectCreator` role of the Cloud Storage Object `staging_dir`. +The Kubeflow user service account is a member of: +- `roles/dataflow.developer` role of the project. +- `roles/storage.objectViewer` role of the Cloud Storage Objects `python_file_path` and `requirements_file_path`. +- `roles/storage.objectCreator` role of the Cloud Storage Object `staging_dir`. ## Detailed description -Before using the component, make sure the following files are prepared in a Cloud Storage bucket. -* A Beam Python code file. -* A `requirements.txt` file which includes a list of dependent packages. - -The Beam Python code should follow [Beam programing model](https://beam.apache.org/documentation/programming-guide/) and the following additional requirements to be compatible with this component: -* It accepts command line arguments: `--project`, `--temp_location`, `--staging_location`, which are [standard Dataflow Runner options](https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#setting-other-cloud-pipeline-options). -* Enable info logging before the start of a Dataflow job in the Python code. This is important to allow the component to track the status and ID of create job. For example: calling `logging.getLogger().setLevel(logging.INFO)` before any other code. - The component does several things during the execution: -* Download `python_file_path` and `requirements_file_path` to local files. -* Start a subprocess to launch the Python program. -* Monitor the logs produced from the subprocess to extract Dataflow job information. -* Store Dataflow job information in `staging_dir` so the job can be resumed in case of failure. -* Wait for the job to finish. - -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +- Downloads `python_file_path` and `requirements_file_path` to local files. +- Starts a subprocess to launch the Python program. +- Monitors the logs produced from the subprocess to extract the Cloud Dataflow job information. +- Stores the Cloud Dataflow job information in `staging_dir` so the job can be resumed in case of failure. +- Waits for the job to finish. +The steps to use the component in a pipeline are: +1. Install the Kubeflow Pipelines SDK: @@ -70,17 +81,9 @@ dataflow_python_op = comp.load_component_from_url( help(dataflow_python_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataflow/_launch_python.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataflow/launch_python/sample.ipynb) -* [Dataflow Python Quickstart](https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python) - ### Sample - -Note: the sample code below works in both IPython notebook or python code directly. - -In this sample, we run a wordcount sample code in a KFP pipeline. The output will be stored in a Cloud Storage bucket. Here is the sample code: +Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template. +In this sample, we run a wordcount sample code in a Kubeflow Pipeline. The output will be stored in a Cloud Storage bucket. Here is the sample code: ```python @@ -292,3 +295,12 @@ run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arg ```python !gsutil cat $OUTPUT_FILE ``` + +## References +* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataflow/_launch_python.py) +* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataflow/launch_python/sample.ipynb) +* [Dataflow Python Quickstart](https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python) + +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataflow/launch_python/sample.ipynb b/components/gcp/dataflow/launch_python/sample.ipynb index 93113512c4d..61a663439ec 100644 --- a/components/gcp/dataflow/launch_python/sample.ipynb +++ b/components/gcp/dataflow/launch_python/sample.ipynb @@ -4,56 +4,67 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Executing an Apache Beam Python job in Cloud Dataflow\n", - "A Kubeflow Pipeline component that submits an Apache Beam job (authored in Python) to Cloud Dataflow for execution. The Python Beam code is run with the Cloud Dataflow Runner.\n", + "# Name\n", + "Data preparation by executing an Apache Beam job in Cloud Dataflow\n", "\n", - "## Intended Use\n", - "Use this component to run a Python Beam code to submit a Dataflow job as a step of a KFP pipeline. The component will wait until the job finishes.\n", + "# Labels\n", + "GCP, Cloud Dataflow, Apache Beam, Python, Kubeflow\n", + "\n", + "# Summary\n", + "A Kubeflow Pipeline component that prepares data by submitting an Apache Beam job (authored in Python) to Cloud Dataflow for execution. The Python Beam code is run with Cloud Dataflow Runner.\n", + "\n", + "# Details\n", + "## Intended use\n", + "\n", + "Use this component to run a Python Beam code to submit a Cloud Dataflow job as a step of a Kubeflow pipeline. \n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "python_file_path | The Cloud Storage or the local path to the python file being run. | String | No |\n", - "project_id | The ID of the parent project of the Dataflow job. | GCPProjectID | No |\n", - "staging_dir | The Cloud Storage directory for keeping staging files. A random subdirectory will be created under the directory to keep job info for resuming the job in case of failure and it will be passed as `staging_location` and `temp_location` command line args of the beam code. | GCSPath | Yes | ` `\n", - "requirements_file_path | The Cloud Storageor the local path to the pip requirements file. | String | Yes | ` `\n", - "args | The list of arguments to pass to the python file. | List | Yes | `[]`\n", - "wait_interval | The seconds to wait between calls to get the job status. | Integer | Yes | `30`\n", + "Name | Description | Optional | Data type| Accepted values | Default |\n", + ":--- | :----------| :----------| :----------| :----------| :---------- |\n", + "python_file_path | The path to the Cloud Storage bucket or local directory containing the Python file to be run. | | GCSPath | | |\n", + "project_id | The ID of the Google Cloud Platform (GCP) project containing the Cloud Dataflow job.| | GCPProjectID | | |\n", + "staging_dir | The path to the Cloud Storage directory where the staging files are stored. A random subdirectory will be created under the staging directory to keep the job information.This is done so that you can resume the job in case of failure. `staging_dir` is passed as the command line arguments (`staging_location` and `temp_location`) of the Beam code. | Yes | GCPPath | | None |\n", + "requirements_file_path | The path to the Cloud Storage bucket or local directory containing the pip requirements file. | Yes | GCSPath | | None |\n", + "args | The list of arguments to pass to the Python file. | No | List | A list of string arguments | None |\n", + "wait_interval | The number of seconds to wait between calls to get the status of the job. | Yes | Integer | | 30 |\n", + "\n", + "## Input data schema\n", + "\n", + "Before you use the component, the following files must be ready in a Cloud Storage bucket:\n", + "- A Beam Python code file.\n", + "- A `requirements.txt` file which includes a list of dependent packages.\n", + "\n", + "The Beam Python code should follow the [Beam programming guide](https://beam.apache.org/documentation/programming-guide/) as well as the following additional requirements to be compatible with this component:\n", + "- It accepts the command line arguments `--project`, `--temp_location`, `--staging_location`, which are [standard Dataflow Runner options](https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#setting-other-cloud-pipeline-options).\n", + "- It enables `info logging` before the start of a Cloud Dataflow job in the Python code. This is important to allow the component to track the status and ID of the job that is created. For example, calling `logging.getLogger().setLevel(logging.INFO)` before any other code.\n", + "\n", "\n", - "## Output:\n", - "Name | Description | Type\n", - ":--- | :---------- | :---\n", - "job_id | The id of the created dataflow job. | String\n", + "## Output\n", + "Name | Description\n", + ":--- | :----------\n", + "job_id | The id of the Cloud Dataflow job that is created.\n", "\n", - "## Cautions and requirements\n", + "## Cautions & requirements\n", "To use the components, the following requirements must be met:\n", - "* Dataflow API is enabled.\n", - "* The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a KFP cluster. For example:\n", + "- Cloud Dataflow API is enabled.\n", + "- The component is running under a secret Kubeflow user service account in a Kubeflow Pipeline cluster. For example:\n", "```\n", "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", "```\n", - "* The Kubeflow user service account is a member of `roles/dataflow.developer` role of the project.\n", - "* The Kubeflow user service account is a member of `roles/storage.objectViewer` role of the Cloud Storage Objects `python_file_path` and `requirements_file_path`.\n", - "* The Kubeflow user service account is a member of `roles/storage.objectCreator` role of the Cloud Storage Object `staging_dir`.\n", + "The Kubeflow user service account is a member of:\n", + "- `roles/dataflow.developer` role of the project.\n", + "- `roles/storage.objectViewer` role of the Cloud Storage Objects `python_file_path` and `requirements_file_path`.\n", + "- `roles/storage.objectCreator` role of the Cloud Storage Object `staging_dir`. \n", "\n", "## Detailed description\n", - "Before using the component, make sure the following files are prepared in a Cloud Storage bucket.\n", - "* A Beam Python code file.\n", - "* A `requirements.txt` file which includes a list of dependent packages.\n", - "\n", - "The Beam Python code should follow [Beam programing model](https://beam.apache.org/documentation/programming-guide/) and the following additional requirements to be compatible with this component:\n", - "* It accepts command line arguments: `--project`, `--temp_location`, `--staging_location`, which are [standard Dataflow Runner options](https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#setting-other-cloud-pipeline-options).\n", - "* Enable info logging before the start of a Dataflow job in the Python code. This is important to allow the component to track the status and ID of create job. For example: calling `logging.getLogger().setLevel(logging.INFO)` before any other code.\n", - "\n", "The component does several things during the execution:\n", - "* Download `python_file_path` and `requirements_file_path` to local files.\n", - "* Start a subprocess to launch the Python program.\n", - "* Monitor the logs produced from the subprocess to extract Dataflow job information.\n", - "* Store Dataflow job information in `staging_dir` so the job can be resumed in case of failure.\n", - "* Wait for the job to finish.\n", - "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "- Downloads `python_file_path` and `requirements_file_path` to local files.\n", + "- Starts a subprocess to launch the Python program.\n", + "- Monitors the logs produced from the subprocess to extract the Cloud Dataflow job information.\n", + "- Stores the Cloud Dataflow job information in `staging_dir` so the job can be resumed in case of failure.\n", + "- Waits for the job to finish.\n", + "The steps to use the component in a pipeline are:\n", + "1. Install the Kubeflow Pipelines SDK:\n" ] }, { @@ -92,17 +103,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataflow/_launch_python.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataflow/launch_python/sample.ipynb)\n", - "* [Dataflow Python Quickstart](https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python)\n", - "\n", "### Sample\n", - "\n", - "Note: the sample code below works in both IPython notebook or python code directly.\n", - "\n", - "In this sample, we run a wordcount sample code in a KFP pipeline. The output will be stored in a Cloud Storage bucket. Here is the sample code:" + "Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n", + "In this sample, we run a wordcount sample code in a Kubeflow Pipeline. The output will be stored in a Cloud Storage bucket. Here is the sample code:" ] }, { @@ -377,6 +380,20 @@ "source": [ "!gsutil cat $OUTPUT_FILE" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataflow/_launch_python.py)\n", + "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataflow/launch_python/sample.ipynb)\n", + "* [Dataflow Python Quickstart](https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -395,7 +412,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataflow/launch_template/README.md b/components/gcp/dataflow/launch_template/README.md index cf5240af1f2..d04adad6363 100644 --- a/components/gcp/dataflow/launch_template/README.md +++ b/components/gcp/dataflow/launch_template/README.md @@ -1,43 +1,55 @@ -# Submitting a job to Cloud Dataflow service using a template -A Kubeflow Pipeline component to submit a job from a dataflow template to Cloud Dataflow service. +# Name +Data preparation by using a template to submit a job to Cloud Dataflow -## Intended Use +# Labels +GCP, Cloud Dataflow, Kubeflow, Pipeline -A Kubeflow Pipeline component to submit a job from a dataflow template to Google Cloud Dataflow service. +# Summary +A Kubeflow Pipeline component to prepare data by using a template to submit a job to Cloud Dataflow. + +# Details + +## Intended use +Use this component when you have a pre-built Cloud Dataflow template and want to launch it as a step in a Kubeflow Pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The ID of the Cloud Platform project to which the job belongs. | GCPProjectID | No | -gcs_path | A Cloud Storage path to the job creation template. It must be a valid Cloud Storage URL beginning with `gs://`. | GCSPath | No | -launch_parameters | The parameters that are required for the template being launched. The Schema is defined in [LaunchTemplateParameters Parameters](https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters). | Dict | Yes | `{}` -location | The regional endpoint to which the job request is directed. | GCPRegion | Yes | `` -validate_only | If true, the request is validated but not actually executed. | Bool | Yes | `False` -staging_dir | The Cloud Storage path for keeping staging files. A random subdirectory will be created under the directory to keep job info for resuming the job in case of failure. | GCSPath | Yes | `` -wait_interval | The seconds to wait between calls to get the job status. | Integer | Yes |`30` - -## Output: -Name | Description | Type -:--- | :---------- | :--- -job_id | The id of the created dataflow job. | String - -## Cautions and requirements -To use the components, the following requirements must be met: -* Dataflow API is enabled. -* The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a KFP cluster. For example: -``` -component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) -``` -* The Kubeflow user service account is a member of `roles/dataflow.developer` role of the project. -* The Kubeflow user service account is a member of `roles/storage.objectViewer` role of the Cloud Storage Object `gcs_path`. -* The Kubeflow user service account is a member of `roles/storage.objectCreator` role of the Cloud Storage Object `staging_dir`. +Argument | Description | Optional | Data type | Accepted values | Default | +:--- | :---------- | :----------| :----------| :---------- | :----------| +project_id | The ID of the Google Cloud Platform (GCP) project to which the job belongs. | No | GCPProjectID | | | +gcs_path | The path to a Cloud Storage bucket containing the job creation template. It must be a valid Cloud Storage URL beginning with 'gs://'. | No | GCSPath | | | +launch_parameters | The parameters that are required to launch the template. The schema is defined in [LaunchTemplateParameters](https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters). The parameter `jobName` is replaced by a generated name. | Yes | Dict | A JSON object which has the same structure as [LaunchTemplateParameters](https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters) | None | +location | The regional endpoint to which the job request is directed.| Yes | GCPRegion | | None | +staging_dir | The path to the Cloud Storage directory where the staging files are stored. A random subdirectory will be created under the staging directory to keep the job information. This is done so that you can resume the job in case of failure.| Yes | GCSPath | | None | +validate_only | If True, the request is validated but not executed. | Yes | Boolean | | False | +wait_interval | The number of seconds to wait between calls to get the status of the job. | Yes | Integer | | 30 | + +## Input data schema + +The input `gcs_path` must contain a valid Cloud Dataflow template. The template can be created by following the instructions in [Creating Templates](https://cloud.google.com/dataflow/docs/guides/templates/creating-templates). You can also use [Google-provided templates](https://cloud.google.com/dataflow/docs/guides/templates/provided-templates). + +## Output +Name | Description +:--- | :---------- +job_id | The id of the Cloud Dataflow job that is created. + +## Caution & requirements + +To use the component, the following requirements must be met: +- Cloud Dataflow API is enabled. +- The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow Pipeline cluster. For example: + ``` + component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` +* The Kubeflow user service account is a member of: + - `roles/dataflow.developer` role of the project. + - `roles/storage.objectViewer` role of the Cloud Storage Object `gcs_path.` + - `roles/storage.objectCreator` role of the Cloud Storage Object `staging_dir.` ## Detailed description -The input `gcs_path` must contain a valid Dataflow template. The template can be created by following the guide [Creating Templates](https://cloud.google.com/dataflow/docs/guides/templates/creating-templates). Or, you can use [Google-provided templates](https://cloud.google.com/dataflow/docs/guides/templates/provided-templates). - -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +You can execute the template locally by following the instructions in [Executing Templates](https://cloud.google.com/dataflow/docs/guides/templates/executing-templates). See the sample code below to learn how to execute the template. +Follow these steps to use the component in a pipeline: +1. Install the Kubeflow Pipeline SDK: @@ -59,17 +71,10 @@ dataflow_template_op = comp.load_component_from_url( help(dataflow_template_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataflow/_launch_template.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataflow/launch_template/sample.ipynb) -* [Cloud Dataflow Templates overview](https://cloud.google.com/dataflow/docs/guides/templates/overview) - ### Sample -Note: the sample code below works in both IPython notebook or python code directly. - -In this sample, we run a Google provided word count template from `gs://dataflow-templates/latest/Word_Count`. The template takes a text file as input and output word counts to a Cloud Storage bucket. Here is the sample input: +Note: The following sample code works in an IPython notebook or directly in Python code. +In this sample, we run a Google-provided word count template from `gs://dataflow-templates/latest/Word_Count`. The template takes a text file as input and outputs word counts to a Cloud Storage bucket. Here is the sample input: ```python @@ -159,3 +164,14 @@ run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arg ```python !gsutil cat $OUTPUT_PATH* ``` + +## References + +* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataflow/_launch_template.py) +* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataflow/launch_template/sample.ipynb) +* [Cloud Dataflow Templates overview](https://cloud.google.com/dataflow/docs/guides/templates/overview) + +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. + diff --git a/components/gcp/dataflow/launch_template/sample.ipynb b/components/gcp/dataflow/launch_template/sample.ipynb index 706d69549a6..ec313804895 100644 --- a/components/gcp/dataflow/launch_template/sample.ipynb +++ b/components/gcp/dataflow/launch_template/sample.ipynb @@ -4,45 +4,57 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Submitting a job to Cloud Dataflow service using a template\n", - "A Kubeflow Pipeline component to submit a job from a dataflow template to Cloud Dataflow service.\n", + "# Name\n", + "Data preparation by using a template to submit a job to Cloud Dataflow\n", "\n", - "## Intended Use\n", + "# Labels\n", + "GCP, Cloud Dataflow, Kubeflow, Pipeline\n", "\n", - "A Kubeflow Pipeline component to submit a job from a dataflow template to Google Cloud Dataflow service.\n", + "# Summary\n", + "A Kubeflow Pipeline component to prepare data by using a template to submit a job to Cloud Dataflow.\n", + "\n", + "# Details\n", + "\n", + "## Intended use\n", + "Use this component when you have a pre-built Cloud Dataflow template and want to launch it as a step in a Kubeflow Pipeline.\n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The ID of the Cloud Platform project to which the job belongs. | GCPProjectID | No |\n", - "gcs_path | A Cloud Storage path to the job creation template. It must be a valid Cloud Storage URL beginning with `gs://`. | GCSPath | No |\n", - "launch_parameters | The parameters that are required for the template being launched. The Schema is defined in [LaunchTemplateParameters Parameters](https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters). | Dict | Yes | `{}`\n", - "location | The regional endpoint to which the job request is directed. | GCPRegion | Yes | ``\n", - "validate_only | If true, the request is validated but not actually executed. | Bool | Yes | `False`\n", - "staging_dir | The Cloud Storage path for keeping staging files. A random subdirectory will be created under the directory to keep job info for resuming the job in case of failure. | GCSPath | Yes | ``\n", - "wait_interval | The seconds to wait between calls to get the job status. | Integer | Yes |`30`\n", + "Argument | Description | Optional | Data type | Accepted values | Default |\n", + ":--- | :---------- | :----------| :----------| :---------- | :----------|\n", + "project_id | The ID of the Google Cloud Platform (GCP) project to which the job belongs. | No | GCPProjectID | | |\n", + "gcs_path | The path to a Cloud Storage bucket containing the job creation template. It must be a valid Cloud Storage URL beginning with 'gs://'. | No | GCSPath | | |\n", + "launch_parameters | The parameters that are required to launch the template. The schema is defined in [LaunchTemplateParameters](https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters). The parameter `jobName` is replaced by a generated name. | Yes | Dict | A JSON object which has the same structure as [LaunchTemplateParameters](https://cloud.google.com/dataflow/docs/reference/rest/v1b3/LaunchTemplateParameters) | None |\n", + "location | The regional endpoint to which the job request is directed.| Yes | GCPRegion | | None |\n", + "staging_dir | The path to the Cloud Storage directory where the staging files are stored. A random subdirectory will be created under the staging directory to keep the job information. This is done so that you can resume the job in case of failure.| Yes | GCSPath | | None |\n", + "validate_only | If True, the request is validated but not executed. | Yes | Boolean | | False |\n", + "wait_interval | The number of seconds to wait between calls to get the status of the job. | Yes | Integer | | 30 |\n", "\n", - "## Output:\n", - "Name | Description | Type\n", - ":--- | :---------- | :---\n", - "job_id | The id of the created dataflow job. | String\n", + "## Input data schema\n", "\n", - "## Cautions and requirements\n", - "To use the components, the following requirements must be met:\n", - "* Dataflow API is enabled.\n", - "* The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a KFP cluster. For example:\n", - "```\n", - "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", - "```\n", - "* The Kubeflow user service account is a member of `roles/dataflow.developer` role of the project.\n", - "* The Kubeflow user service account is a member of `roles/storage.objectViewer` role of the Cloud Storage Object `gcs_path`.\n", - "* The Kubeflow user service account is a member of `roles/storage.objectCreator` role of the Cloud Storage Object `staging_dir`.\n", + "The input `gcs_path` must contain a valid Cloud Dataflow template. The template can be created by following the instructions in [Creating Templates](https://cloud.google.com/dataflow/docs/guides/templates/creating-templates). You can also use [Google-provided templates](https://cloud.google.com/dataflow/docs/guides/templates/provided-templates).\n", "\n", - "## Detailed description\n", - "The input `gcs_path` must contain a valid Dataflow template. The template can be created by following the guide [Creating Templates](https://cloud.google.com/dataflow/docs/guides/templates/creating-templates). Or, you can use [Google-provided templates](https://cloud.google.com/dataflow/docs/guides/templates/provided-templates).\n", + "## Output\n", + "Name | Description\n", + ":--- | :----------\n", + "job_id | The id of the Cloud Dataflow job that is created.\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "## Caution & requirements\n", + "\n", + "To use the component, the following requirements must be met:\n", + "- Cloud Dataflow API is enabled.\n", + "- The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow Pipeline cluster. For example:\n", + " ```\n", + " component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "* The Kubeflow user service account is a member of:\n", + " - `roles/dataflow.developer` role of the project.\n", + " - `roles/storage.objectViewer` role of the Cloud Storage Object `gcs_path.`\n", + " - `roles/storage.objectCreator` role of the Cloud Storage Object `staging_dir.` \n", + "\n", + "## Detailed description\n", + "You can execute the template locally by following the instructions in [Executing Templates](https://cloud.google.com/dataflow/docs/guides/templates/executing-templates). See the sample code below to learn how to execute the template.\n", + "Follow these steps to use the component in a pipeline:\n", + "1. Install the Kubeflow Pipeline SDK:\n" ] }, { @@ -81,17 +93,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataflow/_launch_template.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataflow/launch_template/sample.ipynb)\n", - "* [Cloud Dataflow Templates overview](https://cloud.google.com/dataflow/docs/guides/templates/overview)\n", - "\n", "### Sample\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly.\n", - "\n", - "In this sample, we run a Google provided word count template from `gs://dataflow-templates/latest/Word_Count`. The template takes a text file as input and output word counts to a Cloud Storage bucket. Here is the sample input:" + "Note: The following sample code works in an IPython notebook or directly in Python code.\n", + "In this sample, we run a Google-provided word count template from `gs://dataflow-templates/latest/Word_Count`. The template takes a text file as input and outputs word counts to a Cloud Storage bucket. Here is the sample input:" ] }, { @@ -239,6 +244,21 @@ "source": [ "!gsutil cat $OUTPUT_PATH*" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "\n", + "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataflow/_launch_template.py)\n", + "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataflow/launch_template/sample.ipynb)\n", + "* [Cloud Dataflow Templates overview](https://cloud.google.com/dataflow/docs/guides/templates/overview)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control.\n" + ] } ], "metadata": { @@ -257,7 +277,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataproc/create_cluster/README.md b/components/gcp/dataproc/create_cluster/README.md index 1945c524085..2ffedc57163 100644 --- a/components/gcp/dataproc/create_cluster/README.md +++ b/components/gcp/dataproc/create_cluster/README.md @@ -1,44 +1,62 @@ -# Creating a Cluster with Cloud Dataproc -A Kubeflow Pipeline component to create a cluster in Cloud Dataproc service. +# Name +Data processing by creating a cluster in Cloud Dataproc -## Intended Use -This component can be used at the start of a KFP pipeline to create a temporary Dataproc cluster to run Dataproc jobs as subsequent steps in the pipeline. The cluster can be later recycled by the [Dataproc delete cluster component](https://github.com/kubeflow/pipelines/tree/master/components/gcp/dataproc/delete_cluster). +# Label +Cloud Dataproc, cluster, GCP, Cloud Storage, KubeFlow, Pipeline + + +# Summary +A Kubeflow Pipeline component to create a cluster in Cloud Dataproc. + +# Details +## Intended use + +Use this component at the start of a Kubeflow Pipeline to create a temporary Cloud Dataproc cluster to run Cloud Dataproc jobs as steps in the pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No | -region | The Cloud Dataproc region runs the newly created cluster. | GCPRegion | No | -name | The name of the newly created cluster. Cluster names within a project must be unique. Names of deleted clusters can be reused. | String | Yes | ` ` -name_prefix | The prefix of the cluster name. | String | Yes | ` ` -initialization_actions | List of Cloud Storage URIs of executables to execute on each node after the configuration is completed. By default, executables are run on the master and all the worker nodes. | List | Yes | `[]` -config_bucket | A Cloud Storage bucket used to stage the job dependencies, the configuration files, and the job driver console’s output. | GCSPath | Yes | ` ` -image_version | The version of the software inside the cluster. | String | Yes | ` ` -cluster | The full [cluster config] (https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#Cluster). | Dict | Yes | `{}` -wait_interval | The number of seconds to pause between polling the operation done status. | Integer | Yes | `30` + +| Argument | Description | Optional | Data type | Accepted values | Default | +|----------|-------------|----------|-----------|-----------------|---------| +| project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectID | | | +| region | The Cloud Dataproc region to create the cluster in. | No | GCPRegion | | | +| name | The name of the cluster. Cluster names within a project must be unique. You can reuse the names of deleted clusters. | Yes | String | | None | +| name_prefix | The prefix of the cluster name. | Yes | String | | None | +| initialization_actions | A list of Cloud Storage URIs identifying executables to execute on each node after the configuration is completed. By default, executables are run on the master and all the worker nodes. | Yes | List | | None | +| config_bucket | The Cloud Storage bucket to use to stage the job dependencies, the configuration files, and the job driver console’s output. | Yes | GCSPath | | None | +| image_version | The version of the software inside the cluster. | Yes | String | | None | +| cluster | The full [cluster configuration](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#Cluster). | Yes | Dict | | None | +| wait_interval | The number of seconds to pause before polling the operation. | Yes | Integer | | 30 | ## Output Name | Description | Type :--- | :---------- | :--- -cluster_name | The cluster name of the created cluster. | String +cluster_name | The name of the cluster. | String + +Note: You can recycle the cluster by using the [Dataproc delete cluster component](https://github.com/kubeflow/pipelines/tree/master/components/gcp/dataproc/delete_cluster). + ## Cautions & requirements -To use the component, you must: -* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -``` -component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) -``` -* Grant Kubeflow user service account the read access to the Cloud Storage buckets which contains initialization action files. -* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project. -## Detailed Description -This component creates a new Dataproc cluster by using [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create). +To use the component, you must: +* Set up the GCP project by following these [steps](https://cloud.google.com/dataproc/docs/guides/setup-project). +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -Here are the steps to use the component in a pipeline: -1. Install KFP SDK + ``` + component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` +* Grant the following types of access to the Kubeflow user service account: + * Read access to the Cloud Storage buckets which contains initialization action files. + * The role, `roles/dataproc.editor` on the project. + +## Detailed description + +This component creates a new Dataproc cluster by using the [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create). + +Follow these steps to use the component in a pipeline: + +1. Install the Kubeflow Pipeline SDK: @@ -60,16 +78,8 @@ dataproc_create_cluster_op = comp.load_component_from_url( help(dataproc_create_cluster_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_create_cluster.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/create_cluster/sample.ipynb) -* [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create) - - ### Sample - -Note: the sample code below works in both IPython notebook or python code directly. +Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template. #### Set sample parameters @@ -142,3 +152,13 @@ experiment = client.create_experiment(EXPERIMENT_NAME) run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` + +## References +* [Kubernetes Engine for Kubeflow](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) +* [Component Python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_create_cluster.py) +* [Component Docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/create_cluster/sample.ipynb) +* [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create) + +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataproc/create_cluster/sample.ipynb b/components/gcp/dataproc/create_cluster/sample.ipynb index 1c9a000406d..16a7dd8c60b 100644 --- a/components/gcp/dataproc/create_cluster/sample.ipynb +++ b/components/gcp/dataproc/create_cluster/sample.ipynb @@ -4,46 +4,64 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Creating a Cluster with Cloud Dataproc\n", - "A Kubeflow Pipeline component to create a cluster in Cloud Dataproc service.\n", + "# Name\n", + "Data processing by creating a cluster in Cloud Dataproc\n", "\n", - "## Intended Use\n", - "This component can be used at the start of a KFP pipeline to create a temporary Dataproc cluster to run Dataproc jobs as subsequent steps in the pipeline. The cluster can be later recycled by the [Dataproc delete cluster component](https://github.com/kubeflow/pipelines/tree/master/components/gcp/dataproc/delete_cluster).\n", "\n", + "# Label\n", + "Cloud Dataproc, cluster, GCP, Cloud Storage, KubeFlow, Pipeline\n", + "\n", + "\n", + "# Summary\n", + "A Kubeflow Pipeline component to create a cluster in Cloud Dataproc.\n", + "\n", + "# Details\n", + "## Intended use\n", + "\n", + "Use this component at the start of a Kubeflow Pipeline to create a temporary Cloud Dataproc cluster to run Cloud Dataproc jobs as steps in the pipeline.\n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No |\n", - "region | The Cloud Dataproc region runs the newly created cluster. | GCPRegion | No |\n", - "name | The name of the newly created cluster. Cluster names within a project must be unique. Names of deleted clusters can be reused. | String | Yes | ` `\n", - "name_prefix | The prefix of the cluster name. | String | Yes | ` `\n", - "initialization_actions | List of Cloud Storage URIs of executables to execute on each node after the configuration is completed. By default, executables are run on the master and all the worker nodes. | List | Yes | `[]`\n", - "config_bucket | A Cloud Storage bucket used to stage the job dependencies, the configuration files, and the job driver console’s output. | GCSPath | Yes | ` `\n", - "image_version | The version of the software inside the cluster. | String | Yes | ` `\n", - "cluster | The full [cluster config] (https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#Cluster). | Dict | Yes | `{}`\n", - "wait_interval | The number of seconds to pause between polling the operation done status. | Integer | Yes | `30`\n", + "\n", + "| Argument | Description | Optional | Data type | Accepted values | Default |\n", + "|----------|-------------|----------|-----------|-----------------|---------|\n", + "| project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectID | | |\n", + "| region | The Cloud Dataproc region to create the cluster in. | No | GCPRegion | | |\n", + "| name | The name of the cluster. Cluster names within a project must be unique. You can reuse the names of deleted clusters. | Yes | String | | None |\n", + "| name_prefix | The prefix of the cluster name. | Yes | String | | None |\n", + "| initialization_actions | A list of Cloud Storage URIs identifying executables to execute on each node after the configuration is completed. By default, executables are run on the master and all the worker nodes. | Yes | List | | None |\n", + "| config_bucket | The Cloud Storage bucket to use to stage the job dependencies, the configuration files, and the job driver console’s output. | Yes | GCSPath | | None |\n", + "| image_version | The version of the software inside the cluster. | Yes | String | | None |\n", + "| cluster | The full [cluster configuration](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#Cluster). | Yes | Dict | | None |\n", + "| wait_interval | The number of seconds to pause before polling the operation. | Yes | Integer | | 30 |\n", "\n", "## Output\n", "Name | Description | Type\n", ":--- | :---------- | :---\n", - "cluster_name | The cluster name of the created cluster. | String\n", + "cluster_name | The name of the cluster. | String\n", + "\n", + "Note: You can recycle the cluster by using the [Dataproc delete cluster component](https://github.com/kubeflow/pipelines/tree/master/components/gcp/dataproc/delete_cluster).\n", + "\n", "\n", "## Cautions & requirements\n", - "To use the component, you must:\n", - "* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", - "```\n", - "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", - "```\n", - "* Grant Kubeflow user service account the read access to the Cloud Storage buckets which contains initialization action files.\n", - "* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project.\n", - "\n", - "## Detailed Description\n", - "This component creates a new Dataproc cluster by using [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create).\n", - "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "\n", + "To use the component, you must:\n", + "* Set up the GCP project by following these [steps](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "\n", + " ```\n", + " component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "* Grant the following types of access to the Kubeflow user service account:\n", + " * Read access to the Cloud Storage buckets which contains initialization action files.\n", + " * The role, `roles/dataproc.editor` on the project.\n", + "\n", + "## Detailed description\n", + "\n", + "This component creates a new Dataproc cluster by using the [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create). \n", + "\n", + "Follow these steps to use the component in a pipeline:\n", + "\n", + "1. Install the Kubeflow Pipeline SDK:\n" ] }, { @@ -82,22 +100,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_create_cluster.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/create_cluster/sample.ipynb)\n", - "* [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create)\n", - "\n", - "\n", "### Sample\n", + "Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ "#### Set sample parameters" ] }, @@ -205,6 +210,21 @@ "run_name = pipeline_func.__name__ + ' run'\n", "run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "* [Kubernetes Engine for Kubeflow](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts)\n", + "* [Component Python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_create_cluster.py)\n", + "* [Component Docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/create_cluster/sample.ipynb)\n", + "* [Dataproc create cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/create)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -223,7 +243,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataproc/delete_cluster/README.md b/components/gcp/dataproc/delete_cluster/README.md index fb2be0b9722..5cb238c607f 100644 --- a/components/gcp/dataproc/delete_cluster/README.md +++ b/components/gcp/dataproc/delete_cluster/README.md @@ -1,33 +1,43 @@ -# Deleting a Cluster with Cloud Dataproc -A Kubeflow Pipeline component to delete a cluster in Cloud Dataproc service. +# Name + +Data preparation by deleting a cluster in Cloud Dataproc + +# Label +Cloud Dataproc, cluster, GCP, Cloud Storage, Kubeflow, Pipeline + + +# Summary +A Kubeflow Pipeline component to delete a cluster in Cloud Dataproc. + +## Intended use +Use this component at the start of a Kubeflow Pipeline to delete a temporary Cloud Dataproc cluster to run Cloud Dataproc jobs as steps in the pipeline. This component is usually used with an [exit handler](https://github.com/kubeflow/pipelines/blob/master/samples/basic/exit_handler.py) to run at the end of a pipeline. -## Intended Use -Use the component to recycle a Dataproc cluster as one of the step in a KFP pipeline. This component is usually used with an [exit handler](https://github.com/kubeflow/pipelines/blob/master/samples/basic/exit_handler.py) to run at the end of a pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No | -region | The Cloud Dataproc region runs the cluster to delete. | GCPRegion | No | -name | The cluster name to delete. | String | No | -wait_interval | The number of seconds to pause between polling the delete operation done status. | Integer | Yes | `30` +| Argument | Description | Optional | Data type | Accepted values | Default | +|----------|-------------|----------|-----------|-----------------|---------| +| project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectID | | | +| region | The Cloud Dataproc region in which to handle the request. | No | GCPRegion | | | +| name | The name of the cluster to delete. | No | String | | | +| wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | | 30 | + ## Cautions & requirements To use the component, you must: -* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -``` -component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) -``` -* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project. +* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -## Detailed Description -This component deletes a Dataproc cluster by using [Dataproc delete cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/delete). + ``` + component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` +* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project. -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +## Detailed description +This component deletes a Dataproc cluster by using [Dataproc delete cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/delete). +Follow these steps to use the component in a pipeline: +1. Install the Kubeflow Pipeline SDK: ```python @@ -48,20 +58,13 @@ dataproc_delete_cluster_op = comp.load_component_from_url( help(dataproc_delete_cluster_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_delete_cluster.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/delete_cluster/sample.ipynb) -* [Dataproc delete cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/delete) - - ### Sample -Note: the sample code below works in both IPython notebook or python code directly. +Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template. #### Prerequisites -Before running the sample code, you need to [create a Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). +[Create a Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) before running the sample code. #### Set sample parameters @@ -122,3 +125,14 @@ experiment = client.create_experiment(EXPERIMENT_NAME) run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` + +## References + +* [Component Python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_delete_cluster.py) +* [Component Docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/delete_cluster/sample.ipynb) +* [Dataproc delete cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/delete) + + +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataproc/delete_cluster/sample.ipynb b/components/gcp/dataproc/delete_cluster/sample.ipynb index 15ad51550e9..d0de6367956 100644 --- a/components/gcp/dataproc/delete_cluster/sample.ipynb +++ b/components/gcp/dataproc/delete_cluster/sample.ipynb @@ -4,34 +4,45 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Deleting a Cluster with Cloud Dataproc\n", - "A Kubeflow Pipeline component to delete a cluster in Cloud Dataproc service.\n", + "# Name\n", + "\n", + "Data preparation by deleting a cluster in Cloud Dataproc\n", + "\n", + "# Label\n", + "Cloud Dataproc, cluster, GCP, Cloud Storage, Kubeflow, Pipeline\n", + "\n", + "\n", + "# Summary\n", + "A Kubeflow Pipeline component to delete a cluster in Cloud Dataproc.\n", + "\n", + "## Intended use\n", + "Use this component at the start of a Kubeflow Pipeline to delete a temporary Cloud Dataproc cluster to run Cloud Dataproc jobs as steps in the pipeline. This component is usually used with an [exit handler](https://github.com/kubeflow/pipelines/blob/master/samples/basic/exit_handler.py) to run at the end of a pipeline.\n", "\n", - "## Intended Use\n", - "Use the component to recycle a Dataproc cluster as one of the step in a KFP pipeline. This component is usually used with an [exit handler](https://github.com/kubeflow/pipelines/blob/master/samples/basic/exit_handler.py) to run at the end of a pipeline.\n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No |\n", - "region | The Cloud Dataproc region runs the cluster to delete. | GCPRegion | No |\n", - "name | The cluster name to delete. | String | No |\n", - "wait_interval | The number of seconds to pause between polling the delete operation done status. | Integer | Yes | `30`\n", + "| Argument | Description | Optional | Data type | Accepted values | Default |\n", + "|----------|-------------|----------|-----------|-----------------|---------|\n", + "| project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectID | | |\n", + "| region | The Cloud Dataproc region in which to handle the request. | No | GCPRegion | | |\n", + "| name | The name of the cluster to delete. | No | String | | |\n", + "| wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | | 30 |\n", + "\n", "\n", "## Cautions & requirements\n", "To use the component, you must:\n", - "* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", - "```\n", - "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", - "```\n", - "* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project.\n", - "\n", - "## Detailed Description\n", + "* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "\n", + " ```\n", + " component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project.\n", + "\n", + "## Detailed description\n", "This component deletes a Dataproc cluster by using [Dataproc delete cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/delete).\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "Follow these steps to use the component in a pipeline:\n", + "1. Install the Kubeflow Pipeline SDK:" ] }, { @@ -70,31 +81,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_delete_cluster.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/delete_cluster/sample.ipynb)\n", - "* [Dataproc delete cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/delete)\n", - "\n", - "\n", "### Sample\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ + "Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n", + "\n", "#### Prerequisites\n", "\n", - "Before running the sample code, you need to [create a Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ + "[Create a Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) before running the sample code.\n", + "\n", "#### Set sample parameters" ] }, @@ -190,6 +184,22 @@ "run_name = pipeline_func.__name__ + ' run'\n", "run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "\n", + "* [Component Python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_delete_cluster.py)\n", + "* [Component Docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/delete_cluster/sample.ipynb)\n", + "* [Dataproc delete cluster REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters/delete)\n", + "\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -208,7 +218,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataproc/submit_hadoop_job/README.md b/components/gcp/dataproc/submit_hadoop_job/README.md index 1d5bf42ff88..d1ae5d3c975 100644 --- a/components/gcp/dataproc/submit_hadoop_job/README.md +++ b/components/gcp/dataproc/submit_hadoop_job/README.md @@ -1,22 +1,36 @@ -# Submitting a Hadoop Job to Cloud Dataproc -A Kubeflow Pipeline component to submit an Apache Hadoop MapReduce job on Apache Hadoop YARN in Google Cloud Dataproc service. +# Name +Data preparation using Hadoop MapReduce on YARN with Cloud Dataproc -## Intended Use -Use the component to run an Apache Hadoop MapReduce job as one preprocessing step in a KFP pipeline. +# Label +Cloud Dataproc, GCP, Cloud Storage, Hadoop, YARN, Apache, MapReduce + + +# Summary +A Kubeflow Pipeline component to prepare data by submitting an Apache Hadoop MapReduce job on Apache Hadoop YARN to Cloud Dataproc. + +# Details +## Intended use +Use the component to run an Apache Hadoop MapReduce job as one preprocessing step in a Kubeflow Pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No | -region | The Dataproc region that handles the request. | GCPRegion | No | -cluster_name | The name of the cluster that runs the job. | String | No | -main_jar_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the JAR file containing the main class to execute. Examples: `gs://foo-bucket/analytics-binaries/extract-useful-metrics-mr.jar` `hdfs:/tmp/test-samples/custom-wordcount.jar` `file:///home/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar` | GCSPath | No | -main_class | The name of the driver's main class. The JARfile that contains the class must be in the default CLASSPATH or specified in `hadoop_job.jarFileUris`. | String | No | -args | The arguments to pass to the driver. Do not include arguments, such as -libjars or -Dfoo=bar, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | Yes | List | `[]` -hadoop_job | The payload of a [HadoopJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HadoopJob). | Dict | Yes | `{}` -job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}` -wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30` +| Argument | Description | Optional | Data type | Accepted values | Default | +|----------|-------------|----------|-----------|-----------------|---------| +| project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectID | | | +| region | The Dataproc region to handle the request. | No | GCPRegion | | | +| cluster_name | The name of the cluster to run the job. | No | String | | | +| main_jar_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the JAR file containing the main class to execute. | No | List | | | +| main_class | The name of the driver's main class. The JAR file that contains the class must be either in the default CLASSPATH or specified in `hadoop_job.jarFileUris`. | No | String | | | +| args | The arguments to pass to the driver. Do not include arguments, such as -libjars or -Dfoo=bar, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | Yes | List | | None | +| hadoop_job | The payload of a [HadoopJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HadoopJob). | Yes | Dict | | None | +| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None | +| wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | | 30 | + +Note: +`main_jar_file_uri`: The examples for the files are : +- `gs://foo-bucket/analytics-binaries/extract-useful-metrics-mr.jar` +- `hdfs:/tmp/test-samples/custom-wordcount.jarfile:///home/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar` + ## Output Name | Description | Type @@ -25,19 +39,22 @@ job_id | The ID of the created job. | String ## Cautions & requirements To use the component, you must: -* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). -* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -``` -component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) -``` -* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project. +* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). +* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: + + ```python + component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` +* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project. + +## Detailed description -## Detailed Description This component creates a Hadoop job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit). -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +Follow these steps to use the component in a pipeline: + +1. Install the Kubeflow Pipeline SDK: @@ -59,28 +76,23 @@ dataproc_submit_hadoop_job_op = comp.load_component_from_url( help(dataproc_submit_hadoop_job_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_hadoop_job.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_hadoop_job/sample.ipynb) -* [Dataproc HadoopJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HadoopJob) +## Sample +Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template. -### Sample -Note: the sample code below works in both IPython notebook or python code directly. - -#### Setup a Dataproc cluster +### Setup a Dataproc cluster [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code. -#### Prepare Hadoop job -Upload your Hadoop jar file to a Cloud Storage (GCS) bucket. In the sample, we will use a jar file that is pre-installed in the main cluster, so there is no need to provide the `main_jar_file_uri`. We only set `main_class` to be `org.apache.hadoop.examples.WordCount`. +### Prepare a Hadoop job +Upload your Hadoop JAR file to a Cloud Storage bucket. In the sample, we will use a JAR file that is preinstalled in the main cluster, so there is no need to provide `main_jar_file_uri`. Here is the [WordCount example source code](https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/WordCount.java). -To package a self-contained Hadoop MapReduceapplication from the source code, follow the [MapReduce Tutorial](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html). +To package a self-contained Hadoop MapReduce application from the source code, follow the [MapReduce Tutorial](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html). + -#### Set sample parameters +### Set sample parameters ```python @@ -101,12 +113,10 @@ The input file is a simple text file: !gsutil cat $INTPUT_GCS_PATH ``` -#### Clean up existing output files (Optional) +### Clean up the existing output files (optional) +This is needed because the sample code requires the output folder to be a clean folder. To continue to run the sample, make sure that the service account of the notebook server has access to the `OUTPUT_GCS_PATH`. -This is needed because the sample code requires the output folder to be a clean folder. -To continue to run the sample, make sure that the service account of the notebook server has access to the `OUTPUT_GCS_PATH`. - -**CAUTION**: This will remove all blob files under `OUTPUT_GCS_PATH`. +CAUTION: This will remove all blob files under `OUTPUT_GCS_PATH`. ```python @@ -177,10 +187,19 @@ run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` -#### Inspect the outputs -The sample in the notebook will count the words in the input text and save them in sharded files. Here is the command to inspect them: +### Inspect the output +The sample in the notebook will count the words in the input text and save them in sharded files. The command to inspect the output is: ```python !gsutil cat $OUTPUT_GCS_PATH/* ``` + +## References +* [Component Python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_hadoop_job.py) +* [Component Docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_hadoop_job/sample.ipynb) +* [Dataproc HadoopJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HadoopJob) + +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataproc/submit_hadoop_job/sample.ipynb b/components/gcp/dataproc/submit_hadoop_job/sample.ipynb index 6fa0f822be7..dc4b1230ebe 100644 --- a/components/gcp/dataproc/submit_hadoop_job/sample.ipynb +++ b/components/gcp/dataproc/submit_hadoop_job/sample.ipynb @@ -4,24 +4,38 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Submitting a Hadoop Job to Cloud Dataproc\n", - "A Kubeflow Pipeline component to submit an Apache Hadoop MapReduce job on Apache Hadoop YARN in Google Cloud Dataproc service.\n", + "# Name\n", + "Data preparation using Hadoop MapReduce on YARN with Cloud Dataproc\n", "\n", - "## Intended Use\n", - "Use the component to run an Apache Hadoop MapReduce job as one preprocessing step in a KFP pipeline. \n", + "# Label\n", + "Cloud Dataproc, GCP, Cloud Storage, Hadoop, YARN, Apache, MapReduce\n", + "\n", + "\n", + "# Summary\n", + "A Kubeflow Pipeline component to prepare data by submitting an Apache Hadoop MapReduce job on Apache Hadoop YARN to Cloud Dataproc.\n", + "\n", + "# Details\n", + "## Intended use\n", + "Use the component to run an Apache Hadoop MapReduce job as one preprocessing step in a Kubeflow Pipeline. \n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No |\n", - "region | The Dataproc region that handles the request. | GCPRegion | No |\n", - "cluster_name | The name of the cluster that runs the job. | String | No |\n", - "main_jar_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the JAR file containing the main class to execute. Examples: `gs://foo-bucket/analytics-binaries/extract-useful-metrics-mr.jar` `hdfs:/tmp/test-samples/custom-wordcount.jar` `file:///home/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar` | GCSPath | No |\n", - "main_class | The name of the driver's main class. The JARfile that contains the class must be in the default CLASSPATH or specified in `hadoop_job.jarFileUris`. | String | No |\n", - "args | The arguments to pass to the driver. Do not include arguments, such as -libjars or -Dfoo=bar, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | Yes | List | `[]`\n", - "hadoop_job | The payload of a [HadoopJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HadoopJob). | Dict | Yes | `{}`\n", - "job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}`\n", - "wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30`\n", + "| Argument | Description | Optional | Data type | Accepted values | Default |\n", + "|----------|-------------|----------|-----------|-----------------|---------|\n", + "| project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectID | | |\n", + "| region | The Dataproc region to handle the request. | No | GCPRegion | | |\n", + "| cluster_name | The name of the cluster to run the job. | No | String | | |\n", + "| main_jar_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the JAR file containing the main class to execute. | No | List | | |\n", + "| main_class | The name of the driver's main class. The JAR file that contains the class must be either in the default CLASSPATH or specified in `hadoop_job.jarFileUris`. | No | String | | |\n", + "| args | The arguments to pass to the driver. Do not include arguments, such as -libjars or -Dfoo=bar, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | Yes | List | | None |\n", + "| hadoop_job | The payload of a [HadoopJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HadoopJob). | Yes | Dict | | None |\n", + "| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None |\n", + "| wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | | 30 |\n", + "\n", + "Note: \n", + "`main_jar_file_uri`: The examples for the files are : \n", + "- `gs://foo-bucket/analytics-binaries/extract-useful-metrics-mr.jar` \n", + "- `hdfs:/tmp/test-samples/custom-wordcount.jarfile:///home/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar`\n", + "\n", "\n", "## Output\n", "Name | Description | Type\n", @@ -30,19 +44,22 @@ "\n", "## Cautions & requirements\n", "To use the component, you must:\n", - "* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", - "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", - "```\n", - "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", - "```\n", - "* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project.\n", + "* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", + "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "\n", + " ```python\n", + " component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project.\n", + "\n", + "## Detailed description\n", "\n", - "## Detailed Description\n", "This component creates a Hadoop job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit).\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "Follow these steps to use the component in a pipeline:\n", + "\n", + "1. Install the Kubeflow Pipeline SDK:\n" ] }, { @@ -81,33 +98,23 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_hadoop_job.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_hadoop_job/sample.ipynb)\n", - "* [Dataproc HadoopJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HadoopJob)\n", - "\n", - "### Sample\n", + "## Sample\n", + "Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly.\n", "\n", - "#### Setup a Dataproc cluster\n", + "### Setup a Dataproc cluster\n", "[Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code.\n", "\n", "\n", - "#### Prepare Hadoop job\n", - "Upload your Hadoop jar file to a Cloud Storage (GCS) bucket. In the sample, we will use a jar file that is pre-installed in the main cluster, so there is no need to provide the `main_jar_file_uri`. We only set `main_class` to be `org.apache.hadoop.examples.WordCount`.\n", + "### Prepare a Hadoop job\n", + "Upload your Hadoop JAR file to a Cloud Storage bucket. In the sample, we will use a JAR file that is preinstalled in the main cluster, so there is no need to provide `main_jar_file_uri`. \n", "\n", "Here is the [WordCount example source code](https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/WordCount.java).\n", "\n", - "To package a self-contained Hadoop MapReduceapplication from the source code, follow the [MapReduce Tutorial](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Set sample parameters" + "To package a self-contained Hadoop MapReduce application from the source code, follow the [MapReduce Tutorial](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html).\n", + "\n", + "\n", + "### Set sample parameters" ] }, { @@ -150,12 +157,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### Clean up existing output files (Optional)\n", - "\n", - "This is needed because the sample code requires the output folder to be a clean folder.\n", - "To continue to run the sample, make sure that the service account of the notebook server has access to the `OUTPUT_GCS_PATH`.\n", + "### Clean up the existing output files (optional)\n", + "This is needed because the sample code requires the output folder to be a clean folder. To continue to run the sample, make sure that the service account of the notebook server has access to the `OUTPUT_GCS_PATH`.\n", "\n", - "**CAUTION**: This will remove all blob files under `OUTPUT_GCS_PATH`." + "CAUTION: This will remove all blob files under `OUTPUT_GCS_PATH`." ] }, { @@ -262,8 +267,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### Inspect the outputs\n", - "The sample in the notebook will count the words in the input text and save them in sharded files. Here is the command to inspect them:" + "### Inspect the output\n", + "The sample in the notebook will count the words in the input text and save them in sharded files. The command to inspect the output is:" ] }, { @@ -274,6 +279,20 @@ "source": [ "!gsutil cat $OUTPUT_GCS_PATH/*" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "* [Component Python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_hadoop_job.py)\n", + "* [Component Docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_hadoop_job/sample.ipynb)\n", + "* [Dataproc HadoopJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HadoopJob)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -292,7 +311,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataproc/submit_hive_job/README.md b/components/gcp/dataproc/submit_hive_job/README.md index 8cd1d0b01c9..f73bc257f1c 100644 --- a/components/gcp/dataproc/submit_hive_job/README.md +++ b/components/gcp/dataproc/submit_hive_job/README.md @@ -1,22 +1,29 @@ -# Submitting a Hive Job to Cloud Dataproc -A Kubeflow Pipeline component to submit a Hive job to Google Cloud Dataproc service. +# Name +Data preparation using Apache Hive on YARN with Cloud Dataproc -## Intended Use -Use the component to run an Apache Hive job as one preprocessing step in a KFP pipeline. +# Label +Cloud Dataproc, GCP, Cloud Storage, YARN, Hive, Apache + +# Summary +A Kubeflow Pipeline component to prepare data by submitting an Apache Hive job on YARN to Cloud Dataproc. + +# Details +## Intended use +Use the component to run an Apache Hive job as one preprocessing step in a Kubeflow Pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No | -region | The Dataproc region that handles the request. | GCPRegion | No | -cluster_name | The name of the cluster that runs the job. | String | No | -queries | The queries to execute. You do not need to terminate a query with a semicolon. Multiple queries can be specified in one string by separating each with a semicolon. | List | Yes | `[]` -query_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the script that contains Hive queries. | GCSPath | Yes | ` ` -script_variables | Mapping of query variable names to values (equivalent to the Hive command: SET name="value";). | List | Yes | `[]` -hive_job | The payload of a [HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob). | Dict | Yes | `{}` -job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}` -wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30` +| Argument | Description | Optional | Data type | Accepted values | Default | +|----------|-------------|----------|-----------|-----------------|---------| +| project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectId | | | +| region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | | +| cluster_name | The name of the cluster to run the job. | No | String | | | +| queries | The queries to execute the Hive job. Specify multiple queries in one string by separating them with semicolons. You do not need to terminate queries with semicolons. | Yes | List | | None | +| query_file_uri | The HCFS URI of the script that contains the Hive queries. | Yes | GCPPath | | None | +| script_variables | Mapping of the query’s variable names to their values (equivalent to the Hive command: SET name="value";). | Yes | Dict | | None | +| hive_job | The payload of a [HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob) | Yes | Dict | | None | +| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None | +| wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | | 30 | ## Output Name | Description | Type @@ -25,19 +32,20 @@ job_id | The ID of the created job. | String ## Cautions & requirements To use the component, you must: -* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). -* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -``` -component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) -``` -* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project. +* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). +* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: + + ``` + component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` +* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project. -## Detailed Description +## Detailed description This component creates a Hive job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit). -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +Follow these steps to use the component in a pipeline: +1. Install the Kubeflow Pipeline SDK: @@ -59,23 +67,21 @@ dataproc_submit_hive_job_op = comp.load_component_from_url( help(dataproc_submit_hive_job_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_hive_job.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_hive_job/sample.ipynb) -* [Dataproc HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob) - ### Sample -Note: the sample code below works in both IPython notebook or python code directly. +Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template. + #### Setup a Dataproc cluster + [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code. -#### Prepare Hive query -Directly put your Hive queries in the `queries` list or upload your Hive queries into a file to a Cloud Storage (GCS) bucket and place the path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a public CSV file from GCS. +#### Prepare a Hive query + +Put your Hive queries in the queries list, or upload your Hive queries into a file saved in a Cloud Storage bucket and then enter the Cloud Storage bucket’s path in `query_file_uri.` In this sample, we will use a hard coded query in the queries list to select data from a public CSV file from Cloud Storage. + +For more details, see the [Hive language manual.](https://cwiki.apache.org/confluence/display/Hive/LanguageManual) -For more details, please checkout [Hive language manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual) #### Set sample parameters @@ -166,3 +172,12 @@ experiment = client.create_experiment(EXPERIMENT_NAME) run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` + +## References +* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_hive_job.py) +* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_hive_job/sample.ipynb) +* [Dataproc HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob) + +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataproc/submit_hive_job/sample.ipynb b/components/gcp/dataproc/submit_hive_job/sample.ipynb index a6081328ebe..bfd32c6558a 100644 --- a/components/gcp/dataproc/submit_hive_job/sample.ipynb +++ b/components/gcp/dataproc/submit_hive_job/sample.ipynb @@ -4,24 +4,31 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Submitting a Hive Job to Cloud Dataproc\n", - "A Kubeflow Pipeline component to submit a Hive job to Google Cloud Dataproc service. \n", + "# Name\n", + "Data preparation using Apache Hive on YARN with Cloud Dataproc\n", "\n", - "## Intended Use\n", - "Use the component to run an Apache Hive job as one preprocessing step in a KFP pipeline. \n", + "# Label\n", + "Cloud Dataproc, GCP, Cloud Storage, YARN, Hive, Apache\n", + "\n", + "# Summary\n", + "A Kubeflow Pipeline component to prepare data by submitting an Apache Hive job on YARN to Cloud Dataproc.\n", + "\n", + "# Details\n", + "## Intended use\n", + "Use the component to run an Apache Hive job as one preprocessing step in a Kubeflow Pipeline.\n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No |\n", - "region | The Dataproc region that handles the request. | GCPRegion | No |\n", - "cluster_name | The name of the cluster that runs the job. | String | No |\n", - "queries | The queries to execute. You do not need to terminate a query with a semicolon. Multiple queries can be specified in one string by separating each with a semicolon. | List | Yes | `[]`\n", - "query_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the script that contains Hive queries. | GCSPath | Yes | ` `\n", - "script_variables | Mapping of query variable names to values (equivalent to the Hive command: SET name=\"value\";). | List | Yes | `[]`\n", - "hive_job | The payload of a [HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob). | Dict | Yes | `{}`\n", - "job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}`\n", - "wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30`\n", + "| Argument | Description | Optional | Data type | Accepted values | Default |\n", + "|----------|-------------|----------|-----------|-----------------|---------|\n", + "| project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | No | GCPProjectId | | |\n", + "| region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | |\n", + "| cluster_name | The name of the cluster to run the job. | No | String | | |\n", + "| queries | The queries to execute the Hive job. Specify multiple queries in one string by separating them with semicolons. You do not need to terminate queries with semicolons. | Yes | List | | None |\n", + "| query_file_uri | The HCFS URI of the script that contains the Hive queries. | Yes | GCPPath | | None |\n", + "| script_variables | Mapping of the query’s variable names to their values (equivalent to the Hive command: SET name=\"value\";). | Yes | Dict | | None |\n", + "| hive_job | The payload of a [HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob) | Yes | Dict | | None |\n", + "| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None |\n", + "| wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | | 30 |\n", "\n", "## Output\n", "Name | Description | Type\n", @@ -30,19 +37,20 @@ "\n", "## Cautions & requirements\n", "To use the component, you must:\n", - "* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", - "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", - "```\n", - "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", - "```\n", - "* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project.\n", - "\n", - "## Detailed Description\n", + "* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", + "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "\n", + " ```\n", + " component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project.\n", + "\n", + "## Detailed description\n", "This component creates a Hive job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit).\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "Follow these steps to use the component in a pipeline:\n", + "1. Install the Kubeflow Pipeline SDK:\n" ] }, { @@ -81,29 +89,22 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_hive_job.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_hive_job/sample.ipynb)\n", - "* [Dataproc HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob)\n", - "\n", "### Sample\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly.\n", + "Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n", + "\n", "\n", "#### Setup a Dataproc cluster\n", + "\n", "[Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code.\n", "\n", - "#### Prepare Hive query\n", - "Directly put your Hive queries in the `queries` list or upload your Hive queries into a file to a Cloud Storage (GCS) bucket and place the path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a public CSV file from GCS.\n", + "#### Prepare a Hive query\n", + "\n", + "Put your Hive queries in the queries list, or upload your Hive queries into a file saved in a Cloud Storage bucket and then enter the Cloud Storage bucket’s path in `query_file_uri.` In this sample, we will use a hard coded query in the queries list to select data from a public CSV file from Cloud Storage.\n", + "\n", + "For more details, see the [Hive language manual.](https://cwiki.apache.org/confluence/display/Hive/LanguageManual)\n", + "\n", "\n", - "For more details, please checkout [Hive language manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ "#### Set sample parameters" ] }, @@ -229,6 +230,20 @@ "run_name = pipeline_func.__name__ + ' run'\n", "run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References\n", + "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_hive_job.py)\n", + "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_hive_job/sample.ipynb)\n", + "* [Dataproc HiveJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/HiveJob)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -247,7 +262,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataproc/submit_pig_job/README.md b/components/gcp/dataproc/submit_pig_job/README.md index 252b0cad638..70ead813b0e 100644 --- a/components/gcp/dataproc/submit_pig_job/README.md +++ b/components/gcp/dataproc/submit_pig_job/README.md @@ -1,22 +1,31 @@ -# Submitting a Pig Job to Cloud Dataproc -A Kubeflow Pipeline component to submit a Pig job to Google Cloud Dataproc service. +# Name +Data preparation using Apache Pig on YARN with Cloud Dataproc -## Intended Use -Use the component to run an Apache Pig job as one preprocessing step in a KFP pipeline. +# Label +Cloud Dataproc, GCP, Cloud Storage, YARN, Pig, Apache, Kubeflow, pipelines, components + + +# Summary +A Kubeflow Pipeline component to prepare data by submitting an Apache Pig job on YARN to Cloud Dataproc. + + +# Details +## Intended use +Use the component to run an Apache Pig job as one preprocessing step in a Kubeflow Pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No | -region | The Dataproc region that handles the request. | GCPRegion | No | -cluster_name | The name of the cluster that runs the job. | String | No | -queries | The queries to execute. You do not need to terminate a query with a semicolon. Multiple queries can be specified in one string by separating each with a semicolon. | List | Yes | `[]` -query_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the script that contains Pig queries.| GCSPath | Yes | ` ` -script_variables | Optional. Mapping of query variable names to values (equivalent to the Pig command: SET name="value";).| List | Yes | `[]` -pig_job | The payload of a [PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob). | Dict | Yes | `{}` -job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs).| Dict | Yes | `{}` -wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30` +| Argument | Description | Optional | Data type | Accepted values | Default | +|----------|-------------|----------|-----------|-----------------|---------| +| project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No | GCPProjectID | | | +| region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | | +| cluster_name | The name of the cluster to run the job. | No | String | | | +| queries | The queries to execute the Pig job. Specify multiple queries in one string by separating them with semicolons. You do not need to terminate queries with semicolons. | Yes | List | | None | +| query_file_uri | The HCFS URI of the script that contains the Pig queries. | Yes | GCSPath | | None | +| script_variables | Mapping of the query’s variable names to their values (equivalent to the Pig command: SET name="value";). | Yes | Dict | | None | +| pig_job | The payload of a [PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob). | Yes | Dict | | None | +| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None | +| wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | | 30 | ## Output Name | Description | Type @@ -24,20 +33,22 @@ Name | Description | Type job_id | The ID of the created job. | String ## Cautions & requirements + To use the component, you must: -* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). -* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -``` -component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) -``` -* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project. +* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). +* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: + + ``` + component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` +* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project. -## Detailed Description +## Detailed description This component creates a Pig job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit). -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +Follow these steps to use the component in a pipeline: +1. Install the Kubeflow Pipeline SDK: @@ -59,23 +70,21 @@ dataproc_submit_pig_job_op = comp.load_component_from_url( help(dataproc_submit_pig_job_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_pig_job.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_pig_job/sample.ipynb) -* [Dataproc PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob) - ### Sample -Note: the sample code below works in both IPython notebook or python code directly. +Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template. + #### Setup a Dataproc cluster + [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code. -#### Prepare Pig query -Directly put your Pig queries in the `queries` list or upload your Pig queries into a file to a Google Cloud Storage (GCS) bucket and place the path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a local `passwd` file. -For more details, please checkout [Pig documentation](http://pig.apache.org/docs/latest/) +#### Prepare a Pig query + +Either put your Pig queries in the `queries` list, or upload your Pig queries into a file to a Cloud Storage bucket and then enter the Cloud Storage bucket’s path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a local `passwd` file. + +For more details on Apache Pig, see the [Pig documentation.](http://pig.apache.org/docs/latest/) #### Set sample parameters @@ -154,7 +163,11 @@ run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` +## References +* [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) +* [Pig documentation](http://pig.apache.org/docs/latest/) +* [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs) +* [PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob) -```python - -``` +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataproc/submit_pig_job/sample.ipynb b/components/gcp/dataproc/submit_pig_job/sample.ipynb index 9da409b8e1d..b695b2eadaa 100644 --- a/components/gcp/dataproc/submit_pig_job/sample.ipynb +++ b/components/gcp/dataproc/submit_pig_job/sample.ipynb @@ -4,24 +4,33 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Submitting a Pig Job to Cloud Dataproc\n", - "A Kubeflow Pipeline component to submit a Pig job to Google Cloud Dataproc service. \n", + "# Name\n", + "Data preparation using Apache Pig on YARN with Cloud Dataproc\n", "\n", - "## Intended Use\n", - "Use the component to run an Apache Pig job as one preprocessing step in a KFP pipeline. \n", + "# Label\n", + "Cloud Dataproc, GCP, Cloud Storage, YARN, Pig, Apache, Kubeflow, pipelines, components\n", + "\n", + "\n", + "# Summary\n", + "A Kubeflow Pipeline component to prepare data by submitting an Apache Pig job on YARN to Cloud Dataproc.\n", + "\n", + "\n", + "# Details\n", + "## Intended use\n", + "Use the component to run an Apache Pig job as one preprocessing step in a Kubeflow Pipeline.\n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No |\n", - "region | The Dataproc region that handles the request. | GCPRegion | No |\n", - "cluster_name | The name of the cluster that runs the job. | String | No |\n", - "queries | The queries to execute. You do not need to terminate a query with a semicolon. Multiple queries can be specified in one string by separating each with a semicolon. | List | Yes | `[]`\n", - "query_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the script that contains Pig queries.| GCSPath | Yes | ` `\n", - "script_variables | Optional. Mapping of query variable names to values (equivalent to the Pig command: SET name=\"value\";).| List | Yes | `[]`\n", - "pig_job | The payload of a [PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob). | Dict | Yes | `{}`\n", - "job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs).| Dict | Yes | `{}`\n", - "wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30`\n", + "| Argument | Description | Optional | Data type | Accepted values | Default |\n", + "|----------|-------------|----------|-----------|-----------------|---------|\n", + "| project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No | GCPProjectID | | |\n", + "| region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | |\n", + "| cluster_name | The name of the cluster to run the job. | No | String | | |\n", + "| queries | The queries to execute the Pig job. Specify multiple queries in one string by separating them with semicolons. You do not need to terminate queries with semicolons. | Yes | List | | None |\n", + "| query_file_uri | The HCFS URI of the script that contains the Pig queries. | Yes | GCSPath | | None |\n", + "| script_variables | Mapping of the query’s variable names to their values (equivalent to the Pig command: SET name=\"value\";). | Yes | Dict | | None |\n", + "| pig_job | The payload of a [PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob). | Yes | Dict | | None |\n", + "| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None |\n", + "| wait_interval | The number of seconds to pause between polling the operation. | Yes | Integer | | 30 |\n", "\n", "## Output\n", "Name | Description | Type\n", @@ -29,20 +38,22 @@ "job_id | The ID of the created job. | String\n", "\n", "## Cautions & requirements\n", + "\n", "To use the component, you must:\n", - "* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", - "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", - "```\n", - "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", - "```\n", - "* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project.\n", - "\n", - "## Detailed Description\n", + "* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", + "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "\n", + " ```\n", + " component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project.\n", + "\n", + "## Detailed description\n", "This component creates a Pig job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit).\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "Follow these steps to use the component in a pipeline:\n", + "1. Install the Kubeflow Pipeline SDK:\n" ] }, { @@ -81,29 +92,22 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_pig_job.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_pig_job/sample.ipynb)\n", - "* [Dataproc PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob)\n", - "\n", "### Sample\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly.\n", + "Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n", + "\n", "\n", "#### Setup a Dataproc cluster\n", + "\n", "[Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code.\n", "\n", - "#### Prepare Pig query\n", - "Directly put your Pig queries in the `queries` list or upload your Pig queries into a file to a Google Cloud Storage (GCS) bucket and place the path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a local `passwd` file.\n", "\n", - "For more details, please checkout [Pig documentation](http://pig.apache.org/docs/latest/)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ + "#### Prepare a Pig query\n", + "\n", + "Either put your Pig queries in the `queries` list, or upload your Pig queries into a file to a Cloud Storage bucket and then enter the Cloud Storage bucket’s path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a local `passwd` file.\n", + "\n", + "For more details on Apache Pig, see the [Pig documentation.](http://pig.apache.org/docs/latest/)\n", + "\n", "#### Set sample parameters" ] }, @@ -218,11 +222,18 @@ ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], - "source": [] + "source": [ + "## References\n", + "* [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) \n", + "* [Pig documentation](http://pig.apache.org/docs/latest/)\n", + "* [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs)\n", + "* [PigJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PigJob)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -241,7 +252,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataproc/submit_pyspark_job/README.md b/components/gcp/dataproc/submit_pyspark_job/README.md index 3a5f6db5f89..7ba0533cb3e 100644 --- a/components/gcp/dataproc/submit_pyspark_job/README.md +++ b/components/gcp/dataproc/submit_pyspark_job/README.md @@ -1,21 +1,31 @@ -# Submitting a PySpark Job to Cloud Dataproc -A Kubeflow Pipeline component to submit a PySpark job to Google Cloud Dataproc service. +# Name +Data preparation using PySpark on Cloud Dataproc + + +# Label +Cloud Dataproc, GCP, Cloud Storage,PySpark, Kubeflow, pipelines, components + + +# Summary +A Kubeflow Pipeline component to prepare data by submitting a PySpark job to Cloud Dataproc. + + +# Details +## Intended use +Use the component to run an Apache PySpark job as one preprocessing step in a Kubeflow Pipeline. -## Intended Use -Use the component to run an Apache PySpark job as one preprocessing step in a KFP pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No | -region | The Dataproc region that handles the request. | GCPRegion | No | -cluster_name | The name of the cluster that runs the job. | String | No | -main_python_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the main Python file to use as the driver. Must be a .py file. | GCSPath | No | -args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | List | Yes | `[]` -pyspark_job | The payload of a [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob). | Dict | Yes | `{}` -job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}` -wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30` +| Argument | Description | Optional | Data type | Accepted values | Default | +|----------------------|------------|----------|--------------|-----------------|---------| +| project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No | GCPProjectID | | | +| region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | | +| cluster_name | The name of the cluster to run the job. | No | String | | | +| main_python_file_uri | The HCFS URI of the Python file to use as the driver. This must be a .py file. | No | GCSPath | | | +| args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | Yes | List | | None | +| pyspark_job | The payload of a [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob). | Yes | Dict | | None | +| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None | ## Output Name | Description | Type @@ -23,21 +33,24 @@ Name | Description | Type job_id | The ID of the created job. | String ## Cautions & requirements + To use the component, you must: -* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). -* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -``` -component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) -``` -* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project. +* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). +* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -## Detailed Description -This component creates a PySpark job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit). + ``` + component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` +* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project. -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +## Detailed description +This component creates a PySpark job from the [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit). + +Follow these steps to use the component in a pipeline: + +1. Install the Kubeflow Pipeline SDK: ```python @@ -58,21 +71,19 @@ dataproc_submit_pyspark_job_op = comp.load_component_from_url( help(dataproc_submit_pyspark_job_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_pyspark_job.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_pyspark_job/sample.ipynb) -* [Dataproc PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob) - ### Sample -Note: the sample code below works in both IPython notebook or python code directly. +Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template. + #### Setup a Dataproc cluster + [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code. -#### Prepare PySpark job -Upload your PySpark code file to a Cloud Storage bucket. For example, thisis a publicly accessible hello-world.py in Cloud Storage: + +#### Prepare a PySpark job + +Upload your PySpark code file to a Cloud Storage bucket. For example, this is a publicly accessible `hello-world.py` in Cloud Storage: ```python @@ -151,7 +162,11 @@ run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` +## References -```python +* [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) +* [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob) +* [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs) -``` +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataproc/submit_pyspark_job/sample.ipynb b/components/gcp/dataproc/submit_pyspark_job/sample.ipynb index 6fac3c069c3..f9f8bc09245 100644 --- a/components/gcp/dataproc/submit_pyspark_job/sample.ipynb +++ b/components/gcp/dataproc/submit_pyspark_job/sample.ipynb @@ -4,23 +4,33 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Submitting a PySpark Job to Cloud Dataproc\n", - "A Kubeflow Pipeline component to submit a PySpark job to Google Cloud Dataproc service. \n", + "# Name\n", + "Data preparation using PySpark on Cloud Dataproc\n", + "\n", + "\n", + "# Label\n", + "Cloud Dataproc, GCP, Cloud Storage,PySpark, Kubeflow, pipelines, components\n", + "\n", + "\n", + "# Summary\n", + "A Kubeflow Pipeline component to prepare data by submitting a PySpark job to Cloud Dataproc.\n", + "\n", + "\n", + "# Details\n", + "## Intended use\n", + "Use the component to run an Apache PySpark job as one preprocessing step in a Kubeflow Pipeline.\n", "\n", - "## Intended Use\n", - "Use the component to run an Apache PySpark job as one preprocessing step in a KFP pipeline. \n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No |\n", - "region | The Dataproc region that handles the request. | GCPRegion | No |\n", - "cluster_name | The name of the cluster that runs the job. | String | No |\n", - "main_python_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the main Python file to use as the driver. Must be a .py file. | GCSPath | No |\n", - "args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | List | Yes | `[]`\n", - "pyspark_job | The payload of a [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob). | Dict | Yes | `{}`\n", - "job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}`\n", - "wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30`\n", + "| Argument | Description | Optional | Data type | Accepted values | Default |\n", + "|----------------------|------------|----------|--------------|-----------------|---------|\n", + "| project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No | GCPProjectID | | |\n", + "| region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | |\n", + "| cluster_name | The name of the cluster to run the job. | No | String | | |\n", + "| main_python_file_uri | The HCFS URI of the Python file to use as the driver. This must be a .py file. | No | GCSPath | | |\n", + "| args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | Yes | List | | None |\n", + "| pyspark_job | The payload of a [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob). | Yes | Dict | | None |\n", + "| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None |\n", "\n", "## Output\n", "Name | Description | Type\n", @@ -28,20 +38,24 @@ "job_id | The ID of the created job. | String\n", "\n", "## Cautions & requirements\n", + "\n", "To use the component, you must:\n", - "* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", - "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", - "```\n", - "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", - "```\n", - "* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project.\n", - "\n", - "## Detailed Description\n", - "This component creates a PySpark job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit).\n", - "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", + "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "\n", + " ```\n", + " component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project.\n", + "\n", + "## Detailed description\n", + "\n", + "This component creates a PySpark job from the [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit).\n", + "\n", + "Follow these steps to use the component in a pipeline:\n", + "\n", + "1. Install the Kubeflow Pipeline SDK:" ] }, { @@ -80,21 +94,19 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_pyspark_job.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_pyspark_job/sample.ipynb)\n", - "* [Dataproc PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob)\n", - "\n", "### Sample\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly.\n", + "Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n", + "\n", "\n", "#### Setup a Dataproc cluster\n", + "\n", "[Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code.\n", "\n", - "#### Prepare PySpark job\n", - "Upload your PySpark code file to a Cloud Storage bucket. For example, thisis a publicly accessible hello-world.py in Cloud Storage:" + "\n", + "#### Prepare a PySpark job\n", + "\n", + "Upload your PySpark code file to a Cloud Storage bucket. For example, this is a publicly accessible `hello-world.py` in Cloud Storage:" ] }, { @@ -219,11 +231,18 @@ ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], - "source": [] + "source": [ + "## References\n", + "\n", + "* [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) \n", + "* [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob)\n", + "* [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -242,7 +261,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataproc/submit_spark_job/README.md b/components/gcp/dataproc/submit_spark_job/README.md index 4c7ad7fcda8..5cad85794b5 100644 --- a/components/gcp/dataproc/submit_spark_job/README.md +++ b/components/gcp/dataproc/submit_spark_job/README.md @@ -1,22 +1,36 @@ -# Submitting a Spark Job to Cloud Dataproc -A Kubeflow Pipeline component to submit a Spark job to Google Cloud Dataproc service. +# Name -## Intended Use -Use the component to run an Apache Spark job as one preprocessing step in a KFP pipeline. +Data preparation using Spark on YARN with Cloud Dataproc + + +# Label + +Cloud Dataproc, GCP, Cloud Storage, Spark, Kubeflow, pipelines, components, YARN + + +# Summary + +A Kubeflow Pipeline component to prepare data by submitting a Spark job on YARN to Cloud Dataproc. + +# Details + +## Intended use + +Use the component to run an Apache Spark job as one preprocessing step in a Kubeflow Pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No | -region | The Dataproc region that handles the request. | GCPRegion | No | -cluster_name | The name of the cluster that runs the job. | String | No | -main_jar_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the jar file that contains the main class. | GCSPath | No | -main_class | The name of the driver's main class. The jar file that contains the class must be in the default CLASSPATH or specified in `spark_job.jarFileUris`. | String | No | -args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | List | Yes | `[]` -spark_job | The payload of a [SparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkJob). | Dict | Yes | `{}` -job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}` -wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30` +Argument | Description | Optional | Data type | Accepted values | Default | +:--- | :---------- | :--- | :------- | :------| :------| +project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to.|No | GCPProjectID | | | +region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | | +cluster_name | The name of the cluster to run the job. | No | String | | | +main_jar_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the JAR file that contains the main class. | No | GCSPath | | | +main_class | The name of the driver's main class. The JAR file that contains the class must be either in the default CLASSPATH or specified in `spark_job.jarFileUris`.| No | | | | +args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission.| Yes | | | | +spark_job | The payload of a [SparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkJob).| Yes | | | | +job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | | | | +wait_interval | The number of seconds to wait between polling the operation. | Yes | | | 30 | ## Output Name | Description | Type @@ -24,22 +38,33 @@ Name | Description | Type job_id | The ID of the created job. | String ## Cautions & requirements + To use the component, you must: -* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). -* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: -``` -component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) -``` -* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project. -## Detailed Description + + +* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). +* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: + + ``` + component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) + ``` + + +* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project. + + +## Detailed description + This component creates a Spark job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit). -Here are the steps to use the component in a pipeline: -1. Install KFP SDK +Follow these steps to use the component in a pipeline: + +1. Install the Kubeflow Pipeline SDK: + ```python %%capture --no-stderr @@ -59,25 +84,21 @@ dataproc_submit_spark_job_op = comp.load_component_from_url( help(dataproc_submit_spark_job_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_spark_job.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_spark_job/sample.ipynb) -* [Dataproc SparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkJob) - ### Sample +Note: The following sample code works in an IPython notebook or directly in Python code. -Note: the sample code below works in both IPython notebook or python code directly. -#### Setup a Dataproc cluster +#### Set up a Dataproc cluster [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code. -#### Prepare Spark job -Upload your Spark jar file to a Cloud Storage (GCS) bucket. In the sample, we will use a jar file that is pre-installed in the main cluster `file:///usr/lib/spark/examples/jars/spark-examples.jar`. -Here is the [Pi example source code](https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaSparkPi.java). +#### Prepare a Spark job +Upload your Spark JAR file to a Cloud Storage bucket. In the sample, we use a JAR file that is preinstalled in the main cluster: `file:///usr/lib/spark/examples/jars/spark-examples.jar`. + +Here is the [source code of the sample](https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaSparkPi.java). + +To package a self-contained Spark application, follow these [instructions](https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications). -To package a self-contained spark application, follow the [instructions](https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications). #### Set sample parameters @@ -154,7 +175,12 @@ run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` +## References -```python +* [Component Python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_spark_job.py) +* [Component Docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) +* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_spark_job/sample.ipynb) +* [Dataproc SparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkJob) -``` +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataproc/submit_spark_job/sample.ipynb b/components/gcp/dataproc/submit_spark_job/sample.ipynb index 0681629ce31..3d2b79cdc42 100644 --- a/components/gcp/dataproc/submit_spark_job/sample.ipynb +++ b/components/gcp/dataproc/submit_spark_job/sample.ipynb @@ -4,24 +4,38 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Submitting a Spark Job to Cloud Dataproc\n", - "A Kubeflow Pipeline component to submit a Spark job to Google Cloud Dataproc service. \n", + "# Name\n", "\n", - "## Intended Use\n", - "Use the component to run an Apache Spark job as one preprocessing step in a KFP pipeline. \n", + "Data preparation using Spark on YARN with Cloud Dataproc\n", + "\n", + "\n", + "# Label\n", + "\n", + "Cloud Dataproc, GCP, Cloud Storage, Spark, Kubeflow, pipelines, components, YARN\n", + "\n", + "\n", + "# Summary\n", + "\n", + "A Kubeflow Pipeline component to prepare data by submitting a Spark job on YARN to Cloud Dataproc.\n", + "\n", + "# Details\n", + "\n", + "## Intended use\n", + "\n", + "Use the component to run an Apache Spark job as one preprocessing step in a Kubeflow Pipeline.\n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No |\n", - "region | The Dataproc region that handles the request. | GCPRegion | No |\n", - "cluster_name | The name of the cluster that runs the job. | String | No |\n", - "main_jar_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the jar file that contains the main class. | GCSPath | No |\n", - "main_class | The name of the driver's main class. The jar file that contains the class must be in the default CLASSPATH or specified in `spark_job.jarFileUris`. | String | No |\n", - "args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission. | List | Yes | `[]`\n", - "spark_job | The payload of a [SparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkJob). | Dict | Yes | `{}`\n", - "job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}`\n", - "wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30`\n", + "Argument | Description | Optional | Data type | Accepted values | Default |\n", + ":--- | :---------- | :--- | :------- | :------| :------| \n", + "project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to.|No | GCPProjectID | | |\n", + "region | The Cloud Dataproc region to handle the request. | No | GCPRegion | | | \n", + "cluster_name | The name of the cluster to run the job. | No | String | | |\n", + "main_jar_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the JAR file that contains the main class. | No | GCSPath | | |\n", + "main_class | The name of the driver's main class. The JAR file that contains the class must be either in the default CLASSPATH or specified in `spark_job.jarFileUris`.| No | | | | \n", + "args | The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission.| Yes | | | |\n", + "spark_job | The payload of a [SparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkJob).| Yes | | | |\n", + "job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | | | |\n", + "wait_interval | The number of seconds to wait between polling the operation. | Yes | | | 30 |\n", "\n", "## Output\n", "Name | Description | Type\n", @@ -29,20 +43,32 @@ "job_id | The ID of the created job. | String\n", "\n", "## Cautions & requirements\n", + "\n", "To use the component, you must:\n", - "* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", - "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", - "```\n", - "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", - "```\n", - "* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project.\n", - "\n", - "## Detailed Description\n", + "\n", + "\n", + "\n", + "* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", + "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "\n", + " ```\n", + " component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", + " ```\n", + "\n", + "\n", + "* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project.\n", + "\n", + "\n", + "## Detailed description\n", + "\n", "This component creates a Spark job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit).\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "Follow these steps to use the component in a pipeline:\n", + "\n", + "\n", + "\n", + "1. Install the Kubeflow Pipeline SDK:" ] }, { @@ -81,31 +107,22 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_spark_job.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_spark_job/sample.ipynb)\n", - "* [Dataproc SparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkJob)\n", - "\n", "### Sample\n", + "Note: The following sample code works in an IPython notebook or directly in Python code.\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly.\n", "\n", - "#### Setup a Dataproc cluster\n", + "#### Set up a Dataproc cluster\n", "[Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code.\n", "\n", - "#### Prepare Spark job\n", - "Upload your Spark jar file to a Cloud Storage (GCS) bucket. In the sample, we will use a jar file that is pre-installed in the main cluster `file:///usr/lib/spark/examples/jars/spark-examples.jar`. \n", "\n", - "Here is the [Pi example source code](https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaSparkPi.java).\n", + "#### Prepare a Spark job\n", + "Upload your Spark JAR file to a Cloud Storage bucket. In the sample, we use a JAR file that is preinstalled in the main cluster: `file:///usr/lib/spark/examples/jars/spark-examples.jar`.\n", + "\n", + "Here is the [source code of the sample](https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaSparkPi.java).\n", + "\n", + "To package a self-contained Spark application, follow these [instructions](https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications).\n", + "\n", "\n", - "To package a self-contained spark application, follow the [instructions](https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ "#### Set sample parameters" ] }, @@ -218,11 +235,19 @@ ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], - "source": [] + "source": [ + "## References\n", + "\n", + "* [Component Python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_spark_job.py)\n", + "* [Component Docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", + "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_spark_job/sample.ipynb)\n", + "* [Dataproc SparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkJob)\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -241,7 +266,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/dataproc/submit_sparksql_job/README.md b/components/gcp/dataproc/submit_sparksql_job/README.md index 841e582a06a..4b743859ad8 100644 --- a/components/gcp/dataproc/submit_sparksql_job/README.md +++ b/components/gcp/dataproc/submit_sparksql_job/README.md @@ -1,22 +1,30 @@ -# Submitting a SparkSql Job to Cloud Dataproc -A Kubeflow Pipeline component to submit a SparkSql job to Google Cloud Dataproc service. +# Name +Data preparation using SparkSQL on YARN with Cloud Dataproc -## Intended Use -Use the component to run an Apache SparkSql job as one preprocessing step in a KFP pipeline. +# Label +Cloud Dataproc, GCP, Cloud Storage, YARN, SparkSQL, Kubeflow, pipelines, components + +# Summary +A Kubeflow Pipeline component to prepare data by submitting a SparkSql job on YARN to Cloud Dataproc. + +# Details + +## Intended use +Use the component to run an Apache SparkSql job as one preprocessing step in a Kubeflow Pipeline. ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No | -region | The Dataproc region that handles the request. | GCPRegion | No | -cluster_name | The name of the cluster that runs the job. | String | No | -queries | The queries to execute. You do not need to terminate a query with a semicolon. Multiple queries can be specified in one string by separating each with a semicolon. | List | Yes | `[]` -query_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the script that contains SQL queries.| GCSPath | Yes | ` ` -script_variables | Mapping of query variable names to values (equivalent to the Spark SQL command: SET name="value";). | List | Yes | `[]` -sparksql_job | The payload of a [SparkSqlJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkSqlJob). | Dict | Yes | `{}` -job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}` -wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30` +Argument| Description | Optional | Data type| Accepted values| Default | +:--- | :---------- | :--- | :------- | :------ | :------ +project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No| GCPProjectID | | | +region | The Cloud Dataproc region to handle the request. | No | GCPRegion| +cluster_name | The name of the cluster to run the job. | No | String| | | +queries | The queries to execute the SparkSQL job. Specify multiple queries in one string by separating them with semicolons. You do not need to terminate queries with semicolons. | Yes | List | | None | +query_file_uri | The HCFS URI of the script that contains the SparkSQL queries.| Yes | GCSPath | | None | +script_variables | Mapping of the query’s variable names to their values (equivalent to the SparkSQL command: SET name="value";).| Yes| Dict | | None | +sparksql_job | The payload of a [SparkSqlJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkSqlJob). | Yes | Dict | | None | +job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None | +wait_interval | The number of seconds to pause between polling the operation. | Yes |Integer | | 30 | ## Output Name | Description | Type @@ -25,20 +33,19 @@ job_id | The ID of the created job. | String ## Cautions & requirements To use the component, you must: -* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). +* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project). * [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). -* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: +* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example: ``` component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa')) ``` -* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project. +* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project. ## Detailed Description This component creates a Pig job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit). -Here are the steps to use the component in a pipeline: -1. Install KFP SDK - +Follow these steps to use the component in a pipeline: +1. Install the Kubeflow Pipeline SDK: ```python @@ -59,23 +66,17 @@ dataproc_submit_sparksql_job_op = comp.load_component_from_url( help(dataproc_submit_sparksql_job_op) ``` -For more information about the component, please checkout: -* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_sparksql_job.py) -* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile) -* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_sparksql_job/sample.ipynb) -* [Dataproc SparkSqlJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkSqlJob) - ### Sample -Note: the sample code below works in both IPython notebook or python code directly. +Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template. #### Setup a Dataproc cluster [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code. -#### Prepare SparkSQL job -Directly put your SparkSQL queries in the `queires` list or upload your SparkSQL queries into a file to a Google Cloud Storage (GCS) bucket and place the path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a public CSV file from GCS. +#### Prepare a SparkSQL job +Either put your SparkSQL queries in the `queires` list, or upload your SparkSQL queries into a file to a Cloud Storage bucket and then enter the Cloud Storage bucket’s path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a public CSV file from Cloud Storage. -For more details about Spark SQL, please checkout the [programming guide](https://spark.apache.org/docs/latest/sql-programming-guide.html) +For more details about Spark SQL, see [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html) #### Set sample parameters @@ -167,7 +168,11 @@ run_name = pipeline_func.__name__ + ' run' run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments) ``` +## References +* [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html) +* [SparkSqlJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkSqlJob) +* [Cloud Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs) -```python -``` +## License +By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control. diff --git a/components/gcp/dataproc/submit_sparksql_job/sample.ipynb b/components/gcp/dataproc/submit_sparksql_job/sample.ipynb index 7d8709fa8c7..7e1ec4b84e8 100644 --- a/components/gcp/dataproc/submit_sparksql_job/sample.ipynb +++ b/components/gcp/dataproc/submit_sparksql_job/sample.ipynb @@ -4,24 +4,32 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Submitting a SparkSql Job to Cloud Dataproc\n", - "A Kubeflow Pipeline component to submit a SparkSql job to Google Cloud Dataproc service. \n", + "# Name\n", + "Data preparation using SparkSQL on YARN with Cloud Dataproc\n", "\n", - "## Intended Use\n", - "Use the component to run an Apache SparkSql job as one preprocessing step in a KFP pipeline. \n", + "# Label\n", + "Cloud Dataproc, GCP, Cloud Storage, YARN, SparkSQL, Kubeflow, pipelines, components \n", + "\n", + "# Summary\n", + "A Kubeflow Pipeline component to prepare data by submitting a SparkSql job on YARN to Cloud Dataproc.\n", + "\n", + "# Details\n", + "\n", + "## Intended use\n", + "Use the component to run an Apache SparkSql job as one preprocessing step in a Kubeflow Pipeline.\n", "\n", "## Runtime arguments\n", - "Name | Description | Type | Optional | Default\n", - ":--- | :---------- | :--- | :------- | :------\n", - "project_id | The Google Cloud Platform (GCP) project ID that the cluster belongs to. | GCPProjectID | No |\n", - "region | The Dataproc region that handles the request. | GCPRegion | No |\n", - "cluster_name | The name of the cluster that runs the job. | String | No |\n", - "queries | The queries to execute. You do not need to terminate a query with a semicolon. Multiple queries can be specified in one string by separating each with a semicolon. | List | Yes | `[]`\n", - "query_file_uri | The Hadoop Compatible Filesystem (HCFS) URI of the script that contains SQL queries.| GCSPath | Yes | ` `\n", - "script_variables | Mapping of query variable names to values (equivalent to the Spark SQL command: SET name=\"value\";). | List | Yes | `[]`\n", - "sparksql_job | The payload of a [SparkSqlJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkSqlJob). | Dict | Yes | `{}`\n", - "job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Dict | Yes | `{}`\n", - "wait_interval | The number of seconds to pause between polling the operation. | Integer | Yes | `30`\n", + "Argument| Description | Optional | Data type| Accepted values| Default |\n", + ":--- | :---------- | :--- | :------- | :------ | :------\n", + "project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No| GCPProjectID | | |\n", + "region | The Cloud Dataproc region to handle the request. | No | GCPRegion|\n", + "cluster_name | The name of the cluster to run the job. | No | String| | |\n", + "queries | The queries to execute the SparkSQL job. Specify multiple queries in one string by separating them with semicolons. You do not need to terminate queries with semicolons. | Yes | List | | None | \n", + "query_file_uri | The HCFS URI of the script that contains the SparkSQL queries.| Yes | GCSPath | | None |\n", + "script_variables | Mapping of the query’s variable names to their values (equivalent to the SparkSQL command: SET name=\"value\";).| Yes| Dict | | None |\n", + "sparksql_job | The payload of a [SparkSqlJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkSqlJob). | Yes | Dict | | None |\n", + "job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict | | None |\n", + "wait_interval | The number of seconds to pause between polling the operation. | Yes |Integer | | 30 |\n", "\n", "## Output\n", "Name | Description | Type\n", @@ -30,19 +38,19 @@ "\n", "## Cautions & requirements\n", "To use the component, you must:\n", - "* Setup project by following the [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", + "* Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).\n", "* [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).\n", - "* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", + "* Run the component under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n", "```\n", "component_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n", "```\n", - "* Grant Kubeflow user service account the `roles/dataproc.editor` role on the project.\n", + "* Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project.\n", "\n", "## Detailed Description\n", "This component creates a Pig job from [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit).\n", "\n", - "Here are the steps to use the component in a pipeline:\n", - "1. Install KFP SDK\n" + "Follow these steps to use the component in a pipeline:\n", + "1. Install the Kubeflow Pipeline SDK:" ] }, { @@ -81,29 +89,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For more information about the component, please checkout:\n", - "* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/dataproc/_submit_sparksql_job.py)\n", - "* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n", - "* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/dataproc/submit_sparksql_job/sample.ipynb)\n", - "* [Dataproc SparkSqlJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkSqlJob)\n", - "\n", "### Sample\n", "\n", - "Note: the sample code below works in both IPython notebook or python code directly.\n", + "Note: The following sample code works in an IPython notebook or directly in Python code. See the sample code below to learn how to execute the template.\n", "\n", "#### Setup a Dataproc cluster\n", "[Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) (or reuse an existing one) before running the sample code.\n", "\n", - "#### Prepare SparkSQL job\n", - "Directly put your SparkSQL queries in the `queires` list or upload your SparkSQL queries into a file to a Google Cloud Storage (GCS) bucket and place the path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a public CSV file from GCS.\n", + "#### Prepare a SparkSQL job\n", + "Either put your SparkSQL queries in the `queires` list, or upload your SparkSQL queries into a file to a Cloud Storage bucket and then enter the Cloud Storage bucket’s path in `query_file_uri`. In this sample, we will use a hard coded query in the `queries` list to select data from a public CSV file from Cloud Storage.\n", + "\n", + "For more details about Spark SQL, see [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)\n", "\n", - "For more details about Spark SQL, please checkout the [programming guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ "#### Set sample parameters" ] }, @@ -231,11 +228,18 @@ ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], - "source": [] + "source": [ + "## References\n", + "* [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)\n", + "* [SparkSqlJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/SparkSqlJob)\n", + "* [Cloud Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs)\n", + "\n", + "\n", + "## License\n", + "By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control." + ] } ], "metadata": { @@ -254,7 +258,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.6.4" } }, "nbformat": 4, diff --git a/components/gcp/ml_engine/batch_predict/README.md b/components/gcp/ml_engine/batch_predict/README.md index 1e38885b54f..c6674458606 100644 --- a/components/gcp/ml_engine/batch_predict/README.md +++ b/components/gcp/ml_engine/batch_predict/README.md @@ -1,23 +1,49 @@ -# Batch predicting using Cloud Machine Learning Engine -A Kubeflow Pipeline component to submit a batch prediction job against a trained model to Cloud ML Engine service. +# Name + +Batch prediction using Cloud Machine Learning Engine + + +# Label + +Cloud Storage, Cloud ML Engine, Kubeflow, Pipeline, Component + + +# Summary + +A Kubeflow Pipeline component to submit a batch prediction job against a deployed model on Cloud ML Engine. + + +# Details + ## Intended use -Use the component to run a batch prediction job against a deployed model in Cloud Machine Learning Engine. The prediction output will be stored in a Cloud Storage bucket. + +Use the component to run a batch prediction job against a deployed model on Cloud ML Engine. The prediction output is stored in a Cloud Storage bucket. + ## Runtime arguments -Name | Description | Type | Optional | Default -:--- | :---------- | :--- | :------- | :------ -project_id | The ID of the parent project of the job. | GCPProjectID | No | -model_path | Required. The path to the model. It can be one of the following paths: