Skip to content

Commit

Permalink
Apply latest doc review changes to github docs (#1128)
Browse files Browse the repository at this point in the history
* Apply latest doc review changes to github docs

* merge changes from tech writer

* adding missing dataproc components
  • Loading branch information
hongye-sun authored and k8s-ci-robot committed Apr 18, 2019
1 parent d673a1f commit 1115fa5
Show file tree
Hide file tree
Showing 28 changed files with 1,767 additions and 1,183 deletions.
105 changes: 68 additions & 37 deletions components/gcp/bigquery/query/README.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,78 @@

# Submitting a query using BigQuery
A Kubeflow Pipeline component to submit a query to Google Cloud Bigquery service and dump outputs to a Google Cloud Storage blob.
# Name

## Intended Use
The component is intended to export query data from BiqQuery service to Cloud Storage.
Gather training data by querying BigQuery

## Runtime arguments
Name | Description | Data type | Optional | Default
:--- | :---------- | :-------- | :------- | :------
query | The query used by Bigquery service to fetch the results. | String | No |
project_id | The project to execute the query job. | GCPProjectID | No |
dataset_id | The ID of the persistent dataset to keep the results of the query. If the dataset does not exist, the operation will create a new one. | String | Yes | ` `
table_id | The ID of the table to keep the results of the query. If absent, the operation will generate a random id for the table. | String | Yes | ` `
output_gcs_path | The path to the Cloud Storage bucket to store the query output. | GCSPath | Yes | ` `
dataset_location | The location to create the dataset. Defaults to `US`. | String | Yes | `US`
job_config | The full config spec for the query job. See [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) for details. | Dict | Yes | ` `

# Labels

GCP, BigQuery, Kubeflow, Pipeline


# Summary

A Kubeflow Pipeline component to submit a query to BigQuery and store the result in a Cloud Storage bucket.


# Details


## Intended use

Use this Kubeflow component to:
* Select training data by submitting a query to BigQuery.
* Output the training data into a Cloud Storage bucket as CSV files.


## Runtime arguments:


| Argument | Description | Optional | Data type | Accepted values | Default |
|----------|-------------|----------|-----------|-----------------|---------|
| query | The query used by BigQuery to fetch the results. | No | String | | |
| project_id | The project ID of the Google Cloud Platform (GCP) project to use to execute the query. | No | GCPProjectID | | |
| dataset_id | The ID of the persistent BigQuery dataset to store the results of the query. If the dataset does not exist, the operation will create a new one. | Yes | String | | None |
| table_id | The ID of the BigQuery table to store the results of the query. If the table ID is absent, the operation will generate a random ID for the table. | Yes | String | | None |
| output_gcs_path | The path to the Cloud Storage bucket to store the query output. | Yes | GCSPath | | None |
| dataset_location | The location where the dataset is created. Defaults to US. | Yes | String | | US |
| job_config | The full configuration specification for the query job. See [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) for details. | Yes | Dict | A JSONobject which has the same structure as [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) | None |
## Input data schema

The input data is a BigQuery job containing a query that pulls data f rom various sources.


## Output:

## Outputs
Name | Description | Type
:--- | :---------- | :---
output_gcs_path | The path to the Cloud Storage bucket containing the query output in CSV format. | GCSPath

## Cautions and requirements
## Cautions & requirements

To use the component, the following requirements must be met:
* BigQuery API is enabled
* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:

```python
bigquery_query_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))
* The BigQuery API is enabled.
* The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow Pipeline cluster. For example:

```
```
bigquery_query_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))
```
* The Kubeflow user service account is a member of the `roles/bigquery.admin` role of the project.
* The Kubeflow user service account is a member of the `roles/storage.objectCreator `role of the Cloud Storage output bucket.
* The Kubeflow user service account is a member of `roles/bigquery.admin` role of the project.
* The Kubeflow user service account is also a member of `roles/storage.objectCreator` role of the Cloud Storage output bucket.
## Detailed description
This Kubeflow Pipeline component is used to:
* Submit a query to BigQuery.
* The query results are persisted in a dataset table in BigQuery.
* An extract job is created in BigQuery to extract the data from the dataset table and output it to a Cloud Storage bucket as CSV files.
## Detailed Description
The component does several things:
1. Creates persistent dataset and table if they do not exist.
1. Submits a query to BigQuery service and persists the result to the table.
1. Creates an extraction job to output the table data to a Cloud Storage bucket in CSV format.
Use the code below as an example of how to run your BigQuery job.
Here are the steps to use the component in a pipeline:
1. Install KFP SDK
### Sample
Note: The following sample code works in an IPython notebook or directly in Python code.
#### Set sample parameters
```python
Expand All @@ -64,13 +93,6 @@ bigquery_query_op = comp.load_component_from_url(
help(bigquery_query_op)
```

For more information about the component, please checkout:
* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/bigquery/_query.py)
* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)
* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/bigquery/query/sample.ipynb)
* [BigQuery query REST API](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query)


### Sample

Note: The following sample code works in IPython notebook or directly in Python code.
Expand Down Expand Up @@ -161,3 +183,12 @@ run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arg
```python
!gsutil cat OUTPUT_PATH
```

## References
* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/bigquery/_query.py)
* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)
* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/bigquery/query/sample.ipynb)
* [BigQuery query REST API](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query)

## License
By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control.
113 changes: 75 additions & 38 deletions components/gcp/bigquery/query/sample.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,50 +4,80 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Submitting a query using BigQuery \n",
"A Kubeflow Pipeline component to submit a query to Google Cloud Bigquery service and dump outputs to a Google Cloud Storage blob. \n",
"# Name\n",
"\n",
"## Intended Use\n",
"The component is intended to export query data from BiqQuery service to Cloud Storage. \n",
"Gather training data by querying BigQuery \n",
"\n",
"## Runtime arguments\n",
"Name | Description | Data type | Optional | Default\n",
":--- | :---------- | :-------- | :------- | :------\n",
"query | The query used by Bigquery service to fetch the results. | String | No |\n",
"project_id | The project to execute the query job. | GCPProjectID | No |\n",
"dataset_id | The ID of the persistent dataset to keep the results of the query. If the dataset does not exist, the operation will create a new one. | String | Yes | ` `\n",
"table_id | The ID of the table to keep the results of the query. If absent, the operation will generate a random id for the table. | String | Yes | ` `\n",
"output_gcs_path | The path to the Cloud Storage bucket to store the query output. | GCSPath | Yes | ` `\n",
"dataset_location | The location to create the dataset. Defaults to `US`. | String | Yes | `US`\n",
"job_config | The full config spec for the query job. See [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) for details. | Dict | Yes | ` `\n",
"\n",
"# Labels\n",
"\n",
"GCP, BigQuery, Kubeflow, Pipeline\n",
"\n",
"\n",
"# Summary\n",
"\n",
"A Kubeflow Pipeline component to submit a query to BigQuery and store the result in a Cloud Storage bucket.\n",
"\n",
"\n",
"# Details\n",
"\n",
"\n",
"## Intended use\n",
"\n",
"Use this Kubeflow component to:\n",
"* Select training data by submitting a query to BigQuery.\n",
"* Output the training data into a Cloud Storage bucket as CSV files.\n",
"\n",
"\n",
"## Runtime arguments:\n",
"\n",
"\n",
"| Argument | Description | Optional | Data type | Accepted values | Default |\n",
"|----------|-------------|----------|-----------|-----------------|---------|\n",
"| query | The query used by BigQuery to fetch the results. | No | String | | |\n",
"| project_id | The project ID of the Google Cloud Platform (GCP) project to use to execute the query. | No | GCPProjectID | | |\n",
"| dataset_id | The ID of the persistent BigQuery dataset to store the results of the query. If the dataset does not exist, the operation will create a new one. | Yes | String | | None |\n",
"| table_id | The ID of the BigQuery table to store the results of the query. If the table ID is absent, the operation will generate a random ID for the table. | Yes | String | | None |\n",
"| output_gcs_path | The path to the Cloud Storage bucket to store the query output. | Yes | GCSPath | | None |\n",
"| dataset_location | The location where the dataset is created. Defaults to US. | Yes | String | | US |\n",
"| job_config | The full configuration specification for the query job. See [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) for details. | Yes | Dict | A JSONobject which has the same structure as [QueryJobConfig](https://googleapis.github.io/google-cloud-python/latest/bigquery/generated/google.cloud.bigquery.job.QueryJobConfig.html#google.cloud.bigquery.job.QueryJobConfig) | None |\n",
"## Input data schema\n",
"\n",
"The input data is a BigQuery job containing a query that pulls data f rom various sources. \n",
"\n",
"\n",
"## Output:\n",
"\n",
"## Outputs\n",
"Name | Description | Type\n",
":--- | :---------- | :---\n",
"output_gcs_path | The path to the Cloud Storage bucket containing the query output in CSV format. | GCSPath\n",
"\n",
"## Cautions and requirements\n",
"## Cautions & requirements\n",
"\n",
"To use the component, the following requirements must be met:\n",
"* BigQuery API is enabled\n",
"* The component is running under a secret of [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow cluster. For example:\n",
"\n",
"```python\n",
"bigquery_query_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n",
"* The BigQuery API is enabled.\n",
"* The component is running under a secret [Kubeflow user service account](https://www.kubeflow.org/docs/started/getting-started-gke/#gcp-service-accounts) in a Kubeflow Pipeline cluster. For example:\n",
"\n",
" ```\n",
" bigquery_query_op(...).apply(gcp.use_gcp_secret('user-gcp-sa'))\n",
" ```\n",
"* The Kubeflow user service account is a member of the `roles/bigquery.admin` role of the project.\n",
"* The Kubeflow user service account is a member of the `roles/storage.objectCreator `role of the Cloud Storage output bucket.\n",
"\n",
"## Detailed description\n",
"This Kubeflow Pipeline component is used to:\n",
"* Submit a query to BigQuery.\n",
" * The query results are persisted in a dataset table in BigQuery.\n",
" * An extract job is created in BigQuery to extract the data from the dataset table and output it to a Cloud Storage bucket as CSV files.\n",
"\n",
"```\n",
" Use the code below as an example of how to run your BigQuery job.\n",
"\n",
"* The Kubeflow user service account is a member of `roles/bigquery.admin` role of the project.\n",
"* The Kubeflow user service account is also a member of `roles/storage.objectCreator` role of the Cloud Storage output bucket.\n",
"### Sample\n",
"\n",
"## Detailed Description\n",
"The component does several things:\n",
"1. Creates persistent dataset and table if they do not exist.\n",
"1. Submits a query to BigQuery service and persists the result to the table.\n",
"1. Creates an extraction job to output the table data to a Cloud Storage bucket in CSV format.\n",
"Note: The following sample code works in an IPython notebook or directly in Python code.\n",
"\n",
"Here are the steps to use the component in a pipeline:\n",
"1. Install KFP SDK\n"
"#### Set sample parameters"
]
},
{
Expand Down Expand Up @@ -86,13 +116,6 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"For more information about the component, please checkout:\n",
"* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/bigquery/_query.py)\n",
"* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n",
"* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/bigquery/query/sample.ipynb)\n",
"* [BigQuery query REST API](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query)\n",
"\n",
"\n",
"### Sample\n",
"\n",
"Note: The following sample code works in IPython notebook or directly in Python code.\n",
Expand Down Expand Up @@ -241,6 +264,20 @@
"source": [
"!gsutil cat OUTPUT_PATH"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"* [Component python code](https://github.com/kubeflow/pipelines/blob/master/component_sdk/python/kfp_component/google/bigquery/_query.py)\n",
"* [Component docker file](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/Dockerfile)\n",
"* [Sample notebook](https://github.com/kubeflow/pipelines/blob/master/components/gcp/bigquery/query/sample.ipynb)\n",
"* [BigQuery query REST API](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query)\n",
"\n",
"## License\n",
"By deploying or using this software you agree to comply with the [AI Hub Terms of Service](https://aihub.cloud.google.com/u/0/aihub-tos) and the [Google APIs Terms of Service](https://developers.google.com/terms/). To the extent of a direct conflict of terms, the AI Hub Terms of Service will control."
]
}
],
"metadata": {
Expand All @@ -259,7 +296,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.4"
}
},
"nbformat": 4,
Expand Down
Loading

0 comments on commit 1115fa5

Please sign in to comment.