Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Samples - Improved the TFX OSS notebook and README #922

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions samples/tfx-oss/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ This pipeline demonstrates the TFX capablities at scale. The pipeline uses a pub

## Setup

Enable DataFlow API for your GKE cluster: <https://console.developers.google.com/apis/api/dataflow.googleapis.com/overview>

Create a local Python 3.5 conda environment
```
conda create -n tfx-kfp pip python=3.5.3
Expand All @@ -29,7 +31,13 @@ git clone https://github.com/tensorflow/tfx

Upload the utility code to your storage bucket. You can modify this code if needed for a different dataset.
```
gsutil cp tfx/examples/chicago_taxi_pipeline/taxi_utils.py gs://my-bucket/
gsutil cp tfx/examples/chicago_taxi_pipeline/taxi_utils.py gs://my-bucket/<path>/
```

If gsutil does not work, try `tensorflow.gfile`:
```
from tensorflow import gfile
gfile.Copy('tfx/examples/chicago_taxi_pipeline/taxi_utils.py', 'gs://<my bucket>/<path>/taxi_utils.py')
```

## Configure the TFX Pipeline
Expand All @@ -39,9 +47,9 @@ Modify the pipeline configuration file at
tfx/examples/chicago_taxi_pipeline/taxi_pipeline_kubeflow_large.py
```
Configure
- GCS storage bucket name (replace "my-bucket")
- GCP project ID (replace "my-gcp-project")
- Make sure the path to the taxi_utils.py is correct
- Set `_input_bucket` to the GCS directory where you've copied taxi_utils.py. I.e. gs://<my bucket>/<path>/
- Set `_output_bucket` to the GCS directory where you've want the results to be written
- Set GCP project ID (replace my-gcp-project). Note that it should be project ID, not project name.
- The original BigQuery dataset has 100M rows, which can take time to process. Modify the selection criteria (% of records) to run a sample test.

## Compile and run the pipeline
Expand Down
32 changes: 24 additions & 8 deletions samples/tfx-oss/TFX Example.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
"source": [
"# TFX on KubeFlow Pipelines Example\n",
"\n",
"This notebook should be run inside a KF Pipelines cluster.\n",
"\n",
"### Install TFX and KFP packages"
]
},
Expand All @@ -19,6 +21,14 @@
"!pip3 install https://storage.googleapis.com/ml-pipeline/release/0.1.10/kfp.tar.gz --upgrade\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Enable DataFlow API for your GKE cluster\n",
"<https://console.developers.google.com/apis/api/dataflow.googleapis.com/overview>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand All @@ -41,8 +51,9 @@
"metadata": {},
"outputs": [],
"source": [
"# copy the trainer code to a storage bucket \n",
"!gsutil cp tfx/examples/chicago_taxi_pipeline/taxi_utils.py gs://<my bucket>/"
"# copy the trainer code to a storage bucket as the TFX pipeline will need that code file in GCS\n",
"from tensorflow import gfile\n",
"gfile.Copy('tfx/examples/chicago_taxi_pipeline/taxi_utils.py', 'gs://<my bucket>/<path>/taxi_utils.py')"
]
},
{
Expand All @@ -57,9 +68,9 @@
"```\n",
"\n",
"Configure:\n",
"- GCS storage bucket name (replace my-bucket)\n",
"- GCP project ID (replace my-gcp-project)\n",
"- Make sure the path to the taxi_utils.py is correct\n",
"- Set `_input_bucket` to the GCS directory where you've copied taxi_utils.py. I.e. gs://<my bucket>/<path>/\n",
"- Set `_output_bucket` to the GCS directory where you've want the results to be written\n",
"- Set GCP project ID (replace my-gcp-project). Note that it should be project ID, not project name.\n",
"\n",
"The dataset in BigQuery has 100M rows, you can change the query parameters in WHERE clause to limit the number of rows used.\n"
]
Expand Down Expand Up @@ -89,12 +100,17 @@
"# Get or create a new experiment\n",
"import kfp\n",
"client = kfp.Client()\n",
"experiment = client.create_experiment(\"TFX Examples\")\n",
"pipeline_filename = \"chicago_taxi_pipeline_kubeflow.tar.gz\"\n",
"experiment_name='TFX Examples'\n",
"try:\n",
" experiment_id = client.get_experiment(experiment_name=experiment_name).id\n",
"except:\n",
" experiment_id = client.create_experiment(experiment_name).id\n",
"\n",
"pipeline_filename = 'chicago_taxi_pipeline_kubeflow.tar.gz'\n",
"\n",
"#Submit a pipeline run\n",
"run_name = 'Run 1'\n",
"run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, {})\n"
"run_result = client.run_pipeline(experiment_id, run_name, pipeline_filename, {})\n"
]
}
],
Expand Down