Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create README.md #909

Merged
merged 1 commit into from
Mar 5, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions samples/tfx-oss/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# TFX Pipeline Example

This sample walks through running [TFX](https://github.com/tensorflow/tfx) [Taxi Application Example](https://github.com/tensorflow/tfx/tree/master/examples/chicago_taxi_pipeline) on Kubeflow Pipelines cluster.

## Overview

This pipeline demonstrates the TFX capablities at scale. The pipeline uses a public BigQuery dataset and uses GCP services to preprpocess data (Dataflow) and train the model (Cloud ML Engine). The model is then deployed to Cloud ML Engine Prediction service.


## Setup

Create a local Python 3.5 conda environment
```
conda create -n tfx-kfp pip python=3.5.3
```
then activate the environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add how to activate:

conda activate tfx-kfp



Install TFX and Kubeflow Pipelines SDK
```
!pip3 install https://storage.googleapis.com/ml-pipeline/tfx/tfx-0.12.0rc0-py2.py3-none-any.whl
!pip3 install https://storage.googleapis.com/ml-pipeline/release/0.1.10/kfp.tar.gz --upgrade
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JFYI: The latest version is 0.1.11 and in 0.1.12 will have improved experiment creation.

```

Clone TFX github repo
```
git clone https://github.com/tensorflow/tfx
```

Upload the utility code to your storage bucket. You can modify this code if needed for a different dataset.
```
gsutil cp tfx/examples/chicago_taxi_pipeline/taxi_utils.py gs://my-bucket/
```

## Configure the TFX Pipeline

Modify the pipeline configuration file at
```
tfx/examples/chicago_taxi_pipeline/taxi_pipeline_kubeflow_large.py
```
Configure
- GCS storage bucket name (replace "my-bucket")
- GCP project ID (replace "my-gcp-project")
- Make sure the path to the taxi_utils.py is correct
- Set the limit on the BigQuery query. The original dataset has 100M rows, which can take time to process. Set it to 20000 to run an sample test.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, I changed it to use RAND() < 0.01. So we can say:

"Change the sampling rate, or alternately, replace it with a LIMIT clause to process a smaller dataset. We recommend using at least 20000 rows in your sample."

or something like this.



## Compile a run the pipeline
```
python tfx/examples/chicago_taxi_pipeline/taxi_pipeline_kubeflow_large.py
```
This will generate a file named chicago_taxi_pipeline_kubeflow_large.tar.gz
Upload this file to the Pipelines Cluster and crate a run.