-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create README.md #909
Create README.md #909
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# TFX Pipeline Example | ||
|
||
This sample walks through running [TFX](https://github.com/tensorflow/tfx) [Taxi Application Example](https://github.com/tensorflow/tfx/tree/master/examples/chicago_taxi_pipeline) on Kubeflow Pipelines cluster. | ||
|
||
## Overview | ||
|
||
This pipeline demonstrates the TFX capablities at scale. The pipeline uses a public BigQuery dataset and uses GCP services to preprpocess data (Dataflow) and train the model (Cloud ML Engine). The model is then deployed to Cloud ML Engine Prediction service. | ||
|
||
|
||
## Setup | ||
|
||
Create a local Python 3.5 conda environment | ||
``` | ||
conda create -n tfx-kfp pip python=3.5.3 | ||
``` | ||
then activate the environment. | ||
|
||
|
||
Install TFX and Kubeflow Pipelines SDK | ||
``` | ||
!pip3 install https://storage.googleapis.com/ml-pipeline/tfx/tfx-0.12.0rc0-py2.py3-none-any.whl | ||
!pip3 install https://storage.googleapis.com/ml-pipeline/release/0.1.10/kfp.tar.gz --upgrade | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. JFYI: The latest version is 0.1.11 and in 0.1.12 will have improved experiment creation. |
||
``` | ||
|
||
Clone TFX github repo | ||
``` | ||
git clone https://github.com/tensorflow/tfx | ||
``` | ||
|
||
Upload the utility code to your storage bucket. You can modify this code if needed for a different dataset. | ||
``` | ||
gsutil cp tfx/examples/chicago_taxi_pipeline/taxi_utils.py gs://my-bucket/ | ||
``` | ||
|
||
## Configure the TFX Pipeline | ||
|
||
Modify the pipeline configuration file at | ||
``` | ||
tfx/examples/chicago_taxi_pipeline/taxi_pipeline_kubeflow_large.py | ||
``` | ||
Configure | ||
- GCS storage bucket name (replace "my-bucket") | ||
- GCP project ID (replace "my-gcp-project") | ||
- Make sure the path to the taxi_utils.py is correct | ||
- Set the limit on the BigQuery query. The original dataset has 100M rows, which can take time to process. Set it to 20000 to run an sample test. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right now, I changed it to use RAND() < 0.01. So we can say: "Change the sampling rate, or alternately, replace it with a LIMIT clause to process a smaller dataset. We recommend using at least 20000 rows in your sample." or something like this. |
||
|
||
|
||
## Compile a run the pipeline | ||
``` | ||
python tfx/examples/chicago_taxi_pipeline/taxi_pipeline_kubeflow_large.py | ||
``` | ||
This will generate a file named chicago_taxi_pipeline_kubeflow_large.tar.gz | ||
Upload this file to the Pipelines Cluster and crate a run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add how to activate: