Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure dataproc components #542

Merged
merged 3 commits into from
Dec 18, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions .cloudbuild.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -125,27 +125,27 @@ steps:
# Build the Dataproc-based pipeline component images
- name: 'gcr.io/cloud-builders/docker'
entrypoint: '/bin/bash'
args: ['-c', 'cd /workspace/components/dataproc/containers/analyze && ./build.sh -p $PROJECT_ID -t $COMMIT_SHA']
args: ['-c', 'cd /workspace/components/dataproc/analyze && ./build_image.sh -p $PROJECT_ID -t $COMMIT_SHA']
id: 'buildDataprocAnalyze'
- name: 'gcr.io/cloud-builders/docker'
entrypoint: '/bin/bash'
args: ['-c', 'cd /workspace/components/dataproc/containers/create_cluster && ./build.sh -p $PROJECT_ID -t $COMMIT_SHA']
args: ['-c', 'cd /workspace/components/dataproc/create_cluster && ./build_image.sh -p $PROJECT_ID -t $COMMIT_SHA']
id: 'buildDataprocCreateCluster'
- name: 'gcr.io/cloud-builders/docker'
entrypoint: '/bin/bash'
args: ['-c', 'cd /workspace/components/dataproc/containers/delete_cluster && ./build.sh -p $PROJECT_ID -t $COMMIT_SHA']
args: ['-c', 'cd /workspace/components/dataproc/delete_cluster && ./build_image.sh -p $PROJECT_ID -t $COMMIT_SHA']
id: 'buildDataprocDeleteCluster'
- name: 'gcr.io/cloud-builders/docker'
entrypoint: '/bin/bash'
args: ['-c', 'cd /workspace/components/dataproc/containers/predict && ./build.sh -p $PROJECT_ID -t $COMMIT_SHA']
args: ['-c', 'cd /workspace/components/dataproc/predict && ./build_image.sh -p $PROJECT_ID -t $COMMIT_SHA']
id: 'buildDataprocPredict'
- name: 'gcr.io/cloud-builders/docker'
entrypoint: '/bin/bash'
args: ['-c', 'cd /workspace/components/dataproc/containers/transform && ./build.sh -p $PROJECT_ID -t $COMMIT_SHA']
args: ['-c', 'cd /workspace/components/dataproc/transform && ./build_image.sh -p $PROJECT_ID -t $COMMIT_SHA']
id: 'buildDataprocTransform'
- name: 'gcr.io/cloud-builders/docker'
entrypoint: '/bin/bash'
args: ['-c', 'cd /workspace/components/dataproc/containers/train && ./build.sh -p $PROJECT_ID -t $COMMIT_SHA']
args: ['-c', 'cd /workspace/components/dataproc/train && ./build_image.sh -p $PROJECT_ID -t $COMMIT_SHA']
id: 'buildDataprocTrain'

# Build the ResNet-CMLE sample pipeline component images
Expand Down
10 changes: 5 additions & 5 deletions components/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ as input and may produce one or more


**Example: XGBoost DataProc components**
* [Set up cluster](dataproc/xgboost/create_cluster.py)
* [Analyze](dataproc/xgboost/analyze.py)
* [Transform](dataproc/xgboost/transform.py)
* [Distributed train](dataproc/xgboost/train.py)
* [Delete cluster](dataproc/xgboost/delete_cluster.py)
* [Set up cluster](dataproc/create_cluster/src/create_cluster.py)
* [Analyze](dataproc/analyze/src/analyze.py)
* [Transform](dataproc/transform/src/transform.py)
* [Distributed train](dataproc/train/src/train.py)
* [Delete cluster](dataproc/delete_cluster/src/delete_cluster.py)

Each task usually includes two parts:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ fi

# build base image
pushd ../base
./build.sh
./build_image.sh
popd

docker build -t ${LOCAL_IMAGE_NAME} .
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,7 @@ def main(argv=None):
args = parser.parse_args()

code_path = os.path.dirname(os.path.realpath(__file__))
dirname = os.path.basename(__file__).split('.')[0]
runfile_source = os.path.join(code_path, dirname, 'run.py')
runfile_source = os.path.join(code_path, 'analyze_run.py')
dest_files = _utils.copy_resources_to_gcs([runfile_source], args.output)
try:
api = _utils.get_client()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,17 @@
# limitations under the License.


mkdir -p ./build
rsync -arvp "../../xgboost"/ ./build/
mkdir -p ./build/common
rsync -arvp "../analyze/src"/ ./build/
rsync -arvp "../train/src"/ ./build/
rsync -arvp "../predict/src"/ ./build/
rsync -arvp "../create_cluster/src"/ ./build/
rsync -arvp "../delete_cluster/src"/ ./build/
rsync -arvp "../transform/src"/ ./build/
rsync -arvp "../common"/ ./build/common/

cp ../../../license.sh ./build
cp ../../../third_party_licenses.csv ./build
cp ../../license.sh ./build
cp ../../third_party_licenses.csv ./build

docker build -t ml-pipeline-dataproc-base .
rm -rf ./build
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ fi

# build base image
pushd ../base
./build.sh
./build_image.sh
popd

docker build -t ${LOCAL_IMAGE_NAME} .
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,7 @@ def main(argv=None):
args = parser.parse_args()

code_path = os.path.dirname(os.path.realpath(__file__))
dirname = os.path.basename(__file__).split('.')[0]
init_file_source = os.path.join(code_path, dirname, 'initialization_actions.sh')
init_file_source = os.path.join(code_path, 'initialization_actions.sh')
dest_files = _utils.copy_resources_to_gcs([init_file_source], args.staging)

try:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ fi

# build base image
pushd ../base
./build.sh
./build_image.sh
popd

docker build -t ${LOCAL_IMAGE_NAME} .
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ fi

# build base image
pushd ../base
./build.sh
./build_image.sh
popd

docker build -t ${LOCAL_IMAGE_NAME} .
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ fi

# build base image
pushd ../base
./build.sh
./build_image.sh
popd

docker build -t ${LOCAL_IMAGE_NAME} .
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ fi

# build base image
pushd ../base
./build.sh
./build_image.sh
popd

docker build -t ${LOCAL_IMAGE_NAME} .
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,7 @@ def main(argv=None):
_utils.delete_directory_from_gcs(os.path.join(args.output, 'eval'))

code_path = os.path.dirname(os.path.realpath(__file__))
dirname = os.path.basename(__file__).split('.')[0]
runfile_source = os.path.join(code_path, dirname, 'run.py')
runfile_source = os.path.join(code_path, 'transform_run.py')
dest_files = _utils.copy_resources_to_gcs([runfile_source], args.output)
try:
api = _utils.get_client()
Expand Down
35 changes: 0 additions & 35 deletions components/dataproc/xgboost/setup.py

This file was deleted.

16 changes: 8 additions & 8 deletions samples/tfx/taxi-cab-classification-pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
def dataflow_tf_data_validation_op(inference_data: 'GcsUri', validation_data: 'GcsUri', column_names: 'GcsUri[text/json]', key_columns, project: 'GcpProject', mode, validation_output: 'GcsUri[Directory]', step_name='validation'):
return dsl.ContainerOp(
name = step_name,
image = 'gcr.io/ml-pipeline/ml-pipeline-dataflow-tfdv:dev', #TODO-release: update the release tag for the next release
image = 'gcr.io/ml-pipeline/ml-pipeline-dataflow-tfdv:0.1.4', #TODO-release: update the release tag for the next release
arguments = [
'--csv-data-for-inference', inference_data,
'--csv-data-to-validate', validation_data,
Expand All @@ -40,7 +40,7 @@ def dataflow_tf_data_validation_op(inference_data: 'GcsUri', validation_data: 'G
def dataflow_tf_transform_op(train_data: 'GcsUri', evaluation_data: 'GcsUri', schema: 'GcsUri[text/json]', project: 'GcpProject', preprocess_mode, preprocess_module: 'GcsUri[text/code/python]', transform_output: 'GcsUri[Directory]', step_name='preprocess'):
return dsl.ContainerOp(
name = step_name,
image = 'gcr.io/ml-pipeline/ml-pipeline-dataflow-tft:0.1.3-rc.2', #TODO-release: update the release tag for the next release
image = 'gcr.io/ml-pipeline/ml-pipeline-dataflow-tft:0.1.4', #TODO-release: update the release tag for the next release
arguments = [
'--train', train_data,
'--eval', evaluation_data,
Expand All @@ -57,7 +57,7 @@ def dataflow_tf_transform_op(train_data: 'GcsUri', evaluation_data: 'GcsUri', sc
def tf_train_op(transformed_data_dir, schema: 'GcsUri[text/json]', learning_rate: float, hidden_layer_size: int, steps: int, target: str, preprocess_module: 'GcsUri[text/code/python]', training_output: 'GcsUri[Directory]', step_name='training'):
return dsl.ContainerOp(
name = step_name,
image = 'gcr.io/ml-pipeline/ml-pipeline-kubeflow-tf-trainer:0.1.3-rc.2', #TODO-release: update the release tag for the next release
image = 'gcr.io/ml-pipeline/ml-pipeline-kubeflow-tf-trainer:0.1.4', #TODO-release: update the release tag for the next release
arguments = [
'--transformed-data-dir', transformed_data_dir,
'--schema', schema,
Expand All @@ -74,7 +74,7 @@ def tf_train_op(transformed_data_dir, schema: 'GcsUri[text/json]', learning_rate
def dataflow_tf_model_analyze_op(model: 'TensorFlow model', evaluation_data: 'GcsUri', schema: 'GcsUri[text/json]', project: 'GcpProject', analyze_mode, analyze_slice_column, analysis_output: 'GcsUri', step_name='analysis'):
return dsl.ContainerOp(
name = step_name,
image = 'gcr.io/ml-pipeline/ml-pipeline-dataflow-tfma:0.1.3-rc.2', #TODO-release: update the release tag for the next release
image = 'gcr.io/ml-pipeline/ml-pipeline-dataflow-tfma:0.1.4', #TODO-release: update the release tag for the next release
arguments = [
'--model', model,
'--eval', evaluation_data,
Expand All @@ -91,7 +91,7 @@ def dataflow_tf_model_analyze_op(model: 'TensorFlow model', evaluation_data: 'Gc
def dataflow_tf_predict_op(evaluation_data: 'GcsUri', schema: 'GcsUri[text/json]', target: str, model: 'TensorFlow model', predict_mode, project: 'GcpProject', prediction_output: 'GcsUri', step_name='prediction'):
return dsl.ContainerOp(
name = step_name,
image = 'gcr.io/ml-pipeline/ml-pipeline-dataflow-tf-predict:0.1.3-rc.2', #TODO-release: update the release tag for the next release
image = 'gcr.io/ml-pipeline/ml-pipeline-dataflow-tf-predict:0.1.4', #TODO-release: update the release tag for the next release
arguments = [
'--data', evaluation_data,
'--schema', schema,
Expand All @@ -108,7 +108,7 @@ def dataflow_tf_predict_op(evaluation_data: 'GcsUri', schema: 'GcsUri[text/json]
def confusion_matrix_op(predictions: 'GcsUri', output: 'GcsUri', step_name='confusion_matrix'):
return dsl.ContainerOp(
name=step_name,
image='gcr.io/ml-pipeline/ml-pipeline-local-confusion-matrix:dev', #TODO-release: update the release tag for the next release
image='gcr.io/ml-pipeline/ml-pipeline-local-confusion-matrix:0.1.4', #TODO-release: update the release tag for the next release
arguments=[
'--output', '%s/{{workflow.name}}/confusionmatrix' % output,
'--predictions', predictions,
Expand All @@ -119,7 +119,7 @@ def confusion_matrix_op(predictions: 'GcsUri', output: 'GcsUri', step_name='conf
def roc_op(predictions: 'GcsUri', output: 'GcsUri', step_name='roc'):
return dsl.ContainerOp(
name=step_name,
image='gcr.io/ml-pipeline/ml-pipeline-local-roc:dev', #TODO-release: update the release tag for the next release
image='gcr.io/ml-pipeline/ml-pipeline-local-roc:0.1.4', #TODO-release: update the release tag for the next release
arguments=[
'--output', '%s/{{workflow.name}}/roc' % output,
'--predictions', predictions,
Expand All @@ -130,7 +130,7 @@ def roc_op(predictions: 'GcsUri', output: 'GcsUri', step_name='roc'):
def kubeflow_deploy_op(model: 'TensorFlow model', tf_server_name, step_name='deploy'):
return dsl.ContainerOp(
name = step_name,
image = 'gcr.io/ml-pipeline/ml-pipeline-kubeflow-deployer:0.1.3-rc.2', #TODO-release: update the release tag for the next release
image = 'gcr.io/ml-pipeline/ml-pipeline-kubeflow-deployer:0.1.4', #TODO-release: update the release tag for the next release
arguments = [
'--model-path', model,
'--server-name', tf_server_name
Expand Down
24 changes: 12 additions & 12 deletions samples/xgboost-spark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,24 +31,24 @@ pipeline run results. Note that each pipeline run will create a unique directory
## Components source

Create Cluster:
[source code](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/xgboost/create_cluster)
[container](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/containers/create_cluster)
[source code](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/xgboost/create_cluster/src)
[container](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/create_cluster)

Analyze (step one for preprocessing):
[source code](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/xgboost/analyze)
[container](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/containers/analyze)
[source code](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/xgboost/analyze/src)
[container](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/analyze)

Transform (step two for preprocessing):
[source code](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/xgboost/transform)
[container](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/containers/transform)
[source code](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/xgboost/transform/src)
[container](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/transform)

Distributed Training:
[source code](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/xgboost/train)
[container](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/containers/train)
[source code](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/xgboost/train/src)
[container](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/train)

Distributed Predictions:
[source code](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/xgboost/predict)
[container](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/containers/predict)
[source code](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/xgboost/predict/src)
[container](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/predict)

Confusion Matrix:
[source code](https://github.com/kubeflow/pipelines/tree/master/components/local/confusion_matrix/src)
Expand All @@ -61,7 +61,7 @@ ROC:


Delete Cluster:
[source code](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/xgboost/delete_cluster)
[container](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/containers/delete_cluster)
[source code](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/xgboost/delete_cluster/src)
[container](https://github.com/kubeflow/pipelines/tree/master/components/dataproc/delete_cluster)


12 changes: 6 additions & 6 deletions test/sample_test_v2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -131,47 +131,47 @@ spec:
- name: image-name
value: "{{inputs.parameters.target-image-prefix}}{{inputs.parameters.dataproc-create-cluster-image-suffix}}"
- name: build-script
value: components/dataproc/containers/create_cluster/build.sh
value: components/dataproc/create_cluster/build_image.sh
- name: build-dataproc-delete-cluster-image
template: build-image-by-script
arguments:
parameters:
- name: image-name
value: "{{inputs.parameters.target-image-prefix}}{{inputs.parameters.dataproc-delete-cluster-image-suffix}}"
- name: build-script
value: components/dataproc/containers/delete_cluster/build.sh
value: components/dataproc/delete_cluster/build_image.sh
- name: build-dataproc-analyze-image
template: build-image-by-script
arguments:
parameters:
- name: image-name
value: "{{inputs.parameters.target-image-prefix}}{{inputs.parameters.dataproc-analyze-image-suffix}}"
- name: build-script
value: components/dataproc/containers/analyze/build.sh
value: components/dataproc/analyze/build_image.sh
- name: build-dataproc-transform-image
template: build-image-by-script
arguments:
parameters:
- name: image-name
value: "{{inputs.parameters.target-image-prefix}}{{inputs.parameters.dataproc-transform-image-suffix}}"
- name: build-script
value: components/dataproc/containers/transform/build.sh
value: components/dataproc/transform/build_image.sh
- name: build-dataproc-train-image
template: build-image-by-script
arguments:
parameters:
- name: image-name
value: "{{inputs.parameters.target-image-prefix}}{{inputs.parameters.dataproc-train-image-suffix}}"
- name: build-script
value: components/dataproc/containers/train/build.sh
value: components/dataproc/train/build_image.sh
- name: build-dataproc-predict-image
template: build-image-by-script
arguments:
parameters:
- name: image-name
value: "{{inputs.parameters.target-image-prefix}}{{inputs.parameters.dataproc-predict-image-suffix}}"
- name: build-script
value: components/dataproc/containers/predict/build.sh
value: components/dataproc/predict/build_image.sh
- name: build-kubeflow-dnntrainer-image
template: build-image-by-script
arguments:
Expand Down