Improve multi_images.py- use core image and configure sandbox.config (f…

…lyteorg#750) * Configure sandbox.config Removed Dockerfile.prediction and sandbox.config present in `containerization` folder. Specify ``core`` image in container_image Signed-off-by: SmritiSatyanV <smriti@union.ai> * Added recommended way of specifying docker image Signed-off-by: SmritiSatyanV <smriti@union.ai> * Changed default to fqn Changed default to fqn in format of 'container_image' Updated sandbox.config Signed-off-by: SmritiSatyanV <smriti@union.ai> * Updated sandbox.config Signed-off-by: SmritiSatyanV <smriti@union.ai>
eapolinario · Jun 8, 2022 · 2defe95 · 2defe95
1 parent 8941267
commit 2defe95
Show file tree

Hide file tree

Showing 6 changed files with 37 additions and 91 deletions.
diff --git a/cookbook/core/containerization/Dockerfile.prediction b/cookbook/core/containerization/Dockerfile.prediction
diff --git a/cookbook/core/containerization/multi_images.py b/cookbook/core/containerization/multi_images.py
@@ -5,45 +5,17 @@
 ----------------------------------------------
 
 When working locally, it is recommended to install all requirements of your project locally (maybe in a single virtual environment). It gets complicated when you want to deploy your code to a remote
-environment. This is because most tasks in Flyte (function tasks) are deployed using a Docker Container. 
+environment since most tasks in Flyte (function tasks) are deployed using a Docker Container. 
 
-A Docker container allows you to create an expected environment for your tasks. It is also possible to build a single container image with all your dependencies, but sometimes this is complicated and impractical.
-
-Here are the reasons why it is complicated and not recommended:
-
-#. All the dependencies in one container increase the size of the container image.
-#. Some task executions like Spark, SageMaker-based Training, and deep learning use GPUs that need specific runtime configurations. For example,
-   
-   - Spark needs JavaVirtualMachine installation and Spark entrypoints to be set
-   - NVIDIA drivers and corresponding libraries need to be installed to use GPUs for deep learning. However, these are not required for a CPU
-   - SageMaker expects the ENTRYPOINT to be designed to accept its parameters
-
-#. Building a single image may increase the build time for the image itself.
-
-.. note::
-
-   Flyte (Service) by default does not require a workflow to be bound to a single container image. Flytekit offers a simple interface to easily alter the images that should be associated with every task, yet keeping the local execution simple for the user.
-
-For every :py:class:`flytekit.PythonFunctionTask` type task or simply a task that is decorated with the ``@task`` decorator, users can supply rules of how the container image should be bound. By default, flytekit binds one container image, i.e., the ``default`` image to all tasks.
+For every :py:class:`flytekit.PythonFunctionTask` type task or simply a task decorated with the ``@task`` decorator, users can supply rules of how the container image should be bound. By default, flytekit binds one container image, i.e., the ``default`` image to all tasks.
 To alter the image, use the ``container_image`` parameter available in the :py:func:`flytekit.task` decorator. Any one of the following is an acceptable:
 
-#. Image reference is specified, but the version is derived from the default image version ``container_image="docker.io/redis:{{.image.default.version}},``
-#. Both the FQN and the version are derived from the default image ``container_image="{{.image.default.fqn}}:spark-{{.image.default.version}},``
-
-The images themselves are parameterizable in the config in the following format:
- ``{{.image.<name>.<attribute>}}``
-
-- ``name`` refers to the name of the image in the image configuration. The name ``default`` is a reserved keyword and will automatically apply to the default image name for this repository.
-- ``fqn`` refers to the fully qualified name of the image. For example, it includes the repository and domain url of the image. Example: ``docker.io/my_repo/xyz``.
-- ``version`` refers to the tag of the image. For example: `latest`, or `python-3.8` etc. If the `container_image` is not specified then the default configured image for the project is used.
-
-.. note::
-
-    The default image (name + version) is always ``{{.image.default.fqn}}:{{.image.default.version}}``
+#. Image reference is specified, but the version is derived from the default image version ``container_image="docker.io/redis:{{.image.default.version}}``
+#. Both the FQN and the version are derived from the default image ``container_image="{{.image.default.fqn}}:spark-{{.image.default.version}}``
 
 .. warning:
 
-   To be able to use the image, push a container image that matches the new name described.
+   To use the image, push a container image that matches the new name described.
 
 If you wish to build and push your Docker image to GHCR, follow `this <https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry>`_.
 If you wish to build and push your Docker image to Dockerhub through your account, follow the below steps:
@@ -59,7 +31,7 @@
 .. code-block::
 
    docker login
-4. It will prompt you to enter the username and the password.
+4. It prompts you to enter the username and the password.
 5. Push the Docker image to Dockerhub:
     
 .. code-block::
@@ -76,9 +48,9 @@
 
 .. tip::
    
-   Sometimes, ``docker login`` may not be successful. In such a case, execute ``docker logout`` and ``docker login``.
+   Sometimes, ``docker login`` may not be successful. In such cases, execute ``docker logout`` and ``docker login``.
 
-Let us understand how multiple images can be used within a single workflow.
+Let's dive into the example.
 """
 # %%
 # Import the necessary dependencies.
@@ -102,7 +74,7 @@
 # %%
 # Define a task that fetches data and splits the data into train and test sets.
 @task(
-    container_image="ghcr.io/flyteorg/flytecookbook:core-with-sklearn-baa17ccf39aa667c5950bd713a4366ce7d5fccaf7f85e6be8c07fe4b522f92c3"
+    container_image="{{.image.trainer.fqn }}:{{.image.trainer.version}}" 
 )
 def svm_trainer() -> split_data:
     fish_data = pd.read_csv(dataset_url)
@@ -122,15 +94,25 @@ def svm_trainer() -> split_data:
 
 # %%
 # .. note ::
+#
 #     To use your own Docker image, replace the value of `container_image` with the fully qualified name that identifies where the image has been pushed. 
-#     One pattern has been specified in the task itself, i.e., specifying the Docker image URI. The recommended usage is:
+#     The recommended usage (specified in the example) is:
+#
+#     ``container_image= "{{.image.default.fqn}}:{{.image.default.version}}"``
+#
+#     #. ``image`` refers to the name of the image in the image configuration. The name ``default`` is a reserved keyword and will automatically apply to the default image name for this repository.
+#     #. ``fqn`` refers to the fully qualified name of the image. For example, it includes the repository and domain url of the image. Example: ``docker.io/my_repo/xyz``.
+#     #. ``version`` refers to the tag of the image. For example: `latest`, or `python-3.8` etc. If the `container_image` is not specified then the default configured image for the project is used.
+#
+#     The images themselves are parameterizable in the config file in the following format:
 #
-#     ``container_image="{{.image.default.fqn}}:multi-images-preprocess-{{.image.default.version}}"``
+#     ``{{.image.<name>.<attribute>}}``
+
 
 # %%
 # Define another task that trains the model on the data and computes the accuracy score.
 @task(
-    container_image="ghcr.io/flyteorg/flytecookbook:multi-image-predict-98b125fd57d20594026941c2ebe7ef662e5acb7d6423660a65f493ca2d9aa267"
+    container_image="{{.image.predictor.fqn }}:{{.image.predictor.version}}"
 )
 def svm_predictor(
     X_train: pd.DataFrame,
@@ -144,7 +126,6 @@ def svm_predictor(
     accuracy_score = float(model.score(X_test, y_test.values.ravel()))
     return accuracy_score
 
-
 # %%
 # Define a workflow.
 @workflow
@@ -158,10 +139,15 @@ def my_workflow() -> float:
     )
     return svm_accuracy
 
-
 if __name__ == "__main__":
-    print(f"Running my_workflow(), accuracy : { my_workflow() }")
+    print(f"Running my_workflow(), accuracy: {my_workflow()}")
 
 # %%
-# .. note::
-#     Notice that the two task annotators have two different `container_image`s specified.
+# Configuring sandbox.config
+# ==========================
+#
+# The container image referenced in the tasks above is specified in the sandbox.config file. Provided a name to every Docker image, and reference that in ``container_image``. In this example, we have used the ``core`` image for both the tasks for illustration purposes. 
+#
+# sandbox.config
+# ^^^^^^^^^^^^^^
+# .. literalinclude::  ../../../../core/sandbox.config
diff --git a/cookbook/core/containerization/sandbox.config b/cookbook/core/containerization/sandbox.config
diff --git a/cookbook/core/requirements.txt b/cookbook/core/requirements.txt
@@ -202,4 +202,4 @@ wrapt==1.14.0
     #   deprecated
     #   flytekit
 zipp==3.8.0
-    # via importlib-metadata
+    # via importlib-metadata
diff --git a/cookbook/core/sandbox.config b/cookbook/core/sandbox.config
@@ -1,2 +1,6 @@
 [sdk]
 workflow_packages=core
+
+[images]
+trainer = ghcr.io/flyteorg/flytecookbook:core-latest
+predictor = ghcr.io/flyteorg/flytecookbook:core-latest
diff --git a/...ook/integrations/flytekit_plugins/pandera_examples/validating_and_testing_ml_pipelines.py b/...ook/integrations/flytekit_plugins/pandera_examples/validating_and_testing_ml_pipelines.py
@@ -99,7 +99,7 @@
 #      - the predicted attribute
 #
 # In practice, we'd want to do a little data exploration to first to get a sense of the distribution of variables.
-# A useful resource for this is the `Kaggle <https://www.kaggle.com/datasets/cherngs/heart-disease-cleveland-uci>`__ version of this dataset,
+# A useful resource for this is the `Kaggle <https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset>`__ version of this dataset,
 # which has been slightly preprocessed to be model-ready.
 #
 # .. Note::