Merge pull request #168 from zenml-io/alex/ENG-61-pathutils-refactor

Merge path_utils into fileio & refactor what was left
zenml-io · Nov 11, 2021 · fff4986 · fff4986
2 parents 6e845ad + e6e81ee
commit fff4986
Show file tree

Hide file tree

Showing 60 changed files with 1,333 additions and 795 deletions.
diff --git a/README.md b/README.md
@@ -36,6 +36,13 @@
 
 ## What is ZenML?
 
+
+Before: Sam struggles to productionalize ML |  After: Sam finds Zen in her MLOps with ZenML.
+:-------------------------:|:-------------------------:
+![](docs/readme/sam_frustrated.jpg)  |  ![](docs/readme/sam_zen_mode.jpg)
+
+
+
 **ZenML** is an extensible, open-source MLOps framework to create production-ready machine learning pipelines. It has a simple, flexible syntax,
 is cloud and tool agnostic, and has interfaces/abstractions that are catered towards ML workflows.
 
@@ -104,6 +111,10 @@ ZenML is managed by a [core team](https://zenml.io/team) of developers that are
 
 We would love to receive your contributions! Check our [Contributing Guide](CONTRIBUTING.md) for more details on how best to contribute.
 
+<br>
+
+![Repobeats analytics image](https://repobeats.axiom.co/api/embed/635c57b743efe649cadceba6a2e6a956663f96dd.svg "Repobeats analytics image")
+
 ## Copyright
 
 ZenML is distributed under the terms of the Apache License Version 2.0. A complete version of the license is available in the [LICENSE.md](LICENSE.md) in this repository.

diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
@@ -1,3 +1,29 @@
+# 0.5.2
+
+0.5.2 brings an improved post-execution workflow and lots of minor changes and upgrades for the developer experience when 
+creating pipelines. It also improves the Airflow orchestrator logic to accommodate for more real world scenarios. Check out the 
+[low level API guide for more details!](https://docs.zenml.io/guides/low-level-api)
+
+## What's Changed
+* Fix autocomplete for step and pipeline decorated functions by @schustmi in https://github.com/zenml-io/zenml/pull/144
+* Add reference docs for CLI example functionality by @alex-zenml in https://github.com/zenml-io/zenml/pull/145
+* Fix mypy integration by @schustmi in https://github.com/zenml-io/zenml/pull/147
+* Improve Post-Execution Workflow by @schustmi in https://github.com/zenml-io/zenml/pull/146
+* Fix CLI examples bug by @alex-zenml in https://github.com/zenml-io/zenml/pull/148
+* Update quickstart example notebook by @alex-zenml in https://github.com/zenml-io/zenml/pull/150
+* Add documentation images by @alex-zenml in https://github.com/zenml-io/zenml/pull/151
+* Add prettierignore to gitignore by @alex-zenml in https://github.com/zenml-io/zenml/pull/154
+* Airflow orchestrator improvements by @schustmi in https://github.com/zenml-io/zenml/pull/153
+* Google colab added by @htahir1 in https://github.com/zenml-io/zenml/pull/155
+* Tests for `core` and `cli` modules by @alex-zenml in https://github.com/zenml-io/zenml/pull/149
+* Add Paperspace environment check by @alex-zenml in https://github.com/zenml-io/zenml/pull/156
+* Step caching by @schustmi in https://github.com/zenml-io/zenml/pull/157
+* Add documentation for pipeline step parameter and run name configuration by @schustmi in https://github.com/zenml-io/zenml/pull/158
+* Automatically disable caching if the step function code has changed by @schustmi in https://github.com/zenml-io/zenml/pull/159
+
+
+**Full Changelog**: https://github.com/zenml-io/zenml/compare/0.5.1...0.5.2
+
 # 0.5.1
 0.5.1 builds on top of Slack of the 0.5.0 release with quick bug updates.
 

diff --git a/docs/book/.gitbook/assets/localstack (1).png b/docs/book/.gitbook/assets/localstack (1).png
diff --git a/docs/book/.gitbook/assets/localstack-with-airflow-orchestrator.png b/docs/book/.gitbook/assets/localstack-with-airflow-orchestrator.png
diff --git a/docs/book/.gitbook/assets/localstack.png b/docs/book/.gitbook/assets/localstack.png
diff --git a/docs/book/.gitbook/assets/sam_frustrated.jpg b/docs/book/.gitbook/assets/sam_frustrated.jpg
diff --git a/docs/book/.gitbook/assets/sam_zen_mode (1).jpg b/docs/book/.gitbook/assets/sam_zen_mode (1).jpg
diff --git a/docs/book/.gitbook/assets/sam_zen_mode (2).jpg b/docs/book/.gitbook/assets/sam_zen_mode (2).jpg
diff --git a/docs/book/.gitbook/assets/sam_zen_mode (3).jpg b/docs/book/.gitbook/assets/sam_zen_mode (3).jpg
diff --git a/docs/book/.gitbook/assets/sam_zen_mode.jpg b/docs/book/.gitbook/assets/sam_zen_mode.jpg
diff --git a/docs/book/guides/low-level-api/chapter-7.md b/docs/book/guides/low-level-api/chapter-7.md
@@ -2,29 +2,31 @@
 description: Deploy pipelines to production
 ---
 
-If you want to see the code for this chapter of the guide, head over to the [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/low_level_guide/chapter_7.py).
+# Chapter 7
 
-# Deploy pipelines to production
+If you want to see the code for this chapter of the guide, head over to the [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/low\_level\_guide/chapter\_7.py).
+
+## Deploy pipelines to production
 
 When developing ML models, your pipelines will, at first, most probably live in your machine with a local [Stack](../../core-concepts.md). However, at a certain point when you are finished with its design, you might want to transition to a more production-ready setting, and deploy the pipeline to a more robust environment.
 
-## Install and configure Airflow
+### Install and configure Airflow
 
-This part is optional, and it would depend on your pre-existing production setting. For example, for this guide, Airflow will be set up from scratch and set it to work locally, however you might want to use a managed Airflow instance like [Cloud Composer](https://cloud.google.com/composer) or [Astronomer](https://astronomer.io/).
+This part is optional, and it would depend on your pre-existing production setting. For example, for this guide, Airflow will be set up from scratch and set it to work locally, however you might want to use a managed Airflow instance like [Cloud Composer](https://cloud.google.com/composer) or [Astronomer](https://astronomer.io).
 
 For this guide, you'll want to install airflow before continuing:
 
 ```shell
 pip install apache_airflow
 ```
 
-## Creating an Airflow Stack
+### Creating an Airflow Stack
 
 A [Stack](../../core-concepts.md) is the configuration of the surrounding infrastructure where ZenML pipelines are run and managed. For now, a `Stack` consists of:
 
-- A metadata store: To store metadata like parameters and artifact URIs
-- An artifact store: To store interim data step output.
-- An orchestrator: A service that actually kicks off and runs each step of the pipeline.
+* A metadata store: To store metadata like parameters and artifact URIs
+* An artifact store: To store interim data step output.
+* An orchestrator: A service that actually kicks off and runs each step of the pipeline.
 
 When you did `zenml init` at the start of this guide, a default `local_stack` was created with local version of all of these. In order to see the stack you can check it out in the command line:
 
@@ -38,10 +40,10 @@ Output:
 STACKS:
 key          stack_type    metadata_store_name    artifact_store_name    orchestrator_name
 -----------  ------------  ---------------------  ---------------------  -------------------
-local_stack  base          local_metadata_store   local_artifact_store   local_orchestrator
+local_stack  base          local_metadata_store   local_artifact_store   local_orchestrato
 ```
 
-![Your local stack when you start](assets/localstack.png)
+![Your local stack when you start](../../.gitbook/assets/localstack.png)
 
 Let's stick with the `local_metadata_store` and a `local_artifact_store` for now and create an Airflow orchestrator and corresponding stack.
 
@@ -62,32 +64,30 @@ Stack `airflow_stack` successfully registered!
 Active stack: airflow_stack
 ```
 
-![Your stack with Airflow as orchestrator](assets/localstack-with-airflow-orchestrator.png)
+![Your stack with Airflow as orchestrator](../../.gitbook/assets/localstack-with-airflow-orchestrator.png)
 
 {% hint style="warning" %}
 In the real-world we would also switch to something like a MySQL-based metadata store and a Azure/GCP/S3-based artifact store. We have just skipped that part to keep everything in one machine to make it a bit easier to run this guide.
 {% endhint %}
 
-## Starting up Airflow
+### Starting up Airflow
 
 ZenML takes care of configuring Airflow, all we need to do is run:
 
 ```bash
 zenml orchestrator up
 ```
 
-This will bootstrap Airflow, start up all the necessary components and run them in the background.
-When the setup is finished, it will print username and password for the Airflow webserver to the console.
+This will bootstrap Airflow, start up all the necessary components and run them in the background. When the setup is finished, it will print username and password for the Airflow webserver to the console.
 
 {% hint style="warning" %}
-If you can't find the password on the console, you can navigate to the `APP_DIR / airflow / airflow_root / STACK_UUID / standalone_admin_password.txt` file.
-The username will always be `admin`.
+If you can't find the password on the console, you can navigate to the `APP_DIR / airflow / airflow_root / STACK_UUID / standalone_admin_password.txt` file. The username will always be `admin`.
 
-- APP_DIR will depend on your os. See which path corresponds to your OS [here](https://click.palletsprojects.com/en/8.0.x/api/#click.get_app_dir).
-- STACK_UUID will be the unique id of the airflow_stack. There will be only one folder here so you can just navigate to the one that is present.
-  {% endhint %}
+* APP\_DIR will depend on your os. See which path corresponds to your OS [here](https://click.palletsprojects.com/en/8.0.x/api/#click.get\_app\_dir).
+* STACK\_UUID will be the unique id of the airflow\_stack. There will be only one folder here so you can just navigate to the one that is present.
+{% endhint %}
 
-## Run
+### Run
 
 The code from this chapter is the same as the last chapter. So run:
 
@@ -97,38 +97,38 @@ python chapter_7.py
 
 Even through the pipeline script is the same, the output will be a lot different from last time. ZenML will detect that `airflow_stack` is the active stack, and do the following:
 
-- `chapter_7.py` will be copied to the Airflow `dag_dir` so Airflow can detect is as an [Airflow DAG definition file](https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html#it-s-a-dag-definition-file).
-- The Airflow DAG will show up in the Airflow UI at [http://0.0.0.0:8080](http://0.0.0.0:8080). You will have to login with the username and password generated above.
-- The DAG name will be the same as the pipeline name, so in this case `mnist_pipeline`.
-- The DAG will be scheduled to run every minute.
-- The DAG will be un-paused so you'll probably see the first run as you click through.
+* `chapter_7.py` will be copied to the Airflow `dag_dir` so Airflow can detect is as an [Airflow DAG definition file](https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html#it-s-a-dag-definition-file).
+* The Airflow DAG will show up in the Airflow UI at [http://0.0.0.0:8080](http://0.0.0.0:8080). You will have to login with the username and password generated above.
+* The DAG name will be the same as the pipeline name, so in this case `mnist_pipeline`.
+* The DAG will be scheduled to run every minute.
+* The DAG will be un-paused so you'll probably see the first run as you click through.
 
 And that's it: As long as you keep Airflow running now, this script will run every minute, pull the latest data, and train a new model!
 
 We now have a continuously training ML pipeline training on new data every day. All the pipelines will be tracked in your production [Stack's metadata store](../../core-concepts.md), the interim artifacts will be stored in the [Artifact Store](../../core-concepts.md), and the scheduling and orchestration is being handled by the [orchestrator](../../core-concepts.md), in this case Airflow.
 
-## Shutting down Airflow
+### Shutting down Airflow
 
 Once we are done experimenting, we need to shut down Airflow by running:
 
 ```bash
 zenml orchestrator down
 ```
 
-# Conclusion
+## Conclusion
 
 If you made it this far, congratulations! You're one step closer to being production-ready with your ML workflows! Here is what we achieved in this entire guide:
 
-- Experimented locally and built-up a ML pipeline.
-- Transitioned to production by deploying a continuously training pipeline on newly arriving data.
-- All the while retained complete lineage and tracking over parameters, data, code, and metadata.
+* Experimented locally and built-up a ML pipeline.
+* Transitioned to production by deploying a continuously training pipeline on newly arriving data.
+* All the while retained complete lineage and tracking over parameters, data, code, and metadata.
 
-## Coming soon
+### Coming soon
 
 There are lot's more things you do in production that you might consider adding to your workflows:
 
-- Adding a step to automatically deploy the models to a REST endpoint.
-- Setting up a drift detection and validation step to test models before deploying.
-- Creating a batch inference pipeline to get predictions.
+* Adding a step to automatically deploy the models to a REST endpoint.
+* Setting up a drift detection and validation step to test models before deploying.
+* Creating a batch inference pipeline to get predictions.
 
 ZenML will help with all of these and above -> Watch out for future releases and the next extension of this guide coming soon!
diff --git a/docs/book/guides/pipeline-configuration.md b/docs/book/guides/pipeline-configuration.md
@@ -0,0 +1,49 @@
+# Pipeline Configuration
+
+## Setting step parameters using a config file
+
+In addition to setting parameters for your pipeline steps in code, ZenML also allows you to use a configuration [yaml](https://yaml.org/) file.
+This configuration file must follow the following structure:
+```yaml
+steps:
+  step_name:
+    parameters:
+      parameter_name: parameter_value
+      some_other_parameter_name: 2
+  some_other_step_name:
+    ...
+```
+
+Use the configuration file by calling the pipeline method `with_config(...)`:
+
+```python
+@pipeline
+def my_pipeline(...):
+    ...
+
+pipeline_instance = my_pipeline(...).with_config("path_to_config.yaml")
+pipeline_instance.run()
+```
+
+## Naming a pipeline run
+
+When running a pipeline by calling `my_pipeline.run()`, ZenML uses the current date and time as the name for the pipeline run.
+In order to change the name for a run, simply pass it as a parameter to the `run()` function:
+
+```python
+my_pipeline.run(run_name="custom_pipeline_run_name")
+```
+
+{% hint style="warning" %}
+Pipeline run names must be unique, so make sure to compute it dynamically if you plan to run your pipeline multiple times.
+{% endhint %}
+
+Once the pipeline run is finished we can easily access this specific run during our post-execution workflow:
+
+```python
+from zenml.core.repo import Repository
+
+repo = Repository()
+pipeline = repo.get_pipeline(pipeline_name="my_pipeline")
+run = pipeline.get_run(run_name="custom_pipeline_run_name")
+```
diff --git a/docs/book/guides/post-execution-workflow-test.md b/docs/book/guides/post-execution-workflow-test.md
@@ -0,0 +1,110 @@
+# Post Execution Workflow
+
+## Post-execution workflow
+
+After executing a pipeline, the user needs to be able to fetch it from history and perform certain tasks. This page captures these workflows at an orbital level.
+
+## Component Hierarchy
+
+
+In the context of a post-execution workflow, there is an implied hierarchy of some basic ZenML components:
+
+```bash
+repository -> pipelines -> runs -> steps -> outputs
+
+# where -> implies a 1-many relationship.
+```
+
+### Repository
+
+The highest level `repository` object is where to start from.
+
+
+### Define standard ML steps
+
+```python
+@trainer
+def trainer(dataset: torch.Dataset) -> torch.nn.Module:
+   ... 
+   return model
+```
+
+
+
+### Get pipelines and runs
+
+```python
+pipelines = repo.get_pipelines()   # get all pipelines from all stacks
+pipeline = repo.get_pipeline(pipeline_name=..., stack_key=...)
+```
+
+```python
+runs = pipeline.get_runs()  # all runs of a pipeline chronlogically ordered
+run = runs[-1]  # latest run
+output = step.outputs[0]  # get outputs
+```
+
+### Materializing outputs (or inputs)
+
+Once an output artifact is acquired from history, one can visualize it with any chosen `Materializer`.
+
+```python
+df = output.read(materializer=PandasMaterializer)  # get data
+
+```
+
+### Seeing statistics and schema
+
+```python
+stats = output.read(materializer=StatisticsMaterializer)  # get stats
+schema = output.read(materializer=SchemaMaterializer)  # get schema
+```
+
+### Retrieving Model
+
+```python
+model = output.read(materializer=KerasModelMaterializer)  # get model
+```
+
+
+
+#### Pipelines
+
+```python
+# get all pipelines from all stacks
+pipelines = repo.get_pipelines()  
+
+# or get one pipeline by name and/or stack key
+pipeline = repo.get_pipeline(pipeline_name=..., stack_key=...)
+```
+
+#### Runs
+
+```python
+
+```
+
+#### Steps
+
+```python
+# at this point we switch from the `get_` paradigm to properties
+steps = run.steps  # all steps of a pipeline
+step = steps[0] 
+print(step.name)
+```
+
+#### Outputs
+
+```python
+# all outputs of a step
+# if one output, then its the first element in the list
+# if multiple output, then in the order defined with the `Output`
+
+
+# will get you the value from the original materializer used in the pipeline
+output.read()  
+```
+
+## Visuals
+
+
diff --git a/docs/book/index.md b/docs/book/index.md
@@ -14,7 +14,9 @@ Read more about Why ZenML exists [here](why-zenml.md).
 
 ## Who is ZenML for?
 
-![ZenML Is For The Data Scientist](<.gitbook/assets/cover_image.png>)
+![Before: Sam is a regular data scientist struggling with productionalizing ML models.](.gitbook/assets/sam\_frustrated.jpg)
+
+![After: Sam finds Zen in her MLOps with ZenML.](<.gitbook/assets/sam\_zen\_mode (2).jpg>)
 
 ZenML is created for data science / machine learning teams that are engaged in not only training models, but also putting them out in production. Production can mean many things, but examples would be:
 

diff --git a/docs/book/toc.md b/docs/book/toc.md
@@ -21,6 +21,7 @@
 * [High Level API](guides/high-level-api/README.md)
   * [Chapter 1](guides/high-level-api/chapter-1.md)
 * [Post Execution Workflow](guides/post-execution-workflow.md)
+* [Pipeline Configuration](guides/pipeline-configuration.md)
 * [Deploy Pipelines to Production](guides/deploy_to_production.md)
 
 ## Support