Skip to content

Commit

Permalink
Merge pull request #168 from zenml-io/alex/ENG-61-pathutils-refactor
Browse files Browse the repository at this point in the history
Merge path_utils into fileio & refactor what was left
  • Loading branch information
alex-zenml authored Nov 11, 2021
2 parents 6e845ad + e6e81ee commit fff4986
Show file tree
Hide file tree
Showing 60 changed files with 1,333 additions and 795 deletions.
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,13 @@

## What is ZenML?


Before: Sam struggles to productionalize ML | After: Sam finds Zen in her MLOps with ZenML.
:-------------------------:|:-------------------------:
![](docs/readme/sam_frustrated.jpg) | ![](docs/readme/sam_zen_mode.jpg)



**ZenML** is an extensible, open-source MLOps framework to create production-ready machine learning pipelines. It has a simple, flexible syntax,
is cloud and tool agnostic, and has interfaces/abstractions that are catered towards ML workflows.

Expand Down Expand Up @@ -104,6 +111,10 @@ ZenML is managed by a [core team](https://zenml.io/team) of developers that are

We would love to receive your contributions! Check our [Contributing Guide](CONTRIBUTING.md) for more details on how best to contribute.

<br>

![Repobeats analytics image](https://repobeats.axiom.co/api/embed/635c57b743efe649cadceba6a2e6a956663f96dd.svg "Repobeats analytics image")

## Copyright

ZenML is distributed under the terms of the Apache License Version 2.0. A complete version of the license is available in the [LICENSE.md](LICENSE.md) in this repository.
Expand Down
26 changes: 26 additions & 0 deletions RELEASE_NOTES.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,29 @@
# 0.5.2

0.5.2 brings an improved post-execution workflow and lots of minor changes and upgrades for the developer experience when
creating pipelines. It also improves the Airflow orchestrator logic to accommodate for more real world scenarios. Check out the
[low level API guide for more details!](https://docs.zenml.io/guides/low-level-api)

## What's Changed
* Fix autocomplete for step and pipeline decorated functions by @schustmi in https://github.com/zenml-io/zenml/pull/144
* Add reference docs for CLI example functionality by @alex-zenml in https://github.com/zenml-io/zenml/pull/145
* Fix mypy integration by @schustmi in https://github.com/zenml-io/zenml/pull/147
* Improve Post-Execution Workflow by @schustmi in https://github.com/zenml-io/zenml/pull/146
* Fix CLI examples bug by @alex-zenml in https://github.com/zenml-io/zenml/pull/148
* Update quickstart example notebook by @alex-zenml in https://github.com/zenml-io/zenml/pull/150
* Add documentation images by @alex-zenml in https://github.com/zenml-io/zenml/pull/151
* Add prettierignore to gitignore by @alex-zenml in https://github.com/zenml-io/zenml/pull/154
* Airflow orchestrator improvements by @schustmi in https://github.com/zenml-io/zenml/pull/153
* Google colab added by @htahir1 in https://github.com/zenml-io/zenml/pull/155
* Tests for `core` and `cli` modules by @alex-zenml in https://github.com/zenml-io/zenml/pull/149
* Add Paperspace environment check by @alex-zenml in https://github.com/zenml-io/zenml/pull/156
* Step caching by @schustmi in https://github.com/zenml-io/zenml/pull/157
* Add documentation for pipeline step parameter and run name configuration by @schustmi in https://github.com/zenml-io/zenml/pull/158
* Automatically disable caching if the step function code has changed by @schustmi in https://github.com/zenml-io/zenml/pull/159


**Full Changelog**: https://github.com/zenml-io/zenml/compare/0.5.1...0.5.2

# 0.5.1
0.5.1 builds on top of Slack of the 0.5.0 release with quick bug updates.

Expand Down
Binary file added docs/book/.gitbook/assets/localstack (1).png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/book/.gitbook/assets/localstack.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/book/.gitbook/assets/sam_frustrated.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/book/.gitbook/assets/sam_zen_mode (1).jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/book/.gitbook/assets/sam_zen_mode (2).jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/book/.gitbook/assets/sam_zen_mode (3).jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/book/.gitbook/assets/sam_zen_mode.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
68 changes: 34 additions & 34 deletions docs/book/guides/low-level-api/chapter-7.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,31 @@
description: Deploy pipelines to production
---

If you want to see the code for this chapter of the guide, head over to the [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/low_level_guide/chapter_7.py).
# Chapter 7

# Deploy pipelines to production
If you want to see the code for this chapter of the guide, head over to the [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/low\_level\_guide/chapter\_7.py).

## Deploy pipelines to production

When developing ML models, your pipelines will, at first, most probably live in your machine with a local [Stack](../../core-concepts.md). However, at a certain point when you are finished with its design, you might want to transition to a more production-ready setting, and deploy the pipeline to a more robust environment.

## Install and configure Airflow
### Install and configure Airflow

This part is optional, and it would depend on your pre-existing production setting. For example, for this guide, Airflow will be set up from scratch and set it to work locally, however you might want to use a managed Airflow instance like [Cloud Composer](https://cloud.google.com/composer) or [Astronomer](https://astronomer.io/).
This part is optional, and it would depend on your pre-existing production setting. For example, for this guide, Airflow will be set up from scratch and set it to work locally, however you might want to use a managed Airflow instance like [Cloud Composer](https://cloud.google.com/composer) or [Astronomer](https://astronomer.io).

For this guide, you'll want to install airflow before continuing:

```shell
pip install apache_airflow
```

## Creating an Airflow Stack
### Creating an Airflow Stack

A [Stack](../../core-concepts.md) is the configuration of the surrounding infrastructure where ZenML pipelines are run and managed. For now, a `Stack` consists of:

- A metadata store: To store metadata like parameters and artifact URIs
- An artifact store: To store interim data step output.
- An orchestrator: A service that actually kicks off and runs each step of the pipeline.
* A metadata store: To store metadata like parameters and artifact URIs
* An artifact store: To store interim data step output.
* An orchestrator: A service that actually kicks off and runs each step of the pipeline.

When you did `zenml init` at the start of this guide, a default `local_stack` was created with local version of all of these. In order to see the stack you can check it out in the command line:

Expand All @@ -38,10 +40,10 @@ Output:
STACKS:
key stack_type metadata_store_name artifact_store_name orchestrator_name
----------- ------------ --------------------- --------------------- -------------------
local_stack base local_metadata_store local_artifact_store local_orchestrator
local_stack base local_metadata_store local_artifact_store local_orchestrato
```

![Your local stack when you start](assets/localstack.png)
![Your local stack when you start](../../.gitbook/assets/localstack.png)

Let's stick with the `local_metadata_store` and a `local_artifact_store` for now and create an Airflow orchestrator and corresponding stack.

Expand All @@ -62,32 +64,30 @@ Stack `airflow_stack` successfully registered!
Active stack: airflow_stack
```

![Your stack with Airflow as orchestrator](assets/localstack-with-airflow-orchestrator.png)
![Your stack with Airflow as orchestrator](../../.gitbook/assets/localstack-with-airflow-orchestrator.png)

{% hint style="warning" %}
In the real-world we would also switch to something like a MySQL-based metadata store and a Azure/GCP/S3-based artifact store. We have just skipped that part to keep everything in one machine to make it a bit easier to run this guide.
{% endhint %}

## Starting up Airflow
### Starting up Airflow

ZenML takes care of configuring Airflow, all we need to do is run:

```bash
zenml orchestrator up
```

This will bootstrap Airflow, start up all the necessary components and run them in the background.
When the setup is finished, it will print username and password for the Airflow webserver to the console.
This will bootstrap Airflow, start up all the necessary components and run them in the background. When the setup is finished, it will print username and password for the Airflow webserver to the console.

{% hint style="warning" %}
If you can't find the password on the console, you can navigate to the `APP_DIR / airflow / airflow_root / STACK_UUID / standalone_admin_password.txt` file.
The username will always be `admin`.
If you can't find the password on the console, you can navigate to the `APP_DIR / airflow / airflow_root / STACK_UUID / standalone_admin_password.txt` file. The username will always be `admin`.

- APP_DIR will depend on your os. See which path corresponds to your OS [here](https://click.palletsprojects.com/en/8.0.x/api/#click.get_app_dir).
- STACK_UUID will be the unique id of the airflow_stack. There will be only one folder here so you can just navigate to the one that is present.
{% endhint %}
* APP\_DIR will depend on your os. See which path corresponds to your OS [here](https://click.palletsprojects.com/en/8.0.x/api/#click.get\_app\_dir).
* STACK\_UUID will be the unique id of the airflow\_stack. There will be only one folder here so you can just navigate to the one that is present.
{% endhint %}

## Run
### Run

The code from this chapter is the same as the last chapter. So run:

Expand All @@ -97,38 +97,38 @@ python chapter_7.py

Even through the pipeline script is the same, the output will be a lot different from last time. ZenML will detect that `airflow_stack` is the active stack, and do the following:

- `chapter_7.py` will be copied to the Airflow `dag_dir` so Airflow can detect is as an [Airflow DAG definition file](https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html#it-s-a-dag-definition-file).
- The Airflow DAG will show up in the Airflow UI at [http://0.0.0.0:8080](http://0.0.0.0:8080). You will have to login with the username and password generated above.
- The DAG name will be the same as the pipeline name, so in this case `mnist_pipeline`.
- The DAG will be scheduled to run every minute.
- The DAG will be un-paused so you'll probably see the first run as you click through.
* `chapter_7.py` will be copied to the Airflow `dag_dir` so Airflow can detect is as an [Airflow DAG definition file](https://airflow.apache.org/docs/apache-airflow/stable/tutorial.html#it-s-a-dag-definition-file).
* The Airflow DAG will show up in the Airflow UI at [http://0.0.0.0:8080](http://0.0.0.0:8080). You will have to login with the username and password generated above.
* The DAG name will be the same as the pipeline name, so in this case `mnist_pipeline`.
* The DAG will be scheduled to run every minute.
* The DAG will be un-paused so you'll probably see the first run as you click through.

And that's it: As long as you keep Airflow running now, this script will run every minute, pull the latest data, and train a new model!

We now have a continuously training ML pipeline training on new data every day. All the pipelines will be tracked in your production [Stack's metadata store](../../core-concepts.md), the interim artifacts will be stored in the [Artifact Store](../../core-concepts.md), and the scheduling and orchestration is being handled by the [orchestrator](../../core-concepts.md), in this case Airflow.

## Shutting down Airflow
### Shutting down Airflow

Once we are done experimenting, we need to shut down Airflow by running:

```bash
zenml orchestrator down
```

# Conclusion
## Conclusion

If you made it this far, congratulations! You're one step closer to being production-ready with your ML workflows! Here is what we achieved in this entire guide:

- Experimented locally and built-up a ML pipeline.
- Transitioned to production by deploying a continuously training pipeline on newly arriving data.
- All the while retained complete lineage and tracking over parameters, data, code, and metadata.
* Experimented locally and built-up a ML pipeline.
* Transitioned to production by deploying a continuously training pipeline on newly arriving data.
* All the while retained complete lineage and tracking over parameters, data, code, and metadata.

## Coming soon
### Coming soon

There are lot's more things you do in production that you might consider adding to your workflows:

- Adding a step to automatically deploy the models to a REST endpoint.
- Setting up a drift detection and validation step to test models before deploying.
- Creating a batch inference pipeline to get predictions.
* Adding a step to automatically deploy the models to a REST endpoint.
* Setting up a drift detection and validation step to test models before deploying.
* Creating a batch inference pipeline to get predictions.

ZenML will help with all of these and above -> Watch out for future releases and the next extension of this guide coming soon!
49 changes: 49 additions & 0 deletions docs/book/guides/pipeline-configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Pipeline Configuration

## Setting step parameters using a config file

In addition to setting parameters for your pipeline steps in code, ZenML also allows you to use a configuration [yaml](https://yaml.org/) file.
This configuration file must follow the following structure:
```yaml
steps:
step_name:
parameters:
parameter_name: parameter_value
some_other_parameter_name: 2
some_other_step_name:
...
```
Use the configuration file by calling the pipeline method `with_config(...)`:

```python
@pipeline
def my_pipeline(...):
...
pipeline_instance = my_pipeline(...).with_config("path_to_config.yaml")
pipeline_instance.run()
```

## Naming a pipeline run

When running a pipeline by calling `my_pipeline.run()`, ZenML uses the current date and time as the name for the pipeline run.
In order to change the name for a run, simply pass it as a parameter to the `run()` function:

```python
my_pipeline.run(run_name="custom_pipeline_run_name")
```

{% hint style="warning" %}
Pipeline run names must be unique, so make sure to compute it dynamically if you plan to run your pipeline multiple times.
{% endhint %}

Once the pipeline run is finished we can easily access this specific run during our post-execution workflow:

```python
from zenml.core.repo import Repository
repo = Repository()
pipeline = repo.get_pipeline(pipeline_name="my_pipeline")
run = pipeline.get_run(run_name="custom_pipeline_run_name")
```
110 changes: 110 additions & 0 deletions docs/book/guides/post-execution-workflow-test.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Post Execution Workflow

## Post-execution workflow

After executing a pipeline, the user needs to be able to fetch it from history and perform certain tasks. This page captures these workflows at an orbital level.

## Component Hierarchy


In the context of a post-execution workflow, there is an implied hierarchy of some basic ZenML components:

```bash
repository -> pipelines -> runs -> steps -> outputs

# where -> implies a 1-many relationship.
```

### Repository

The highest level `repository` object is where to start from.


### Define standard ML steps

```python
@trainer
def trainer(dataset: torch.Dataset) -> torch.nn.Module:
...
return model
```



### Get pipelines and runs

```python
pipelines = repo.get_pipelines() # get all pipelines from all stacks
pipeline = repo.get_pipeline(pipeline_name=..., stack_key=...)
```

```python
runs = pipeline.get_runs() # all runs of a pipeline chronlogically ordered
run = runs[-1] # latest run
output = step.outputs[0] # get outputs
```

### Materializing outputs (or inputs)

Once an output artifact is acquired from history, one can visualize it with any chosen `Materializer`.

```python
df = output.read(materializer=PandasMaterializer) # get data

```

### Seeing statistics and schema

```python
stats = output.read(materializer=StatisticsMaterializer) # get stats
schema = output.read(materializer=SchemaMaterializer) # get schema
```

### Retrieving Model

```python
model = output.read(materializer=KerasModelMaterializer) # get model
```



#### Pipelines

```python
# get all pipelines from all stacks
pipelines = repo.get_pipelines()

# or get one pipeline by name and/or stack key
pipeline = repo.get_pipeline(pipeline_name=..., stack_key=...)
```

#### Runs

```python

```

#### Steps

```python
# at this point we switch from the `get_` paradigm to properties
steps = run.steps # all steps of a pipeline
step = steps[0]
print(step.name)
```

#### Outputs

```python
# all outputs of a step
# if one output, then its the first element in the list
# if multiple output, then in the order defined with the `Output`


# will get you the value from the original materializer used in the pipeline
output.read()
```

## Visuals


4 changes: 3 additions & 1 deletion docs/book/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@ Read more about Why ZenML exists [here](why-zenml.md).

## Who is ZenML for?

![ZenML Is For The Data Scientist](<.gitbook/assets/cover_image.png>)
![Before: Sam is a regular data scientist struggling with productionalizing ML models.](.gitbook/assets/sam\_frustrated.jpg)

![After: Sam finds Zen in her MLOps with ZenML.](<.gitbook/assets/sam\_zen\_mode (2).jpg>)

ZenML is created for data science / machine learning teams that are engaged in not only training models, but also putting them out in production. Production can mean many things, but examples would be:

Expand Down
1 change: 1 addition & 0 deletions docs/book/toc.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
* [High Level API](guides/high-level-api/README.md)
* [Chapter 1](guides/high-level-api/chapter-1.md)
* [Post Execution Workflow](guides/post-execution-workflow.md)
* [Pipeline Configuration](guides/pipeline-configuration.md)
* [Deploy Pipelines to Production](guides/deploy_to_production.md)

## Support
Expand Down
Loading

0 comments on commit fff4986

Please sign in to comment.