Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation images #151

Merged
merged 5 commits into from
Nov 2, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/book/assets/localstack.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/book/assets/quickstart-diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
36 changes: 17 additions & 19 deletions docs/book/core-concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,9 @@ description: A good place to start before diving further into the docs.

# Core Concepts

## Core Concepts

**ZenML** consists of the following key components:

![ZenML Architectural Overview](<.gitbook/assets/core_concepts_zenml.png>)
![ZenML Architectural Overview](assets/2021-11-02-architecture-overview.png)

**Repository**

Expand Down Expand Up @@ -83,8 +81,8 @@ def simplest_step_ever(basic_param_1: int, basic_param_2: str) -> int:

There are only a few considerations for the parameters and return types.

* All parameters passed into the signature must be [typed](https://docs.python.org/3/library/typing.html). Similarly, if you're returning something, it must be also be typed with the return operator (`->`)
* ZenML uses [Pydantic](https://pydantic-docs.helpmanual.io/usage/types/) for type checking and serialization under-the-hood, so all [Pydantic types](https://pydantic-docs.helpmanual.io/usage/types/) are supported \[full list available soon].
- All parameters passed into the signature must be [typed](https://docs.python.org/3/library/typing.html). Similarly, if you're returning something, it must be also be typed with the return operator (`->`)
- ZenML uses [Pydantic](https://pydantic-docs.helpmanual.io/usage/types/) for type checking and serialization under-the-hood, so all [Pydantic types](https://pydantic-docs.helpmanual.io/usage/types/) are supported \[full list available soon].

While this is just a function with a decorator, it is not super useful. ZenML steps really get powerful when you put them together with [data artifacts](broken-reference). Read about more of that here!

Expand All @@ -104,7 +102,7 @@ def my_step(first_artifact: int, second_artifact: torch.nn.Module -> int:
return 1
```

Artifacts can be serialized and deserialized (i.e. written and read from the Artifact Store) in many different ways like `TFRecord`s or saved model pickles, depending on what the step produces.The serialization and deserialization logic of artifacts is defined by [materializers.md](reference/zenml/materializers.md "mention").
Artifacts can be serialized and deserialized (i.e. written and read from the Artifact Store) in many different ways like `TFRecord`s or saved model pickles, depending on what the step produces.The serialization and deserialization logic of artifacts is defined by [materializers.md](reference/zenml/materializers.md "mention").

**Materializers**

Expand All @@ -120,7 +118,7 @@ from zenml.steps.base_step_config import BaseStepConfig
class MyStepConfig(BaseStepConfig):
basic_param_1: int = 1
basic_param_2: str = 2

@step
def my_step(params: MyStepConfig):
# user params here
Expand All @@ -143,9 +141,9 @@ An orchestrator is a special kind of backend that manages the running of each st

A stack is made up of the following three core components:

* An Artifact Store
* A Metadata Store
* An Orchestrator (backend)
- An Artifact Store
- A Metadata Store
- An Orchestrator (backend)

A ZenML stack also happens to be a Pydantic `BaseSettings` class, which means that there are multiple ways to use it.

Expand All @@ -172,9 +170,9 @@ On a high level, when data is read from an **artifact** the results are persiste

A few rules apply:

* Every **orchestrator** (local, Google Cloud VMs, etc) can run all **pipeline steps**, including training.
* **Orchestrators** have a selection of compatible **processing backends**.
* **Pipelines** can be configured to utilize more powerful **processing** (e.g. distributed) and **training** (e.g. Google AI Platform) **executors**.
- Every **orchestrator** (local, Google Cloud VMs, etc) can run all **pipeline steps**, including training.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these all automatic from your linter? We should standardize it otherwise it becomes hard to review markdown. Can you share the tools your using? Maybe we can add a pre-commit hook?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry about that.

I think it is prettier which is taking out extra spaces on the ends of lines.

https://prettier.io/docs/en/precommit.html shows how it'd integrate with precommit.

I imagine there will be some overlap with some things we already have? For markdown, my settings are:

  "arrowParens": "always",
  "bracketSpacing": true,
  "endOfLine": "lf",
  "htmlWhitespaceSensitivity": "css",
  "insertPragma": false,
  "jsxBracketSameLine": false,
  "jsxSingleQuote": false,
  "printWidth": 80,
  "proseWrap": "preserve",
  "quoteProps": "as-needed",
  "requirePragma": false,
  "semi": true,
  "singleQuote": false,
  "tabWidth": 2,
  "trailingComma": "es5",
  "useTabs": false,
  "vueIndentScriptAndStyle": false,
  "filepath": "/Users/strickvl/coding/zenml/repos/zenml/docs/book/guides/low-level-api/chapter-7.md",
  "parser": "markdown"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding it to the whole dev cycle? Including adding a new script and editing our pre-commit hook?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in a separate branch.

- **Orchestrators** have a selection of compatible **processing backends**.
- **Pipelines** can be configured to utilize more powerful **processing** (e.g. distributed) and **training** (e.g. Google AI Platform) **executors**.

A quick example for large datasets makes this clearer. By default, your experiments will run locally. Pipelines that load large datasets would be severely bottlenecked, so you can configure [Google Dataflow](https://cloud.google.com/dataflow) as a **processing executor** for distributed computation, and [Google AI Platform](https://cloud.google.com/ai-platform) as a **training executor**.

Expand All @@ -184,11 +182,11 @@ The design choices in **ZenML** follow the understanding that production-ready m

In different words, **ZenML** runs your **ML** code while taking care of the "**Op**eration**s**" for you. It takes care of:

* Interfacing between the individual processing **steps** (splitting, transform, training).
* Tracking of intermediate results and metadata
* Caching your processing artifacts.
* Parallelization of computing tasks.
* Ensuring the immutability of your pipelines from data sourcing to model artifacts.
* No matter where - cloud, on-prem, or locally.
- Interfacing between the individual processing **steps** (splitting, transform, training).
- Tracking of intermediate results and metadata
- Caching your processing artifacts.
- Parallelization of computing tasks.
- Ensuring the immutability of your pipelines from data sourcing to model artifacts.
- No matter where - cloud, on-prem, or locally.

Since production scenarios often look complex, **ZenML** is built with integrations in mind. **ZenML** will support a range of integrations for processing, training, and serving, and you can always add custom integrations via our extensible interfaces.
12 changes: 7 additions & 5 deletions docs/book/guides/low-level-api/chapter-1.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ description: Create your first step.

If you want to see the code for this chapter of the guide, head over to the [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/low_level_guide/chapter_1.py).

# Chapter 1: Create an importer step to load data
# Create an importer step to load data
htahir1 marked this conversation as resolved.
Show resolved Hide resolved

The first thing to do is to load our data. We create a step that can load data from an external source (in this case a [Keras Dataset](https://keras.io/api/datasets/)). This can be done by creating a simple function and decorating it with the `@step` decorator.

Expand All @@ -30,8 +30,8 @@ def importer_mnist() -> Output(

There are some things to note:

* As this step has multiple outputs, we need to use the `zenml.steps.step_output.Output` class to indicate the names of each output. If there was only one, we would not need to do this.
* We could have returned the `tf.keras.datasets.mnist` directly but we wanted to persist the actual data (for caching purposes), rather than the dataset object.
- As this step has multiple outputs, we need to use the `zenml.steps.step_output.Output` class to indicate the names of each output. If there was only one, we would not need to do this.
- We could have returned the `tf.keras.datasets.mnist` directly but we wanted to persist the actual data (for caching purposes), rather than the dataset object.

Now we can go ahead and create a pipeline with one step to make sure this step works:

Expand All @@ -51,11 +51,13 @@ load_mnist_pipeline(importer=importer_mnist()).run()
```

## Run

You can run this as follows:

```python
python chapter_1.py
```

The output will look as follows (note: this is filtered to highlight the most important logs)

```bash
Expand All @@ -66,7 +68,7 @@ Step `importer_mnist` has started.
Step `importer_mnist` has finished in 1.726s.
```

## Inspect
## Inspect

You can add the following code to fetch the pipeline:

Expand Down Expand Up @@ -98,4 +100,4 @@ Output 'y_train' is an array with shape: (60000,)
Output 'X_train' is an array with shape: (60000, 28, 28)
```

So now we have successfully confirmed that the data is loaded with the right shape and we can fetch it again from the artifact store.
So now we have successfully confirmed that the data is loaded with the right shape and we can fetch it again from the artifact store.
10 changes: 5 additions & 5 deletions docs/book/guides/low-level-api/chapter-2.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,10 @@ description: Add some normalization

If you want to see the code for this chapter of the guide, head over to the [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/low_level_guide/chapter_2.py).

# Chapter 2: Normalize the data.
# Normalize the data.

Now before writing any trainers we can actually normalize our data to make sure we get better results. To do this let's add another step and make the pipeline a bit more complex.


## Create steps

We can think of this as a `normalizer` step that takes data from the importer and normalizes it:
Expand Down Expand Up @@ -38,13 +37,14 @@ def load_and_normalize_pipeline(
normalizer(X_train=X_train, X_test=X_test)
```


## Run

You can run this as follows:

```python
python chapter_2.py
```

The output will look as follows (note: this is filtered to highlight the most important logs)

```bash
Expand All @@ -57,7 +57,7 @@ Step `normalize_mnist` has started.
Step `normalize_mnist` has finished in 1.848s.
```

## Inspect
## Inspect

You can add the following code to fetch the pipeline:

Expand Down Expand Up @@ -87,4 +87,4 @@ Output 'X_train_normed' is an array with shape: (60000, 28, 28)
Output 'X_test_normed' is an array with shape: (10000, 28, 28)
```

Which confirms again that the data is stored properly! Now we are ready to create some trainers..
Which confirms again that the data is stored properly! Now we are ready to create some trainers..
19 changes: 11 additions & 8 deletions docs/book/guides/low-level-api/chapter-3.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,10 @@ description: Train some models.

If you want to see the code for this chapter of the guide, head over to the [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/low_level_guide/chapter_3.py).

# Chapter 3: Train and evaluate the model.
# Train and evaluate the model.

Finally we can train and evaluate our model.

Finally we can train and evaluate our model.
## Create steps

For this we decide to add two steps, a `trainer` and an `evaluator` step. We also keep using TensorFlow to help with these.
Expand All @@ -26,10 +27,10 @@ class TrainerConfig(BaseStepConfig):
epochs: int = 1
gamma: float = 0.7
lr: float = 0.001

@step
def tf_trainer(
config: TrainerConfig, # not an artifact, passed in when
config: TrainerConfig, # not an artifact, passed in when
X_train: np.ndarray,
y_train: np.ndarray,
) -> tf.keras.Model:
Expand Down Expand Up @@ -61,10 +62,11 @@ def tf_trainer(

A few things of note:

* This is our first instance of `parameterizing` a step with a `BaseStepConfig`. This allows us to specify some parameters at run-time rather than via data artifacts between steps.
* This time the trainer returns a `tf.keras.Model`, which ZenML takes care of storing in the artifact store. We will talk about how to 'take over' this storing via `Materializers` in a later chapter.
- This is our first instance of `parameterizing` a step with a `BaseStepConfig`. This allows us to specify some parameters at run-time rather than via data artifacts between steps.
- This time the trainer returns a `tf.keras.Model`, which ZenML takes care of storing in the artifact store. We will talk about how to 'take over' this storing via `Materializers` in a later chapter.

### Evaluator

We also add a a simple evaluator:

```python
Expand Down Expand Up @@ -116,6 +118,7 @@ mnist_pipeline(
Beautiful, now the pipeline is truly doing something. Let's run it!

## Run

You can run this as follows:

```python
Expand All @@ -138,7 +141,7 @@ Step `tf_evaluator` has started.
`tf_evaluator` has finished in 0.742s.
```

## Inspect
## Inspect

If you add the following code to fetch the pipeline:

Expand All @@ -165,4 +168,4 @@ The first run has 4 steps.
The `tf_evaluator step` returned an accuracy: 0.9100000262260437
```

Wow, we just trained our first model! But have not stopped yet. What if did not want to use TensorFlow? Let's swap out our trainers and evaluators for different libraries.
Wow, we just trained our first model! But have not stopped yet. What if did not want to use TensorFlow? Let's swap out our trainers and evaluators for different libraries.
10 changes: 6 additions & 4 deletions docs/book/guides/low-level-api/chapter-4.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ description: Leverage caching.

If you want to see the code for this chapter of the guide, head over to the [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/low_level_guide/chapter_4.py).

# Chapter 4: Swap out implementations of individual steps and see caching in action
# Swap out implementations of individual steps and see caching in action

What if we don't want to use TensorFlow but rather a [scikit-learn](https://scikit-learn.org/) model? This is easy to do.

Expand Down Expand Up @@ -35,6 +35,7 @@ def sklearn_trainer(
A simple enough step using a sklearn `ClassifierMixin` model. ZenML also knows how to store all primitive sklearn model types.

### Evaluator

We also add a simple evaluator:

```python
Expand Down Expand Up @@ -65,6 +66,7 @@ mnist_pipeline(
```

## Run

You can run this as follows:

```python
Expand All @@ -89,7 +91,7 @@ Step `sklearn_evaluator` has finished in 0.191s.

Note that the `importer` and `mnist` steps are now **100x** faster. This is because we have not changed the pipeline at all, and just made another run with different functions. So ZenML caches these steps and skips straight to the new trainer and evaluator.

## Inspect
## Inspect

If you add the following code to fetch the pipeline:

Expand All @@ -115,6 +117,6 @@ For tf_evaluator, the accuracy is: 0.91
For sklearn_evaluator, the accuracy is: 0.92
```

Looks like sklearn narrowly beat TensorFlow in this one. If we want we can keep extending this and add a PyTorch example (we have done with the `not_so_quickstart` [example](https://github.com/zenml-io/zenml/tree/main/examples/not_so_quickstart)).
Looks like sklearn narrowly beat TensorFlow in this one. If we want we can keep extending this and add a PyTorch example (we have done with the `not_so_quickstart` [example](https://github.com/zenml-io/zenml/tree/main/examples/not_so_quickstart)).

Combining different complex steps with standard pipeline interfaces is a powerful tool in any MLOps setup. You can now organize, track, and manage your codebase as it grows with your use-cases.
Combining different complex steps with standard pipeline interfaces is a powerful tool in any MLOps setup. You can now organize, track, and manage your codebase as it grows with your use-cases.
10 changes: 6 additions & 4 deletions docs/book/guides/low-level-api/chapter-5.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,13 @@ description: Materialize artifacts as you want.

If you want to see the code for this chapter of the guide, head over to the [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/low_level_guide/chapter_5.py).

# Chapter 5: Materialize artifacts the way you want to consume them.
# Materialize artifacts the way you want to consume them.

At this point, the precise way that data passes between the steps has been a bit of a mystery to us. There is, of course, a mechanism to serialize and deserialize stuff flowing between steps. We can now take control of this mechanism if we require further control.

## Create custom materializer
Data that flows through steps is stored in `Artifact Stores`. The logic that governs the reading and writing of data to and from the `Artifact Stores` lives in the `Materializers`.

Data that flows through steps is stored in `Artifact Stores`. The logic that governs the reading and writing of data to and from the `Artifact Stores` lives in the `Materializers`.

Suppose we wanted to write the output of our `evaluator` step and store it in a SQLite table in the Artifact Store, rather than whatever the default mechanism is to store the float. Well, that should be easy. Let's create a custom materializer:

Expand Down Expand Up @@ -96,13 +97,14 @@ scikit_p = mnist_pipeline(
```

## Run

You can run this as follows:

```python
python chapter_5.py
```

## Inspect
## Inspect

We can also now read data from the SQLite table with our custom materializer:

Expand All @@ -120,4 +122,4 @@ Which returns:
```bash
Pipeline `mnist_pipeline` has 1 run(s)
The evaluator stored the value: 0.9238 in a SQLite database!
```
```
13 changes: 7 additions & 6 deletions docs/book/guides/low-level-api/chapter-6.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@ description: Reading from a continuously changing datasource

If you want to see the code for this chapter of the guide, head over to the [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/low_level_guide/chapter_6.py).

# Chapter 6: Import data from a dynamic data source
# Import data from a dynamic data source

Until now, we've been reading from a static data importer step because we are at the experimentation phase of the ML workflow. Now as we head towards production, we want to switch over to a non-static, dynamic data importer step:

This could be anything like:

* A database/data warehouse that updates regularly (SQL databases, BigQuery, Snowflake)
* A data lake (S3 Buckets/Azure Blob Storage/GCP Storage)
* An API which allows you to query the latest data.
- A database/data warehouse that updates regularly (SQL databases, BigQuery, Snowflake)
- A data lake (S3 Buckets/Azure Blob Storage/GCP Storage)
- An API which allows you to query the latest data.

## Read from a dynamic datasource

Expand Down Expand Up @@ -71,13 +71,14 @@ scikit_p = mnist_pipeline(
```

## Run

You can run this as follows:

```python
python chapter_6.py
```

## Inspect
## Inspect

Even if our data originally lives in an external API, we have now downloaded it and versioned locally as we ran this pipeline. So we can fetch it and inspect it:

Expand All @@ -103,4 +104,4 @@ Now we are loading data dynamically from a continously changing data source!

{% hint style="info" %}
In the near future, ZenML will help you automatically detect drift and schema changes across pipeline runs, to make your pipelines even more robust! Keep an eye out on this space and future releases!
{% endhint %}
{% endhint %}
Loading