Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add concepts back into glossary #425

Merged
merged 1 commit into from
Feb 23, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 151 additions & 14 deletions docs/book/reference/glossary.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,158 @@
# Glossary

**CLI**
## Artifact

Our command-line tool is your entry point into ZenML. You install this tool and use it to set up and configure your
repository to work with ZenML. A simple `init` command serves to get you started, and then you can provision the
infrastructure that you wish to work with easily using a simple `stack register` command with the relevant arguments
passed in.
Artifacts are the data that power your experimentation and model training. It is
actually steps that produce artifacts, which are then stored in the artifact
store.

**DAG**
Artifacts can be serialized and deserialized (i.e. written and read from the
Artifact Store) in different ways like `TFRecord`s or saved model pickles,
depending on what the step produces.The serialization and deserialization logic
of artifacts is defined by Materializers.

Pipelines are traditionally represented as DAGs. DAG is an acronym for Directed Acyclic Graph.
## Artifact Store

* Directed, because the nodes of the graph (i.e. the steps of a pipeline), have a sequence. Nodes do not exist as
free-standing entities in this way.
* Acyclic, because there must be one (or more) straight paths through the graph from the beginning to the end. It is
acyclic because the graph doesn't loop back on itself at any point.
* Graph, because the steps of the pipeline are represented as nodes in a graph.
An artifact store is a place where artifacts are stored. These artifacts may
have been produced by the pipeline steps, or they may be the data first ingested
into a pipeline via an ingestion step.

ZenML follows this paradigm and it is a useful mental model to have in your head when thinking about how the pieces
of your pipeline get executed and how dependencies between the different stages are managed.
## CLI

Our command-line tool is your entry point into ZenML. You install this tool and
use it to set up and configure your repository to work with ZenML. A simple
`init` command serves to get you started, and then you can provision the
infrastructure that you wish to work with easily using a simple `stack register`
command with the relevant arguments passed in.

## Container Registry

A container registry is a store for (Docker) containers. A ZenML workflow
involving a container registry would see you spinning up a Kubernetes cluster
and then deploying a pipeline to be run on Kubeflow Pipelines. As part of the
deployment to the cluster, the ZenML base image would be downloaded (from a
cloud container registry) and used as the basis for the deployed 'run'. When you
are running a local Kubeflow stack, you would therefore have a local container
registry which stores the container images you create that bundle up your
pipeline code. These images would in turn be built on top of a base image or
custom image of your choice.

## DAG

Pipelines are traditionally represented as DAGs. DAG is an acronym for Directed
Acyclic Graph.

- Directed, because the nodes of the graph (i.e. the steps of a pipeline), have
a sequence. Nodes do not exist as free-standing entities in this way.
- Acyclic, because there must be one (or more) straight paths through the graph
from the beginning to the end. It is acyclic because the graph doesn't loop
back on itself at any point.
- Graph, because the steps of the pipeline are represented as nodes in a graph.

ZenML follows this paradigm and it is a useful mental model to have in your head
when thinking about how the pieces of your pipeline get executed and how
dependencies between the different stages are managed.

## Materializers

A materializer defines how and where Artifacts live in between steps. It is used
to convert a ZenML artifact into a specific format. They are most often used to
handle the input or output of ZenML steps, and can be extended by building on
the `BaseMaterializer` class. We care about this because steps are not just
isolated pieces of work; they are linked together and the outputs of one step
might well be the inputs of the next.

We have some built-in ways to serialize and deserialize the data flowing between
steps. Of course, if you are using some library or tool which doesn't work with
our built-in options, you can write
[your own custom materializer](https://docs.zenml.io/guides/functional-api/materialize-artifacts)
to ensure that your data can be passed from step to step in this way. We use our
[`fileio` utilities](https://apidocs.zenml.io/api_reference/zenml.io.fileio.html)
to do the disk operations without needing to be concerned with whether we're
operating on a local or cloud machine.

## Metadata

Metadata are the pieces of information tracked about the pipelines, experiments
and configurations that you are running with ZenML. Metadata are stored inside
the metadata store.

## Metadata Store

The configuration of each pipeline, step and produced artifacts are all tracked
within the metadata store. The metadata store is an SQL database, and can be
`sqlite` or `mysql`.

## Orchestrator

An orchestrator manages the running of each step of the pipeline, administering
the actual pipeline runs. You can think of it as the 'root' of any pipeline job
that you run during your experimentation.

## Parameter

When we think about steps as functions, we know they receive input in the form
of artifacts. We also know that they produce output (also in the form of
artifacts, stored in the artifact store). But steps also take parameters. The
parameters that you pass into the steps are also (helpfully!) stored in the
metadata store. This helps freeze the iterations of your experimentation
workflow in time, so you can return to them exactly as you ran them.

## Pipeline

Pipelines are designed as simple functions. They are created by using decorators
appropriate to the specific use case you have. The moment it is `run`, a
pipeline is compiled and passed directly to the orchestrator, to be run in the
orchestrator environment.

Within your repository, you will have one or more pipelines as part of your
experimentation workflow. A ZenML pipeline is a sequence of tasks that execute
in a specific order and yield artifacts. The artifacts are stored within the
artifact store and indexed via the metadata store. Each individual task within a
pipeline is known as a step. The standard pipelines (like `TrainingPipeline`)
within ZenML are designed to have easy interfaces to add pre-decided steps, with
the order also pre-decided. Other sorts of pipelines can be created as well from
scratch.

## Repository

Every ZenML project starts inside a ZenML repository and, it is at the core of
all ZenML activity. Every action that can be executed within ZenML must take
place within such a repository. ZenML repositories are denoted by a local `.zen`
folder in your project root where various information about your local
configuration lives, e.g., the active
[Stack](../guides/functional-api/deploy-to-production.md) that you are using to
run pipelines, is stored.

## Stack

A stack is made up of the following three core components:

- An Artifact Store
- A Metadata Store
- An Orchestrator

A ZenML stack also happens to be a Pydantic `BaseSettings` class, which means
that there are multiple ways to use it.

```bash
zenml stack register STACK_NAME \
-m METADATA_STORE_NAME \
-a ARTIFACT_STORE_NAME \
-o ORCHESTRATOR_NAME
```

## Step

A step is a single piece or stage of a ZenML pipeline. Think of each step as
being one of the nodes of the DAG. Steps are responsible for one aspect of
processing or interacting with the data / artifacts in the pipeline. ZenML
currently implements a basic `step` interface, but there will be other more
customized interfaces (layered in a hierarchy) for specialized implementations.
For example, broad steps like `@trainer`, `@split` and so on. Conceptually, a
`Step` is a discrete and independent part of a pipeline that is responsible for
one particular aspect of data manipulation inside a ZenML pipeline.

## Visualizer

A visualizer contains logic to create visualizations within the ZenML ecosystem.