ZAirflow

An opinionated docker image and helm chart for a simple docker/kubernetes airflow deploy.

The repo includes,

A docker image for airflow.
A helm chart for airflow.

See examples here.

Supports

Python 3.8
Airflow 2.1.1
Kubernetes/Local executors (celery executor is supported on docker compose only at this time)
KubernetesJobOperator (built in)
Database Logger (built in, AirflowDBLogger) - airflow logs are saved to the database using SQLAlchemy.
dags and plugins synchronization vs a git repo (per branch/tag).
Default configuration for pools, variables and connections.
Default configuration for airflow webserver (admin allow all).
linux/arm64 devices. (Tested on linux/arm64/v8 raspberry pi 4)

Resources

The zairflow image is published to dockerhub, and the helm chart is hosted on a github release,

lamaani/zairflow:[major].[minor].[patch]
lamaani/zairflow:[major].[minor]
lamaani/zairflow:latest

https://github.com/LamaAni/zairflow/releases/download/[release_tag, eg. 0.5.2]/helm.tar.gz

The image is tagged per release. Version definition,

[major].[minor].[patch]

Diversions from default airflow config

Changes to the default config:

[Core].logging_config_class = airflow_db_logger.LOGGING_CONFIG - log to database instead of files.
[Kubernetes].dags_in_image= True - Expect kubernetes worker dags in the image.
[Kubernetes].kube_client_request_args = "" - a changed due to a bug in the core airflow config; the json is not parsed properly.

Note

It is recommended to control the airflow configuration using environment variables, like so,

export AIRFLOW__[section]__[property]=[value]

For more info on setting airflow environment variables see here.

`Envs`

Main

name	description	type/values	default
ZAIRFLOW_RUN_INIT_ENVIRONMENT	Initialize the zairflow environment (Should be called once)	`boolean`	False
ZAIRFLOW_DB_HOST	the host for the airflow database, this value is required in order to validate the db	`string`	localhost
ZAIRFLOW_DB_PORT	the port for the airflow database	1-65535	5432
ZAIRFLOW_SKIP_DB_CHECK	If `true` then skip the db check.

ZAIRFLOW_CONTAINER_TYPE	The type of the container to execute	scheduler, worker, webserver, flower, init_environment, command	None/Empty - will cause an error
`...`ZAIRFLOW_CONTAINER_TYPE	Run `airflow [type]`, after preparing the env	scheduler, worker, webserver, flower, init_environment
`...`ZAIRFLOW_CONTAINER_TYPE	Run `"$@"`, after preparing the env	command

GIT_AUTOSYNC_REPO_URL	A uri to the git repo to sync. If exists the git sync process will start. If a git repo already exists on the image at the location of the dags folder, use "internal" (remember to set the correct airflow dag folder path). See example and notes below on autosync.	`string`	None
GIT_AUTOSYNC_REPO_BRANCH	The autosync branch name, if dose not exist uses the default branch. See example and notes below on autosync.	`string`	None
ZAIRFLOW_WEBSERVER_CONFIG_PATH	The path to the flask_appbuilder webserver_config.py, that allows for the security configuration. Will be auto linked and override the airflow home webserver_confog	`string`	None

Advanced

name	description	type/values	default
ZAIRFLOW_WAIT_FOR	a list of uri, including port (example: localhost:8888) to wait until open on TCP.	`string`	None
ZAIRFLOW_ENTRYPOINT_INIT_HOOK	A bash script/command to run before the airflow environment (init_environment + command) starts	`string`	None
ZAIRFLOW_ENTRYPOINT_RUN_HOOK	A bash script/command to run before airflow runs (after init_environment)	`string`	None
ZAIRFLOW_ENTRYPOINT_DESTROY_HOOK	A bash script/command to run after the airflow environment exists	`string`	None
ZAIRFLOW_POST_LOAD_USER_CODE	While calling init_environment, INIT HOOK and RUN HOOK, points airflow to load dags and plugins from an empty folder. Allows for initialization without plugin/dag errors and proper initialization of airflow variables.	`boolean`	False
ZAIRFLOW_AUTO_DETECT_CLUSTER	Auto detect the cluster config in running in a kubernetes cluster	`boolean`	true

ZARIFLOW_DB_WAIT_TRIES	The number of attempts to run when waiting for db tables to be ready	`int`	60
ZARIFLOW_DB_WAIT_INTERVAL	The number of seconds to wait between each db tables test	`int`	1

ZARIFLOW_CONNECTION_WAIT_TRIES	The number of attempts to run when waiting for a connection	`int`	60
ZARIFLOW_CONNECTION_WAIT_TIMEOUT	The connection wait timeout	`int`	1
ZARIFLOW_CONNECTION_WAIT_INTERVAL	The number of seconds to wait between connection attempts	`int`	1

ZAIRFLOW_INIT_ENV_YAML	An env enabled yaml configuration for variables, connections and pools to be loaded	`string`	None
ZAIRFLOW_INIT_ENV_YAML_FILEPATH	Am env enabled yaml configuration filepath for variables, connections and pools to be loaded	`string`	None

GIT_AUTOSYNC_REPO_LOCAL_PATH	Overrides /app directory. The path where the git repo will sync to (remember to set the correct airflow dags/plugins folder path). See notes below on autosync.	`string`	None

DB logger

Write log data to the database instead of files, see AirflowDBLogger pacakge, by applying,

[CORE]
logging_config_class = airflow_db_logger.LOGGING_CONFIG

This package is highly recommended for multi pod implementations, and was added by default.

Git auto-sync

The auto-sync feature runs a backround script, inside the airflow pod, which periodically checks for changes in the git repo and pulls any change that was detected. See script github reop and details here.

The auto sync is recommended for development mode.

Configuring the auto sync enviroment

Fist we tell zairflow where the repo is, by setting the environment variables:

GIT_AUTOSYNC_REPO_URL: [my-repo-uri]
GIT_AUTOSYNC_REPO_BRANCH: [my-repo-branch] # Optional, default = default branch.

Then, if in your repo the paths to the airflow dags and plugins are:

[repo root]/deployment/airflow/dags
[repo root]/deployment/airflow/plugins

You need to set the airflow environment variables (or in the airflow config file):

AIRFLOW__CORE__DAGS_FOLDER: /app/deployment/airflow/dags
AIRFLOW__CORE__PLUGINS_FOLDER: /app/deployment/airflow/plugins

NOTE: If your image pre-contains dags/plugins, you must copy them into the appropriate paths for dags and plugins

Configuring default pools, connections and variables

To configure the defaults, you can either use a yaml file or send the yaml directly to the image, via,

ZAIRFLOW_INIT_ENV_YAML_FILEPATH='/my/file/path'
ZAIRFLOW_INIT_ENV_YAML='raw yaml'

The yamls are env enabled, via the {{ENV_NAME}} python format. Example,

pools:
  pool1: 30
  pool2:
    description: 'nna'
    slots: 122
variables:
  a_string_from_env: '{{VERSION}}'
  pased_to_json_with_env:
    this: "is my value"
    version: '{{VERSION}}'
connections:
  testconn:
    conn_type: test
    host: ttt.kkk.mmm
    port: 4242
    extra:
      this: val
      is: extra
      json: value
      version: '{{VERSION}}'

Helm

A template based deployment chart using helm. To learn more about helm please see helm and helmfile. This introduction is also a good read.

Available executors

In order to simplify the chart, only the following executors are implemented,

LocalExecutor
KubernetesExecutor
SequentialExecutor (Debug)

Note: `The celery executor was not implemented due to instabilities in task execution during testing. Currently, it is under consideration, but may not be implemented in future releases.`

TL;DR:

See helmfile example

`Chart values`

Note:

The definition [a].[b]=value should be translated in the yaml values file as,

a:
  b: value

Main

name	description	type/values	default
`nameOverride`	Override the name of the chart	`string`	None
`fullnameOverride`	Override the name of the chart and the suffixes	`string`	None
`envs`	global env collection, added to config map	`yaml`	None
`overrideEnvs`	global env collection, added to config map, that will override any internal env values that were produced by the chart	`yaml`	None

`image.pullPolicy`	The pull policy	IfNotPresent, Never, Always	IfNotPresent
`image.repository`	The image repo	`string`	lamaani/zairflow
`image.tag`	The image tag	`string`	latest

`executor.type`	The executor to be used by airflow	SequentialExecutor, LocalExecutor, KubernetesExecutor	LocalExecutor
`executor.workerImagePullPolicy`	The pull policy	IfNotPresent, Never, Always	`image.pullPolicy`
`executor.workerImageRepository`	The image repo	`string`	`image.repository`
`executor.workerImageTag`	The image tag	`string`	`image.tag`

`init_environment.enabled`	Enabled the init_environment job	`boolean`	true

`webserver.port`	The webserver port to use	`int`	8080
`webserver.terminationGracePeriodSeconds`	The number of seconds before forced pod termination	`int`	10
`webserver.replicas`	The number of webserver replicas	`int`	1
`webserver.envs`	Environment variables to add to the webserver pods	`yaml`	None
`webserver.resources`	Pod resources	`yaml`	None

`scheduler.terminationGracePeriodSeconds`	The number of seconds before forced pod termination	`int`	10
`scheduler.replicas`	The number of webserver replicas	`int`	1
`scheduler.envs`	Environment variables to add to the webserver pods	`yaml`	None
`scheduler.resources`	Pod resources	`yaml`	None

`postgres.enabled`	If true, create a postgres database	`boolean`	true
`postgres.image`	The postgres image, with tag	`string`	postgres:12.2
`postgres.port`	The database port to use	`int`	5432
`postgres.terminationGracePeriodSeconds`	The number of seconds before forced pod termination	`int`	10
`postgres.envs`	Environment variables to add to the webserver pods	`yaml`	None
`postgres.resources`	Pod resources	`yaml`	None
`postgres.maxConnections`	The maximal number of database connections	`int`	10000
`postgres.persist`	The maximal number of database connections	`bool`	true
`postgres.pvc`	Add a kubernetes PVC to the database, allowing it to persist through db pod restarts	`yaml`	see here
`postgres.db`	The default db	`string`	airflow
`postgres.credentials.user`	The db username	`string`	airflow
`postgres.credentials.password`	the db password	`string`	airflow

`serviceAccount.enabled`	If true creates a service account	`boolean`	false
`serviceAccount.name`	The name of the service account to use.	`string`	`chart full name`
`serviceAccount.annotations`	More service account info	`yaml`	None
`serviceAccount.role`	The name of the role to use in the role binding, role not created if None	`string`	None
`serviceAccount.roleKind`	The kind of the role to bind	`string`	`Role`
`serviceAccount.roleBindingKind`	The kind of the role binding. Must use `ClusterRole` in `serviceAccount.roleKind` for `ClusterRoleBinding`	`string`	`RoleBinding`
`serviceAccount.allowKubernetesAccess`	If true generates the kubernetes access role binding	`boolean`	true
`serviceAccount.allowKubernetesAccessRules`	The rules for the zairflow worker kubernetes access	`yaml`

Advanced

Yaml injection, use with care,

name	description	type/values	applies to types
`[type].injectContainerYaml`	yaml inject	`yaml`	webserver, scheduler, postgres, init_environment
`[type].injectTemplateSpecYaml`	yaml inject	`yaml`	webserver, scheduler, postgres, init_environment
`[type].injectSpecYaml`	yaml inject	`yaml`	webserver, scheduler, postgres, init_environment
`[type].injectYamlMetadata`	yaml inject	`yaml`	serviceAccount
`[type].injectYaml`	yaml inject	`yaml`	serviceAccount

Creating a derived docker image

If you are creating a derived image, and you are installing airflow using pip, or in some way overriding /usr/local/bin/airflow or the airflow cli with a new airflow install. For a KubernetesExecutor deployment you must override the cli airflow command as root with,

ln -sf /scripts/image/invoke_airflow /usr/local/bin/airflow

So the remote environment sync would work.

Name		Name	Last commit message	Last commit date
Latest commit History 457 Commits
.github/workflows		.github/workflows
docker		docker
examples		examples
experimental		experimental
helm		helm
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
.prettierrc		.prettierrc
LICENSE		LICENSE
Pipfile		Pipfile
README.md		README.md
envs		envs
missing.md		missing.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ZAirflow

The repo includes,

Supports

Resources

Diversions from default airflow config

Note

`Envs`

Main

Advanced

DB logger

Git auto-sync

Configuring the auto sync enviroment

NOTE: If your image pre-contains dags/plugins, you must copy them into the appropriate paths for dags and plugins

Configuring default pools, connections and variables

Helm

Available executors

Note: `The celery executor was not implemented due to instabilities in task execution during testing. Currently, it is under consideration, but may not be implemented in future releases.`

TL;DR:

`Chart values`

Note:

Main

Advanced

Creating a derived docker image

This issue will be addressed in future releases of zairflow.

Licence

About

Releases 48

Packages

Contributors 2

Languages

License

LamaAni/zairflow

Folders and files

Latest commit

History

Repository files navigation

ZAirflow

The repo includes,

Supports

Resources

Diversions from default airflow config

Note

Envs

Main

Advanced

DB logger

Git auto-sync

Configuring the auto sync enviroment

NOTE: If your image pre-contains dags/plugins, you must copy them into the appropriate paths for dags and plugins

Configuring default pools, connections and variables

Helm

Available executors

Note: The celery executor was not implemented due to instabilities in task execution during testing. Currently, it is under consideration, but may not be implemented in future releases.

TL;DR:

Chart values

Note:

Main

Advanced

Creating a derived docker image

This issue will be addressed in future releases of zairflow.

Licence

About

Resources

License

Stars

Watchers

Forks

Releases 48

Packages 0

Contributors 2

Languages

`Envs`

Note: `The celery executor was not implemented due to instabilities in task execution during testing. Currently, it is under consideration, but may not be implemented in future releases.`

`Chart values`

Packages