Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cov() and corr() - finalization #3550

Closed
wants to merge 42 commits into from
Closed
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
ec58dfa
added da.corr() and da.cov() to dataarray.py. Test added in test_data…
hrishikeshac Jan 4, 2019
ec1dd72
Made the code PEP8 compatible
hrishikeshac Jan 4, 2019
184e4ab
further merges and minor changes
Nov 20, 2019
ef1e25c
minor pep8 changes
Nov 20, 2019
fe3fe06
formatted using black
Nov 20, 2019
d57eb7c
added decorator '@requires_scipy_or_netCDF4'
Nov 20, 2019
13a3925
minor pep8 changes
Nov 20, 2019
e9a8869
added `@requires_scipy_or_netCDF4` to test_corr()
Nov 20, 2019
617fd43
added TODO tag to test_corr()
r-beer Nov 20, 2019
29b3e43
Update xarray/core/dataarray.py
r-beer Nov 20, 2019
27980c8
Update xarray/core/dataarray.py
r-beer Nov 20, 2019
5e4f002
Update xarray/core/dataarray.py
r-beer Nov 20, 2019
28b1229
Update xarray/core/dataarray.py
r-beer Nov 20, 2019
98a5f82
Update xarray/core/dataarray.py
r-beer Nov 20, 2019
ea8f6eb
Update xarray/core/dataarray.py
r-beer Nov 20, 2019
a4d0446
minor docstring correction
r-beer Nov 20, 2019
5a4b97c
modified test_corr():
r-beer Nov 20, 2019
1a6d939
added examples to corr and cov docstrings
r-beer Nov 20, 2019
93dfeeb
Merge branch 'corr' of https://github.com/r-beer/xarray into corr
r-beer Nov 20, 2019
9a04ae3
minor pep8 change
r-beer Nov 20, 2019
363e238
added documentation for da.corr and da.cov
r-beer Nov 20, 2019
48e4ad4
minor changes
r-beer Nov 20, 2019
ac46a43
moved cov and corr to computation
r-beer Nov 22, 2019
bc33c99
black and flake8 formatting
r-beer Nov 22, 2019
60831ff
fixed bug in da.cov and da.corr
r-beer Nov 22, 2019
3945ba1
add assert statement to np.allclose
r-beer Nov 23, 2019
bfe5b2b
minor fixes
r-beer Nov 23, 2019
24e9484
added test_cov()
r-beer Nov 23, 2019
8f2a118
Merge branch 'corr' of https://github.com/r-beer/xarray into corr
r-beer Nov 23, 2019
e09a607
removed da.corr and da.cov
r-beer Nov 23, 2019
2055962
manually deleted trailing whitespaces
r-beer Nov 23, 2019
a1063b5
added whitespace after comma manually
r-beer Nov 23, 2019
b694b5f
black reformatted
r-beer Nov 23, 2019
b54487e
minor improvements & documentation
r-beer Nov 23, 2019
5e7b32d
implemented cov calculation with N-1 normalization
r-beer Nov 23, 2019
bc0b100
black formatting
r-beer Nov 23, 2019
4ac0320
added Pearson to xr.corr() docstring
r-beer Nov 23, 2019
3a7120c
refactored test_cov and test_corr to test_func
r-beer Nov 23, 2019
ac00070
added ddof to xr.cov()
r-beer Nov 24, 2019
52ae54a
added dim parametrization to test_corr & test_cov
r-beer Nov 24, 2019
4f09263
black and flake8
r-beer Nov 24, 2019
4ab4af4
changed to assert_allclose and xr.DataArray
r-beer Nov 25, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ Top-level functions
zeros_like
ones_like
dot
cov
corr
map_blocks

Dataset
Expand Down
2 changes: 2 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ Breaking changes

New Features
~~~~~~~~~~~~
- Added - :py:func:`xarray.cov` and :py:func:`xarray.corr` (:pull:`3550`).
By `Robin Beer <https://github.com/r-beer>`_.


Bug fixes
Expand Down
2 changes: 1 addition & 1 deletion xarray/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from .core.common import full_like, zeros_like, ones_like
from .core.concat import concat
from .core.combine import combine_by_coords, combine_nested, auto_combine
from .core.computation import apply_ufunc, dot, where
from .core.computation import apply_ufunc, dot, where, cov, corr
from .core.extensions import register_dataarray_accessor, register_dataset_accessor
from .core.variable import as_variable, Variable, IndexVariable, Coordinate
from .core.dataset import Dataset
Expand Down
171 changes: 169 additions & 2 deletions xarray/core/computation.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,8 @@
)

import numpy as np

from . import duck_array_ops, utils
from .alignment import deep_align
from .alignment import broadcast, deep_align
from .merge import merge_coordinates_without_align
from .pycompat import dask_array_type
from .utils import is_dict_like
Expand Down Expand Up @@ -1047,6 +1046,174 @@ def earth_mover_distance(first_samples,
return apply_array_ufunc(func, *args, dask=dask)


def cov(da_a, da_b, dim=None):
"""Compute covariance between two DataArray objects along a shared dimension.

Parameters
----------
da_a: DataArray (or Variable) object
Array to compute.
da_b: DataArray (or Variable) object
Array to compute.
dim : str, optional
The dimension along which the covariance will be computed

Returns
-------
covariance: DataArray

See also
--------
pandas.Series.cov: corresponding pandas function
xr.corr: respective function to calculate correlation

Examples
--------

>>> da_a = DataArray(np.random.random((3, 5)),
... dims=("space", "time"),
... coords=[('space', ['IA', 'IL', 'IN']),
... ('time', pd.date_range("2000-01-01", freq="1D", periods=5))])
>>> da_a
<xarray.DataArray (space: 3, time: 5)>
array([[0.04356841, 0.11479286, 0.70359101, 0.59072561, 0.16601438],
[0.81552383, 0.72304926, 0.77644406, 0.05788198, 0.74065536],
[0.96252519, 0.36877741, 0.22248412, 0.55185954, 0.23547536]])
Coordinates:
* space (space) <U2 'IA' 'IL' 'IN'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-05

>>> da_b = DataArray(np.random.random((3, 5)),
... dims=("space", "time"),
... coords=[('space', ['IA', 'IL', 'IN']),
... ('time', pd.date_range("2000-01-01", freq="1D", periods=5))])
>>> da_b
<xarray.DataArray (space: 3, time: 5)>
array([[0.41505599, 0.43002193, 0.45250454, 0.57701084, 0.5327754 ],
[0.0998048 , 0.67225522, 0.4234324 , 0.13514615, 0.4399088 ],
[0.24675048, 0.58555283, 0.1942955 , 0.86128908, 0.05068975]])
Coordinates:
* space (space) <U2 'IA' 'IL' 'IN'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-05
>>> xr.cov(da_a, da_b)
<xarray.DataArray ()>
array(0.03823054)
>>> xr.cov(da_a, da_b, dim='time')
<xarray.DataArray (space: 3)>
array([0.00207952, 0.01024296, 0.08214707])
Coordinates:
* space (space) <U2 'IA' 'IL' 'IN'
"""

# 1. Broadcast the two arrays
da_a, da_b = broadcast(da_a, da_b)

# 2. Ignore the nans
valid_values = da_a.notnull() & da_b.notnull()
da_a = da_a.where(
valid_values, drop=True
) # TODO: avoid drop as explained in https://github.com/pydata/xarray/pull/2652#discussion_r245492002
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine as preparation for future PR right?

If there is an easy way to drop the drop, I can implement it of course!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't drop=True is needed here at all, and I think it actually introduces a bug. In particular, consider the following case:

>>> array = xarray.DataArray([[1, 2], [np.nan, np.nan]], dims=['x', 'y'])
>>> xarray.corr(array, array, dim='y')
<xarray.DataArray (x: 1)>
array([2.])
Dimensions without coordinates: x

This doesn't look right to me at all. It has the wrong shape, size 1 instead of size 2, and also somehow the correlation is greater than 1.

da_b = da_b.where(valid_values, drop=True)
valid_count = (
valid_values.sum(dim) - 1
) # as in pandas.Series.cov, default to unbiased "N - 1" normalization
# TODO: add parameter "bias" to decide whether or not N-1 normalization should be used

# 3. Compute mean and standard deviation along the given dim
demeaned_da_a = da_a - da_a.mean(dim=dim)
demeaned_da_b = da_b - da_b.mean(dim=dim)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not demean in place, and avoid additional copies of the full arrays? Aren't da_a and da_b already copies?


# 4. Compute covariance along the given dim
cov = (demeaned_da_a * demeaned_da_b).sum(dim=dim) / (valid_count)

return cov


def corr(da_a, da_b, dim=None):
"""Compute correlation between two DataArray objects along a shared dimension.

Parameters
----------
da_a: DataArray (or Variable) object
Array to compute.
da_b: DataArray (or Variable) object
Array to compute.
dim: str, optional
The dimension along which the correlation will be computed

Returns
-------
correlation: DataArray

See also
--------
pandas.Series.corr: corresponding pandas function
xr.cov: underlying covariance function

Examples
--------

>>> da_a = DataArray(np.random.random((3, 5)),
... dims=("space", "time"),
... coords=[('space', ['IA', 'IL', 'IN']),
... ('time', pd.date_range("2000-01-01", freq="1D", periods=5))])
>>> da_a
<xarray.DataArray (space: 3, time: 5)>
array([[0.04356841, 0.11479286, 0.70359101, 0.59072561, 0.16601438],
[0.81552383, 0.72304926, 0.77644406, 0.05788198, 0.74065536],
[0.96252519, 0.36877741, 0.22248412, 0.55185954, 0.23547536]])
Coordinates:
* space (space) <U2 'IA' 'IL' 'IN'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-05

>>> da_b = DataArray(np.random.random((3, 5)),
... dims=("space", "time"),
... coords=[('space', ['IA', 'IL', 'IN']),
... ('time', pd.date_range("2000-01-01", freq="1D", periods=5))])
>>> da_b
<xarray.DataArray (space: 3, time: 5)>
array([[0.41505599, 0.43002193, 0.45250454, 0.57701084, 0.5327754 ],
[0.0998048 , 0.67225522, 0.4234324 , 0.13514615, 0.4399088 ],
[0.24675048, 0.58555283, 0.1942955 , 0.86128908, 0.05068975]])
Coordinates:
* space (space) <U2 'IA' 'IL' 'IN'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-05
>>> xr.corr(da_a, da_b)
<xarray.DataArray ()>
array(0.67407116)
>>> xr.corr(da_a, da_b, dim='time')
<xarray.DataArray (space: 3)>
array([0.23150267, 0.24900968, 0.9061562 ])
Coordinates:
* space (space) <U2 'IA' 'IL' 'IN'
"""
from .dataarray import DataArray

if any(not isinstance(arr, (Variable, DataArray)) for arr in [da_a, da_b]):
raise TypeError(
"Only xr.DataArray and xr.Variable are supported."
"Given {}.".format([type(arr) for arr in [da_a, da_b]])
)

# 1. Broadcast the two arrays
da_a, da_b = broadcast(da_a, da_b)

# 2. Ignore the nans
valid_values = da_a.notnull() & da_b.notnull()
da_a = da_a.where(valid_values, drop=True)
da_b = da_b.where(
valid_values, drop=True
) # TODO: avoid drop as explained in https://github.com/pydata/xarray/pull/2652#discussion_r245492002

# 3. Compute correlation based on standard deviations and cov()
da_a_std = da_a.std(dim=dim)
da_b_std = da_b.std(dim=dim)

corr = cov(da_a, da_b, dim=dim) / (da_a_std * da_b_std)

return corr


def dot(*arrays, dims=None, **kwargs):
"""Generalized dot product for xarray objects. Like np.einsum, but
provides a simpler interface based on array dimensions.
Expand Down
1 change: 0 additions & 1 deletion xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -2834,7 +2834,6 @@ def dot(
>>> da = DataArray(da_vals, dims=['x', 'y', 'z'])
>>> dm_vals = np.arange(4)
>>> dm = DataArray(dm_vals, dims=['z'])

>>> dm.dims
('z')
>>> da.dims
Expand Down
1 change: 1 addition & 0 deletions xarray/tests/test_backends.py
Original file line number Diff line number Diff line change
Expand Up @@ -2439,6 +2439,7 @@ def skip_if_not_engine(engine):
pytest.importorskip(engine)


@requires_scipy_or_netCDF4
@requires_dask
@pytest.mark.filterwarnings("ignore:use make_scale(name) instead")
def test_open_mfdataset_manyfiles(
Expand Down
100 changes: 100 additions & 0 deletions xarray/tests/test_computation.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,28 @@
from . import has_dask, raises_regex, requires_dask


@pytest.fixture(params=[1])
def da(request):
if request.param == 1:
times = pd.date_range("2000-01-01", freq="1D", periods=21)
values = np.random.random((3, 21, 4))
da = xr.DataArray(values, dims=("a", "time", "x"))
da["time"] = times
return da

if request.param == 2:
return xr.DataArray(
[0, np.nan, 1, 2, np.nan, 3, 4, 5, np.nan, 6, 7], dims="time"
)

if request.param == "repeating_ints":
return xr.DataArray(
np.tile(np.arange(12), 5).reshape(5, 4, 3),
coords={"x": list("abc"), "y": list("defg")},
dims=list("zyx"),
)


def assert_identical(a, b):
if hasattr(a, "identical"):
msg = f"not identical:\n{a!r}\n{b!r}"
Expand Down Expand Up @@ -789,6 +811,84 @@ def func(x):
assert_identical(expected, actual)


def test_corr(da):

# other: select missaligned data, and smooth it to dampen the correlation with self.
da_smooth = da.isel(time=range(2, 20)).rolling(time=3, center=True).mean(dim="time")

da = da.isel(time=range(0, 18))

def select_pts(array):
return array.sel(a=1, x=2)

# Test #1: Misaligned 1-D dataarrays with missing values
ts1 = select_pts(da.copy())
ts2 = select_pts(da_smooth.copy())

def pd_corr(ts1, ts2):
"""Ensure the ts are aligned and missing values ignored"""
ts1, ts2 = xr.align(ts1, ts2)
valid_values = ts1.notnull() & ts2.notnull()

ts1 = ts1.where(valid_values, drop=True)
ts2 = ts2.where(valid_values, drop=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought that we should be testing that we get a similar result from pandas' cov and xarray cov without much preparation. Can you help me understand why we need to drop values here? What's the test case that fails when we don't have these lines?


return ts1.to_series().corr(ts2.to_series())

expected = pd_corr(ts1, ts2)
actual = xr.corr(ts1, ts2)
assert np.allclose(expected, actual)

# Test #2: Misaligned N-D dataarrays with missing values
actual_ND = xr.corr(da, da_smooth, dim="time")
actual = select_pts(actual_ND)
assert np.allclose(expected, actual)

# Test #3: One 1-D dataarray and another N-D dataarray; misaligned and having missing values
actual_ND = xr.corr(da_smooth, ts1, dim="time")
actual = select_pts(actual_ND)
assert np.allclose(expected, actual)


def test_cov(da):

# other: select missaligned data, and smooth it to dampen the correlation with self.
da_smooth = da.isel(time=range(2, 20)).rolling(time=3, center=True).mean(dim="time")

da = da.isel(time=range(0, 18))

def select_pts(array):
return array.sel(a=1, x=2)

# Test #1: Misaligned 1-D dataarrays with missing values
ts1 = select_pts(da.copy())
ts2 = select_pts(da_smooth.copy())

def pd_cov(ts1, ts2):
"""Ensure the ts are aligned and missing values ignored"""
ts1, ts2 = xr.align(ts1, ts2)
valid_values = ts1.notnull() & ts2.notnull()

ts1 = ts1.where(valid_values, drop=True)
ts2 = ts2.where(valid_values, drop=True)

return ts1.to_series().cov(ts2.to_series())

expected = pd_cov(ts1, ts2)
actual = xr.cov(ts1, ts2)
assert np.allclose(expected, actual)

# Test #2: Misaligned N-D dataarrays with missing values
actual_ND = xr.cov(da, da_smooth, dim="time")
actual = select_pts(actual_ND)
assert np.allclose(expected, actual)

# Test #3: One 1-D dataarray and another N-D dataarray; misaligned and having missing values
actual_ND = xr.cov(ts1, da_smooth, dim="time")
actual = select_pts(actual_ND)
assert np.allclose(expected, actual)


def pandas_median(x):
return pd.Series(x).median()

Expand Down