cov() and corr() - finalization #3550

r-beer · 2019-11-20T09:40:54Z

Dear maintainers,

this is my first PR and contribution to xarray. I have read the contributing guidelines and did my best to conform to your code of conduct. However, I would be happy if you have a closer look whether everything is correct and give feedback for improvement!

As I don't have push access to @hrishikeshac's fork, I created this PR to add the final changes for PR #2652.

…array.py, and tested using pytest.

pep8speaks · 2019-11-20T09:41:02Z

Hello @r-beer! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file xarray/tests/test_computation.py:

Line 817:1: E302 expected 2 blank lines, found 1

Comment last updated at 2019-11-25 20:47:29 UTC

r-beer · 2019-11-20T11:39:11Z

@max-sixty, all three checks fail because test_open_mfdataset_manyfiles requires either netcdf4 or scipy.

Shall I, therefore, add the @requires_scipy_or_netCDF4 decorator?

keewis · 2019-11-20T12:27:08Z

There seems to be a problem with your email address (github can't match it to your account). Does it match the one you used for signing up on github?

r-beer · 2019-11-20T13:06:32Z

There seems to be a problem with your email address (github can't match it to your account). Does it match the one you used for signing up on github?

Dear @keewis, I just set my real email address also as public email address and set my local git user name and email address accordingly. Does this solve the problem?

keewis · 2019-11-20T13:12:23Z

for the new commit, yes. The older commits could be rewritten if you don't like the other email address being public (I can guide you through the process if you want) but since PRs get squashed during the merge you don't have to.

r-beer · 2019-11-20T13:17:37Z

for the new commit, yes. The older commits could be rewritten if you don't like the other email address being public (I guide you through the process if you want) but since PRs get squashed after the merge you don't have to.

OK, no need for the extra effort at this point, the squashing at the end of PR is fine for me. Thank you for the offer, anyway! 🙂

Are any other changes necessary for the PR?

ideas:

examples for corr and cov usage
whats-new.rst

keewis · 2019-11-20T13:25:52Z

yes, whats-new.rst and api.rst will have to eventually be updated before the merge. Also, examples would be really good.

keewis · 2019-11-20T13:35:22Z

since the APIs of DataArray and Dataset are usually kept in sync as much as possible, it might be good to add the new methods to Dataset, too. This doesn't have to be in this PR, though, and I'd wait for confirmation first.

r-beer · 2019-11-20T13:47:24Z

since the APIs of DataArray and Dataset are usually kept in sync as much as possible, it might be good to add the new methods to Dataset, too. This doesn't have to be in this PR, though, and I'd wait for confirmation first.

Yes, the functionality that I would be very interested in would be to calculate corr() also between different variables of a dataset as in pandas.DataFrame.corr as well as correlations between different locations within the same DataArray (e.g. correlations of time series with different latitudes).

xarray/core/dataarray.py

xarray/tests/test_dataarray.py

xarray/core/dataarray.py

max-sixty · 2019-11-20T15:08:21Z

xarray/tests/test_dataarray.py

@@ -4079,6 +4080,49 @@ def test_rank(self):
        y = DataArray([0.75, 0.25, np.nan, 0.5, 1.0], dims=("z",))
        assert_equal(y.rank("z", pct=True), y)

+    # TODO: use fixture or randomly created array instead of load_dataset


xarray/tests/test_dataarray.py

max-sixty · 2019-11-20T15:10:41Z

Great start!

Thanks for the review @keewis

I think it's fine to leave Dataset for another PR; it'd be great to get this in, and given it's your first PR (welcome!) let's not bloat it

Test looks good. Let's make those adjustments you put in the TODO and add one for the other function? We could also add some variants which test along different dimensions / all NaNs / other special cases.

r-beer · 2019-11-20T15:19:16Z

Great start!

Thanks for the review @keewis

I think it's fine to leave Dataset for another PR; it'd be great to get this in, and given it's your first PR (welcome!) let's not bloat it

Test looks good. Let's make those adjustments you put in the TODO and add one for the other function? We could also add some variants which test along different dimensions / all NaNs / other special cases.

Thank you two, @max-sixty and @keewis, for the constructive feedback that made the start easier!
I would also be happy to go once through the whole process and then implement the PR for the ds separately. This will probably cut the learning curve and accelerate the process.

Additionally, I wonder whether I have to configure black for the spaces between colons?

I had autoformatting running with autopep8 previously but since one of the last VS code updates something broke so I will set this up properly again on Friday or tomorrow.

Co-Authored-By: keewis <keewis@users.noreply.github.com>

max-sixty · 2019-11-23T15:51:48Z

@r-beer great list of future PRs!

r-beer · 2019-11-23T16:33:58Z

Turns out the assert np.allclose statements actually failed for cov, but not for corr:

        actual = xr.cov(ts1, ts2)
>       assert np.allclose(expected, actual)
E       assert False
E        +  where False = <function allclose at 0x7ff5dbaf0e18>(0.004632253406505826, <xarray.DataArray ()>\narray(0.004323))
E        +    where <function allclose at 0x7ff5dbaf0e18> = np.allclose

0.00463 vs. 0.00432

r-beer

pandas.Series.cov uses np.cov that defaults to bias=False.

Should we also default to bias=False for normalization?

xarray/core/computation.py

shoyer · 2019-11-23T18:25:56Z

xarray/core/computation.py

+    valid_values = da_a.notnull() & da_b.notnull()
+    da_a = da_a.where(
+        valid_values, drop=True
+    )  # TODO: avoid drop as explained in https://github.com/pydata/xarray/pull/2652#discussion_r245492002


I don't drop=True is needed here at all, and I think it actually introduces a bug. In particular, consider the following case:

>>> array = xarray.DataArray([[1, 2], [np.nan, np.nan]], dims=['x', 'y']) >>> xarray.corr(array, array, dim='y') <xarray.DataArray (x: 1)> array([2.]) Dimensions without coordinates: x

This doesn't look right to me at all. It has the wrong shape, size 1 instead of size 2, and also somehow the correlation is greater than 1.

max-sixty · 2019-11-23T19:02:53Z

@shoyer good spot, thanks

@r-beer we could add that as a test case

r-beer · 2019-11-23T20:07:56Z

@shoyer good spot, thanks

@r-beer we could add that as a test case

Yes, I will add it.

Thinking about test arrays in general:
Wouldn't it be good to define some data array fixtures in the conftest.py which can then be used in all different test_function.py tests?

In view of the additional test I started to restructure the test_cov and test_corr to a single test_func. I am now asking myself how much of the data preprocessing shall go to the parametrization and how much remains in the test function...

Any guidelines or suggestions?

Either way, it would be good to have a clear list of da_a and da_b and respective expected results. I am a bit confused about what we actually would expect for xarray.DataArray([[1, 2], [np.nan, np.nan]], dims=['x', 'y']).

started to move test data creation to parametrize

r-beer · 2019-11-23T20:13:45Z

Either way, it would be good to have a clear list of da_a and da_b and respective expected results. I am a bit confused about what we actually would expect for xarray.DataArray([[1, 2], [np.nan, np.nan]], dims=['x', 'y']).

Maybe, as a first step some type checking would be enough, to identify such special cases as the one above and give a ValueError or NotImplementedError?

max-sixty · 2019-11-23T21:00:12Z

@shoyer good spot, thanks
@r-beer we could add that as a test case

Yes, I will add it.

Thinking about test arrays in general:
Wouldn't it be good to define some data array fixtures in the conftest.py which can then be used in all different test_function.py tests?

I think we could do more of this, yes. It's a tradeoff between (a) better sharing & performance vs. (b) customization for each function & localization of effects.
We have def create_test_data in a few files, before we used pytest, so could convert some of the usages to fixtures.
@shoyer and I discussed this briefly with @keewis 's work on tests for units. I think @shoyer is keener on having fixture-like code close to the test functions.

In view of the additional test I started to restructure the test_cov and test_corr to a single test_func. I am now asking myself how much of the data preprocessing shall go to the parametrization and how much remains in the test function...

Any guidelines or suggestions?

I would start with having the code in the function; i.e. don't abstract too early unless we're really confident on the eventual state. When we find ourselves something multiple times, then we can start pulling it out into more abstractions (e.g. fixtures).

max-sixty · 2019-11-23T21:10:37Z

Either way, it would be good to have a clear list of da_a and da_b and respective expected results. I am a bit confused about what we actually would expect for xarray.DataArray([[1, 2], [np.nan, np.nan]], dims=['x', 'y']).

For xarray.corr(array, array, dim='y'), I think we'd expect an array of the same dimensionality as below, with one point as the correlation of 1.0 (because {1,2} is 100% correlated to {1,2}) and another NaN. Does that make sense?

In [17]: xr.dot(da,da, dims='y')
Out[17]:
<xarray.DataArray (x: 2)>
array([ 5., nan])
Dimensions without coordinates: x

r-beer · 2019-11-24T09:39:30Z

I would start with having the code in the function; i.e. don't abstract too early unless we're really confident on the eventual state. When we find ourselves something multiple times, then we can start pulling it out into more abstractions (e.g. fixtures).

Alright! I found myself having very similar tests and therefore abstracted it to test_func. I completely agree with not abstracting too early and but when having the same structure several times.

r-beer · 2019-11-24T10:11:31Z

For xarray.corr(array, array, dim='y'), I think we'd expect an array of the same dimensionality as below, with one point as the correlation of 1.0 (because {1,2} is 100% correlated to {1,2}) and another NaN. Does that make sense?

On the one side, I am with you in terms of "What You Put In Is What You Get Out", on the other hand pandas.Series.cov and np.cov have other behaviors that both seem plausible:

So for the moment, we might stick with pandas.Series.cov?
Or (future PR?) we rather want the np.cov behavior, this would require more effort and the repr should probably be given as xarray instead of np.array, right?

refactored test functions back to test_cov and test_corr

minor changes

r-beer · 2019-11-24T14:41:44Z

I have added several test cases, almost all pass. The ones that don't pass are related to dim="time" and incompletely implemented pandas_cov and pandas_corr, that currently don't have a dim parameter.

I had a look at np.cov, np.corrcoef, pd.Series.cov, pd.frame.cov, pd.Series.corr, and pd.frame.corr: Computation of cov and corr is implemented in many different ways.

xr.cov and xr.corr now imitate the behavior of the respective pd.Series functions.

I actually prefer the numpy behavior, resulting in covariance and correlation matrices. However I feel that efficient implementation of this behavior is above my current understanding of xarray. So, I would highly appreciate your support on this implementation!

Otherwise, I added the ddof parameter to xr.cov to imitate the behavior of the respective pd.Series functions. I now get the results that pd.Series produces. However, I had to set ddof=1 for xr.cov and ddof=0 for xr.corr. Does that make sense?

max-sixty

Almost there!

max-sixty · 2019-11-24T17:20:19Z

xarray/tests/test_computation.py

+        valid_values = ts1.notnull() & ts2.notnull()
+
+        ts1 = ts1.where(valid_values, drop=True)
+        ts2 = ts2.where(valid_values, drop=True)


I would have thought that we should be testing that we get a similar result from pandas' cov and xarray cov without much preparation. Can you help me understand why we need to drop values here? What's the test case that fails when we don't have these lines?

max-sixty · 2019-11-24T17:22:17Z

xarray/tests/test_computation.py

+    dims=("a", "time", "x"),
+)
+
+arrays = [


The issue with putting these at the module level (and one of the benefits of fixtures) is that each time this file is imported—even when no relevant test is being run—this will execute, and so it'll slow down rapid test collection. Could we put this in a fixture? I can help show you how if it's not clear. Thanks @r-beer

Yes, I planned to make a fixture out of it. (How) Can I then easily use this fixture within parametrize?

The following code does not work, as the function is not iterable:

@pytest.fixture() def array_tuples(): da = xr.DataArray(np.random.random((3, 21, 4)), coords={"time": pd.date_range("2000-01-01", freq="1D", periods=21)}, dims=("a", "time", "x"),) arrays = [ da.isel(time=range(0, 18)), da.isel(time=range(2, 20)).rolling(time=3, center=True).mean(dim="time"), xr.DataArray([0, np.nan, 1, 2, np.nan, 3, 4, 5, np.nan, 6, 7], dims="time"), xr.DataArray([1, 1, np.nan, 2, np.nan, 3, 5, 4, 6, np.nan, 7], dims="time"), xr.DataArray([[1, 2], [1, np.nan]], dims=["x", "time"]), xr.DataArray([[1, 2], [np.nan, np.nan]], dims=["x", "time"]), ] array_tuples = [ (arrays[0], arrays[0]), (arrays[0], arrays[1]), (arrays[1], arrays[1]), (arrays[2], arrays[2]), (arrays[2], arrays[3]), (arrays[3], arrays[3]), (arrays[4], arrays[4]), (arrays[4], arrays[5]), (arrays[5], arrays[5]), ] return array_tuples @pytest.mark.parametrize("da_a, da_b", array_tuples) @pytest.mark.parametrize("dim", [None, "time", "x"]) def test_cov(da_a, da_b, dim): def pandas_cov(ts1, ts2):

Errors in:

E TypeError: 'function' object is not iterable

If you want to use array_tuples as a fixture, you need for the test function to take it as an argument, not to pass it as a argument to parametrize. (I know the multiple ways of parametrizing can be confusing)

5P2 combinations is fine for performance, particularly if the inputs are differentiated and so each test has marginal value. It does limit you to tests with a known correct implementation, though (you're not going to want to write out the results for all of those!)

If you want to use array_tuples as a fixture, you need for the test function to take it as an argument, not to pass it as a argument to parametrize. (I know the multiple ways of parametrizing can be confusing)

That means, that all the permutation is done within the fixture? It's not possible to put the fixture into the parametrize at all?

5P2 combinations is fine for performance, particularly if the inputs are differentiated and so each test has marginal value. It does limit you to tests with a known correct implementation, though (you're not going to want to write out the results for all of those!)

Haha, yeah, don't want to write all those by hand. 🙂
I did not completely get the part about "differentiation" and "marginal test value". Could you write one or two more sentences about it?

Thanks a lot for all the explanations!

keewis · 2019-11-24T20:23:21Z

xarray/tests/test_computation.py

+array_tuples = [
+    (arrays[0], arrays[0]),
+    (arrays[0], arrays[1]),
+    (arrays[1], arrays[1]),
+    (arrays[2], arrays[2]),
+    (arrays[2], arrays[3]),
+    (arrays[3], arrays[3]),
+    (arrays[4], arrays[4]),
+    (arrays[4], arrays[5]),
+    (arrays[5], arrays[5]),
+]


This looks like a good use case for itertools.permutations. So this could be

Suggested change

array_tuples = [

(arrays[0], arrays[0]),

(arrays[0], arrays[1]),

(arrays[1], arrays[1]),

(arrays[2], arrays[2]),

(arrays[2], arrays[3]),

(arrays[3], arrays[3]),

(arrays[4], arrays[4]),

(arrays[4], arrays[5]),

(arrays[5], arrays[5]),

]

array_tuples = [

(arrays[first], arrays[second])

for first, second in itertools.permutations(range(len(arrays)))

]

which means that you don't have to update array_tuples if you decide you have to add another item to arrays

Edit: github apparently misunderstands multi-line suggestions, you'd have to edit by hand.

If you want all permutations of the two variables, just pass them as separate parameters with pytest

sorry, I mistook combinations with permutations. Looking at it again, it seems that the pattern is some sort of "combine with the same index and the previous one", which is not what combinations would yield.

So never mind.

Thanks for the feedback. I have a questions
Should all possible data arrays be allowed?

If so, instead of using array_tuples we can use what I previously tested (but didn't commit):

@pytest.mark.parametrize("da_b", arrays) @pytest.mark.parametrize("da_a", arrays) @pytest.mark.parametrize("dim", [None, "time", "x"]) def test_cov(da_a, da_b, dim):

This is, what @max-sixty means, right?

However, doesn't using all combinations might blow up the number of test functions unnecessarily? On the other hand not using xarray_tuples would be much more readable.

xarray/tests/test_computation.py

seth-p · 2019-12-08T03:40:54Z

xarray/core/computation.py

@@ -1047,6 +1047,175 @@ def earth_mover_distance(first_samples,
        return apply_array_ufunc(func, *args, dask=dask)


+def cov(da_a, da_b, dim=None, ddof=1):


skipna=None?

seth-p · 2019-12-08T03:43:00Z

xarray/core/computation.py

+
+    # 3. Compute mean and standard deviation along the given dim
+    demeaned_da_a = da_a - da_a.mean(dim=dim)
+    demeaned_da_b = da_b - da_b.mean(dim=dim)


Why not demean in place, and avoid additional copies of the full arrays? Aren't da_a and da_b already copies?

seth-p · 2019-12-08T03:45:53Z

xarray/core/computation.py

+    da_a_std = da_a.std(dim=dim)
+    da_b_std = da_b.std(dim=dim)
+
+    corr = cov(da_a, da_b, dim=dim, ddof=ddof) / (da_a_std * da_b_std)


This is making a zillion copies of the full arrays. Please try to avoid unnecessary copies.

I think we're more likely to have a working performant feature by make it work first, and then making it performant...

* Added chunks='auto' option in dataset.py * reverted accidental changes in dataset.chunk() * Added corr and cov to computation.py. Taken from r-beer:xarray/corr * Added r-beer's tests to test_computation.py Still issues I think * trying to fix github.com//pull/3550#discussion_r349935731 * Removing drop=True from the `.where()` calls in `computation.py`+test.py * api.rst and whats-new.rst * Updated `xarray/__init__.py` and added `broadcast` import to computation * added DataArray import to corr, cov * assert_allclose added to test_computation.py * removed whitespace in test_dask...oops * Added to init * format changes * Fiddling around with cov/corr tests in `test_computation.py` * PEP8 changes * pep * remove old todo and comments * isort * Added consistency check between corr() and cov(), ensure they give same * added `skipna=False` to `computation.py`. made consistency+autocov tests * formatting * Added numpy-based tests. * format * formatting again * Update doc/whats-new.rst Co-authored-by: keewis <keewis@users.noreply.github.com> * refactored corr/cov so there is one internal method for calculating both * formatting * updating docstrings and code suggestions from PR * paramterize ddof in tests * removed extraneous test arrays * formatting + adding deterministic docstring * added test for TypeError * formatting * tidying up docstring * formatting and tidying up `_cov_corr()` so that the logic is more clear * flake8 ... Co-authored-by: keewis <keewis@users.noreply.github.com>

max-sixty · 2020-05-25T16:57:04Z

@r-beer in case you come back to see this: thank you for taking it so far; your code was helpful to eventually getting this feature in. And we'd of course appreciate any additional contributions.

hrishikeshac and others added 3 commits November 20, 2019 09:23

added da.corr() and da.cov() to dataarray.py. Test added in test_data…

ec58dfa

…array.py, and tested using pytest.

Made the code PEP8 compatible

ec1dd72

further merges and minor changes

184e4ab

Robin Beer added 2 commits November 20, 2019 10:42

minor pep8 changes

ef1e25c

formatted using black

fe3fe06

Robin Beer added 3 commits November 20, 2019 12:40

added decorator '@requires_scipy_or_netCDF4'

d57eb7c

minor pep8 changes

13a3925

added @requires_scipy_or_netCDF4 to test_corr()

e9a8869

added TODO tag to test_corr()

617fd43

keewis requested changes Nov 20, 2019

View reviewed changes

max-sixty reviewed Nov 20, 2019

View reviewed changes

xarray/tests/test_dataarray.py Outdated Show resolved Hide resolved

max-sixty reviewed Nov 20, 2019

View reviewed changes

xarray/core/dataarray.py Outdated Show resolved Hide resolved

max-sixty reviewed Nov 20, 2019

View reviewed changes

xarray/tests/test_dataarray.py Outdated Show resolved Hide resolved

r-beer and others added 5 commits November 20, 2019 16:22

Update xarray/core/dataarray.py

29b3e43

Co-Authored-By: keewis <keewis@users.noreply.github.com>

Update xarray/core/dataarray.py

27980c8

Co-Authored-By: keewis <keewis@users.noreply.github.com>

Update xarray/core/dataarray.py

5e4f002

Co-Authored-By: keewis <keewis@users.noreply.github.com>

Update xarray/core/dataarray.py

28b1229

Co-Authored-By: keewis <keewis@users.noreply.github.com>

Update xarray/core/dataarray.py

98a5f82

Co-Authored-By: keewis <keewis@users.noreply.github.com>

minor improvements & documentation

b54487e

r-beer commented Nov 23, 2019

View reviewed changes

xarray/core/computation.py Show resolved Hide resolved

r-beer added 2 commits November 23, 2019 18:34

implemented cov calculation with N-1 normalization

5e7b32d

black formatting

bc0b100

shoyer requested changes Nov 23, 2019

View reviewed changes

r-beer added 2 commits November 23, 2019 21:09

added Pearson to xr.corr() docstring

4ac0320

refactored test_cov and test_corr to test_func

3a7120c

started to move test data creation to parametrize

r-beer added 3 commits November 24, 2019 14:36

added ddof to xr.cov()

ac00070

refactored test functions back to test_cov and test_corr

added dim parametrization to test_corr & test_cov

52ae54a

minor changes

black and flake8

4f09263

max-sixty reviewed Nov 24, 2019

View reviewed changes

keewis reviewed Nov 24, 2019

View reviewed changes

changed to assert_allclose and xr.DataArray

4ab4af4

AndrewILWilliams mentioned this pull request Feb 20, 2020

Function for regressing/correlating multiple fields? #3784

Closed

seth-p suggested changes Feb 27, 2020

View reviewed changes

jbusecke mentioned this pull request Apr 10, 2020

WIP: Better parallel implementation of linear regression jbusecke/xarrayutils#62

Merged

6 tasks

AndrewILWilliams mentioned this pull request May 23, 2020

xr.cov() and xr.corr() #4089

Merged

5 tasks

max-sixty closed this May 25, 2020

		@@ -1047,6 +1047,175 @@ def earth_mover_distance(first_samples,
		return apply_array_ufunc(func, *args, dask=dask)


		def cov(da_a, da_b, dim=None, ddof=1):

cov() and corr() - finalization #3550

cov() and corr() - finalization #3550

Conversation

r-beer commented Nov 20, 2019 • edited Loading

pep8speaks commented Nov 20, 2019 • edited Loading

Comment last updated at 2019-11-25 20:47:29 UTC

r-beer commented Nov 20, 2019

keewis commented Nov 20, 2019

r-beer commented Nov 20, 2019

keewis commented Nov 20, 2019 • edited Loading

r-beer commented Nov 20, 2019

keewis commented Nov 20, 2019

keewis commented Nov 20, 2019

r-beer commented Nov 20, 2019

Choose a reason for hiding this comment

max-sixty commented Nov 20, 2019

r-beer commented Nov 20, 2019

max-sixty commented Nov 23, 2019

r-beer commented Nov 23, 2019 • edited Loading

r-beer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty commented Nov 23, 2019

r-beer commented Nov 23, 2019 • edited Loading

r-beer commented Nov 23, 2019

max-sixty commented Nov 23, 2019

max-sixty commented Nov 23, 2019

r-beer commented Nov 24, 2019

r-beer commented Nov 24, 2019 • edited Loading

r-beer commented Nov 24, 2019

max-sixty left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

r-beer Nov 26, 2019 • edited Loading

Choose a reason for hiding this comment

keewis Nov 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keewis Nov 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty commented May 25, 2020

r-beer commented Nov 20, 2019 •

edited

Loading

pep8speaks commented Nov 20, 2019 •

edited

Loading

keewis commented Nov 20, 2019 •

edited

Loading

r-beer commented Nov 23, 2019 •

edited

Loading

r-beer commented Nov 23, 2019 •

edited

Loading

r-beer commented Nov 24, 2019 •

edited

Loading

r-beer Nov 26, 2019 •

edited

Loading

keewis Nov 24, 2019 •

edited

Loading

keewis Nov 24, 2019 •

edited

Loading