Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cov() and corr() - finalization #3550

Closed
wants to merge 42 commits into from
Closed

cov() and corr() - finalization #3550

wants to merge 42 commits into from

Conversation

r-beer
Copy link

@r-beer r-beer commented Nov 20, 2019

Dear maintainers,

this is my first PR and contribution to xarray. I have read the contributing guidelines and did my best to conform to your code of conduct. However, I would be happy if you have a closer look whether everything is correct and give feedback for improvement!

As I don't have push access to @hrishikeshac's fork, I created this PR to add the final changes for PR #2652.

  • Closes cov() and corr() #2652
  • Tests had been added in cov() and corr() #2652
  • Passes black . && mypy . && flake8
    • black
    • flake8
    • mypy
  • Fully documented, including whats-new.rst for all changes and api.rst for new API
    • whats-new.rst
    • api.rst
  • example notebook for usage of corr() and cov()
    • where and how to write the examples?
    • covariance
    • correlation
  • moved implementation of cov() and corr() to computation
  • add test_cov()
  • remove da.cov and da.corr

@pep8speaks
Copy link

pep8speaks commented Nov 20, 2019

Hello @r-beer! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 817:1: E302 expected 2 blank lines, found 1

Comment last updated at 2019-11-25 20:47:29 UTC

@r-beer
Copy link
Author

r-beer commented Nov 20, 2019

@max-sixty, all three checks fail because test_open_mfdataset_manyfiles requires either netcdf4 or scipy.

Shall I, therefore, add the @requires_scipy_or_netCDF4 decorator?

@keewis
Copy link
Collaborator

keewis commented Nov 20, 2019

There seems to be a problem with your email address (github can't match it to your account). Does it match the one you used for signing up on github?

@r-beer
Copy link
Author

r-beer commented Nov 20, 2019

There seems to be a problem with your email address (github can't match it to your account). Does it match the one you used for signing up on github?

Dear @keewis, I just set my real email address also as public email address and set my local git user name and email address accordingly. Does this solve the problem?

@keewis
Copy link
Collaborator

keewis commented Nov 20, 2019

for the new commit, yes. The older commits could be rewritten if you don't like the other email address being public (I can guide you through the process if you want) but since PRs get squashed during the merge you don't have to.

@r-beer
Copy link
Author

r-beer commented Nov 20, 2019

for the new commit, yes. The older commits could be rewritten if you don't like the other email address being public (I guide you through the process if you want) but since PRs get squashed after the merge you don't have to.

OK, no need for the extra effort at this point, the squashing at the end of PR is fine for me. Thank you for the offer, anyway! 🙂

Are any other changes necessary for the PR?

ideas:

  • examples for corr and cov usage
  • whats-new.rst

@keewis
Copy link
Collaborator

keewis commented Nov 20, 2019

yes, whats-new.rst and api.rst will have to eventually be updated before the merge. Also, examples would be really good.

@keewis
Copy link
Collaborator

keewis commented Nov 20, 2019

since the APIs of DataArray and Dataset are usually kept in sync as much as possible, it might be good to add the new methods to Dataset, too. This doesn't have to be in this PR, though, and I'd wait for confirmation first.

@r-beer
Copy link
Author

r-beer commented Nov 20, 2019

since the APIs of DataArray and Dataset are usually kept in sync as much as possible, it might be good to add the new methods to Dataset, too. This doesn't have to be in this PR, though, and I'd wait for confirmation first.

Yes, the functionality that I would be very interested in would be to calculate corr() also between different variables of a dataset as in pandas.DataFrame.corr as well as correlations between different locations within the same DataArray (e.g. correlations of time series with different latitudes).

@@ -4079,6 +4080,49 @@ def test_rank(self):
y = DataArray([0.75, 0.25, np.nan, 0.5, 1.0], dims=("z",))
assert_equal(y.rank("z", pct=True), y)

# TODO: use fixture or randomly created array instead of load_dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@max-sixty
Copy link
Collaborator

Great start!

Thanks for the review @keewis

I think it's fine to leave Dataset for another PR; it'd be great to get this in, and given it's your first PR (welcome!) let's not bloat it

Test looks good. Let's make those adjustments you put in the TODO and add one for the other function? We could also add some variants which test along different dimensions / all NaNs / other special cases.

@r-beer
Copy link
Author

r-beer commented Nov 20, 2019

Great start!

Thanks for the review @keewis

I think it's fine to leave Dataset for another PR; it'd be great to get this in, and given it's your first PR (welcome!) let's not bloat it

Test looks good. Let's make those adjustments you put in the TODO and add one for the other function? We could also add some variants which test along different dimensions / all NaNs / other special cases.

Thank you two, @max-sixty and @keewis, for the constructive feedback that made the start easier!
I would also be happy to go once through the whole process and then implement the PR for the ds separately. This will probably cut the learning curve and accelerate the process.

Additionally, I wonder whether I have to configure black for the spaces between colons?

I had autoformatting running with autopep8 previously but since one of the last VS code updates something broke so I will set this up properly again on Friday or tomorrow.

r-beer and others added 5 commits November 20, 2019 16:22
Co-Authored-By: keewis <keewis@users.noreply.github.com>
Co-Authored-By: keewis <keewis@users.noreply.github.com>
Co-Authored-By: keewis <keewis@users.noreply.github.com>
Co-Authored-By: keewis <keewis@users.noreply.github.com>
Co-Authored-By: keewis <keewis@users.noreply.github.com>
@max-sixty
Copy link
Collaborator

@r-beer great list of future PRs!

@r-beer
Copy link
Author

r-beer commented Nov 23, 2019

Turns out the assert np.allclose statements actually failed for cov, but not for corr:

        actual = xr.cov(ts1, ts2)
>       assert np.allclose(expected, actual)
E       assert False
E        +  where False = <function allclose at 0x7ff5dbaf0e18>(0.004632253406505826, <xarray.DataArray ()>\narray(0.004323))
E        +    where <function allclose at 0x7ff5dbaf0e18> = np.allclose

0.00463 vs. 0.00432

Copy link
Author

@r-beer r-beer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grafik

pandas.Series.cov uses np.cov that defaults to bias=False.

Should we also default to bias=False for normalization?

valid_values = da_a.notnull() & da_b.notnull()
da_a = da_a.where(
valid_values, drop=True
) # TODO: avoid drop as explained in https://github.com/pydata/xarray/pull/2652#discussion_r245492002
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't drop=True is needed here at all, and I think it actually introduces a bug. In particular, consider the following case:

>>> array = xarray.DataArray([[1, 2], [np.nan, np.nan]], dims=['x', 'y'])
>>> xarray.corr(array, array, dim='y')
<xarray.DataArray (x: 1)>
array([2.])
Dimensions without coordinates: x

This doesn't look right to me at all. It has the wrong shape, size 1 instead of size 2, and also somehow the correlation is greater than 1.

@max-sixty
Copy link
Collaborator

@shoyer good spot, thanks

@r-beer we could add that as a test case

@r-beer
Copy link
Author

r-beer commented Nov 23, 2019

@shoyer good spot, thanks

@r-beer we could add that as a test case

Yes, I will add it.

Thinking about test arrays in general:
Wouldn't it be good to define some data array fixtures in the conftest.py which can then be used in all different test_function.py tests?

In view of the additional test I started to restructure the test_cov and test_corr to a single test_func. I am now asking myself how much of the data preprocessing shall go to the parametrization and how much remains in the test function...

Any guidelines or suggestions?

Either way, it would be good to have a clear list of da_a and da_b and respective expected results. I am a bit confused about what we actually would expect for xarray.DataArray([[1, 2], [np.nan, np.nan]], dims=['x', 'y']).

@r-beer
Copy link
Author

r-beer commented Nov 23, 2019

Either way, it would be good to have a clear list of da_a and da_b and respective expected results. I am a bit confused about what we actually would expect for xarray.DataArray([[1, 2], [np.nan, np.nan]], dims=['x', 'y']).

Maybe, as a first step some type checking would be enough, to identify such special cases as the one above and give a ValueError or NotImplementedError?

@max-sixty
Copy link
Collaborator

@shoyer good spot, thanks
@r-beer we could add that as a test case

Yes, I will add it.

Thinking about test arrays in general:
Wouldn't it be good to define some data array fixtures in the conftest.py which can then be used in all different test_function.py tests?

I think we could do more of this, yes. It's a tradeoff between (a) better sharing & performance vs. (b) customization for each function & localization of effects.
We have def create_test_data in a few files, before we used pytest, so could convert some of the usages to fixtures.
@shoyer and I discussed this briefly with @keewis 's work on tests for units. I think @shoyer is keener on having fixture-like code close to the test functions.

In view of the additional test I started to restructure the test_cov and test_corr to a single test_func. I am now asking myself how much of the data preprocessing shall go to the parametrization and how much remains in the test function...

Any guidelines or suggestions?

I would start with having the code in the function; i.e. don't abstract too early unless we're really confident on the eventual state. When we find ourselves something multiple times, then we can start pulling it out into more abstractions (e.g. fixtures).

@max-sixty
Copy link
Collaborator

Either way, it would be good to have a clear list of da_a and da_b and respective expected results. I am a bit confused about what we actually would expect for xarray.DataArray([[1, 2], [np.nan, np.nan]], dims=['x', 'y']).

For xarray.corr(array, array, dim='y'), I think we'd expect an array of the same dimensionality as below, with one point as the correlation of 1.0 (because {1,2} is 100% correlated to {1,2}) and another NaN. Does that make sense?

In [17]: xr.dot(da,da, dims='y')
Out[17]:
<xarray.DataArray (x: 2)>
array([ 5., nan])
Dimensions without coordinates: x

@r-beer
Copy link
Author

r-beer commented Nov 24, 2019

I would start with having the code in the function; i.e. don't abstract too early unless we're really confident on the eventual state. When we find ourselves something multiple times, then we can start pulling it out into more abstractions (e.g. fixtures).

Alright! I found myself having very similar tests and therefore abstracted it to test_func. I completely agree with not abstracting too early and but when having the same structure several times.

@r-beer
Copy link
Author

r-beer commented Nov 24, 2019

For xarray.corr(array, array, dim='y'), I think we'd expect an array of the same dimensionality as below, with one point as the correlation of 1.0 (because {1,2} is 100% correlated to {1,2}) and another NaN. Does that make sense?

On the one side, I am with you in terms of "What You Put In Is What You Get Out", on the other hand pandas.Series.cov and np.cov have other behaviors that both seem plausible:

grafik

grafik

So for the moment, we might stick with pandas.Series.cov?
Or (future PR?) we rather want the np.cov behavior, this would require more effort and the repr should probably be given as xarray instead of np.array, right?

refactored test functions back to test_cov and test_corr
@r-beer
Copy link
Author

r-beer commented Nov 24, 2019

I have added several test cases, almost all pass. The ones that don't pass are related to dim="time" and incompletely implemented pandas_cov and pandas_corr, that currently don't have a dim parameter.

I had a look at np.cov, np.corrcoef, pd.Series.cov, pd.frame.cov, pd.Series.corr, and pd.frame.corr: Computation of cov and corr is implemented in many different ways.

xr.cov and xr.corr now imitate the behavior of the respective pd.Series functions.

I actually prefer the numpy behavior, resulting in covariance and correlation matrices. However I feel that efficient implementation of this behavior is above my current understanding of xarray. So, I would highly appreciate your support on this implementation!

Otherwise, I added the ddof parameter to xr.cov to imitate the behavior of the respective pd.Series functions. I now get the results that pd.Series produces. However, I had to set ddof=1 for xr.cov and ddof=0 for xr.corr. Does that make sense?

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there!

valid_values = ts1.notnull() & ts2.notnull()

ts1 = ts1.where(valid_values, drop=True)
ts2 = ts2.where(valid_values, drop=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought that we should be testing that we get a similar result from pandas' cov and xarray cov without much preparation. Can you help me understand why we need to drop values here? What's the test case that fails when we don't have these lines?

dims=("a", "time", "x"),
)

arrays = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with putting these at the module level (and one of the benefits of fixtures) is that each time this file is imported—even when no relevant test is being run—this will execute, and so it'll slow down rapid test collection. Could we put this in a fixture? I can help show you how if it's not clear. Thanks @r-beer

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I planned to make a fixture out of it. (How) Can I then easily use this fixture within parametrize?

The following code does not work, as the function is not iterable:

@pytest.fixture()
def array_tuples():
    da = xr.DataArray(np.random.random((3, 21, 4)),
                      coords={"time": pd.date_range("2000-01-01", freq="1D", periods=21)},
                      dims=("a", "time", "x"),)

    arrays = [
        da.isel(time=range(0, 18)),
        da.isel(time=range(2, 20)).rolling(time=3, center=True).mean(dim="time"),
        xr.DataArray([0, np.nan, 1, 2, np.nan, 3, 4, 5, np.nan, 6, 7], dims="time"),
        xr.DataArray([1, 1, np.nan, 2, np.nan, 3, 5, 4, 6, np.nan, 7], dims="time"),
        xr.DataArray([[1, 2], [1, np.nan]], dims=["x", "time"]),
        xr.DataArray([[1, 2], [np.nan, np.nan]], dims=["x", "time"]),
    ]

    array_tuples = [
        (arrays[0], arrays[0]),
        (arrays[0], arrays[1]),
        (arrays[1], arrays[1]),
        (arrays[2], arrays[2]),
        (arrays[2], arrays[3]),
        (arrays[3], arrays[3]),
        (arrays[4], arrays[4]),
        (arrays[4], arrays[5]),
        (arrays[5], arrays[5]),
    ]

    return array_tuples


@pytest.mark.parametrize("da_a, da_b", array_tuples)
@pytest.mark.parametrize("dim", [None, "time", "x"])
def test_cov(da_a, da_b, dim):
    def pandas_cov(ts1, ts2):

Errors in:

E   TypeError: 'function' object is not iterable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to use array_tuples as a fixture, you need for the test function to take it as an argument, not to pass it as a argument to parametrize. (I know the multiple ways of parametrizing can be confusing)

5P2 combinations is fine for performance, particularly if the inputs are differentiated and so each test has marginal value. It does limit you to tests with a known correct implementation, though (you're not going to want to write out the results for all of those!)

Copy link
Author

@r-beer r-beer Nov 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to use array_tuples as a fixture, you need for the test function to take it as an argument, not to pass it as a argument to parametrize. (I know the multiple ways of parametrizing can be confusing)

That means, that all the permutation is done within the fixture? It's not possible to put the fixture into the parametrize at all?

5P2 combinations is fine for performance, particularly if the inputs are differentiated and so each test has marginal value. It does limit you to tests with a known correct implementation, though (you're not going to want to write out the results for all of those!)

Haha, yeah, don't want to write all those by hand. 🙂
I did not completely get the part about "differentiation" and "marginal test value". Could you write one or two more sentences about it?

Thanks a lot for all the explanations!

Comment on lines +807 to +817
array_tuples = [
(arrays[0], arrays[0]),
(arrays[0], arrays[1]),
(arrays[1], arrays[1]),
(arrays[2], arrays[2]),
(arrays[2], arrays[3]),
(arrays[3], arrays[3]),
(arrays[4], arrays[4]),
(arrays[4], arrays[5]),
(arrays[5], arrays[5]),
]
Copy link
Collaborator

@keewis keewis Nov 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good use case for itertools.permutations. So this could be

Suggested change
array_tuples = [
(arrays[0], arrays[0]),
(arrays[0], arrays[1]),
(arrays[1], arrays[1]),
(arrays[2], arrays[2]),
(arrays[2], arrays[3]),
(arrays[3], arrays[3]),
(arrays[4], arrays[4]),
(arrays[4], arrays[5]),
(arrays[5], arrays[5]),
]
array_tuples = [
(arrays[first], arrays[second])
for first, second in itertools.permutations(range(len(arrays)))
]

which means that you don't have to update array_tuples if you decide you have to add another item to arrays

Edit: github apparently misunderstands multi-line suggestions, you'd have to edit by hand.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want all permutations of the two variables, just pass them as separate parameters with pytest

Copy link
Collaborator

@keewis keewis Nov 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I mistook combinations with permutations. Looking at it again, it seems that the pattern is some sort of "combine with the same index and the previous one", which is not what combinations would yield.

So never mind.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I have a questions
Should all possible data arrays be allowed?

If so, instead of using array_tuples we can use what I previously tested (but didn't commit):

@pytest.mark.parametrize("da_b", arrays)
@pytest.mark.parametrize("da_a", arrays)
@pytest.mark.parametrize("dim", [None, "time", "x"])
def test_cov(da_a, da_b, dim):

This is, what @max-sixty means, right?

However, doesn't using all combinations might blow up the number of test functions unnecessarily? On the other hand not using xarray_tuples would be much more readable.

@@ -1047,6 +1047,175 @@ def earth_mover_distance(first_samples,
return apply_array_ufunc(func, *args, dask=dask)


def cov(da_a, da_b, dim=None, ddof=1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipna=None?


# 3. Compute mean and standard deviation along the given dim
demeaned_da_a = da_a - da_a.mean(dim=dim)
demeaned_da_b = da_b - da_b.mean(dim=dim)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not demean in place, and avoid additional copies of the full arrays? Aren't da_a and da_b already copies?

da_a_std = da_a.std(dim=dim)
da_b_std = da_b.std(dim=dim)

corr = cov(da_a, da_b, dim=dim, ddof=ddof) / (da_a_std * da_b_std)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is making a zillion copies of the full arrays. Please try to avoid unnecessary copies.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're more likely to have a working performant feature by make it work first, and then making it performant...

@AndrewILWilliams AndrewILWilliams mentioned this pull request May 23, 2020
5 tasks
max-sixty pushed a commit that referenced this pull request May 25, 2020
* Added chunks='auto' option in dataset.py

* reverted accidental changes in dataset.chunk()

* Added corr and cov to computation.py. Taken from r-beer:xarray/corr

* Added r-beer's tests to test_computation.py

Still issues I think

* trying to fix github.com//pull/3550#discussion_r349935731

* Removing drop=True from the `.where()` calls in `computation.py`+test.py

* api.rst and whats-new.rst

* Updated `xarray/__init__.py` and added `broadcast` import to computation

* added DataArray import to corr, cov

* assert_allclose added to test_computation.py

* removed whitespace in test_dask...oops

* Added to init

* format changes

* Fiddling around with cov/corr tests in `test_computation.py`

* PEP8 changes

* pep

* remove old todo and comments

* isort

* Added consistency check between corr() and cov(), ensure they give same

* added `skipna=False` to `computation.py`. made consistency+autocov tests

* formatting

* Added numpy-based tests.

* format

* formatting again

* Update doc/whats-new.rst

Co-authored-by: keewis <keewis@users.noreply.github.com>

* refactored corr/cov so there is one internal method for calculating both

* formatting

* updating docstrings and code suggestions from PR

* paramterize ddof in tests

* removed extraneous test arrays

* formatting + adding deterministic docstring

* added test for TypeError

* formatting

* tidying up docstring

* formatting and tidying up `_cov_corr()` so that the logic is more clear

* flake8 ...

Co-authored-by: keewis <keewis@users.noreply.github.com>
@max-sixty
Copy link
Collaborator

@r-beer in case you come back to see this: thank you for taking it so far; your code was helpful to eventually getting this feature in. And we'd of course appreciate any additional contributions.

@max-sixty max-sixty closed this May 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants