-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cov() and corr() - finalization #3550
Conversation
…array.py, and tested using pytest.
Hello @r-beer! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2019-11-25 20:47:29 UTC |
@max-sixty, all three checks fail because Shall I, therefore, add the |
There seems to be a problem with your email address (github can't match it to your account). Does it match the one you used for signing up on github? |
Dear @keewis, I just set my real email address also as public email address and set my local git user name and email address accordingly. Does this solve the problem? |
for the new commit, yes. The older commits could be rewritten if you don't like the other email address being public (I can guide you through the process if you want) but since PRs get squashed during the merge you don't have to. |
OK, no need for the extra effort at this point, the squashing at the end of PR is fine for me. Thank you for the offer, anyway! 🙂 Are any other changes necessary for the PR? ideas:
|
yes, |
since the APIs of |
Yes, the functionality that I would be very interested in would be to calculate |
xarray/tests/test_dataarray.py
Outdated
@@ -4079,6 +4080,49 @@ def test_rank(self): | |||
y = DataArray([0.75, 0.25, np.nan, 0.5, 1.0], dims=("z",)) | |||
assert_equal(y.rank("z", pct=True), y) | |||
|
|||
# TODO: use fixture or randomly created array instead of load_dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Great start! Thanks for the review @keewis I think it's fine to leave Test looks good. Let's make those adjustments you put in the |
Thank you two, @max-sixty and @keewis, for the constructive feedback that made the start easier! Additionally, I wonder whether I have to configure black for the spaces between colons? I had autoformatting running with autopep8 previously but since one of the last VS code updates something broke so I will set this up properly again on Friday or tomorrow. |
Co-Authored-By: keewis <keewis@users.noreply.github.com>
Co-Authored-By: keewis <keewis@users.noreply.github.com>
Co-Authored-By: keewis <keewis@users.noreply.github.com>
Co-Authored-By: keewis <keewis@users.noreply.github.com>
Co-Authored-By: keewis <keewis@users.noreply.github.com>
@r-beer great list of future PRs! |
Turns out the actual = xr.cov(ts1, ts2)
> assert np.allclose(expected, actual)
E assert False
E + where False = <function allclose at 0x7ff5dbaf0e18>(0.004632253406505826, <xarray.DataArray ()>\narray(0.004323))
E + where <function allclose at 0x7ff5dbaf0e18> = np.allclose 0.00463 vs. 0.00432 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xarray/core/computation.py
Outdated
valid_values = da_a.notnull() & da_b.notnull() | ||
da_a = da_a.where( | ||
valid_values, drop=True | ||
) # TODO: avoid drop as explained in https://github.com/pydata/xarray/pull/2652#discussion_r245492002 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't drop=True
is needed here at all, and I think it actually introduces a bug. In particular, consider the following case:
>>> array = xarray.DataArray([[1, 2], [np.nan, np.nan]], dims=['x', 'y'])
>>> xarray.corr(array, array, dim='y')
<xarray.DataArray (x: 1)>
array([2.])
Dimensions without coordinates: x
This doesn't look right to me at all. It has the wrong shape, size 1 instead of size 2, and also somehow the correlation is greater than 1.
Yes, I will add it. Thinking about test arrays in general: In view of the additional test I started to restructure the Any guidelines or suggestions? Either way, it would be good to have a clear list of |
started to move test data creation to parametrize
Maybe, as a first step some type checking would be enough, to identify such special cases as the one above and give a |
I think we could do more of this, yes. It's a tradeoff between (a) better sharing & performance vs. (b) customization for each function & localization of effects.
I would start with having the code in the function; i.e. don't abstract too early unless we're really confident on the eventual state. When we find ourselves something multiple times, then we can start pulling it out into more abstractions (e.g. fixtures). |
For
|
Alright! I found myself having very similar tests and therefore abstracted it to test_func. I completely agree with not abstracting too early and but when having the same structure several times. |
On the one side, I am with you in terms of "What You Put In Is What You Get Out", on the other hand So for the moment, we might stick with |
refactored test functions back to test_cov and test_corr
minor changes
I have added several test cases, almost all pass. The ones that don't pass are related to I had a look at
I actually prefer the numpy behavior, resulting in covariance and correlation matrices. However I feel that efficient implementation of this behavior is above my current understanding of xarray. So, I would highly appreciate your support on this implementation! Otherwise, I added the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there!
valid_values = ts1.notnull() & ts2.notnull() | ||
|
||
ts1 = ts1.where(valid_values, drop=True) | ||
ts2 = ts2.where(valid_values, drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have thought that we should be testing that we get a similar result from pandas' cov
and xarray cov
without much preparation. Can you help me understand why we need to drop values here? What's the test case that fails when we don't have these lines?
dims=("a", "time", "x"), | ||
) | ||
|
||
arrays = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with putting these at the module level (and one of the benefits of fixtures) is that each time this file is imported—even when no relevant test is being run—this will execute, and so it'll slow down rapid test collection. Could we put this in a fixture? I can help show you how if it's not clear. Thanks @r-beer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I planned to make a fixture out of it. (How) Can I then easily use this fixture within parametrize?
The following code does not work, as the function is not iterable:
@pytest.fixture()
def array_tuples():
da = xr.DataArray(np.random.random((3, 21, 4)),
coords={"time": pd.date_range("2000-01-01", freq="1D", periods=21)},
dims=("a", "time", "x"),)
arrays = [
da.isel(time=range(0, 18)),
da.isel(time=range(2, 20)).rolling(time=3, center=True).mean(dim="time"),
xr.DataArray([0, np.nan, 1, 2, np.nan, 3, 4, 5, np.nan, 6, 7], dims="time"),
xr.DataArray([1, 1, np.nan, 2, np.nan, 3, 5, 4, 6, np.nan, 7], dims="time"),
xr.DataArray([[1, 2], [1, np.nan]], dims=["x", "time"]),
xr.DataArray([[1, 2], [np.nan, np.nan]], dims=["x", "time"]),
]
array_tuples = [
(arrays[0], arrays[0]),
(arrays[0], arrays[1]),
(arrays[1], arrays[1]),
(arrays[2], arrays[2]),
(arrays[2], arrays[3]),
(arrays[3], arrays[3]),
(arrays[4], arrays[4]),
(arrays[4], arrays[5]),
(arrays[5], arrays[5]),
]
return array_tuples
@pytest.mark.parametrize("da_a, da_b", array_tuples)
@pytest.mark.parametrize("dim", [None, "time", "x"])
def test_cov(da_a, da_b, dim):
def pandas_cov(ts1, ts2):
Errors in:
E TypeError: 'function' object is not iterable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to use array_tuples
as a fixture, you need for the test function to take it as an argument, not to pass it as a argument to parametrize
. (I know the multiple ways of parametrizing can be confusing)
5P2 combinations is fine for performance, particularly if the inputs are differentiated and so each test has marginal value. It does limit you to tests with a known correct implementation, though (you're not going to want to write out the results for all of those!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to use
array_tuples
as a fixture, you need for the test function to take it as an argument, not to pass it as a argument toparametrize
. (I know the multiple ways of parametrizing can be confusing)
That means, that all the permutation is done within the fixture? It's not possible to put the fixture into the parametrize at all?
5P2 combinations is fine for performance, particularly if the inputs are differentiated and so each test has marginal value. It does limit you to tests with a known correct implementation, though (you're not going to want to write out the results for all of those!)
Haha, yeah, don't want to write all those by hand. 🙂
I did not completely get the part about "differentiation" and "marginal test value". Could you write one or two more sentences about it?
Thanks a lot for all the explanations!
array_tuples = [ | ||
(arrays[0], arrays[0]), | ||
(arrays[0], arrays[1]), | ||
(arrays[1], arrays[1]), | ||
(arrays[2], arrays[2]), | ||
(arrays[2], arrays[3]), | ||
(arrays[3], arrays[3]), | ||
(arrays[4], arrays[4]), | ||
(arrays[4], arrays[5]), | ||
(arrays[5], arrays[5]), | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good use case for itertools.permutations
. So this could be
array_tuples = [ | |
(arrays[0], arrays[0]), | |
(arrays[0], arrays[1]), | |
(arrays[1], arrays[1]), | |
(arrays[2], arrays[2]), | |
(arrays[2], arrays[3]), | |
(arrays[3], arrays[3]), | |
(arrays[4], arrays[4]), | |
(arrays[4], arrays[5]), | |
(arrays[5], arrays[5]), | |
] | |
array_tuples = [ | |
(arrays[first], arrays[second]) | |
for first, second in itertools.permutations(range(len(arrays))) | |
] |
which means that you don't have to update array_tuples
if you decide you have to add another item to arrays
Edit: github apparently misunderstands multi-line suggestions, you'd have to edit by hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want all permutations of the two variables, just pass them as separate parameters with pytest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I mistook combinations
with permutations
. Looking at it again, it seems that the pattern is some sort of "combine with the same index and the previous one", which is not what combinations
would yield.
So never mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I have a questions
Should all possible data arrays be allowed?
If so, instead of using array_tuples
we can use what I previously tested (but didn't commit):
@pytest.mark.parametrize("da_b", arrays)
@pytest.mark.parametrize("da_a", arrays)
@pytest.mark.parametrize("dim", [None, "time", "x"])
def test_cov(da_a, da_b, dim):
This is, what @max-sixty means, right?
However, doesn't using all combinations might blow up the number of test functions unnecessarily? On the other hand not using xarray_tuples
would be much more readable.
@@ -1047,6 +1047,175 @@ def earth_mover_distance(first_samples, | |||
return apply_array_ufunc(func, *args, dask=dask) | |||
|
|||
|
|||
def cov(da_a, da_b, dim=None, ddof=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skipna=None
?
|
||
# 3. Compute mean and standard deviation along the given dim | ||
demeaned_da_a = da_a - da_a.mean(dim=dim) | ||
demeaned_da_b = da_b - da_b.mean(dim=dim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not demean in place, and avoid additional copies of the full arrays? Aren't da_a
and da_b
already copies?
da_a_std = da_a.std(dim=dim) | ||
da_b_std = da_b.std(dim=dim) | ||
|
||
corr = cov(da_a, da_b, dim=dim, ddof=ddof) / (da_a_std * da_b_std) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is making a zillion copies of the full arrays. Please try to avoid unnecessary copies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're more likely to have a working performant feature by make it work first, and then making it performant...
* Added chunks='auto' option in dataset.py * reverted accidental changes in dataset.chunk() * Added corr and cov to computation.py. Taken from r-beer:xarray/corr * Added r-beer's tests to test_computation.py Still issues I think * trying to fix github.com//pull/3550#discussion_r349935731 * Removing drop=True from the `.where()` calls in `computation.py`+test.py * api.rst and whats-new.rst * Updated `xarray/__init__.py` and added `broadcast` import to computation * added DataArray import to corr, cov * assert_allclose added to test_computation.py * removed whitespace in test_dask...oops * Added to init * format changes * Fiddling around with cov/corr tests in `test_computation.py` * PEP8 changes * pep * remove old todo and comments * isort * Added consistency check between corr() and cov(), ensure they give same * added `skipna=False` to `computation.py`. made consistency+autocov tests * formatting * Added numpy-based tests. * format * formatting again * Update doc/whats-new.rst Co-authored-by: keewis <keewis@users.noreply.github.com> * refactored corr/cov so there is one internal method for calculating both * formatting * updating docstrings and code suggestions from PR * paramterize ddof in tests * removed extraneous test arrays * formatting + adding deterministic docstring * added test for TypeError * formatting * tidying up docstring * formatting and tidying up `_cov_corr()` so that the logic is more clear * flake8 ... Co-authored-by: keewis <keewis@users.noreply.github.com>
@r-beer in case you come back to see this: thank you for taking it so far; your code was helpful to eventually getting this feature in. And we'd of course appreciate any additional contributions. |
Dear maintainers,
this is my first PR and contribution to xarray. I have read the contributing guidelines and did my best to conform to your code of conduct. However, I would be happy if you have a closer look whether everything is correct and give feedback for improvement!
As I don't have push access to @hrishikeshac's fork, I created this PR to add the final changes for PR #2652.
black . && mypy . && flake8
whats-new.rst
for all changes andapi.rst
for new APIwhats-new.rst
api.rst
corr()
andcov()