Add standard deviation as a function #566

pcuestas · 2023-08-17T17:01:39Z

Added std as a function using single dispatch in the skfda.exploratory.stats module.

The function takes two arguments: a FData object and a ddof (Delta Degrees Of Freedom) integer. By default, ddof=1. Rationale behind this decision has been that these functions (stats.cov, stats.var, stats.std) are meant to be used for statistical (exploratory) purposes, and not for Maximum Likelihood algorithms. The default is different from numpy.std (which uses ddof=1). pandas.DataFrame.std and scipy.stats.variation use ddof=1.

The approach taken to calculate the std of a FDataBasis object is the more direct one that we discussed which does not use the covariance object. Instead, a matrix multiplication is performed to calculate the pointwise variance on an arbitrary point of the domain:

scikit-fda/skfda/exploratory/stats/_stats.py

Lines 153 to 156 in 6eada72

    
           basis_evaluation = basis(t_points).reshape((-1, 1)) 
        
           return np.sqrt( 
        
               basis_evaluation.T @ coeff_matrix @ basis_evaluation 
        
           ).reshape((1, -1, 1))

Tests included to cover the general $\mathbb{R}^p \to \mathbb{R}^q$ case for both FDataGrid and FDataBasis objects.

Closes #541.

skfda/exploratory/stats/_stats.py

github-actions · 2023-08-17T17:02:29Z

skfda/exploratory/stats/_stats.py

+
+
+@std.register(FDataBasis)
+def std_fdatabasis(X: FDataBasis, ddof: int = 1) -> FDataBasis:


[pep8] _{reported by reviewdog 🐶}
WPS473 Found too many empty lines in def: 8 > 7

Please follow the pep8 advice.

skfda/tests/test_stats.py

skfda/misc/_math.py

skfda/exploratory/stats/_stats.py

codecov · 2023-08-17T17:08:44Z

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (4fe909c) 86.03% compared to head (3115a6d) 86.10%.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #566      +/-   ##
===========================================
+ Coverage    86.03%   86.10%   +0.06%     
===========================================
  Files          148      149       +1     
  Lines        11719    11789      +70     
===========================================
+ Hits         10083    10151      +68     
- Misses        1636     1638       +2

Files	Coverage Δ
skfda/tests/test_stats_std.py	`100.00% <100.00%> (ø)`
skfda/exploratory/stats/_stats.py	`87.17% <94.44%> (+1.93%)`	⬆️
skfda/_utils/_utils.py	`82.29% <80.00%> (+0.22%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

skfda/tests/test_stats.py

skfda/exploratory/stats/_stats.py

vnmabus · 2023-08-21T08:55:36Z

skfda/exploratory/stats/_stats.py

+
+
+@std.register(FDataBasis)
+def std_fdatabasis(X: FDataBasis, ddof: int = 1) -> FDataBasis:


Please follow the pep8 advice.

skfda/exploratory/stats/_stats.py

skfda/misc/_math.py

skfda/tests/test_stats.py

* function_to_fdatabasis to _utils * docstring in each std function * use pytest instead of unittest

skfda/_utils/_utils.py

skfda/exploratory/stats/_stats.py

skfda/tests/test_stats.py

skfda/exploratory/stats/_stats.py

skfda/_utils/_utils.py

skfda/tests/test_stats_std.py

skfda/exploratory/stats/_stats.py

…rn multivalued (vector) outputs. Fix some pep8 issues.

…unction

github-actions · 2023-09-04T18:44:57Z

skfda/_utils/_utils.py

+        FDataBasis: FDataBasis with calculated coefficients and the new
+        basis.
+    """
+    from .. import FDataBasis


[pep8] _{reported by reviewdog 🐶}
WPS442 Found outer scope names shadowing: FDataBasis

github-actions · 2023-09-04T18:45:05Z

skfda/_utils/_utils.py

+    if isinstance(f, FDataBasis) and f.basis == new_basis:
+        return f
+
+    inner_prod = inner_product_matrix(


🚫 [mypy] _{reported by reviewdog 🐶}
Value of type variable "Vector" of "inner_product_matrix" cannot be "object" [type-var]

This is because the typing of inner_product and inner_product_matrix is wrong. It should accept callables too. This should be fixed in another PR.

Don't inner_product and inner_product_matrix accept callables already?

scikit-fda/skfda/misc/_math.py

Lines 214 to 221 in 0457e39

def inner_product(

arg1: Vector,

arg2: Vector,

*,

_matrix: bool = False,

_domain_range: Optional[DomainRange] = None,

**kwargs: Any,

) -> NDArrayFloat:

scikit-fda/skfda/misc/_math.py

Lines 506 to 510 in 0457e39

def inner_product_matrix(

arg1: Vector,

arg2: Optional[Vector] = None,

**kwargs: Any,

) -> NDArrayFloat:

And

scikit-fda/skfda/misc/_math.py

Lines 22 to 25 in 0457e39

Vector = TypeVar(

"Vector",

bound=Union[NDArrayFloat, Basis, Callable[[NDArrayFloat], NDArrayFloat]],

)

They accept them at runtime. However, they are typed as accepting Vector, a protocol which defines sum and scalar product. Normal callables do not define these operations (FData are an exception).

skfda/_utils/_utils.py

vnmabus · 2023-09-05T18:50:55Z

skfda/_utils/_utils.py

+    from ..misc._math import inner_product_matrix
+
+    if isinstance(f, FDataBasis) and f.basis == new_basis:
+        return f


I am not sure if we should return a copy here or not. I am worried that not copying can cause subtle bugs in the future, as the semantics of this function imply that a new FDataBasis is returned.

I think returning a copy is safer. Should it be a deep copy?

vnmabus · 2023-09-05T18:55:47Z

skfda/_utils/_utils.py

+    if isinstance(f, FDataBasis) and f.basis == new_basis:
+        return f
+
+    inner_prod = inner_product_matrix(


This is because the typing of inner_product and inner_product_matrix is wrong. It should accept callables too. This should be fixed in another PR.

skfda/exploratory/stats/_stats.py

skfda/tests/test_stats_std.py

vnmabus · 2023-09-16T20:32:17Z

skfda/tests/test_stats_std.py

+    grid = np.linspace(start, stop, n_features)
+    almost_std_fd = std(fd.to_grid(grid)).to_basis(fourier_basis)
+
+    inner_grid_limit = n_features // 10


Honestly, I do not get what you are trying to do in this test.

I think it was not well designed and that the newer tests are clearer. I removed this one.

skfda/tests/test_stats_std.py

…names for std_fdatagrid.

…est.

skfda/tests/test_stats_std.py

skfda/exploratory/stats/_stats.py

…unction

skfda/exploratory/stats/_stats.py

vnmabus · 2023-10-04T07:06:47Z

skfda/exploratory/stats/_stats.py

+    def std_function(t_points: NDArrayFloat) -> NDArrayFloat:  # noqa: WPS430
+        basis_evaluation = basis(t_points).reshape((basis.n_basis, -1))
+        std_values = np.sqrt(
+            np.diag(basis_evaluation.T @ coeff_cov_matrix @ basis_evaluation),


Is there a better way that does not need to compute the whole matrix just to keep the diagonal?

There are three alternatives that I can think of:

diag:

std_values = np.sqrt( np.diag(basis_evaluation.T @ coeff_cov_matrix @ basis_evaluation), )

einsum:

std_values = np.sqrt( np.einsum("bi,bd,di->i", basis_evaluation, coeff_cov_matrix, basis_evaluation), )

sum:

std_values = np.sqrt(np.sum( basis_evaluation * (coeff_cov_matrix @ basis_evaluation), axis=0, ))

After doing some quick profiling, I cannot see any difference between the three in terms of time. I think the 'diag' version is the clearest by being the most explicit.

Hello,

I pre-post from my phone in order to speak before a potential a merge :)

I just wanted to express a small disagreement with @pcuestas.

After doing some quick profiling, I cannot see any difference between the three in terms of time.

I personally found a consistent 60~80% performance difference (diag_time / sum_time) on my single cpu for NxN matrices with N > 64.

Before showing the results yesterday, I wanted to render nicer graphs with more samples, uncertainty, etc… and ended up not having the time to post. And I probably won’t have enough time before another 5 days now (hence the teaser comment 🤣).

I will definitely come back to you with a proper visual in a few days!

One thing to note: if A, B, C are C-contiguous matrices, then

np.sum(A @ B * C.T, axis=-1)

is 10%~20% slower than

np.sum(A * C.T @ B.T, axis=-1)

For the same output. I included this gain in the first mentioned 60%~80% overall gain.

I'll be back :)
Cheers!

My "quick profiling" was definitely not specific enough. I was measuring the performance of the std_fdatabasis function and not the actual numpy operations. I'll share more accurate measurements when I have them ready.

The sum option seems to be faster than diag. Einsum is the slowest by a factor of ~10.

I have compared the execution time of the following functions (which output the same result):

# b is ndarray of shape (B, n_points) # C is ndarray of shape (B, B) # m_result is ndarray of shape (n_points,) def m_diag(b, C): return np.diag(b.T @ C @ b) def m_sum(b, C): return np.sum(b * (C @ b), axis=0) def m_sum_T(b, C): return np.sum(b.T * (b.T @ C.T), axis=1) def m_einsum(b, C): # discarded due to being ~10 times slower than the others return np.einsum('bi,bd,di->i', b, C, b)

by doing:

def time_func(func, *args, n_times): start = time.time() for _ in range(n_times): func(*args) end = time.time() return end - start

Full code I have written for this is in https://gist.github.com/pcuestas/46be0f88aa4a14d481845cc5d646c8e0

Times and speedups that I got (time data also present in previously linked gist):

It seems that for big enough matrix sizes the speedup of sum over diag stabilizes around 1.5 (sum is 50% faster than diag) on my machine.

I am not sure whether the procedure I followed was good enough, I do not have much experience with this.

Hello, I can see thaty you profiled with time.time, which is generally considered bad practice for profiling. People tend to use timeit or time.perf_counter directly (which is used by timeit by default), for more precision. Also, it would be beneficial to plot the results on a logscale to interpret them more easily I think.

For my part, testing these:

def diag(A, B, C): return np.diag(A @ B @ C) def sum1(A, B, C): return np.sum(A @ B * C.T, axis=-1) def sum2(A, B, C): return np.sum(A * (C.T @ B.T), axis=-1)

over multiple runs (cf graph), I got the following results with barely noticeable uncertainty (Q1 ↔ Q3) areas:

We can observe that depending on the size of the matrices, it is not always obvious what will happen. But what is sure, is that indeed, the C-contiguousity can be leveraged, no matter the size of the matrices.

On a final note, it may be relevant to play with rectangle matrices if yours don't tend to be square (idk?)

You should probably report the minimum instead of median, according to the following comment in timeit.repeat:

It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.

It probably won't change much, though.

As for the methods compared, another approach would be to compute the Cholesky decomposition of $B = LL^T$ outside the inner function, given that it is a covariance matrix. Then we could do $C = L^T A$ and finally compute the sum of the rows of the elementwise square $C^2$ (the squared norm of C rows). This uses the fact that $A^TBA = C^T C$ and the fact that the diagonal of that is the vector of squared norms.

The suggested computation by precalculating the Cholesky decomposition has basically the same performance as the sum version. The reason why this happens is that both require the same operations (once the Cholesky decomposition has been precomputed):

def m_sum(b, C): return np.sum(b * (C @ b), axis=0) def m_cholesky(b, L_T): """C = L @ L.T and L_T = L.T""" return np.sum((L_T @ b)**2, axis=0)

As the m_cholesky version is less explicit (and requires the Cholesky decomposition of the coefficient covariances matrix), it was decided on the meeting from 2023-10-10 to use the sum version:

std_values = np.sqrt(np.sum( basis_evaluation * (coeff_cov_matrix @ basis_evaluation), axis=0, ))

skfda/tests/test_stats_std.py

…541-add-standard-deviation-as-a-function

…unction

vnmabus

LGTM.

pcuestas added 3 commits July 29, 2023 12:05

STD for FDataBasis and FDataGrid

6eada72

A test for std(FData).

036028a

FIX typing issues in function_to_fdatabasis

b2f251a

pcuestas requested a review from vnmabus August 17, 2023 17:01

pcuestas linked an issue Aug 17, 2023 that may be closed by this pull request

Add standard deviation #541

Closed

github-actions bot reviewed Aug 17, 2023

View reviewed changes

skfda/misc/_math.py Outdated Show resolved Hide resolved

skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved

Fix style issues.

f5d59a9

github-actions bot reviewed Aug 17, 2023

View reviewed changes

skfda/tests/test_stats.py Outdated Show resolved Hide resolved

Fix style issues.

3e16d90

vnmabus requested changes Aug 23, 2023

View reviewed changes

Fix some issues outlined in the pull request:

5f3631b

* function_to_fdatabasis to _utils * docstring in each std function * use pytest instead of unittest

github-actions bot reviewed Aug 25, 2023

View reviewed changes

skfda/exploratory/stats/_stats.py Show resolved Hide resolved

skfda/_utils/_utils.py Show resolved Hide resolved

skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved

Test for std_fdatagrid.

4c2757e

github-actions bot reviewed Aug 25, 2023

View reviewed changes

skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved

skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved

skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved

skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved

pcuestas added 2 commits August 25, 2023 17:27

Fix style.

23b2d4e

Test for std_fdatagrid.

7137aba

github-actions bot reviewed Aug 26, 2023

View reviewed changes

skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved

skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved

pcuestas added 2 commits August 26, 2023 09:00

More than 1 evaluation at a time in std_fdatabasis

71f7071

Fix test_stats_std style issues.

614aac5

github-actions bot reviewed Aug 27, 2023

View reviewed changes

skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved

skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved

skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved

skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved

pcuestas added 4 commits August 28, 2023 18:11

Exact std function improved to accept more than one input and to retu…

7a8ece0

…rn multivalued (vector) outputs. Fix some pep8 issues.

Undo test_stats changes

eb80f31

Import order

32233c9

Merge branch 'develop' into feature/541-add-standard-deviation-as-a-f…

0fcbb9c

…unction

github-actions bot reviewed Sep 4, 2023

View reviewed changes

std_fdatabasis with dim>1 (domain or/and codomain) + tests

026d488

vnmabus requested changes Sep 16, 2023

View reviewed changes

pcuestas added 2 commits September 17, 2023 12:44

Do not return input in function_to_fdatabasis, but a copy. No sample …

88c7fd4

…names for std_fdatagrid.

Fix test_stats_std readability. Remove confusing Gaussian processes t…

0457e39

…est.

github-actions bot reviewed Sep 17, 2023

View reviewed changes

skfda/tests/test_stats_std.py Show resolved Hide resolved

github-actions bot reviewed Sep 17, 2023

View reviewed changes

mypy ignore[no-any-return] on fixtures

ccb8ed4

github-actions bot reviewed Sep 17, 2023

View reviewed changes

skfda/tests/test_stats_std.py Show resolved Hide resolved

pcuestas added 2 commits September 17, 2023 13:34

Fix typing issues in std_function inside of std_fdatabasis

c7f70a1

Reorganize std_function for clarity

0b51aa5

github-actions bot reviewed Sep 17, 2023

View reviewed changes

skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved

pcuestas and others added 7 commits September 17, 2023 14:03

import FDatabasis ignore warning in function_to_fdatabasis

a0d7d7e

Fix style

0c11d71

Add std to stats.rst

533d13b

Fix type checking for function_to_fdatabasis

f3845b5

Isort

21967db

Fix nested function warning

ce105d9

Merge branch 'develop' into feature/541-add-standard-deviation-as-a-f…

23432e2

…unction

vnmabus reviewed Oct 4, 2023

View reviewed changes

pcuestas added 4 commits October 4, 2023 14:56

Fix tests, remove type from std.register

5509c00

sum instead of diag to compute std in std_fdatagrid integrand

6b5e47d

Merge branch 'fix/581-default-correction=0-for-cov-var' into feature/…

8897f29

…541-add-standard-deviation-as-a-function

Rename ddof -> correction

bb5923d

pcuestas requested a review from vnmabus October 12, 2023 10:06

Merge branch 'develop' into feature/541-add-standard-deviation-as-a-f…

3115a6d

…unction

vnmabus approved these changes Oct 13, 2023

View reviewed changes

vnmabus merged commit 14b1416 into develop Oct 13, 2023
15 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add standard deviation as a function #566

Add standard deviation as a function #566

pcuestas commented Aug 17, 2023 •

edited

Loading

github-actions bot Aug 17, 2023

vnmabus Aug 21, 2023

codecov bot commented Aug 17, 2023 •

edited

Loading

vnmabus Aug 21, 2023

github-actions bot Sep 4, 2023

github-actions bot Sep 4, 2023

vnmabus Sep 5, 2023

pcuestas Sep 17, 2023

vnmabus Sep 17, 2023

vnmabus Sep 5, 2023

pcuestas Sep 17, 2023

vnmabus Sep 5, 2023

vnmabus Sep 16, 2023

pcuestas Sep 17, 2023

vnmabus Oct 4, 2023

pcuestas Oct 4, 2023 •

edited

Loading

eliegoudout Oct 5, 2023

pcuestas Oct 5, 2023

pcuestas Oct 5, 2023 •

edited

Loading

eliegoudout Oct 5, 2023 •

edited

Loading

vnmabus Oct 6, 2023

pcuestas Oct 12, 2023

vnmabus left a comment

	basis_evaluation = basis(t_points).reshape((-1, 1))
	return np.sqrt(
	basis_evaluation.T @ coeff_matrix @ basis_evaluation
	).reshape((1, -1, 1))



		@std.register(FDataBasis)
		def std_fdatabasis(X: FDataBasis, ddof: int = 1) -> FDataBasis:

	def inner_product(
	arg1: Vector,
	arg2: Vector,
	*,
	_matrix: bool = False,
	_domain_range: Optional[DomainRange] = None,
	**kwargs: Any,
	) -> NDArrayFloat:

	def inner_product_matrix(
	arg1: Vector,
	arg2: Optional[Vector] = None,
	**kwargs: Any,
	) -> NDArrayFloat:

	Vector = TypeVar(
	"Vector",
	bound=Union[NDArrayFloat, Basis, Callable[[NDArrayFloat], NDArrayFloat]],
	)

Add standard deviation as a function #566

Add standard deviation as a function #566

Conversation

pcuestas commented Aug 17, 2023 • edited Loading

github-actions bot Aug 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 17, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

github-actions bot Sep 4, 2023

Choose a reason for hiding this comment

github-actions bot Sep 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcuestas Oct 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcuestas Oct 5, 2023 • edited Loading

Choose a reason for hiding this comment

eliegoudout Oct 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vnmabus left a comment

Choose a reason for hiding this comment

pcuestas commented Aug 17, 2023 •

edited

Loading

codecov bot commented Aug 17, 2023 •

edited

Loading

pcuestas Oct 4, 2023 •

edited

Loading

pcuestas Oct 5, 2023 •

edited

Loading

eliegoudout Oct 5, 2023 •

edited

Loading