Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add standard deviation as a function #566

Merged
merged 33 commits into from
Oct 13, 2023

Conversation

pcuestas
Copy link
Contributor

@pcuestas pcuestas commented Aug 17, 2023

Added std as a function using single dispatch in the skfda.exploratory.stats module.

The function takes two arguments: a FData object and a ddof (Delta Degrees Of Freedom) integer. By default, ddof=1. Rationale behind this decision has been that these functions (stats.cov, stats.var, stats.std) are meant to be used for statistical (exploratory) purposes, and not for Maximum Likelihood algorithms. The default is different from numpy.std (which uses ddof=1). pandas.DataFrame.std and scipy.stats.variation use ddof=1.

The approach taken to calculate the std of a FDataBasis object is the more direct one that we discussed which does not use the covariance object. Instead, a matrix multiplication is performed to calculate the pointwise variance on an arbitrary point of the domain:

basis_evaluation = basis(t_points).reshape((-1, 1))
return np.sqrt(
basis_evaluation.T @ coeff_matrix @ basis_evaluation
).reshape((1, -1, 1))

Tests included to cover the general $\mathbb{R}^p \to \mathbb{R}^q$ case for both FDataGrid and FDataBasis objects.

Closes #541.

@pcuestas pcuestas requested a review from vnmabus August 17, 2023 17:01
@pcuestas pcuestas linked an issue Aug 17, 2023 that may be closed by this pull request
skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved
skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved
skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved
skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved


@std.register(FDataBasis)
def std_fdatabasis(X: FDataBasis, ddof: int = 1) -> FDataBasis:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[pep8] reported by reviewdog 🐶
WPS473 Found too many empty lines in def: 8 > 7

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the pep8 advice.

skfda/tests/test_stats.py Outdated Show resolved Hide resolved
skfda/tests/test_stats.py Outdated Show resolved Hide resolved
skfda/tests/test_stats.py Outdated Show resolved Hide resolved
skfda/tests/test_stats.py Outdated Show resolved Hide resolved
skfda/tests/test_stats.py Outdated Show resolved Hide resolved
skfda/misc/_math.py Outdated Show resolved Hide resolved
skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Aug 17, 2023

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (4fe909c) 86.03% compared to head (3115a6d) 86.10%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #566      +/-   ##
===========================================
+ Coverage    86.03%   86.10%   +0.06%     
===========================================
  Files          148      149       +1     
  Lines        11719    11789      +70     
===========================================
+ Hits         10083    10151      +68     
- Misses        1636     1638       +2     
Files Coverage Δ
skfda/tests/test_stats_std.py 100.00% <100.00%> (ø)
skfda/exploratory/stats/_stats.py 87.17% <94.44%> (+1.93%) ⬆️
skfda/_utils/_utils.py 82.29% <80.00%> (+0.22%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

skfda/tests/test_stats.py Outdated Show resolved Hide resolved
skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved
skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved


@std.register(FDataBasis)
def std_fdatabasis(X: FDataBasis, ddof: int = 1) -> FDataBasis:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the pep8 advice.

skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved
skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved
skfda/misc/_math.py Outdated Show resolved Hide resolved
skfda/tests/test_stats.py Outdated Show resolved Hide resolved
skfda/tests/test_stats.py Outdated Show resolved Hide resolved
* function_to_fdatabasis to _utils
* docstring in each std function
* use pytest instead of unittest
skfda/_utils/_utils.py Show resolved Hide resolved
skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved
skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved
skfda/tests/test_stats.py Outdated Show resolved Hide resolved
skfda/tests/test_stats.py Outdated Show resolved Hide resolved
skfda/tests/test_stats.py Outdated Show resolved Hide resolved
skfda/tests/test_stats.py Outdated Show resolved Hide resolved
skfda/tests/test_stats.py Outdated Show resolved Hide resolved
skfda/tests/test_stats.py Outdated Show resolved Hide resolved
skfda/tests/test_stats.py Outdated Show resolved Hide resolved
skfda/exploratory/stats/_stats.py Show resolved Hide resolved
skfda/_utils/_utils.py Show resolved Hide resolved
skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved
skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved
skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved
skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved
FDataBasis: FDataBasis with calculated coefficients and the new
basis.
"""
from .. import FDataBasis
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[pep8] reported by reviewdog 🐶
WPS442 Found outer scope names shadowing: FDataBasis

if isinstance(f, FDataBasis) and f.basis == new_basis:
return f

inner_prod = inner_product_matrix(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [mypy] reported by reviewdog 🐶
Value of type variable "Vector" of "inner_product_matrix" cannot be "object" [type-var]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the typing of inner_product and inner_product_matrix is wrong. It should accept callables too. This should be fixed in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't inner_product and inner_product_matrix accept callables already?

def inner_product(
arg1: Vector,
arg2: Vector,
*,
_matrix: bool = False,
_domain_range: Optional[DomainRange] = None,
**kwargs: Any,
) -> NDArrayFloat:

def inner_product_matrix(
arg1: Vector,
arg2: Optional[Vector] = None,
**kwargs: Any,
) -> NDArrayFloat:

And

Vector = TypeVar(
"Vector",
bound=Union[NDArrayFloat, Basis, Callable[[NDArrayFloat], NDArrayFloat]],
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They accept them at runtime. However, they are typed as accepting Vector, a protocol which defines sum and scalar product. Normal callables do not define these operations (FData are an exception).

skfda/_utils/_utils.py Outdated Show resolved Hide resolved
from ..misc._math import inner_product_matrix

if isinstance(f, FDataBasis) and f.basis == new_basis:
return f
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we should return a copy here or not. I am worried that not copying can cause subtle bugs in the future, as the semantics of this function imply that a new FDataBasis is returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think returning a copy is safer. Should it be a deep copy?

if isinstance(f, FDataBasis) and f.basis == new_basis:
return f

inner_prod = inner_product_matrix(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the typing of inner_product and inner_product_matrix is wrong. It should accept callables too. This should be fixed in another PR.

skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved
skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved
skfda/exploratory/stats/_stats.py Show resolved Hide resolved
skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
grid = np.linspace(start, stop, n_features)
almost_std_fd = std(fd.to_grid(grid)).to_basis(fourier_basis)

inner_grid_limit = n_features // 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I do not get what you are trying to do in this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was not well designed and that the newer tests are clearer. I removed this one.

skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
skfda/exploratory/stats/_stats.py Outdated Show resolved Hide resolved
def std_function(t_points: NDArrayFloat) -> NDArrayFloat: # noqa: WPS430
basis_evaluation = basis(t_points).reshape((basis.n_basis, -1))
std_values = np.sqrt(
np.diag(basis_evaluation.T @ coeff_cov_matrix @ basis_evaluation),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way that does not need to compute the whole matrix just to keep the diagonal?

Copy link
Contributor Author

@pcuestas pcuestas Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three alternatives that I can think of:

  • diag:
        std_values = np.sqrt(
            np.diag(basis_evaluation.T @ coeff_cov_matrix @ basis_evaluation),
        )
  • einsum:
        std_values = np.sqrt(
            np.einsum("bi,bd,di->i", basis_evaluation, coeff_cov_matrix, basis_evaluation),
        )
  • sum:
        std_values = np.sqrt(np.sum(
            basis_evaluation * (coeff_cov_matrix @ basis_evaluation), axis=0,
        ))

After doing some quick profiling, I cannot see any difference between the three in terms of time. I think the 'diag' version is the clearest by being the most explicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,

I pre-post from my phone in order to speak before a potential a merge :)

I just wanted to express a small disagreement with @pcuestas.

After doing some quick profiling, I cannot see any difference between the three in terms of time.

I personally found a consistent 60~80% performance difference (diag_time / sum_time) on my single cpu for NxN matrices with N > 64.

Before showing the results yesterday, I wanted to render nicer graphs with more samples, uncertainty, etc… and ended up not having the time to post. And I probably won’t have enough time before another 5 days now (hence the teaser comment 🤣).

I will definitely come back to you with a proper visual in a few days!

One thing to note: if A, B, C are C-contiguous matrices, then

np.sum(A @ B * C.T, axis=-1)

is 10%~20% slower than

np.sum(A * C.T @ B.T, axis=-1)

For the same output. I included this gain in the first mentioned 60%~80% overall gain.

I'll be back :)
Cheers!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My "quick profiling" was definitely not specific enough. I was measuring the performance of the std_fdatabasis function and not the actual numpy operations. I'll share more accurate measurements when I have them ready.

The sum option seems to be faster than diag. Einsum is the slowest by a factor of ~10.

Copy link
Contributor Author

@pcuestas pcuestas Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have compared the execution time of the following functions (which output the same result):

# b is ndarray of shape (B, n_points)
# C is ndarray of shape (B, B)
# m_result is ndarray of shape (n_points,)

def m_diag(b, C):
    return np.diag(b.T @ C @ b)


def m_sum(b, C):
    return np.sum(b * (C @ b), axis=0)


def m_sum_T(b, C):
    return np.sum(b.T * (b.T @ C.T), axis=1)


def m_einsum(b, C):   # discarded due to being ~10 times slower than the others
    return np.einsum('bi,bd,di->i', b, C, b)

by doing:

def time_func(func, *args, n_times):
    start = time.time()
    for _ in range(n_times):
        func(*args)
    end = time.time()
    return end - start

Full code I have written for this is in https://gist.github.com/pcuestas/46be0f88aa4a14d481845cc5d646c8e0

Times and speedups that I got (time data also present in previously linked gist):

Figure_1

It seems that for big enough matrix sizes the speedup of sum over diag stabilizes around 1.5 (sum is 50% faster than diag) on my machine.

I am not sure whether the procedure I followed was good enough, I do not have much experience with this.

Copy link
Contributor

@eliegoudout eliegoudout Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, I can see thaty you profiled with time.time, which is generally considered bad practice for profiling. People tend to use timeit or time.perf_counter directly (which is used by timeit by default), for more precision. Also, it would be beneficial to plot the results on a logscale to interpret them more easily I think.

For my part, testing these:

def diag(A, B, C):
    return np.diag(A @ B @ C)

def sum1(A, B, C):
    return np.sum(A @ B * C.T, axis=-1)

def sum2(A, B, C):
    return np.sum(A * (C.T @ B.T), axis=-1)

over multiple runs (cf graph), I got the following results with barely noticeable uncertainty (Q1 ↔ Q3) areas:
profiling

We can observe that depending on the size of the matrices, it is not always obvious what will happen. But what is sure, is that indeed, the C-contiguousity can be leveraged, no matter the size of the matrices.

On a final note, it may be relevant to play with rectangle matrices if yours don't tend to be square (idk?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably report the minimum instead of median, according to the following comment in timeit.repeat:

It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.

It probably won't change much, though.

As for the methods compared, another approach would be to compute the Cholesky decomposition of $B = LL^T$ outside the inner function, given that it is a covariance matrix. Then we could do $C = L^T A$ and finally compute the sum of the rows of the elementwise square $C^2$ (the squared norm of C rows). This uses the fact that $A^TBA = C^T C$ and the fact that the diagonal of that is the vector of squared norms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested computation by precalculating the Cholesky decomposition has basically the same performance as the sum version. The reason why this happens is that both require the same operations (once the Cholesky decomposition has been precomputed):

def m_sum(b, C):
    return np.sum(b * (C @ b), axis=0)

def m_cholesky(b, L_T):
    """C = L @ L.T and L_T = L.T"""
    return np.sum((L_T @ b)**2, axis=0)

As the m_cholesky version is less explicit (and requires the Cholesky decomposition of the coefficient covariances matrix), it was decided on the meeting from 2023-10-10 to use the sum version:

std_values = np.sqrt(np.sum(
    basis_evaluation * (coeff_cov_matrix @ basis_evaluation), axis=0,
))

skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
skfda/tests/test_stats_std.py Show resolved Hide resolved
skfda/tests/test_stats_std.py Outdated Show resolved Hide resolved
@pcuestas pcuestas requested a review from vnmabus October 12, 2023 10:06
Copy link
Member

@vnmabus vnmabus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@vnmabus vnmabus merged commit 14b1416 into develop Oct 13, 2023
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add standard deviation
3 participants