-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add standard deviation as a function #566
Add standard deviation as a function #566
Conversation
skfda/exploratory/stats/_stats.py
Outdated
|
||
|
||
@std.register(FDataBasis) | ||
def std_fdatabasis(X: FDataBasis, ddof: int = 1) -> FDataBasis: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[pep8] reported by reviewdog 🐶
WPS473 Found too many empty lines in def
: 8 > 7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the pep8 advice.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## develop #566 +/- ##
===========================================
+ Coverage 86.03% 86.10% +0.06%
===========================================
Files 148 149 +1
Lines 11719 11789 +70
===========================================
+ Hits 10083 10151 +68
- Misses 1636 1638 +2
☔ View full report in Codecov by Sentry. |
skfda/exploratory/stats/_stats.py
Outdated
|
||
|
||
@std.register(FDataBasis) | ||
def std_fdatabasis(X: FDataBasis, ddof: int = 1) -> FDataBasis: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the pep8 advice.
* function_to_fdatabasis to _utils * docstring in each std function * use pytest instead of unittest
…rn multivalued (vector) outputs. Fix some pep8 issues.
skfda/_utils/_utils.py
Outdated
FDataBasis: FDataBasis with calculated coefficients and the new | ||
basis. | ||
""" | ||
from .. import FDataBasis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[pep8] reported by reviewdog 🐶
WPS442 Found outer scope names shadowing: FDataBasis
if isinstance(f, FDataBasis) and f.basis == new_basis: | ||
return f | ||
|
||
inner_prod = inner_product_matrix( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚫 [mypy] reported by reviewdog 🐶
Value of type variable "Vector" of "inner_product_matrix" cannot be "object" [type-var]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the typing of inner_product
and inner_product_matrix
is wrong. It should accept callables too. This should be fixed in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't inner_product
and inner_product_matrix
accept callables already?
scikit-fda/skfda/misc/_math.py
Lines 214 to 221 in 0457e39
def inner_product( | |
arg1: Vector, | |
arg2: Vector, | |
*, | |
_matrix: bool = False, | |
_domain_range: Optional[DomainRange] = None, | |
**kwargs: Any, | |
) -> NDArrayFloat: |
scikit-fda/skfda/misc/_math.py
Lines 506 to 510 in 0457e39
def inner_product_matrix( | |
arg1: Vector, | |
arg2: Optional[Vector] = None, | |
**kwargs: Any, | |
) -> NDArrayFloat: |
And
scikit-fda/skfda/misc/_math.py
Lines 22 to 25 in 0457e39
Vector = TypeVar( | |
"Vector", | |
bound=Union[NDArrayFloat, Basis, Callable[[NDArrayFloat], NDArrayFloat]], | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They accept them at runtime. However, they are typed as accepting Vector
, a protocol which defines sum and scalar product. Normal callables do not define these operations (FData
are an exception).
skfda/_utils/_utils.py
Outdated
from ..misc._math import inner_product_matrix | ||
|
||
if isinstance(f, FDataBasis) and f.basis == new_basis: | ||
return f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we should return a copy here or not. I am worried that not copying can cause subtle bugs in the future, as the semantics of this function imply that a new FDataBasis
is returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think returning a copy is safer. Should it be a deep copy?
if isinstance(f, FDataBasis) and f.basis == new_basis: | ||
return f | ||
|
||
inner_prod = inner_product_matrix( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the typing of inner_product
and inner_product_matrix
is wrong. It should accept callables too. This should be fixed in another PR.
skfda/tests/test_stats_std.py
Outdated
grid = np.linspace(start, stop, n_features) | ||
almost_std_fd = std(fd.to_grid(grid)).to_basis(fourier_basis) | ||
|
||
inner_grid_limit = n_features // 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I do not get what you are trying to do in this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was not well designed and that the newer tests are clearer. I removed this one.
skfda/exploratory/stats/_stats.py
Outdated
def std_function(t_points: NDArrayFloat) -> NDArrayFloat: # noqa: WPS430 | ||
basis_evaluation = basis(t_points).reshape((basis.n_basis, -1)) | ||
std_values = np.sqrt( | ||
np.diag(basis_evaluation.T @ coeff_cov_matrix @ basis_evaluation), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a better way that does not need to compute the whole matrix just to keep the diagonal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three alternatives that I can think of:
- diag:
std_values = np.sqrt(
np.diag(basis_evaluation.T @ coeff_cov_matrix @ basis_evaluation),
)
- einsum:
std_values = np.sqrt(
np.einsum("bi,bd,di->i", basis_evaluation, coeff_cov_matrix, basis_evaluation),
)
- sum:
std_values = np.sqrt(np.sum(
basis_evaluation * (coeff_cov_matrix @ basis_evaluation), axis=0,
))
After doing some quick profiling, I cannot see any difference between the three in terms of time. I think the 'diag' version is the clearest by being the most explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello,
I pre-post from my phone in order to speak before a potential a merge :)
I just wanted to express a small disagreement with @pcuestas.
After doing some quick profiling, I cannot see any difference between the three in terms of time.
I personally found a consistent 60~80% performance difference (diag_time / sum_time
) on my single cpu for NxN
matrices with N > 64
.
Before showing the results yesterday, I wanted to render nicer graphs with more samples, uncertainty, etc… and ended up not having the time to post. And I probably won’t have enough time before another 5 days now (hence the teaser comment 🤣).
I will definitely come back to you with a proper visual in a few days!
One thing to note: if A, B, C
are C-contiguous matrices, then
np.sum(A @ B * C.T, axis=-1)
is 10%~20% slower than
np.sum(A * C.T @ B.T, axis=-1)
For the same output. I included this gain in the first mentioned 60%~80% overall gain.
I'll be back :)
Cheers!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My "quick profiling" was definitely not specific enough. I was measuring the performance of the std_fdatabasis function and not the actual numpy operations. I'll share more accurate measurements when I have them ready.
The sum option seems to be faster than diag. Einsum is the slowest by a factor of ~10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have compared the execution time of the following functions (which output the same result):
# b is ndarray of shape (B, n_points)
# C is ndarray of shape (B, B)
# m_result is ndarray of shape (n_points,)
def m_diag(b, C):
return np.diag(b.T @ C @ b)
def m_sum(b, C):
return np.sum(b * (C @ b), axis=0)
def m_sum_T(b, C):
return np.sum(b.T * (b.T @ C.T), axis=1)
def m_einsum(b, C): # discarded due to being ~10 times slower than the others
return np.einsum('bi,bd,di->i', b, C, b)
by doing:
def time_func(func, *args, n_times):
start = time.time()
for _ in range(n_times):
func(*args)
end = time.time()
return end - start
Full code I have written for this is in https://gist.github.com/pcuestas/46be0f88aa4a14d481845cc5d646c8e0
Times and speedups that I got (time data also present in previously linked gist):
It seems that for big enough matrix sizes the speedup of sum
over diag
stabilizes around 1.5 (sum
is 50% faster than diag
) on my machine.
I am not sure whether the procedure I followed was good enough, I do not have much experience with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, I can see thaty you profiled with time.time
, which is generally considered bad practice for profiling. People tend to use timeit
or time.perf_counter
directly (which is used by timeit
by default), for more precision. Also, it would be beneficial to plot the results on a logscale to interpret them more easily I think.
For my part, testing these:
def diag(A, B, C):
return np.diag(A @ B @ C)
def sum1(A, B, C):
return np.sum(A @ B * C.T, axis=-1)
def sum2(A, B, C):
return np.sum(A * (C.T @ B.T), axis=-1)
over multiple runs (cf graph), I got the following results with barely noticeable uncertainty (Q1 ↔ Q3) areas:
We can observe that depending on the size of the matrices, it is not always obvious what will happen. But what is sure, is that indeed, the C-contiguousity can be leveraged, no matter the size of the matrices.
On a final note, it may be relevant to play with rectangle matrices if yours don't tend to be square (idk?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should probably report the minimum instead of median, according to the following comment in timeit.repeat
:
It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.
It probably won't change much, though.
As for the methods compared, another approach would be to compute the Cholesky decomposition of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggested computation by precalculating the Cholesky decomposition has basically the same performance as the sum
version. The reason why this happens is that both require the same operations (once the Cholesky decomposition has been precomputed):
def m_sum(b, C):
return np.sum(b * (C @ b), axis=0)
def m_cholesky(b, L_T):
"""C = L @ L.T and L_T = L.T"""
return np.sum((L_T @ b)**2, axis=0)
As the m_cholesky
version is less explicit (and requires the Cholesky decomposition of the coefficient covariances matrix), it was decided on the meeting from 2023-10-10 to use the sum
version:
std_values = np.sqrt(np.sum(
basis_evaluation * (coeff_cov_matrix @ basis_evaluation), axis=0,
))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Added
std
as a function using single dispatch in theskfda.exploratory.stats
module.The function takes two arguments: a
FData
object and addof
(Delta Degrees Of Freedom) integer. By default,ddof=1
. Rationale behind this decision has been that these functions (stats.cov, stats.var, stats.std) are meant to be used for statistical (exploratory) purposes, and not for Maximum Likelihood algorithms. The default is different fromnumpy.std
(which usesddof=1
).pandas.DataFrame.std
andscipy.stats.variation
useddof=1
.The approach taken to calculate the std of a
FDataBasis
object is the more direct one that we discussed which does not use the covariance object. Instead, a matrix multiplication is performed to calculate the pointwise variance on an arbitrary point of the domain:scikit-fda/skfda/exploratory/stats/_stats.py
Lines 153 to 156 in 6eada72
Tests included to cover the general$\mathbb{R}^p \to \mathbb{R}^q$ case for both
FDataGrid
andFDataBasis
objects.Closes #541.