Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MeanStdScaling transform #17

Merged
merged 65 commits into from
Feb 19, 2021
Merged

Add MeanStdScaling transform #17

merged 65 commits into from
Feb 19, 2021

Conversation

bencottier
Copy link
Contributor

@bencottier bencottier commented Feb 12, 2021

Closes #3

Note: the notion of dims is now consistent with mapslices and Statistics.mean/std rather than eachslice. To do that, I inverted the dims passed to the general apply! method for arrays. Statistics.mean/std was used in FeatureEngineering. Related issue: #18

@codecov
Copy link

codecov bot commented Feb 12, 2021

Codecov Report

Merging #17 (e8970b0) into main (9fd201a) will increase coverage by 4.58%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #17      +/-   ##
==========================================
+ Coverage   94.50%   99.09%   +4.58%     
==========================================
  Files           8        9       +1     
  Lines          91      110      +19     
==========================================
+ Hits           86      109      +23     
+ Misses          5        1       -4     
Impacted Files Coverage Δ
src/Transforms.jl 100.00% <ø> (ø)
src/scaling.jl 100.00% <100.00%> (ø)
src/temporal.jl 100.00% <100.00%> (ø)
src/transformers.jl 100.00% <100.00%> (+4.16%) ⬆️
src/utils.jl 87.50% <100.00%> (+87.50%) ⬆️
src/power.jl 100.00% <0.00%> (ø)
src/periodic.jl 100.00% <0.00%> (ø)
src/one_hot_encoding.jl 100.00% <0.00%> (ø)
src/linear_combination.jl 100.00% <0.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9fd201a...8730f3a. Read the comment docs.

@nicoleepp
Copy link
Contributor

nicoleepp commented Feb 12, 2021

I noticed you've been merging branches into this one, can you try rebasing instead in the future? It keeps the git history much more clean.

@bencottier
Copy link
Contributor Author

bencottier commented Feb 15, 2021

I have done some refactoring to always populate the mean and std at construction time (either precomputed or computed). I added the name keyword argument to _apply! inside the general transformers.jl methods for compatibility.

Overall it seems better - much less overloading. One thing is dims or cols has to be specified both at construction and apply if used.

@nicoleepp
Copy link
Contributor

Note: the notion of dims is consistent with eachslice. Statistics.mean/std and mapslices have the opposite notion of dims (e.g. dims=1 ~ dims=2 and dims=3 ~ dims=[1, 2]). Statistics.mean/std was used in FeatureEngineering. Related issue: #18

Why did you decide to go with dims being consistent with eachslice instead of what we are already doing in FeatureEngineering?

Copy link
Contributor

@nicoleepp nicoleepp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking a lot cleaner now I think

@bencottier
Copy link
Contributor Author

Note: the notion of dims is consistent with eachslice. Statistics.mean/std and mapslices have the opposite notion of dims (e.g. dims=1 ~ dims=2 and dims=3 ~ dims=[1, 2]). Statistics.mean/std was used in FeatureEngineering. Related issue: #18

Why did you decide to go with dims being consistent with eachslice instead of what we are already doing in FeatureEngineering?

Because the other transforms so far use dims consistent with eachslice. But I think this is the first transform where dims actually changes the results. So given that, and the FeatureEngineering convention, I'm now leaning towards switching over to mapslices and its convention. Shall I go ahead with that?

@nicoleepp
Copy link
Contributor

Note: the notion of dims is consistent with eachslice. Statistics.mean/std and mapslices have the opposite notion of dims (e.g. dims=1 ~ dims=2 and dims=3 ~ dims=[1, 2]). Statistics.mean/std was used in FeatureEngineering. Related issue: #18

Why did you decide to go with dims being consistent with eachslice instead of what we are already doing in FeatureEngineering?

Because the other transforms so far use dims consistent with eachslice. But I think this is the first transform where dims actually changes the results. So given that, and the FeatureEngineering convention, I'm now leaning towards switching over to mapslices and its convention. Shall I go ahead with that?

The other mutating transforms follow eachslice, but the non mutating follow the mapslices convention. So yeah I think we should move everything to follow the mapslices convention for dims

Also mutating test for Vector
- Re-applying scaling to a different array
- AxisArray, KeyedArray, NamedTuple
- Other minor additions and improvements
- mean, std are either passed in or computed from data upon construction
- general apply methods pass `name` keyword to `_apply`
for compatibility
- add `kwargs...` to general `apply` for tables
and `_apply` in HoD transform, for compatibility
- add missing re-apply test for NamedTuple
- mean, std are either passed in or computed from data upon construction
- general apply methods pass `name` keyword to `_apply`
for compatibility
- add `kwargs...` to general `apply` for tables
and `_apply` in HoD transform, for compatibility
- add missing re-apply test for NamedTuple
Required changing the convention for dims from eachslice to mapslices
Fix dims test names after change in convention
missed in rebase
@bencottier
Copy link
Contributor Author

Sorry, I made a mess of commits trying to rebase after #19 . I will squash before merging.

@nicoleepp nicoleepp merged commit 05c9355 into main Feb 19, 2021
@nicoleepp nicoleepp deleted the bc/scaling branch February 19, 2021 19:32
@nicoleepp nicoleepp mentioned this pull request Feb 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Normalisation / Scaling transform
3 participants