-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory impact of the functional approach #12
Comments
You will find me in full agreement on here. The original version of #9 had mutating But it seemed #9 was not going to be considered at all with mutation, and I want explicit state in the picture as we add optimizer related features to Flux (like scheduling). I just think the implicit state interface is too confusing to extend and error-prone. We can have all the advantages outlined in the comment you linked without giving up mutation, and that's what #9 was meant to do. |
Personally I think a path that starts where Flux's optimizers currently are and slowly adds explicit state then immutability is a much cleaner development path. It is less likely to break user code and cause performance regressions. Instead, we're starting with something that appears at face value to be a performance non-starter and trying to work in the other direction. |
I have been considering it as well, and while for the most part things seem to be stable. With the IdDict approach, all those references stay alive, whereas now not so much. I'd also started with some checks in place in this branch https://github.com/FluxML/Optimisers.jl/tree/dg/mut |
I just realized we sometimes unconditionally mutate state as well, e.g. https://github.com/FluxML/Optimisers.jl/blob/master/src/rules.jl#L47. So in addition to #13 , we should probably change the non-mutating versions to be truly non-mutating. |
Closed by #31 |
I'm bit concerned about the approach here on master and in #9 , where we create a bunch of intermediate values (e.g. optimizers do not mutate gradients but transform them) and one also ends up with a new set of weights
Compared to the current approach in Flux, where in-place mutation is used as much as possible, one might end up with multiple copies of a ResNet before GC kicks in. Has this been discussed and benchmarked already? I'd like to have some reassurance here because the advantages outlined in #9 (comment) don't seem juicy enough to counterbalance this problem
The text was updated successfully, but these errors were encountered: