muP and Deep Double Descent #44
mathbloodprince
started this conversation in
General
Replies: 1 comment
-
Hi,
muP is really orthogonal to double descent since this “larger is better” phenomenon is with regard to training loss, which also translates to test loss when we are training over massive amounts of data like in large language models. Double descent is with regard to the test loss and is an effect associated with the interplay of regularization and model capacity. Potentially you can see double descent as well with muP if you are training on a small dataset with suboptimal regularization, but I have not tried to do this before.
From: mathbloodprince ***@***.***>
Date: Tuesday, April 25, 2023 at 6:40 PM
To: microsoft/mup ***@***.***>
Cc: Subscribed ***@***.***>
Subject: [microsoft/mup] muP and Deep Double Descent (Discussion #44)
Hi, I've been interested recently by your TP papers and had a question regarding muP and its implications for the deep double descent phenomena.
As I understand it, muP is a parameterization scheme that enables wider networks to always perform better throughout the entire training process when we scale up the model width with the "optimal" base hyperparameters. It seems as though this overcomes the deep double descent phenomena where wider models will do worse up to a certain point, and then do better. Do you have any intuition on DDD in its relationship to muP and SP?
I was considering investigating the details further for a university project but wondered if you had encountered this comparison already in your research or had any general thoughts.
Thanks!
—
Reply to this email directly, view it on GitHub<#44>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AMWHHMZD5UCVU6KASZWLXJ3XDB4J7ANCNFSM6AAAAAAXLYDP6E>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I've been interested recently by your TP papers and had a question regarding muP and its implications for the deep double descent phenomena.
As I understand it, muP is a parameterization scheme that enables wider networks to always perform better throughout the entire training process when we scale up the model width with the "optimal" base hyperparameters. It seems as though this overcomes the deep double descent phenomena where wider models will do worse up to a certain point, and then do better. Do you have any intuition on DDD in its relationship to muP and SP?
I was considering investigating the details further for a university project but wondered if you had encountered this comparison already in your research or had any general thoughts.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions