muP and Deep Double Descent #44

mathbloodprince · 2023-04-26T01:40:35Z

mathbloodprince
Apr 26, 2023

Hi, I've been interested recently by your TP papers and had a question regarding muP and its implications for the deep double descent phenomena.

As I understand it, muP is a parameterization scheme that enables wider networks to always perform better throughout the entire training process when we scale up the model width with the "optimal" base hyperparameters. It seems as though this overcomes the deep double descent phenomena where wider models will do worse up to a certain point, and then do better. Do you have any intuition on DDD in its relationship to muP and SP?

I was considering investigating the details further for a university project but wondered if you had encountered this comparison already in your research or had any general thoughts.

Thanks!

thegregyang · 2023-04-26T01:47:25Z

thegregyang
Apr 26, 2023

Hi, muP is really orthogonal to double descent since this “larger is better” phenomenon is with regard to training loss, which also translates to test loss when we are training over massive amounts of data like in large language models. Double descent is with regard to the test loss and is an effect associated with the interplay of regularization and model capacity. Potentially you can see double descent as well with muP if you are training on a small dataset with suboptimal regularization, but I have not tried to do this before. From: mathbloodprince ***@***.***> Date: Tuesday, April 25, 2023 at 6:40 PM To: microsoft/mup ***@***.***> Cc: Subscribed ***@***.***> Subject: [microsoft/mup] muP and Deep Double Descent (Discussion #44) Hi, I've been interested recently by your TP papers and had a question regarding muP and its implications for the deep double descent phenomena. As I understand it, muP is a parameterization scheme that enables wider networks to always perform better throughout the entire training process when we scale up the model width with the "optimal" base hyperparameters. It seems as though this overcomes the deep double descent phenomena where wider models will do worse up to a certain point, and then do better. Do you have any intuition on DDD in its relationship to muP and SP? I was considering investigating the details further for a university project but wondered if you had encountered this comparison already in your research or had any general thoughts. Thanks! — Reply to this email directly, view it on GitHub<#44>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AMWHHMZD5UCVU6KASZWLXJ3XDB4J7ANCNFSM6AAAAAAXLYDP6E>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

muP and Deep Double Descent #44

{{title}}

Replies: 1 comment

{{title}}

Select a reply

muP and Deep Double Descent #44

mathbloodprince Apr 26, 2023

Replies: 1 comment

thegregyang Apr 26, 2023

mathbloodprince
Apr 26, 2023

thegregyang
Apr 26, 2023