-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic Stochastic depth on residual blocks #1253
Conversation
c07ae9f
to
3e9bf11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! It LGTM from the stochastic depth side. A few nits up to your discretion
I'm a lot less confident on the fx graph surgery implementation. It looks reasonable to me, but not sure if there are alternative implementations? It would be great to get someone from eng to review this as well.
One hesitation before approving: would it be possible to create a test for apply_stochastic_residual
? It be nice to check that the graph surgery went as expected for say ResNet18 and ResNet50 e.g. counting the number of stochastic blocks after surgery and checking it aligns with the number of residual blocks before surgery
Oh, you just pushed a test 😂 Is this still being updated? |
Just the test was missing so added it a little while ago. |
3e9bf11
to
9aea8e9
Compare
Added a test for resnet18 that counts residual blocks as well as runs it with 0.0 drop_rate to make sure pre and post surgery outputs are the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to defer the correctness review to @Landanjs / @bandish-shah, but see comments re: code styling.
9aea8e9
to
e9b7418
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, a few style nits, and a main few questions in the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e9b7418
to
bebe805
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Dismissing my review; don't have much to add)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one small nit. Defer to @Landanjs to review for correctness.
Right now apply_stochastic_residual
is not actually used in the stochastic depth
algorithm. I assume this will be deferred to follow-on PR? Should we do a training run with this to see if matches the performance of the old apply?
bebe805
to
71bbd74
Compare
It will be applied in a followup PR along with experiments. However, I did a training run with r50 baseline (77.32 Top-1 in 132.8 mins) and r50 baseline with fx-traced module (77.25 in 125.4 mins). We shouldn't read too much into the times here as it could be due to streaming datasets. I don't see much issues when this is applied with stochastic depth algorithm but, as always, composition with others might pose some issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stochastic depth implementation LGTM. Like Hanlin mentioned the best test would be before and after experiments, but those will happen in a follow up PR 🙂
Detect residual blocks automatically and replace them with a stochastic version. See attached graphs for how transformation happens on ResNet50.
TODO: Add tests.DoneOrig:

Split into bottleneck and downsample:

Converted with a call to block_stochastic_module:

Closes https://mosaicml.atlassian.net/browse/CO-303