Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update memory monitor #1940

Merged
merged 11 commits into from
Feb 8, 2023
Merged

Conversation

mvpatel2000
Copy link
Contributor

@mvpatel2000 mvpatel2000 commented Feb 3, 2023

What does this PR do?

Updates memory monitor to use GB, which is much more readable. Also fixes it to track current allocated memory instead of cumulative (which isn't really useful)

What issue(s) does this change relate to?

CO-1740

@bandish-shah
Copy link
Contributor

Could we post an example of the new output before we're ready to merge this PR in the description?

@dakinggg dakinggg self-requested a review February 3, 2023 21:24
Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking to look at the output and approve

@mvpatel2000
Copy link
Contributor Author

image

@mvpatel2000 mvpatel2000 requested a review from dakinggg February 8, 2023 03:29
Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also paste in what this looks like when you log to console? Otherwise LGTM and will approve after that.

composer/callbacks/memory_monitor.py Outdated Show resolved Hide resolved
Co-authored-by: Daniel King <43149077+dakinggg@users.noreply.github.com>
@mvpatel2000
Copy link
Contributor Author

	 Train trainer/global_step: 95
	 Train trainer/batch_idx: 95
	 Train memory/allocated_mem: 22.9840
	 Train memory/active_mem: 22.9840
	 Train memory/inactive_mem: 5.3172
	 Train memory/reserved_mem: 38.9250
	 Train memory/alloc_retries: 0
	 Train trainer/device_train_microbatch_size: 10
	 Train loss/train/total: 0.1528
	 Train wall_clock/train: 627.0995
	 Train wall_clock/val: 0.0000
	 Train wall_clock/total: 627.0995```

Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mvpatel2000 mvpatel2000 merged commit 6a9d088 into mosaicml:dev Feb 8, 2023
@mvpatel2000 mvpatel2000 deleted the mvpatel2000/memory branch February 8, 2023 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants