Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect training progress to csv #185

Merged
merged 21 commits into from
Feb 24, 2024

Conversation

saeliddp
Copy link
Collaborator

Context

When running training, our progress bar currently just revolves infinitely, which is not useful for gauging how long the process will continue. This PR connects the progress bar to the CSV metrics file, which is created and updated by cyto-dl during training. When the metrics file is modified, we inspect the CSV file for the latest epoch and update the progress bar with latest_epoch_in_csv / total_number_of_epochs as the percent complete.

Changes

  1. core/view.py: create a new type of QThread that emits signals containing task progress (UI thread will pick these up and make UI changes)
  2. core/progress_tracker.py: superclass for 'progress trackers'
  3. training/metrics_csv_progress_tracker.py: implementation of progress tracker for metrics CSV files
  4. training/metrics_csv_event_handler.py: implementation of a watchdog file event handler specific to metrics CSV files

Thread Diagram

This change relies on a mixture of python Threads and PyQt QThreads. The visualization here makes it a bit easier to understand what's going on. Effectively, the MetricsCSVProgressTracker serves as a shared resource between the threads, and updates to the UI are triggered by updates to the tracker (which are in turn triggered by updates to the CSV file). Note that originally, I had my watchdog thread try to call a function to update the UI directly, but this strategy failed since the UI thread will not accept any UI-changing calls from non-QThreads.

diagram

Testing

Added unit tests and supporting files in _tests. Manually tested that the progress bar updated with 2, 3, and 4 epochs of training.

Further Issues

  1. The last epoch appears to take much longer than the others on the progress bar (I think because cyto-dl is running evaluation pipelines once the last epoch has finished).
  2. Not sure how to effectively test the startLongTaskWithProgressBar function in core/view.py even though it has important thread cleanup logic.

@saeliddp
Copy link
Collaborator Author

saeliddp commented Feb 20, 2024

After I made this PR, I had another thought that maybe we could implement this as a pub/sub pattern and remove the need for the awkward progress thread with the sleep function. The idea is that watchdog thread would have a callback to progress tracker, which would publish an event; then view would subscribe to that event and change progress bar when the event happens.

Update: ended up running into the same problem where dispatching an event that ends up changing the UI from a non-UI thread causes an exception. This makes sense after looking into our implementation of publisher.

@yrkim98
Copy link
Collaborator

yrkim98 commented Feb 21, 2024

A couple comments from me, but awesome work on your first ml-segmenter contribution! Thanks :)

@hughes036
Copy link
Collaborator

The last epoch appears to take much longer than the others on the progress bar (I think because cyto-dl is running evaluation pipelines once the last epoch has finished).

Sounds like something to bring up with @benjijamorris . Maybe there is another file which gets updated by the post-final epoch processes (testing), and the watch could be moved to that ?

@hughes036
Copy link
Collaborator

Update: ended up running into the same problem where dispatching an event that ends up changing the UI from a non-UI thread causes an exception. This makes sense after looking into our implementation of publisher.

Yeah, our homegrown pub/sub frame work is really just for decoupling our UI from business logic, making testing easier.

@hughes036
Copy link
Collaborator

The thread diagram is great - can add it to a new doc under /docs?

Copy link
Collaborator

@hughes036 hughes036 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great - nice work.

Im interested to go over a few things, mainly for my own understanding.
Also, a few nits and small requests, and want to make sure that an import you commented out is intentional (i left a comment).

@saeliddp
Copy link
Collaborator Author

The thread diagram is great - can add it to a new doc under /docs?
Added here

Copy link
Collaborator

@hughes036 hughes036 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, the responses to every comment and for including links to commits. All very helpful, as a reviewer.

Awesome work on this pr. Good balance of quality code, while not going overboard with design, and getting it done quickly.

Ship it!

@saeliddp saeliddp merged commit 048f5f0 into main Feb 24, 2024
12 checks passed
@saeliddp saeliddp deleted the feature/connect_training_progress_to_csv branch February 24, 2024 00:14
@saeliddp saeliddp mentioned this pull request Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants