Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor download tests #7546

Merged
merged 3 commits into from
May 2, 2023
Merged

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented May 1, 2023

Our current download tests for the datasets are overcomplicated in some parts:

  1. We have the ability to select where we want to patch the download functionality: in torchvision.datasets.utils or in any other module (usually the module in which the dataset is defined). This is problematic, because some datasets mix download_url and download_and_extract_archive. Meaning, we have to patch download functionality in multiple places.
  2. We have the ability instead of just mocking the download functionality to just spy on it. Meaning, we can still log the call, but the data is actually downloaded. This is never used anywhere.
  3. We have the ability to disable mocking of auxiliary functions like extract_archive. This is never used.
  4. In addition to the URLs we also collect the MD5 checksums. This could be used to actually check if our MD5 checksums on record match the downloaded file, but the test is empty.

This PR does the following to resolve the points above:

  1. Always try to patch download functionalities in the utilities and dataset module.
  2. Remove the option.
  3. Remove the option.
  4. Remove the logging of MD5 checksums and the test that would require them.

Since the diff is pretty large, I'll highlight the important parts inline.

@pytorch-bot
Copy link

pytorch-bot bot commented May 1, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/7546

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

❌ 32 New Failures, 2 Unrelated Failures

As of commit 72d04b3:

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base 6381f7b:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@@ -84,47 +78,45 @@ def inner_wrapper(request, *args, **kwargs):

@contextlib.contextmanager
def log_download_attempts(
urls_and_md5s=None,
file="utils",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 1. in top comment.

@@ -84,47 +78,45 @@ def inner_wrapper(request, *args, **kwargs):

@contextlib.contextmanager
def log_download_attempts(
urls_and_md5s=None,
file="utils",
patch=True,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 2. in top comment.

urls_and_md5s=None,
file="utils",
patch=True,
mock_auxiliaries=None,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 3. in top comment.

Comment on lines +97 to +107
download_url_mocks = []
download_file_from_google_drive_mocks = []
for module in [dataset_module, "utils"]:
maybe_add_mock(module=module, name="download_url", stack=stack, lst=download_url_mocks)
maybe_add_mock(
module=module,
name="download_file_from_google_drive",
stack=stack,
lst=download_file_from_google_drive_mocks,
)
maybe_add_mock(module=module, name="extract_archive", stack=stack)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolution for 1. in top comment

Comment on lines -502 to -504
@pytest.mark.parametrize(**make_parametrize_kwargs(itertools.chain()))
def test_file_downloads_correctly(url, md5):
retry(lambda: assert_file_downloads_correctly(url, md5))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was never run due to the empty parametrization

lambda: datasets.Places365(ROOT, split=split, small=small, download=True),
name=f"Places365, {split}, {'small' if small else 'large'}",
file="places365",
return itertools.chain.from_iterable(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a style nit: Instead of itertools.chain(*[]) we now use itertools.chain.from_iterable([]) which is made for this very specific use case.

return itertools.chain(
*[
collect_download_configs(
lambda: datasets.Places365(ROOT, split=split, small=small, download=True),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ever only used the callable to parametrize the dataset class call. Thus, we can simplify this by accepting the class as well as the args and kwargs. This also allows us to automatically detect the dataset module, since this is available through the class. Note that we can't use functools.partial here for this reason.

Comment on lines -230 to -231
name=f"Places365, {split}, {'small' if small else 'large'}",
file="places365",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of them can now be auto-detected

kitti(),
places365(),
)
def stanford_cars():
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Driveby for #7545 (comment). Issue was not detected since we didn't test for it before. I'm wondering if we should add a test that checks if all datasets that have a download parameter are present here. We could also just check this manually now and enforce it through the review if a new dataset is added. However, this is what we were supposed to do before and it seems we didn't do it.

@pytest.mark.parametrize(
**make_parametrize_kwargs(
itertools.chain(
sbu(), # https://github.com/pytorch/vision/issues/7005
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to "mixed download usage" (see 1. in top comment), we never detected that this download works again through this test. The issue is long closed, but the test never failed.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pmeier , this is mostly a stamp, I only looked at your comments. Didn't check the code in details.

@pmeier
Copy link
Collaborator Author

pmeier commented May 2, 2023

Ci is toasted in general, but the download tests are green: https://github.com/pytorch/vision/actions/runs/4860402390/jobs/8664179225

@pmeier pmeier merged commit e946e87 into pytorch:main May 2, 2023
@pmeier pmeier deleted the refactor-download-tests branch May 2, 2023 11:08
@github-actions
Copy link

github-actions bot commented May 2, 2023

Hey @pmeier!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

facebook-github-bot pushed a commit that referenced this pull request May 15, 2023
Reviewed By: vmoens

Differential Revision: D45522828

fbshipit-source-id: 8aaef29d49b656035c8c859036c252499a6145c4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants