New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

refactor download tests #7546

Merged

pmeier merged 3 commits into pytorch:main from pmeier:refactor-download-tests

May 2, 2023

Collaborator

pmeier commented May 1, 2023

Our current download tests for the datasets are overcomplicated in some parts:

We have the ability to select where we want to patch the download functionality: in torchvision.datasets.utils or in any other module (usually the module in which the dataset is defined). This is problematic, because some datasets mix download_url and download_and_extract_archive. Meaning, we have to patch download functionality in multiple places.
We have the ability instead of just mocking the download functionality to just spy on it. Meaning, we can still log the call, but the data is actually downloaded. This is never used anywhere.
We have the ability to disable mocking of auxiliary functions like extract_archive. This is never used.
In addition to the URLs we also collect the MD5 checksums. This could be used to actually check if our MD5 checksums on record match the downloaded file, but the test is empty.

This PR does the following to resolve the points above:

Always try to patch download functionalities in the utilities and dataset module.
Remove the option.
Remove the option.
Remove the logging of MD5 checksums and the test that would require them.

Since the diff is pretty large, I'll highlight the important parts inline.


          refactor download tests

9ea6dd2

pmeier added module: datasets module: tests labels

pmeier requested a review from NicolasHug

May 1, 2023 08:37

pytorch-bot bot commented May 1, 2023 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/7546

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

❌ 32 New Failures, 2 Unrelated Failures

As of commit 72d04b3:

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base 6381f7b:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the cla signed label

pmeier commented

View reviewed changes

test/test_datasets_download.py

@@ @@ -84,47 +78,45 @@ def inner_wrapper(request, *args, **kwargs): @@
               @contextlib.contextmanager
               def log_download_attempts(
-                  urls_and_md5s=None,
-                  file="utils",

Collaborator Author

pmeier May 1, 2023

See 1. in top comment.

test/test_datasets_download.py

@@ @@ -84,47 +78,45 @@ def inner_wrapper(request, *args, **kwargs): @@
               @contextlib.contextmanager
               def log_download_attempts(
-                  urls_and_md5s=None,
-                  file="utils",
-                  patch=True,

Collaborator Author

pmeier May 1, 2023

See 2. in top comment.

test/test_datasets_download.py

-                  urls_and_md5s=None,
-                  file="utils",
-                  patch=True,
-                  mock_auxiliaries=None,

Collaborator Author

pmeier May 1, 2023

See 3. in top comment.

test/test_datasets_download.py

Comment on lines +97 to +107

+                      download_url_mocks = []
+                      download_file_from_google_drive_mocks = []
+                      for module in [dataset_module, "utils"]:
+                          maybe_add_mock(module=module, name="download_url", stack=stack, lst=download_url_mocks)
+                          maybe_add_mock(
+                              module=module,
+                              name="download_file_from_google_drive",
+                              stack=stack,
+                              lst=download_file_from_google_drive_mocks,
+                          )
+                          maybe_add_mock(module=module, name="extract_archive", stack=stack)

Collaborator Author

pmeier May 1, 2023

Resolution for 1. in top comment

test/test_datasets_download.py

Comment on lines -502 to -504

-              @pytest.mark.parametrize(**make_parametrize_kwargs(itertools.chain()))
-              def test_file_downloads_correctly(url, md5):
-                  retry(lambda: assert_file_downloads_correctly(url, md5))

Collaborator Author

pmeier May 1, 2023

Was never run due to the empty parametrization

test/test_datasets_download.py

-                              lambda: datasets.Places365(ROOT, split=split, small=small, download=True),
-                              name=f"Places365, {split}, {'small' if small else 'large'}",
-                              file="places365",
+                  return itertools.chain.from_iterable(

Collaborator Author

pmeier May 1, 2023

Just a style nit: Instead of itertools.chain(*[]) we now use itertools.chain.from_iterable([]) which is made for this very specific use case.

test/test_datasets_download.py

-                  return itertools.chain(
-                      *[
-                          collect_download_configs(
-                              lambda: datasets.Places365(ROOT, split=split, small=small, download=True),

Collaborator Author

pmeier May 1, 2023

We ever only used the callable to parametrize the dataset class call. Thus, we can simplify this by accepting the class as well as the args and kwargs. This also allows us to automatically detect the dataset module, since this is available through the class. Note that we can't use functools.partial here for this reason.

test/test_datasets_download.py

Comment on lines -230 to -231

		name=f"Places365, {split}, {'small' if small else 'large'}",
		file="places365",

Collaborator Author

pmeier May 1, 2023

Both of them can now be auto-detected

test/test_datasets_download.py

-                          kitti(),
-                          places365(),
-                      )
+              def stanford_cars():

Collaborator Author

pmeier May 1, 2023

Driveby for #7545 (comment). Issue was not detected since we didn't test for it before. I'm wondering if we should add a test that checks if all datasets that have a download parameter are present here. We could also just check this manually now and enforce it through the review if a new dataset is added. However, this is what we were supposed to do before and it seems we didn't do it.

test/test_datasets_download.py

-              @pytest.mark.parametrize(
-                  **make_parametrize_kwargs(
-                      itertools.chain(
-                          sbu(),  # https://github.com/pytorch/vision/issues/7005

Collaborator Author

pmeier May 1, 2023

Due to "mixed download usage" (see 1. in top comment), we never detected that this download works again through this test. The issue is long closed, but the test never failed.

NicolasHug approved these changes

View reviewed changes

Member

NicolasHug left a comment

Thanks @pmeier , this is mostly a stamp, I only looked at your comments. Didn't check the code in details.


          Merge branch 'main' into refactor-download-tests

b6b2ad4

Collaborator Author

pmeier commented May 2, 2023

Ci is toasted in general, but the download tests are green: https://github.com/pytorch/vision/actions/runs/4860402390/jobs/8664179225


          Merge branch 'main' into refactor-download-tests

72d04b3

pmeier merged commit e946e87 into pytorch:main

pmeier deleted the refactor-download-tests branch

May 2, 2023 11:08

github-actions bot commented May 2, 2023

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

NicolasHug added the enhancement label

facebook-github-bot pushed a commit that referenced this pull request


          [fbsync] refactor download tests (#7546)

a0dae30

Reviewed By: vmoens

Differential Revision: D45522828

fbshipit-source-id: 8aaef29d49b656035c8c859036c252499a6145c4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed enhancement module: datasets module: tests