Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GeoDataset: ignore other bands for separate files #2222

Merged
merged 6 commits into from
Aug 19, 2024

Conversation

adamjstewart
Copy link
Collaborator

If the user passes a list of files including all bands, ignore everything other than the primary band so that the file doesn't end up in the index multiple times.

Pros

The dataset length matches what you would expect, and each bounding box is not sampled multiple times. This is especially important when using a dataset splitter to avoid the same image ending up in both train and test.

Cons

Requires the users to ensure that the bands parameter includes the bands in paths.

Fixes #2221

@sfalkena @adriantre can you review?

@adamjstewart adamjstewart added this to the 0.6.0 milestone Aug 13, 2024
@github-actions github-actions bot added datasets Geospatial or benchmark datasets testing Continuous integration testing labels Aug 13, 2024
@adamjstewart adamjstewart force-pushed the datasets/geo-separate-files branch from fc88eef to ce88ca2 Compare August 13, 2024 09:57
elif os.path.isfile(path) or path_is_vsi(path):
elif (
os.path.isfile(path)
and fnmatch.fnmatch(str(path), os.path.join('*', self.filename_glob))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative: use fnmatch.filter at the end. This won't result in a warning though. This could be a good thing or a bad thing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adamjstewart and others added 2 commits August 13, 2024 15:53
Co-authored-by: Sieger Falkena <siegerfalkena@hotmail.com>
Copy link
Contributor

@sfalkena sfalkena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the PR, but just noting that it might be good to add a testcase with the URLs for future reference?

@adamjstewart
Copy link
Collaborator Author

@adriantre is adding a bunch of new VSI tests in #1399, I'll let him test this in his PR.

@adriantre
Copy link
Contributor

Works as expected for zip archives too. #1399 is conflicting with this line, but I say we merge this and I incorporate it to my proposed changes.

import os
import fnmatch
from pathlib import Path

archive = 'tests/data/cms_mangrove_canopy/CMS_Global_Map_Mangrove_Canopy_1665.zip!/CMS_Global_Map_Mangrove_Canopy_1665/data'

filename_glob = '**/Mangrove_agb_Angola*'

path1 = Path('zip://{archive}/Mangrove_agb_Angola.tif')
fnmatch.fnmatch(str(path1), os.path.join('*', filename_glob))
> True

path2 = Path('zip://{archive}!/Mangrove_hba95_Angola.tif')
fnmatch.fnmatch(str(path2), os.path.join('*', filename_glob))
> False

@sfalkena
Copy link
Contributor

I had another try at this and I think I have found another example where this doesn't work and that would pledge in favour of the sampler-based approach. If I pass a list of files, only the band name is replaced from the filename in hits. However, I have the case where different bands come in different resolutions, the dataset fails to find the files. e.g where the files are something like:

["T43REP_20240511T053639_B01_60m.tif",
"T43REP_20240511T053639_B02_10m.tif"]

(for example only band 1 is inserted in the index, but in the __get_item__ it tries to load T43REP_20240511T053639_B02_60m.tif which does not exist.

Right now I have created a custom dataset that puts all files in the index (for both bands in time) and a custom sampler that filters the hits based on the bounds of the hits. I will see if I can create a prototype within the geosampler in the coming days

@adamjstewart
Copy link
Collaborator Author

Note that adding 13x as many files to the R-tree also means it will be much slower. And that not all bands may have the exact same bounds.

@adamjstewart
Copy link
Collaborator Author

@sfalkena Sentinel-2 is a constant thorn in my side. Can you open a separate issue/PR for that? I genuinely don't know how to handle it because the same band may appear in multiple resolutions. We'll likely need to completely override Sentinel2.__getitem__ to search 10, 20, 60 in that order.

@adamjstewart adamjstewart merged commit c26512e into microsoft:main Aug 19, 2024
19 checks passed
@adamjstewart adamjstewart deleted the datasets/geo-separate-files branch August 19, 2024 10:05
@adamjstewart
Copy link
Collaborator Author

Can you open a separate issue/PR for that?

Nevermind, we already have one: #1758

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GridGeoSampler resamples same image repeatedly with separate_files and multiple dates
3 participants