-
Notifications
You must be signed in to change notification settings - Fork 405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GeoDataset: ignore other bands for separate files #2222
GeoDataset: ignore other bands for separate files #2222
Conversation
fc88eef
to
ce88ca2
Compare
torchgeo/datasets/geo.py
Outdated
elif os.path.isfile(path) or path_is_vsi(path): | ||
elif ( | ||
os.path.isfile(path) | ||
and fnmatch.fnmatch(str(path), os.path.join('*', self.filename_glob)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative: use fnmatch.filter
at the end. This won't result in a warning though. This could be a good thing or a bad thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Sieger Falkena <siegerfalkena@hotmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving the PR, but just noting that it might be good to add a testcase with the URLs for future reference?
@adriantre is adding a bunch of new VSI tests in #1399, I'll let him test this in his PR. |
Works as expected for zip archives too. #1399 is conflicting with this line, but I say we merge this and I incorporate it to my proposed changes. import os
import fnmatch
from pathlib import Path
archive = 'tests/data/cms_mangrove_canopy/CMS_Global_Map_Mangrove_Canopy_1665.zip!/CMS_Global_Map_Mangrove_Canopy_1665/data'
filename_glob = '**/Mangrove_agb_Angola*'
path1 = Path('zip://{archive}/Mangrove_agb_Angola.tif')
fnmatch.fnmatch(str(path1), os.path.join('*', filename_glob))
> True
path2 = Path('zip://{archive}!/Mangrove_hba95_Angola.tif')
fnmatch.fnmatch(str(path2), os.path.join('*', filename_glob))
> False |
I had another try at this and I think I have found another example where this doesn't work and that would pledge in favour of the sampler-based approach. If I pass a list of files, only the band name is replaced from the filename in hits. However, I have the case where different bands come in different resolutions, the dataset fails to find the files. e.g where the files are something like:
(for example only band 1 is inserted in the index, but in the Right now I have created a custom dataset that puts all files in the index (for both bands in time) and a custom sampler that filters the hits based on the bounds of the hits. I will see if I can create a prototype within the geosampler in the coming days |
Note that adding 13x as many files to the R-tree also means it will be much slower. And that not all bands may have the exact same bounds. |
@sfalkena Sentinel-2 is a constant thorn in my side. Can you open a separate issue/PR for that? I genuinely don't know how to handle it because the same band may appear in multiple resolutions. We'll likely need to completely override |
Nevermind, we already have one: #1758 |
If the user passes a list of files including all bands, ignore everything other than the primary band so that the file doesn't end up in the index multiple times.
Pros
The dataset length matches what you would expect, and each bounding box is not sampled multiple times. This is especially important when using a dataset splitter to avoid the same image ending up in both train and test.
Cons
Requires the users to ensure that the
bands
parameter includes the bands inpaths
.Fixes #2221
@sfalkena @adriantre can you review?