-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support extracting DICOMs from ZIP files (and possibly other archives) by switching to use shutil.unpack_archive instead of tarfile module functionality #471
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #471 +/- ##
==========================================
+ Coverage 76.13% 81.52% +5.39%
==========================================
Files 37 43 +6
Lines 2962 3930 +968
==========================================
+ Hits 2255 3204 +949
- Misses 707 726 +19
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
I just discovered that this works even without the change I uploaded. Sorry, my bad :-) |
interesting -- stock heudiconv doesn't work for me: $> heudiconv -f reproin --bids -o /tmp/h --files bids_anat-T1w.zip
INFO: Running heudiconv version 0.9.0 latest 0.9.0
INFO: Analyzing 1 dicoms
INFO: Filtering out 0 dicoms based on their filename
WARNING: Ignoring bids_anat-T1w.zip since not quite a "normal" DICOM: 'FileDataset' object has no attribute 'SeriesNumber'
INFO: Generated sequence info for 0 studies with 0 entries total
INFO: Need to process 0 study sessions so may be there is still a need in this/such PR. Do you remember with which version it has worked? Haven't reviewed changes but tried with this PR and above worked... so we still need it seems to me. |
From cursory review looks ok but needs a unittest to ensure that it works. Test could reuse existing files, zip them, and do conversion. |
Interesting, maybe some zip files work with I agree that a unit test would be important. I am currently quite busy, but I will have time to write one next week, so please stay tuned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure yet how it could have worked on zip, most likely some fluke (e.g. .tar.gz
was named with .zip
extension or alike). So I think this enhancement has value so I will keep it live.
It will also need a unittest (produce .zip in the test code, give to function, see that does the right thing).
Meanwhile I will convert to draft
. please advise if you are going to work on it further @HippocampusGirl . Cheers
Hi @yarikoptic, I think I got confused when I closed the pull request. I also agree that this is a valuable feature and have been using my changes for multiple projects. I want to work on this some more, but cannot prioritize this right now. If you or anyone else is willing to step in, then things may be faster. |
Hello @HippocampusGirl, is it still the case that this feature is on hold? I would be interested in the ability to work with zip files and would be happy to bring this pull request over the finish line. |
Hi @psadil, That would be wonderful! |
Great! I wonder what's going to be a good way to contribute code. I have this potential solution that builds on the pull request's initial commit, d076a07. Do I submit a pull request into HippocampusGirl:add-zipfile-support so that it shows up in this #471 request? Can I directly add on to #471? Should I open a new pull request for the main nipy/heudiconv repo? @yarikoptic perhaps you have guidance? Thank you |
Hi @psadil, I have given you write access to my repository. You should be able to push into my branch now, so that it shows up as part of the pull request :-) |
Enh/add zipfile support
@yarikoptic (tagging you as the original reviewer) - I think this may be ready to go. As you requested in the original review, the unpacking is now done with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- major concern -- seems to be a change in behavior in case of a mixed list of files/archives. I would prefer to keep original behavior unless there is really a strong need
- some minor comments on styling etc.
heudiconv/parser.py
Outdated
import tarfile | ||
from tempfile import mkdtemp | ||
import shutil | ||
import typing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think frequent typing.
makes code too cluttered. I am ok with imports of the constructs we use as we already do elsewhere
heudiconv/dicoms.py:from typing import List, Optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note -- ATM we do not check typing information at all. Filed #653
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Looking forward to newer versions that allow the use of builtins like list
or |
:)
|
||
Returns | ||
------- | ||
ItemsView[int | None, list[str]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately on this aspect there even were not discussion so I initiated one numpy/numpydoc#196 (comment) . I guess we are doomed to keep it as this for now here.
heudiconv/parser.py
Outdated
input_list_of_unpacked or | ||
all(t.endswith(_UNPACK_FORMATS) for t in fl) | ||
): | ||
raise ValueError("Some but not all input files are archives.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think prior code was handling that just fine -- and add non-archive to None
session. Is there a reason to introduce this breaking behavior? I think it might have really bad manifestations (didn't check) e.g. by pointing to a folder with some tarballs and some files to just be ignored which we ignore if not dicom later (?) in the code, etc.
heudiconv/parser.py
Outdated
typing.List[str] | ||
] = defaultdict(list) | ||
|
||
if not isinstance(fl, list): | ||
fl = list(fl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we even need this "casting"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check. There was a similar cast before
Lines 85 to 86 in c83e06c
if not isinstance(fl, (list, tuple)): | |
fl = list(fl) |
but perhaps that can be dropped altogether
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah -- that is what I wondered on why it was needed before. I wish smarter earlier us had left a comment ;)
heudiconv/parser.py
Outdated
# contained within each archive. So, files are temporarily | ||
# extracted into unique tempdirs | ||
# cannot use TempDirs since will trigger cleanup with __del__ | ||
tmpdir = tempdirs(prefix="heudiconvDCM") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make that tmpdir 0o700
right away here so no files extracted would have any moment in time when they are accessible to non-authorized user (i.e. until chmoding below)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! This seems especially relevant considering that the files are unpacked into a generic tmp directory, which could be a surprising risk to privacy.
Ah, sorry. I thought that I was making something explicit that was previously only implicit. I didn't have a strong need. I'll update it to match the current behavior
Great! I'm looking through these and will revise accordingly |
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Co-authored-by: Yaroslav Halchenko <debian@onerussian.com>
Okay, now revised |
Checking in on this -- the code is ready for another round of review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code styling is a bit odd to me, but without black I cannot insist really ;)
Otherwise I think it looks great and let's give it a try! Thank you @HippocampusGirl and @psadil !!!
🚀 PR was released in |
Dear heudiconv team,
I was recently working on converting a DICOM dataset to BIDS with heudiconv. Our imaging facility sends us the DICOM files as ZIP archives.
While I could extract the ZIP files manually before running heudiconv, I thought it may be useful to have on-the-fly extraction just like for tarballs. I implemented this via the built-in Python module "zipfile".