RF: improve support for queue args #328

mgxd · 2019-04-01T19:39:38Z

Closes #302

Changes

add queue-args to submission script
fix up batch submission command
- remove queue-args from queue submission
- isolate files / subjects in unique commands
submit batch before any grouping (massive speedup when dealing with many subjects/files)
~~imports changed from relative to absolute, to behave better across different environments.~~

codecov-io · 2019-04-01T20:07:08Z

Codecov Report

Merging #328 into master will increase coverage by 0.23%.
The diff coverage is 88.88%.

@@            Coverage Diff             @@
##           master     #328      +/-   ##
==========================================
+ Coverage   74.08%   74.31%   +0.23%     
==========================================
  Files          35       35              
  Lines        2643     2667      +24     
==========================================
+ Hits         1958     1982      +24     
  Misses        685      685

Impacted Files	Coverage Δ
heudiconv/external/dcmstack.py	`25% <0%> (ø)`	⬆️
heudiconv/tests/test_queue.py	`100% <100%> (ø)`	⬆️
heudiconv/utils.py	`92.12% <100%> (ø)`	⬆️
heudiconv/dicoms.py	`82.07% <100%> (ø)`	⬆️
heudiconv/bids.py	`82.58% <100%> (ø)`	⬆️
heudiconv/cli/run.py	`77.56% <80%> (-0.08%)`	⬇️
heudiconv/queue.py	`90% <87.5%> (+4.28%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3467341...4621423. Read the comment docs.

mgxd · 2019-04-01T21:53:17Z

@fhopp re your question in #304 - could you see if this PR fixes your problem? An example call using the queue-args would look like:

heudiconv --files /path/to/files -s pilot -f convertall -q SLURM --queue-args "--partition=om_bigmem --cpus-per-task=4 --time=10"

yarikoptic · 2019-04-01T22:40:27Z

hm... so what is the benefit of moving to absolute imports exactly, or what problem(s) that would solve?
I've been moving toward relative imports wherever possible and @satra advocates them over absolute to enable python packages to be "embeddable" into larger packages (although I am yet to see that in real use).

yarikoptic · 2019-04-01T22:50:14Z

I left a comment on #302, but most likely I didn't fully comprehend the situation, since most likely it relates to slurm execution... in either case - it must not be a problem of relative-vs-absolute imports but most likely incorrect invocation. I really do not think we should replace relative imports with absolute to address whatever problem it is

satra · 2019-04-02T00:43:48Z

@mgxd - when you say behave better across different environments, what do you mean? i do have a personal preference for relative imports in these relatively shallow projects. and in the past i have gotten bitten over an installed project vs a dev project by using absolute imports.

mgxd · 2019-04-02T04:17:02Z

@satra if anything, its good to clearly show where files and functions lie within the project, especially to new contributors. It's also recommended by pep8, and would fit in well given how basic our structure is (I think our max depth currently is 3).

fhopp · 2019-04-02T05:34:22Z

@mgxd the latest #328 does the trick, a.k.a. it does not crash but executes with arguments being properly passed to SLURM. Now, I understand that each subject will be submitted as a separate "job" to the cluster. Yet, is there any information on how to optimize the execution? For example, does it make sense to specify multiple CPUs per job or is heudiconv a single task and hence operates on one cpu per node? Likewise, would it help to increase the memory per CPU? Apologies for all these questions. I do have a 65 subject dataset for which these are all very helpful questions. I am also keeping track of all the steps and would be more than happy to contribute a working example at the end (I am also not using docker, and I have noticed an increased demand for non-docker examples). Thanks!

satra · 2019-04-02T11:32:47Z

@mgxd - here is a use case for relative imports (although a developer should be more careful):

python setup.py install
# change a file locally and test that it is still working
pytest heudiconv

In this scenario all the absolute imports would get picked from the installed location. Personally, i like the relative imports because of frame of reference. if there are many project cross-links i can easily see how relative imports could get as complicated (go up two and down three). i think they are both clear in shallow trees with only a few branches at the top of level.

here are the two selling points for me:

for users who are not used to python and do want to contribute i feel making the above problem (setup->change->test) easier is better for them.
it is also easier to import libraries into projects as git sub modules. let say you want to distribute your own copy of nipype in your package, as fmriprep (used to?) do. (with absolute imports this would be really painful)

yarikoptic · 2019-04-02T13:27:37Z

Please let stick with relative imports, and concentrate on addressing the problem at hands which smells to be just an "incorrect" invocation specification (directly of a submodule instead of the heudiconv entry point script itself) within the SLURM job.

yarikoptic · 2019-04-02T13:34:02Z

to that extent the culprit is the https://github.com/nipy/heudiconv/blob/master/heudiconv/cli/run.py#L293

pyscript = op.abspath(inspect.getfile(inspect.currentframe()))

from "good old" times of heudiconv being a single script, so that is how cli/run.py ends up being run. The whole "python packaging" is making it a bit more complicated but sys.argv[0] might just work?

satra · 2019-04-02T13:54:55Z

can't pyscript simply be heudiconv (now that it is a package)? aren't we essentially calling heudiconv with one subject?

yarikoptic · 2019-04-02T16:01:51Z

in principle it could be I guess. I just dislike hardcoding anything ;)

…r queue submissions

mgxd · 2019-04-02T16:29:55Z

Ok, I've reverted imports due to popular demand 😆

@fhopp on our cluster, I run heudiconv across a hundred of subjects concurrently; heudiconv is a single core application (currently), and I don't think I have encountered a time where I needed more than 2GB of memory.

fhopp · 2019-04-03T15:24:22Z

@mgxd Just out of curiosity, how long does generating the heuristic file across all subjects take?
I have 64 subjects distributed across 5 nodes, each allocated 2GB and the process seems to be going rather slow (~12 hours now, not finished yet). I just would like to have some benchmark to compare to and rule out that this might be a problem of how my SLURM is configured. The command I execute is as follows:

heudiconv \
-d /srv/lab/fmri/mft/dicom4bids/data/{subject}*/* \
-f /srv/lab/fmri/mft/dicom4bids/data/convertall.py \
-s sub01 sub02 sub03 sub04 sub05 sub06 sub07 sub08 sub09 sub10 sub11 sub12 sub13 sub14 sub15 sub16 sub17 sub18 sub19 sub20 sub21 sub22 sub23 sub24 sub25 sub26 sub27 sub28 sub29 sub30 sub31 sub32 sub33 sub34 sub35 sub36 sub37 sub38 sub39 sub40 sub41 sub42 sub43 sub44 sub45 sub46 sub47 sub48 sub49 sub50 sub51 sub52 sub53 sub54 sub55 sub56 sub57 sub58 sub59 sub60 sub61 sub62 sub63 sub64 \
-c none -q SLURM --queue-args "--partition=heudiconv --mem=2000" \
-o /srv/lab/fmri/mft/dicom4bids/heudiconv_temp/output/

I was also thinking whether it would be better to split the subject into chunks of five rather than submitting the full list as done above?

Finally, it appears that the above command submits just one job to each node. While I understand that heudiconv runs on a single CPU, is it possible to have each node work on multiple jobs concurrently? Say one node has 8 cores, is it possible to submit 8 jobs to that node?

mgxd · 2019-04-03T19:01:57Z

@fhopp actually I think its on our end, I see a bottleneck at

heudiconv/heudiconv/cli/run.py

Lines 251 to 253 in 3467341

    
           study_sessions = get_study_sessions(args.dicom_dir_template, args.files, 
        
                                               heuristic, outdir, args.session, 
        
                                               args.subjs, grouping=args.grouping)

I'm working on a fix now

mgxd · 2019-04-03T21:48:55Z

@fhopp - hopefully you'll see a drastic speed-up after the last few commits.

fhopp · 2019-04-03T22:05:18Z

@mgxd Awesome, trying this now. From what I see so far, this increases the processing time before the batch job(s) are submitted. Will check and send an update on the speed up. I still think it would be best if these jobs submissions can run on single cores on each node in the cluster.
Example: 2 nodes with 4 cores each should allow a concurrent heudiconv of 8 subjects.

fhopp · 2019-04-03T22:09:39Z

@mgxd Confirmed! Job submission is almost instant.

mgxd · 2019-04-04T15:07:56Z

Example: 2 nodes with 4 cores each should allow a concurrent heudiconv of 8 subjects.

If you specify --cpus-per-task=1, this should be the case. I'm going to merge this in, please let us know if you are still running into issues!

fhopp · 2019-04-04T15:47:22Z

@mgxd short answer: yes, please merge, I tried with your latest fix and job submission is instant. As for execution, it was in fact an issue of my SLURM config. It was set to a non-shared, exclusive mode in which CPUs and Memory in the cluster were not seen as "consumable resources": https://slurm.schedmd.com/cons_res.html
I changed the setting and now jobs are being properly distributed and executed concurrently.
Thank you for all these efforts! As a small contribution, may I offer to submit a slide set or notebook that showcases and summarized how SLURM may be set up and used for optimal performance with heudiconv?

mgxd · 2019-04-04T16:12:52Z

Yes, that would be awesome. We have a small collection of User tutorials on our docs that would be a great place for it

rf: move to absolute import

b2b1d2b

fix: better support for queue args

5c5b5d1

mgxd changed the title ~~RF: move to absolute import~~ RF: improve support for queue args Apr 1, 2019

fix: revert to explicit relative imports, use heudiconv executable fo…

66bc53f

…r queue submissions

fix: imports, outdated argument help

9255e2b

mgxd added 2 commits April 3, 2019 17:32

fix: minimize processing before batch submission

96d69de

fix: remove print statement

da45311

fix: avoid relative import in nipype node

4621423

mgxd merged commit 6b2a460 into nipy:master Apr 4, 2019

mgxd deleted the fix/slurm branch April 4, 2019 16:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RF: improve support for queue args #328

RF: improve support for queue args #328

mgxd commented Apr 1, 2019 •

edited

Loading

codecov-io commented Apr 1, 2019 •

edited

Loading

mgxd commented Apr 1, 2019

yarikoptic commented Apr 1, 2019

yarikoptic commented Apr 1, 2019

satra commented Apr 2, 2019

mgxd commented Apr 2, 2019

fhopp commented Apr 2, 2019

satra commented Apr 2, 2019

yarikoptic commented Apr 2, 2019

yarikoptic commented Apr 2, 2019

satra commented Apr 2, 2019

yarikoptic commented Apr 2, 2019

mgxd commented Apr 2, 2019

fhopp commented Apr 3, 2019 •

edited

Loading

mgxd commented Apr 3, 2019

mgxd commented Apr 3, 2019

fhopp commented Apr 3, 2019

fhopp commented Apr 3, 2019

mgxd commented Apr 4, 2019

fhopp commented Apr 4, 2019

mgxd commented Apr 4, 2019

RF: improve support for queue args #328

RF: improve support for queue args #328

Conversation

mgxd commented Apr 1, 2019 • edited Loading

Changes

codecov-io commented Apr 1, 2019 • edited Loading

Codecov Report

mgxd commented Apr 1, 2019

yarikoptic commented Apr 1, 2019

yarikoptic commented Apr 1, 2019

satra commented Apr 2, 2019

mgxd commented Apr 2, 2019

fhopp commented Apr 2, 2019

satra commented Apr 2, 2019

yarikoptic commented Apr 2, 2019

yarikoptic commented Apr 2, 2019

satra commented Apr 2, 2019

yarikoptic commented Apr 2, 2019

mgxd commented Apr 2, 2019

fhopp commented Apr 3, 2019 • edited Loading

mgxd commented Apr 3, 2019

mgxd commented Apr 3, 2019

fhopp commented Apr 3, 2019

fhopp commented Apr 3, 2019

mgxd commented Apr 4, 2019

fhopp commented Apr 4, 2019

mgxd commented Apr 4, 2019

mgxd commented Apr 1, 2019 •

edited

Loading

codecov-io commented Apr 1, 2019 •

edited

Loading

fhopp commented Apr 3, 2019 •

edited

Loading