Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix s3 output #73

Merged
merged 4 commits into from
May 9, 2022
Merged

Fix s3 output #73

merged 4 commits into from
May 9, 2022

Conversation

ellisrichardj
Copy link

There has been an issue where if the input data directory was set as an s3 URI the final CombineOutput process would fail, meaning that the results table was never generated. This is linked to the issue I raised with nextflow here: nextflow-io/nextflow#2502 (comment).

After a bit if digging and lots of testing, the issue was caused by the source data parent directory being recognised as an s3 object when collected as a parameter params.DataDir in line 78 of bTB-WGS_process.nf. The process couldn't be initiated with this for some reason, but could if this was a local path. I still think this is a nextflow bug, but have fixed the issue in our pipeline by defining a variable from params.DataDir (turning what was perceived as a path into a string) which can then be used as an alternate input for the CombineOutput process.

This now fixes things so that an EC2 instance can run the pipeline without the need for separate copying of input files to local storage, or the output files back to s3. We can now run on a naïve instance with a simple command (tested on my ranch machine):

~/nextflow run APHA-CSU/btb-seq -with-docker aphacsubot/btb-seq -r Fixs3Output --reads='s3://s3-csu-001/SB4030/M02410_5271/*_{S*_R1,S*_R2}*.fastq.gz' --outdir='s3://s3-staging-area/RichardEllis/'

The only requirements are to have nextflow and docker installed. Nextflow will pull the repo (and I have defined the specific branch with -r) and use the docker image specified. I think this should simplify the reprocess and hopefully make batch implementation much simpler. This is now much closer to nextflow's ideal modus operandi

Copy link
Member

@nick-pestell nick-pestell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic fix, this will make the workflow much easier. Is it worth updating the readme, so that it details how to run like this(~/nextflow run APHA-CSU/btb-seq -with-docker aphacsubot/btb-seq -r Fixs3Output --reads='s3://s3-csu-001/SB4030/M02410_5271/*_{S*_R1,S*_R2}*.fastq.gz' --outdir='s3://s3-staging-area/RichardEllis/').

@nick-pestell
Copy link
Member

I'm actually getting an error in combineCSV.py:

[b6/d45d52] process > CombineOutput (1)                  [100%] 1 of 1, failed: 1 ✘
Error executing process > 'CombineOutput (1)'

Caused by:
  Process `CombineOutput (1)` terminated with an error exit status (1)

Command executed:

  combineCsv.py assigned_csv qbovis_csv M02410_5271

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File "/home/nickpestell/.nextflow/assets/APHA-CSU/btb-seq/bin/combineCsv.py", line 67, in <module>
      combine(**vars(args))
    File "/home/nickpestell/.nextflow/assets/APHA-CSU/btb-seq/bin/combineCsv.py", line 19, in combine
      commit = repo.head.object.__str__()
    File "/usr/local/lib/python3.8/dist-packages/git/refs/symbolic.py", line 210, in _get_object
      return Object.new_from_sha(self.repo, hex_to_bin(self.dereference_recursive(self.repo, self.path)))
    File "/usr/local/lib/python3.8/dist-packages/git/objects/base.py", line 85, in new_from_sha
      oinfo = repo.odb.info(sha1)
    File "/usr/local/lib/python3.8/dist-packages/git/db.py", line 43, in info
      hexsha, typename, size = self._git.get_object_header(bin_to_hex(binsha))
    File "/usr/local/lib/python3.8/dist-packages/git/cmd.py", line 1253, in get_object_header
      return self.__get_object_header(cmd, ref)
    File "/usr/local/lib/python3.8/dist-packages/git/cmd.py", line 1240, in __get_object_header
      return self._parse_object_header(cmd.stdout.readline())
    File "/usr/local/lib/python3.8/dist-packages/git/cmd.py", line 1198, in _parse_object_header
      raise ValueError("SHA could not be resolved, git returned: %r" % (header_line.strip()))
  ValueError: SHA could not be resolved, git returned: b''

Work dir:
  /home/nickpestell/work/b6/d45d5288e9ba1a38733ed3d6967bbd

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

Looks like the error coming in on lines 17/18/19. Perhaps something to do with running this inside docker rather than natively?

@ellisrichardj
Copy link
Author

I think the error is due to a python/git issue and is mentioned here: gitpython-developers/GitPython#1016 (comment). I have a new branch which uses the Nextflow workflow property to capture the commit ID rather than the python script. There will be a PR request for that soon

@nick-pestell
Copy link
Member

Ah ok, good to have a handle on that. Perhaps we shouldn't merge this in until that bug is fixed? Probably best to only having working code in the main branch. Or does this error only occur with the workflow you define in your original PR comment? Either way probably best to fix before merging, no?

Use nextflow workflow property for commitID
@ellisrichardj
Copy link
Author

Now added the fix for commit ID as well. I think that the readme likely needs a bigger overhaul so will save changes for a future PR.


date_out = date.today().strftime('%d%b%y')
user = getpass.getuser()
scriptpath = os.path.dirname(os.path.abspath(__file__))
repo = git.Repo(scriptpath, search_parent_directories=True)
commit = repo.head.object.__str__()
#commit = repo.head.object.__str__()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the commented line?

Copy link
Member

@nick-pestell nick-pestell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. If works I'm happy with it. Perhaps just remove the commented line (L19).

@ellisrichardj
Copy link
Author

Good spot - I had forgotten to remove it - don't like to delete anything until its fully tested

@ellisrichardj ellisrichardj merged commit 5c529b2 into master May 9, 2022
@ellisrichardj ellisrichardj deleted the Fixs3Output branch September 29, 2022 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants