Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use RefGenie for pipeline template genomes.conf config #1084

Open
ewels opened this issue May 18, 2021 · 4 comments · May be fixed by #2139
Open

Use RefGenie for pipeline template genomes.conf config #1084

ewels opened this issue May 18, 2021 · 4 comments · May be fixed by #2139
Assignees
Labels
automation template nf-core pipeline/component template

Comments

@ewels
Copy link
Member

ewels commented May 18, 2021

Split from #592 - specifically #592 (comment)


I wanted to announce an update that would allow using refgenie for unarchived cloud assets directly. You first need a refgenie digest for the genome of interest, which you can get at an endpoint like this: http://rg.databio.org/genomes/genome_digest/hg38.

Then, you can use that with the assets/file_path endpoint to return either an http or s3 URL to the file of interest. For example:

you can also use individual seek keys, just like the CLI, to get individual items within an asset:
http://rg.databio.org/assets/file_path/2230c535660fb4774114bfa966a62f823fdb6d21acf138d4/fasta/chrom_sizes?remoteClass=s3

One way to auto-generate a config file is to use the new refgenie populate function.

you would create a template like:

params {
  // illumina iGenomes reference file paths
  genomes {
    'GRCh37' {
      fasta       = "refgenie://hg38/fasta"
      bwa         = "refgenie://hg38/bwa_index"
      bowtie2     = "refgenie://hg38/bowtie2_index"
      }
    }
}

then you just run some flavor of refgenie populate file.tpl and you'd get the above, using the then-current URIs


So the idea would be to make a copy of the template conf/igenomes.conf file (need to rename to just genomes.conf) that is called conf/igenomes.conf.tpl. Then build that with the asset identifiers as described above.

Once that's done, we run refgenie populate on that file to create the conf/genomes.conf file with the absolute assets. This file should never be directly edited.

To ensure this, we should have a CI test that regenerates the file and checks that it matches what is committed to the repo (eg. git diff doesn't return anything).

Many / most of the reference genome assets that we currently have will be missing for now I guess, so this ties in with issue #1086 to create those. But we can leave them commented out and in a branch for now until they are ready.

@ewels ewels added template nf-core pipeline/component template automation labels May 18, 2021
@ewels
Copy link
Member Author

ewels commented May 18, 2021

Note that the RefGenie asset identifiers will be new, but hopefully we can keep the top level --genome keys the same in this file. This will save a lot of headache as many pipelines have custom code tied to these keys.

@KevinMenden
Copy link
Contributor

Okay so we want to generate the genomes.config file automatically using refgenie. If I understand this correctly though, then we basically have to add all the assets to the refgenie servers first, and then just use refgenie populater which will get the links.
With the advantage being that we don't have to keep track of the links anymore, because refgenie will do so automatically, and we can just have the asset names.

Okay but for this to be useful we first have to add all (or at least most) of the genomes & assets we already have to refgenie. So I will probably prioritize #1086 for now and see how easy that is.

@ewels
Copy link
Member Author

ewels commented May 28, 2021

Yes exactly, I think that this makes most sense 👍🏻 I had thought that if #1086 took a while then we could get started with this code for just a few that are there with the rest commented out temporarily. But priority is definitely #1086 whilst there is still stuff there to be done.

I guess getting the CI test written is the main task here really, as the template file should be fairly trivial as all of the hard work is done in RefGenie.

@KevinMenden
Copy link
Contributor

Yup agree, and hopefully even the CI test won't be too difficult. Once all the assets are on the refgenie server, it should be straightforward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automation template nf-core pipeline/component template
Projects
None yet
4 participants