pip install dtool-overlay
Get a dataset to play with:
LOCAL_DS_URI=$(dtool cp -q http://bit.ly/Ecoli-ref-genome .)
Show the existing overlays:
$ dtool overlays show $LOCAL_DS_URI identifiers,relpaths 23ebd7cd21a905d5f255919ca1d0491901cb8718,reference.4.bt2 37e2d68bb38271036d96b6979d24666e0d4fd814,reference.rev.1.bt2 41fb9ae5d4f6c37226ff324c701b84bc3110709e,reference.1.bt2 828ebf503926b7c1b8b07c1995b4ca818814b404,reference.rev.2.bt2 b445ff5a1e468ab48628a00a944cac2e007fb9bc,U00096.3.fasta d21454a7338c53eabc8d8ed7c2f9c3ff4585c4cf,reference.3.bt2 dda8452b346d51b9cf60f0662ef3d6e3b6da2e74,reference.2.bt2
The output above show that there are no overlays on this dataset. (The "identifiers" and "relpaths" columns are there for bookkeeping).
Create a "is_fasta" boolean overlay template by using a glob pattern:
$ dtool overlays template glob $LOCAL_DS_URI is_fasta '*.fasta' > is_fasta.csv $ cat is_fasta.csv identifiers,is_fasta,relpaths 23ebd7cd21a905d5f255919ca1d0491901cb8718,False,reference.4.bt2 37e2d68bb38271036d96b6979d24666e0d4fd814,False,reference.rev.1.bt2 41fb9ae5d4f6c37226ff324c701b84bc3110709e,False,reference.1.bt2 828ebf503926b7c1b8b07c1995b4ca818814b404,False,reference.rev.2.bt2 b445ff5a1e468ab48628a00a944cac2e007fb9bc,True,U00096.3.fasta d21454a7338c53eabc8d8ed7c2f9c3ff4585c4cf,False,reference.3.bt2 dda8452b346d51b9cf60f0662ef3d6e3b6da2e74,False,reference.2.bt2
Write the overlay template to the dataset:
$ dtool overlays write $LOCAL_DS_URI is_fasta.csv
Show the newly created overlay:
$ dtool overlays show $LOCAL_DS_URI identifiers,is_fasta,relpaths 23ebd7cd21a905d5f255919ca1d0491901cb8718,False,reference.4.bt2 37e2d68bb38271036d96b6979d24666e0d4fd814,False,reference.rev.1.bt2 41fb9ae5d4f6c37226ff324c701b84bc3110709e,False,reference.1.bt2 828ebf503926b7c1b8b07c1995b4ca818814b404,False,reference.rev.2.bt2 b445ff5a1e468ab48628a00a944cac2e007fb9bc,True,U00096.3.fasta d21454a7338c53eabc8d8ed7c2f9c3ff4585c4cf,False,reference.3.bt2 dda8452b346d51b9cf60f0662ef3d6e3b6da2e74,False,reference.2.bt2
To extract multiple pieces of metadata from the items' relpath one can use the
dtool overlays template parse
command. This takes as input a dataset URI, a
parse rule (see https://pypi.org/project/parse/ for more details) and a glob
rule. The latter decides which relpaths to apply the parsing to.
Consider for example the dataset below:
$ dtool ls http://bit.ly/Ecoli-reads-minified 8bda245a8cd526673aab775f90206c8b67d196af ERR022075_2.fastq.gz 9760280dc6313d3bb598fa03c5931a7f037d7ffc ERR022075_1.fastq.gz
The command below could be used to generate a template for the overlays "useful_name" and "read":
$ dtool overlays template parse \ http://bit.ly/Ecoli-reads-minified \ '{useful_name}_{read:d}.fastq.gz'
Results in the CSV output below:
identifiers,read,useful_name,relpaths 8bda245a8cd526673aab775f90206c8b67d196af,2,ERR022075,ERR022075_2.fastq.gz 9760280dc6313d3bb598fa03c5931a7f037d7ffc,1,ERR022075,ERR022075_1.fastq.gz
To ignore a variable element when parsing one can use unnamed curly braces. The command below for example only generates the overlay "useful_name":
$ dtool overlays template parse \ http://bit.ly/Ecoli-reads-minified \ '{useful_name}_{:d}.fastq.gz' identifiers,useful_name,relpaths 8bda245a8cd526673aab775f90206c8b67d196af,ERR022075,ERR022075_2.fastq.gz 9760280dc6313d3bb598fa03c5931a7f037d7ffc,ERR022075,ERR022075_1.fastq.gz
Sometimes it is useful to be able to find pairs of items. For example when dealing with genomic sequencing data that has forward and reverse reads.
One can create a "pair_id" overlay CSV template for this dataset using the command below:
$ dtool overlays template pairs http://bit.ly/Ecoli-reads-minified .fastq.gz identifiers,pair_id,relpaths 8bda245a8cd526673aab775f90206c8b67d196af,9760280dc6313d3bb598fa03c5931a7f037d7ffc,ERR022075_2.fastq.gz 9760280dc6313d3bb598fa03c5931a7f037d7ffc,8bda245a8cd526673aab775f90206c8b67d196af,ERR022075_1.fastq.gz
In the above the suffix ".fastq.gz" is used to extract the prefix ERR022075_
that is used to find matching pairs.