dtool CLI commands for working with per item metadata

Installation

pip install dtool-overlay

Example usage

Get a dataset to play with:

LOCAL_DS_URI=$(dtool cp -q http://bit.ly/Ecoli-ref-genome .)

Show the existing overlays:

$ dtool overlays show $LOCAL_DS_URI
identifiers,relpaths
23ebd7cd21a905d5f255919ca1d0491901cb8718,reference.4.bt2
37e2d68bb38271036d96b6979d24666e0d4fd814,reference.rev.1.bt2
41fb9ae5d4f6c37226ff324c701b84bc3110709e,reference.1.bt2
828ebf503926b7c1b8b07c1995b4ca818814b404,reference.rev.2.bt2
b445ff5a1e468ab48628a00a944cac2e007fb9bc,U00096.3.fasta
d21454a7338c53eabc8d8ed7c2f9c3ff4585c4cf,reference.3.bt2
dda8452b346d51b9cf60f0662ef3d6e3b6da2e74,reference.2.bt2

The output above show that there are no overlays on this dataset. (The "identifiers" and "relpaths" columns are there for bookkeeping).

Create a "is_fasta" boolean overlay template by using a glob pattern:

$ dtool overlays template glob $LOCAL_DS_URI is_fasta '*.fasta' > is_fasta.csv
$ cat is_fasta.csv
identifiers,is_fasta,relpaths
23ebd7cd21a905d5f255919ca1d0491901cb8718,False,reference.4.bt2
37e2d68bb38271036d96b6979d24666e0d4fd814,False,reference.rev.1.bt2
41fb9ae5d4f6c37226ff324c701b84bc3110709e,False,reference.1.bt2
828ebf503926b7c1b8b07c1995b4ca818814b404,False,reference.rev.2.bt2
b445ff5a1e468ab48628a00a944cac2e007fb9bc,True,U00096.3.fasta
d21454a7338c53eabc8d8ed7c2f9c3ff4585c4cf,False,reference.3.bt2
dda8452b346d51b9cf60f0662ef3d6e3b6da2e74,False,reference.2.bt2

Write the overlay template to the dataset:

$ dtool overlays write $LOCAL_DS_URI is_fasta.csv

Show the newly created overlay:

$ dtool overlays show $LOCAL_DS_URI
identifiers,is_fasta,relpaths
23ebd7cd21a905d5f255919ca1d0491901cb8718,False,reference.4.bt2
37e2d68bb38271036d96b6979d24666e0d4fd814,False,reference.rev.1.bt2
41fb9ae5d4f6c37226ff324c701b84bc3110709e,False,reference.1.bt2
828ebf503926b7c1b8b07c1995b4ca818814b404,False,reference.rev.2.bt2
b445ff5a1e468ab48628a00a944cac2e007fb9bc,True,U00096.3.fasta
d21454a7338c53eabc8d8ed7c2f9c3ff4585c4cf,False,reference.3.bt2
dda8452b346d51b9cf60f0662ef3d6e3b6da2e74,False,reference.2.bt2

To extract multiple pieces of metadata from the items' relpath one can use the dtool overlays template parse command. This takes as input a dataset URI, a parse rule (see https://pypi.org/project/parse/ for more details) and a glob rule. The latter decides which relpaths to apply the parsing to.

Consider for example the dataset below:

$ dtool ls http://bit.ly/Ecoli-reads-minified
8bda245a8cd526673aab775f90206c8b67d196af  ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc  ERR022075_1.fastq.gz

The command below could be used to generate a template for the overlays "useful_name" and "read":

$ dtool overlays template parse  \
    http://bit.ly/Ecoli-reads-minified  \
    '{useful_name}_{read:d}.fastq.gz'

Results in the CSV output below:

identifiers,read,useful_name,relpaths
8bda245a8cd526673aab775f90206c8b67d196af,2,ERR022075,ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc,1,ERR022075,ERR022075_1.fastq.gz

To ignore a variable element when parsing one can use unnamed curly braces. The command below for example only generates the overlay "useful_name":

$ dtool overlays template parse  \
    http://bit.ly/Ecoli-reads-minified  \
    '{useful_name}_{:d}.fastq.gz'
identifiers,useful_name,relpaths
8bda245a8cd526673aab775f90206c8b67d196af,ERR022075,ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc,ERR022075,ERR022075_1.fastq.gz

Sometimes it is useful to be able to find pairs of items. For example when dealing with genomic sequencing data that has forward and reverse reads.

One can create a "pair_id" overlay CSV template for this dataset using the command below:

$  dtool overlays template pairs http://bit.ly/Ecoli-reads-minified .fastq.gz
identifiers,pair_id,relpaths
8bda245a8cd526673aab775f90206c8b67d196af,9760280dc6313d3bb598fa03c5931a7f037d7ffc,ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc,8bda245a8cd526673aab775f90206c8b67d196af,ERR022075_1.fastq.gz

In the above the suffix ".fastq.gz" is used to extract the prefix ERR022075_ that is used to find matching pairs.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
docs		docs
dtool_overlay		dtool_overlay
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.rst		CHANGELOG.rst
LICENSE.rst		LICENSE.rst
MANIFEST.in		MANIFEST.in
README.rst		README.rst
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dtool CLI commands for working with per item metadata

Installation

Example usage

Related packages

About

Releases

Packages

Contributors 2

Languages

License

jic-dtool/dtool-overlay

Folders and files

Latest commit

History

Repository files navigation

dtool CLI commands for working with per item metadata

Installation

Example usage

Related packages

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages