Getting Started

This document explains the basics of using MOODS in your own workflows. It covers both the MOODS command-line utility moods-dna.py and using the MOODS Python library.

(The documentation is still somewhat sparse at the moment, but it will be continuously expanded as the various paper deadlines permit.)

Getting started with the command-line tool

MOODS 1.9.1 introduced a Python command-line script moods-dna.py for standalone PWM analysis using the MOODS algorithms. This section explains the basics of using moods-dna.py.

You can find moods-dna.py from the scripts/ directory if you downloaded the source package, or from your Python executable directory if you installed from PyPi. See Installation for more details.

Input formats

Matrices. The matrices are expected in the JASPAR .pfm format or the .adm format for first-order models. A JASPAR .pfm file consist of four rows, specifying the counts or frequencies for nucleotides A, C, G and T, respectively:

 0  3 79 40 66 48 65 11 65  0
94 75  4  3  1  2  5  2  3  3
 1  0  3  4  1  0  5  3 28 88
 2 19 11 50 29 47 22 81  1  6

A .adm should similarly give counts or frequencies for the 16 possible combinations of nucleotides (16 rows with m-1 columns), along with the 0-order counts (4 rows with m columns). Annotations at the end of the rows are optional and will be ignored:

0.806   0.815   0.249   0.169   0.088   0.312   0.289   ADM_DI  AA
0.018   0.018   0.021   0.192   0.138   0.156   0.157   ADM_DI  AC
0.016   0.028   0.151   0.578   0.692   0.062   0.325   ADM_DI  AG
0.16    0.139   0.578   0.061   0.082   0.469   0.229   ADM_DI  AT
0.261   0.228   0.216   0.243   0.033   0.345   0.002   ADM_DI  CA
0.388   0.383   0.319   0.38    0.117   0.379   0.001   ADM_DI  CC
0.129   0.137   0.147   0.156   0.814   0       0.997   ADM_DI  CG
0.221   0.251   0.318   0.22    0.037   0.276   0.001   ADM_DI  CT
0.27    0.283   0.112   0.078   0       0.001   0.275   ADM_DI  GA
0.158   0.151   0.079   0.375   0       0.997   0       ADM_DI  GC
0.314   0.218   0.553   0.324   0.999   0.001   0.525   ADM_DI  GG
0.257   0.348   0.256   0.223   0.001   0.002   0.2     ADM_DI  GT
0.039   0.04    0.032   0.004   0.034   0.203   0.189   ADM_DI  TA
0.01    0.009   0.009   0.004   0.035   0.304   0.117   ADM_DI  TC
0.01    0.036   0.026   0.98    0.881   0.089   0.514   ADM_DI  TG
0.941   0.915   0.934   0.012   0.05    0.405   0.18    ADM_DI  TT
0.349   0.319   0.298   0.098   0.01    0.002   0.003   0.004   ADM_MONO_A
0.026   0.027   0.025   0.014   0.034   0.001   0.993   0.001   ADM_MONO_C
0.029   0.024   0.041   0.077   0.933   0.995   0.001   0.993   ADM_MONO_G
0.596   0.63    0.636   0.811   0.023   0.002   0.003   0.002   ADM_MONO_T

The MOODS standard expectation is that the given matrices are count or frequency matrices and they should be converted to PWMs (i.e. log-likelihood ratios), but it is also possible to input matrices already in PWM format (see below).

Sequences. The sequences can be given either as plain text files, or as fasta files possibly containing multiple sequences. Newlines and leading and trailing whitespace will be ignored, but other characters not encoding nucleotides will be treated as non-matchable positions (i.e., they should not appear outside fasta headers). IUPAC nucleotide ambiguity codes (WSMKRYBDHV) will be treated as coding SNPs, and MOODS will match the matrices versus all possible combinations.

Basic usage

The basic usage pattern of moods-dna.py is the following:

python moods-dna.py -m example-data/matrices/*.{pfm,adm} -s example-data/seq/chr1-5k-55k.fa -p 0.0001

Here, -m specifies the count/frequency matrix files, -s specifies the sequence files and -p gives the p-value threshold for reporting matches. For matrix files already in PWM format, use -S instead of -m.

Refer to python moods-dna.py -h for full list of options. However, one specific option you should be aware of is --batch. In case you have lots of short input sequences (either in separate files or in a single fasta file), using --batch will significantly speed up things, as preprocessing will only be done once, but p-value based threshold computation will then only be done once, which may cause variance in the number of hits produced between different sequences.

Output format

The output from moods-dna.py consists of lines looking like this:

seq_1,MA0001.pfm,13329,+,5.23551915867,CCATAATTGC,
seq_1,MA0001.pfm,15494,+,9.69547780506,CCATWTATAG,ccatTtatag

The comma-separated fields are:

Sequence name (either file name or fasta header).
Matrix file name.
Hit position (first position of sequence is position 0).
Indicates whether the match is versus the input strand (+) or the reverse complement (-).
Match score.
Hit site in the original sequence (note that if the match is -, then the match is versus the reverse complement of this).
Hit sequence with SNPs applied, i.e. shows how the ambiguity codes are resolved to get this score (there can be multiple hits at the same position with different "real" hit sequences).

The output can be interpreted as .csv. The separator can be changed with --sep parameter.

Getting started with Python libraries

The example scripts (scripts/ex-*.py) provided with the MOODS package show how to use and combine the various MOODS functions. This sections explains some of the basic functions in more detail.

In general, you'll probably want to start by importing all the MOODS submodules (make sure that the MOODS/ directory is in sys.path):

import MOODS.parsers
import MOODS.tools
import MOODS.scan

Reading input files

Matrices. The MOODS.parsers module offers basic facilities for reading matrices from .pfm and .adm files, as discussed above. For count or frequency matrices, you can automatically convert them to PWMs:

bg = MOODS.tools.flat_bg(4)
pseudocount = 0.0001
m1 = MOODS.parsers.pfm_to_log_odds("example-data/matrices/MA0001.pfm", bg, pseudocount)
m2 = MOODS.parsers.adm_to_log_odds("example-data/matrices/E2F3.pfm", bg, pseudocount)

The PWMs will be returned as Python list of lists that you can pass directly to other MOODS functions. If something goes wrong, an empty list will be returned.

Note that you need to specify the background distribution and pseudo-counts used in the log-odds conversion; if you have no specific requirements, the values above are a good starting point.

If your matrix files are already in PWM format, use MOODS.parsers.pfm(filename), MOODS.parsers.adm_1o_terms(filename) and MOODS.parsers.adm_0o_terms(filename).

Sequences. MOODS does not currently have dedicated parsers for sequences, as reading the sequences is generally more convenient to do in Python instead of C++. All MOODS functions accept input sequences as Python strings.

The following snippets can be useful for reading sequences. For plain text files, use:

with open(filename, "r") as f:    
    seq = "".join([line.strip() for line in f])

For reading fasta files, you can construct an iterator that will yield (header, sequence) pairs:

from itertools import groupby

def iter_fasta(filename):
    with open(filename, "r") as f:
        iterator = groupby(f, lambda line: line[0] == ">")
        for is_header, group in iterator:
            if is_header:
                header = group.next()[1:].strip()
            else: 
                yield header, "".join(s.strip() for s in group)

# call iter_fasta:
for header, seq in iter_fasta(filename):
    # do stuff with sequences:
    print header, len(seq)

Preprocessing data

(Under construction)

pvalue = 0.0001
matrices = [MOODS.parsers.pfm_to_log_odds(f, bg, pseudocount) for f in filenames]
thresholds = [MOODS.tools.threshold_from_p(m, bg, pvalue) for m in matrices]

Scanning