-
Notifications
You must be signed in to change notification settings - Fork 12
Getting Started
This document explains the basics of using MOODS in your own workflows. It covers both the MOODS command-line utility moods-dna.py
and using the MOODS Python library.
(The documentation is still somewhat sparse at the moment, but it will be continuously expanded as the various paper deadlines permit.)
MOODS 1.9.1 introduced a Python command-line script moods-dna.py
for standalone PWM analysis using the MOODS algorithms. This section explains the basics of using moods-dna.py
.
You can find moods-dna.py
from the scripts/
directory if you downloaded the source package, or from your Python executable directory if you installed from PyPi. See Installation for more details.
Matrices. The matrices are expected in the JASPAR .pfm
format or the .adm
format for first-order models. A JASPAR .pfm
file consist of four rows, specifying the counts or frequencies for nucleotides A, C, G and T, respectively:
0 3 79 40 66 48 65 11 65 0
94 75 4 3 1 2 5 2 3 3
1 0 3 4 1 0 5 3 28 88
2 19 11 50 29 47 22 81 1 6
A .adm
should similarly give counts or frequencies for the 16 possible combinations of nucleotides (16 rows with m-1 columns), along with the 0-order counts (4 rows with m columns). Annotations at the end of the rows are optional and will be ignored:
0.806 0.815 0.249 0.169 0.088 0.312 0.289 ADM_DI AA
0.018 0.018 0.021 0.192 0.138 0.156 0.157 ADM_DI AC
0.016 0.028 0.151 0.578 0.692 0.062 0.325 ADM_DI AG
0.16 0.139 0.578 0.061 0.082 0.469 0.229 ADM_DI AT
0.261 0.228 0.216 0.243 0.033 0.345 0.002 ADM_DI CA
0.388 0.383 0.319 0.38 0.117 0.379 0.001 ADM_DI CC
0.129 0.137 0.147 0.156 0.814 0 0.997 ADM_DI CG
0.221 0.251 0.318 0.22 0.037 0.276 0.001 ADM_DI CT
0.27 0.283 0.112 0.078 0 0.001 0.275 ADM_DI GA
0.158 0.151 0.079 0.375 0 0.997 0 ADM_DI GC
0.314 0.218 0.553 0.324 0.999 0.001 0.525 ADM_DI GG
0.257 0.348 0.256 0.223 0.001 0.002 0.2 ADM_DI GT
0.039 0.04 0.032 0.004 0.034 0.203 0.189 ADM_DI TA
0.01 0.009 0.009 0.004 0.035 0.304 0.117 ADM_DI TC
0.01 0.036 0.026 0.98 0.881 0.089 0.514 ADM_DI TG
0.941 0.915 0.934 0.012 0.05 0.405 0.18 ADM_DI TT
0.349 0.319 0.298 0.098 0.01 0.002 0.003 0.004 ADM_MONO_A
0.026 0.027 0.025 0.014 0.034 0.001 0.993 0.001 ADM_MONO_C
0.029 0.024 0.041 0.077 0.933 0.995 0.001 0.993 ADM_MONO_G
0.596 0.63 0.636 0.811 0.023 0.002 0.003 0.002 ADM_MONO_T
The MOODS standard expectation is that the given matrices are count or frequency matrices and they should be converted to PWMs (i.e. log-likelihood ratios), but it is also possible to input matrices already in PWM format (see below).
Sequences. The sequences can be given either as plain text files, or as fasta files possibly containing multiple sequences. Newlines and leading and trailing whitespace will be ignored, but other characters not encoding nucleotides will be treated as non-matchable positions (i.e., they should not appear outside fasta headers). IUPAC nucleotide ambiguity codes (WSMKRYBDHV
) will be treated as coding SNPs, and MOODS will match the matrices versus all possible combinations.
The basic usage pattern of moods-dna.py
is the following:
python moods-dna.py -m example-data/matrices/*.{pfm,adm} -s example-data/seq/chr1-5k-55k.fa -p 0.0001
Here, -m
specifies the count/frequency matrix files, -s
specifies the sequence files and -p
gives the p-value threshold for reporting matches. For matrix files already in PWM format, use -S
instead of -m
.
Refer to python moods-dna.py -h
for full list of options. However, one specific option you should be aware of is --batch
. In case you have lots of short input sequences (either in separate files or in a single fasta file), using --batch
will significantly speed up things, as preprocessing will only be done once, but p-value based threshold computation will then only be done once, which may cause variance in the number of hits produced between different sequences.
The output from moods-dna.py
consists of lines looking like this:
seq_1,MA0001.pfm,13329,+,5.23551915867,CCATAATTGC,
seq_1,MA0001.pfm,15494,+,9.69547780506,CCATWTATAG,ccatTtatag
The comma-separated fields are:
- Sequence name (either file name or fasta header).
- Matrix file name.
- Hit position (first position of sequence is position 0).
- Indicates whether the match is versus the input strand (+) or the reverse complement (-).
- Match score.
- Hit site in the original sequence (note that if the match is -, then the match is versus the reverse complement of this).
- Hit sequence with SNPs applied, i.e. shows how the ambiguity codes are resolved to get this score (there can be multiple hits at the same position with different "real" hit sequences).
The output can be interpreted as .csv
. The separator can be changed with --sep
parameter.
The example scripts (scripts/ex-*.py
) provided with the MOODS package show how to use and combine the various MOODS functions. This sections explains some of the basic functions in more detail.
In general, you'll probably want to start by importing all the MOODS submodules (make sure that the MOODS/
directory is in sys.path
):
import MOODS.parsers
import MOODS.tools
import MOODS.scan
Matrices. The MOODS.parsers
module offers basic facilities for reading matrices from .pfm
and .adm
files, as discussed above. For count or frequency matrices, you can automatically convert them to PWMs:
bg = MOODS.tools.flat_bg(4)
pseudocount = 0.0001
m1 = MOODS.parsers.pfm_to_log_odds("example-data/matrices/MA0001.pfm", bg, pseudocount)
m2 = MOODS.parsers.adm_to_log_odds("example-data/matrices/E2F3.pfm", bg, pseudocount)
The PWMs will be returned as Python list of lists that you can pass directly to other MOODS functions. If something goes wrong, an empty list will be returned.
Note that you need to specify the background distribution and pseudo-counts used in the log-odds conversion; if you have no specific requirements, the values above are a good starting point.
If your matrix files are already in PWM format, use MOODS.parsers.pfm(filename)
, MOODS.parsers.adm_1o_terms(filename)
and MOODS.parsers.adm_0o_terms(filename)
.
Sequences. MOODS does not currently have dedicated parsers for sequences, as reading the sequences is generally more convenient to do in Python instead of C++. All MOODS functions accept input sequences as Python strings.
The following snippets can be useful for reading sequences. For plain text files, use:
with open(filename, "r") as f:
seq = "".join([line.strip() for line in f])
For reading fasta files, you can construct an iterator that will yield (header, sequence) pairs:
from itertools import groupby
def iter_fasta(filename):
with open(filename, "r") as f:
iterator = groupby(f, lambda line: line[0] == ">")
for is_header, group in iterator:
if is_header:
header = group.next()[1:].strip()
else:
yield header, "".join(s.strip() for s in group)
# call iter_fasta:
for header, seq in iter_fasta(filename):
# do stuff with sequences:
print header, len(seq)
(Under construction)
pvalue = 0.0001
matrices = [MOODS.parsers.pfm_to_log_odds(f, bg, pseudocount) for f in filenames]
thresholds = [MOODS.tools.threshold_from_p(m, bg, pvalue) for m in matrices]
(Under construction)
results = MOODS.scan.scan_dna(seq, matrices, bg, thresholds, 7)
for rs in results:
for r in rs:
print r.pos, r.score
scanner = MOODS.scan.Scanner(7)
scanner.set_motifs(matrices, bg, thresholds, )
results = scanner.scan(seq)
for rs in results:
for r in rs:
print r.pos, r.score