-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdetails.Rmd
126 lines (93 loc) · 5.35 KB
/
details.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
### Please note: This user interface responds dynamically to your input prompts.
There is no "Submit" button. If it seems slow to update, it is busy
processing your data. Please be patient.
## Step 1: Select Sites
This phase of analysis requires:
1. A protein alignment, provisionally assumed to be in FASTA format.
1. A way to recognize how the longitudinal samples are labeled.
By default, sequence names are assumed to be dot-delimited,
with the timepoint label in the first (left-most) field.
1. An indication of which sequence is the TF sequence.
This is currently taken to be the first sequence in the alignment.
If HXB2 is included with the alignment, it is used to number alignment columns
then removed.
#### Specify a Protein Alignment
You can choose either to use the example data from CH505 or your own alignment.
If you select your own alignment file, consider using the Advanced Options.
#### Choose a Cutoff TF Loss
You must choose a setting for the cutoff parameter value. The number
of sites selected depends on the TF loss cutoff threshold. Higher
cutoffs include fewer sites. This is set high by default, but can be
adjusted using the slider bar.
#### Review Results
A list of selected sites will appear when defined. The table is
interactive. You can sort rows by clicking on the column name and
page through results. Searching is also possible.
Things to look for when reviewing results are the number of sites
and accurately parsed timepoint labels.
Timepoint labels can be in any units and are assumed to indicate time
elapsed since infection. A mixture of different units is not advised.
The values are used primarily when reordering sites. After parsing
the field indicated, any characters (e.g. "d123" or "56dpi") will be
stripped and the remaining digits converted to a numeric value (123
and 56).
Other informative columns for review are when_up and tf_area. The
former indicates when the site first exceeds 10% TF loss, or another
value if given by the "Order sites by when they first reach..."
slider. The tf_area is computed as the polygon area under the TF
line, which may help resolve timing between sites that revert or
change at different rates. The columns labeled 'l' and 'r' are used
internally to label sites where indels occur relative to the reference
sequence, and can be disregarded. (They will eventually be removed
from this table.) For rows where these values equal, either value
gives the reference sequence numbering. Rows where 'l' and 'r' differ
indicate inserted sites, relative to the reference sequence, as is
common for hypervariable loops of the HIV-1 Envelope glycoprotein.
When you are satisfied with the selected sites, check the "Send
selected sites to Step 2" box to proceed, then click on "Step 2:
Select Sequences".
#### Advanced Options
Check the box to enable several more choices:
1. Specify the format of your alignment file.
1. Transform asparagines (N) in PNG motifs to another character (O).
The PNG motif is Nx[ST], where x is any amino acid except proline.
1. Define options to parse the sample timepoint label from each sequence.
Identifying timepoint labels is central to the longitudinal analysis.
Each sequence should use a standard naming scheme, delimited by one of
the characters listed, and in the same field. Parse errors will
appear in place of the table of selected sites.
Not yet done:
1. Force inclusion/exclusion of specific sites.
1. Add option to specify TF sequence name.
1. Add option to specify reference sequence name.
## Step 2: Select Sequences
Information on this tab depends on information provided in the
preceding step. When results are available, a summary of the number
of selected sequences will appear on the top left, and a list of
concatamers will appear in a large panel.
Not yet done (future "advanced options"):
1. Generate/download time-series plots per site.
1. Force inclusion/exclusion of specific sequences by name.
1. Specify maximum tolerated deletion & insertion length.
#### Concatamer List
Concatamers are the selected sites from Step 1, strung together. One
concatamer is listed for each of the selected sequences. The order of
sites from left to right is intended to capture their relative
strength of selection, sorted first by when each site first reaches at
least 10% TF loss, then by the cumulative (non-) TF area.
#### Download Results
Each link under "Download Results" provides a different summary of the
analysis, including plots of sequence logos for selected sites among
the selected sequences.
#### Advanced Options
1. The "Minimum variant count" setting specifies that each mutation
should occur in the alignment at least this many times for inclusion
among selected sequences, e.g. 2 omits singleton mutations.
Increasing this setting, or reducing the number of selected sites in
Step 1, will give fewer sets of selected sequences.
1. To do: specify indel lengths to be tolerated for inclusion.
1. Add support to include and exclude sequences by name.
1. Add ability to download refseq lookup table.
#### Sequence Logos
The code to generate sequence logos is a standalone python program. If you see an error like this "Could not find Ghostscript on path. There should be either a gs executable or a gswin32c.exe on your system's path" then please install gs or gswin32c.exe per instructions for "Dependencies" provided at http://weblogo.threeplusone.com/manual.html#download
***