-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
230 lines (139 loc) · 9.18 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# phyloR
<!-- badges: start -->
```{r , echo=FALSE , results='asis' , message=FALSE}
cat(
badger::badge_devel("cparsania/tidywrappers" , color = "blue"),
badger::badge_lifecycle()
)
```
<!-- badges: end -->
```{r, echo=TRUE, message = FALSE, warning = FALSE , include=FALSE}
library(magrittr)
library(dplyr)
```
`phyloR` is an R package to deal with NCBI-BLAST output. It helps to pre-process BLAST tabular output for downstream phylogenetic analysis.
## Install
```{r, echo=TRUE, message = FALSE, warning = FALSE,message=FALSE , eval=FALSE}
if(require("devtools")){
devtools::install_github("cparsania/phyloR")
} else{
install.packages("devtools")
devtools::install_github("cparsania/phyloR")
}
```
## Assign column names to blast tabular output
```{r, echo=TRUE, message = FALSE, warning = FALSE,message=FALSE}
blast_out_file <- system.file("extdata" ,"blast_output_01.txt", package = "phyloR")
blast_out_tbl <- readr::read_delim(blast_out_file , delim = "\t" , comment = "#" ,col_names = F)
colnames(blast_out_tbl) <- phyloR::get_blast_outformat_7_colnames()
colnames(blast_out_tbl)
```
## Filter blast hits
One can filter blast hits by individual or combinations of these variables.
+ e-value
+ bitscore
+ query coverage
+ percent identity
```{r, echo=TRUE, message = FALSE, warning = FALSE,message=FALSE}
## evalue <= 1e-5
blast_out_tbl %>% phyloR::filter_blast_hits(evalue = 1e-5)
## Filter blast hits : bitscore >= 200
blast_out_tbl %>% phyloR::filter_blast_hits(bit_score = 200)
## Filter blast hits : query coverage >= 90
blast_out_tbl %>% phyloR::filter_blast_hits(query_cov = 90, query_length = 253)
## Filter blast hits : percent identity >= 90
blast_out_tbl %>% phyloR::filter_blast_hits(identity = 90)
## Filter blast hits : evalue <= 1e-5, bitscore >= 200, query coverage >= 90, percent identity >= 60,
blast_out_tbl %>% phyloR::filter_blast_hits(evalue = 1e-5, bit_score = 200, query_cov = 90, query_length = 253, identity = 60)
```
## Format fasta headers
Fasta file downloaded as an output of blast result has fasta headers of two types
1) Headers with alignment co-ordinates - if only aligned sequences downloaded.
+ E.g. `KAE8371401.1:1-253 trypsin-like cysteine/serine peptidase domain-containing protein [Aspergillus bertholletius]`
2) Headers without alignment co-ordinates - if full sequences downloaded.
+ E.g. `KAE8371401.1 trypsin-like cysteine/serine peptidase domain-containing protein [Aspergillus bertholletius]`
By default, `phyloR::format_fasta_headers()` divides headers in three (header type 2) or four (header type 1) groups delimited by 2 underscores (`__`).
1) subject id
2) alignment coordinates (Optional for header type 1)
3) subject description
4) subject species
One can drop alignment coordinates from sequence headers by keeping argument `keep_alignemnt_coord = FALSE`
```{r, echo=TRUE, message = FALSE, warning = FALSE,message=FALSE}
fa_file <- system.file("extdata" ,"blast_output_01.fasta", package = "phyloR")
fa <- Biostrings::readBStringSet(fa_file)
## existing headers
names(fa) %>% head()
## new headers
fa_file %>% phyloR::format_fasta_headers() %>% names() %>% head()
## drop alignment co-ordinates
fa_file %>% phyloR::format_fasta_headers(keep_alignemnt_coord = FALSE) %>% names() %>% head()
```
## Remove redundant hits from blast tabular ouput
BLAST reports same subject hit multiple time with different alignment co-ordinates. For a same subjet id, `phyloR::remove_redundant_hits()` select longest subject hit out of given multiples.
```{r, echo=TRUE, message = FALSE, warning = FALSE}
blast_out_tbl %>% phyloR::remove_redundant_hits()
```
## Subset fasta sequences
After filtering hits from tabular blast output, next obvious question is to filter same sequences from a fasta file and write them to new file. `phyloR::subset_bstringset()` subset the sequences from a fasta using a vector of fasta headers.
```{r}
fa_file <- system.file("extdata" ,"blast_output_01.fasta", package = "phyloR")
fa <- Biostrings::readBStringSet(fa_file)
query_headers <- fa %>% names() %>% sample(100)
seq_filtered <- phyloR::subset_bstringset(x = query_headers , y = fa, partial_match = F)
# One can use function `Biostrings::writeXStringSet()` to write filterd sequences to new fasta file.
```
## Assign taxonomy columns to NCBI protein acession
phyloR allows you to assign NCBI taxonomy levels to NCBI protein accession using function `phyloR::add_taxonomy_columns()`. One of the applications of this function is to assign the NCBI taxonomy levels to blast output, which is required for downstream phylogenetic analysis. For example, a phylogenetic tree generated from blast hits often required to color resultant tree branches either by kingdom, family or any other NCBI taxonomy level. One can easily identify these taxonomy levels using function `phyloR::add_taxonomy_columns()`
```{r, echo=TRUE, message = TRUE, warning = FALSE}
## add kingdom
with_kingdom <- blast_out_tbl %>% slice(1:10) %>%
phyloR::add_taxonomy_columns(ncbi_accession_colname = "subject_acc_ver", taxonomy_level = "kingdom")
with_kingdom %>% dplyr::select(query_acc_ver, subject_acc_ver, kingdom)
## add family
with_kingdom_and_family <- with_kingdom %>% phyloR::add_taxonomy_columns(ncbi_accession_colname = "subject_acc_ver" ,
taxonomy_level = "family")
with_kingdom_and_family %>% dplyr::select(query_acc_ver, subject_acc_ver, kingdom, family )
## add species
with_kingdom_family_and_species <- with_kingdom_and_family %>% phyloR::add_taxonomy_columns(ncbi_accession_colname = "subject_acc_ver" ,
taxonomy_level = "species")
with_kingdom_family_and_species %>% dplyr::select(query_acc_ver, subject_acc_ver, kingdom, family ,species)
```
## Use NCBI taxonomy levels to color the phyloegenetic tree
An object of class `phylo` is widely used object to visualize the phylogenetic tree in R. To gain more insights from the resultant phylogenetic tree, often tree tips required to color by different levels of taxonomy, for instance species, kingdom or family. Obtaining taxonomy data to use them for tree annotations is not straight forward task and require lots of data wrangling in R. phyloR provides handy way to map such a taxonomy data to the phylogenetic tree. For example, one can convert an object of class `phylo` to an object of class `tibble` and vice versa. Once the tibble obtained, phyloR allows to add columns of ncbi taxonomy levels. Resultant tibble can be passed to `ggtree::ggtree()` to visualize the tree and taxonomy levels can be used, for instance, to color the tree tips.
```{r , message=FALSE,warning=FALSE , fig.width=70, fig.height=20, dpi=272}
tree_string <- "((XP_005187699_1__Musca_domestica:0.070627277,(XP_019893806_1__Musca_domestica:0.071069674,((XP_013113221_1__Stomoxys_calcitrans:0.1494662042,ACB98719_1__Glossina_morsitans_morsitans:0.3489851076)67.4/100:0.0470213767,XP_013102958_1__Stomoxys_calcitrans:0.1794878827)98.1/100:0.0959227604)88.2/99:0.0323598861)93/99:0.0435291148,((XP_017472861_1__Rhagoletis_zephyria:0.0049337059,XP_017472862_1__Rhagoletis_zephyria:0.0112391294)97.3/100:0.0860969479,(XP_020713236_1__Ceratitis_capitata:0.2642805176,(XP_014102010_1__Bactrocera_oleae:0.1183517872,XP_018784523_1__Bactrocera_latifrons:0.1137567198)29.6/88:0.0758551876)99.9/100:0.247740081)92/100:0.0716529011)34.3/66:2.487103817;"
tree_objct <- treeio::read.tree(text = tree_string)
tree_tbl <- tree_objct %>%
ggtree::fortify()
tree_tbl <- tree_tbl %>%
dplyr::mutate( seqid = dplyr::case_when(isTip ~ stringr::str_replace(label , pattern = "__.*","" ) %>% ## split by '__'
stringr::str_replace(pattern = "_\\d$" , ""), ## remove trailing digits from seqid
TRUE ~ label
)
)
## add taxonomy
tree_tbl_with_taxonomy <- tree_tbl %>%
phyloR::tidy_taxonomy_tree(ncbi_accession_colname = "seqid",taxonomy_levels = c("species" ,"kingdom","family"))
## visualize tree
# tips colored by species
tree_tbl_with_taxonomy %>% ggtree::ggtree() + ggtree::geom_tiplab(ggplot2::aes(color = species, label = seqid),size = 20) +
ggplot2::theme(legend.position = "bottom" , legend.text = ggplot2::element_text(size= 50) , legend.title = ggplot2::element_text(size= 50))
# tips colored by family
tree_tbl_with_taxonomy %>% ggtree::ggtree() + ggtree::geom_tiplab(ggplot2::aes(color = family,label = seqid),size = 20) +
ggplot2::theme(legend.position = "bottom" , legend.text = ggplot2::element_text(size= 50) , legend.title = ggplot2::element_text(size= 50))
# tips colored by kingdom
tree_tbl_with_taxonomy %>% ggtree::ggtree() + ggtree::geom_tiplab(ggplot2::aes(color = kingdom,label = seqid ),size =20) +
ggplot2::theme(legend.position = "bottom" , legend.text = ggplot2::element_text(size= 50) , legend.title = ggplot2::element_text(size= 50))
```