forked from amboseli/ramboseli
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
242 lines (166 loc) · 8.86 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
---
title: "An R package for the ARBP"
output: github_document
---
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "plots/README-"
)
```

The `ramboseli` package provides a set of small functions and utilities for R that can be shared among members of the Amboseli Baboon Research Project. There are instructions for using the package below.
<br>
<br>
# Installation
To install this package, you first need to install the `devtools` package:
```{r, eval = FALSE}
install.packages("devtools")
```
Then, you can install the latest development version of `ramboseli` from github:
```{r, eval = FALSE}
devtools::install_github("amboseli/ramboseli")
```
After you have installed the package once, you can simply load it in the future using:
```{r, message = FALSE, warning = FALSE}
library(ramboseli)
```
<br>
<hr>
<br>
# Current Functionality
- [Sociality Indices](documentation/sociality-indices.md)
- [Plotting helpers and functions](documentation/plotting.md)
<br>
<hr>
<br>
# Instructions for creating an SSH tunnel
Some (but not all!) functions in this package query the babase database directly. This allows you to pull "live" data from babase directly into R without going through the intermediate steps of writing queries in SQL, saving the results, and reading that data into R from static CSV files.
To use these functions, you must create an SSH tunnel to the database server. The instructions below explain how to do this.
<br>
### Getting a login
First, you must have a username and password for the babase database. If you're here and you're affiliated with the ABRP, you probably have this already. To create the tunnel, you must ALSO request a login to SSH into the server, papio.biology.duke.edu. __Most people do NOT have this—it must be requested from Jake.__
<br>
### Creating a Tunnel on Mac/Linux
The simplest way to create the tunnel on a Mac/Linux is to open a terminal window and type the following, substituting your actual username for "YourUsername":
```
ssh -f YourUsername@papio.biology.duke.edu -L 2222:localhost:5432 -N
```
Your Terminal window should then prompt you for your password for papio.biology.duke.edu. After that's entered, a message appears indicating that the connection has been made:
```
###############################################################################
# You are about to access a Duke University computer network that is intended #
# for authorized users only. You should have no expectation of privacy in #
# your use of this network. Use of this network constitutes consent to #
# monitoring, retrieval, and disclosure of any information stored within the #
# network for any purpose including criminal prosecution. #
###############################################################################
```
Once this is done, the tunnel has been created. Keep the Terminal window open and return to R for your analysis.
<br>
### Creating a Tunnel on Windows
This is a little more complicated. I don't have a Windows machine to test it on, but I _think_ this should work. Please let me know ( [camposfa@gmail.com](mailto:camposfa@gmail.com) ) if you try and can't get it to work properly.
- Download and install PuTTY, a free SSH client for Windows: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
a. Scroll down the page and find the link for the appropriate Windows Installer. It should be called something like `putty-0.70-installer.msi` (version number might differ).
b. Download and install the software.
c. Once the installation is complete, note the location where PuTTY was installed. It should be in:
i. `C:\Program Files (x86)\PuTTY\` or
ii. `C:\Program Files\PuTTY\`
- Create a batch file that will open the SSH tunnel
a. Open Notepad or any other text editor.
b. Type in the following text, all on one line, _including the quotation marks,_ replacing "YourUsername" with your username, and replacing "YourPassword" with your password. Note that you may need to replace "Program Files" with "Program Files (x86)"
```
"C:\Program Files\PuTTY\plink.exe" ssh -f YourUsername@papio.biology.duke.edu -L 2222:localhost:5432 -N -pw YourPassword
```
Alternatively, if you don't want to save your password in a plain text file, use the following text that will prompt you for your password each time you run the batch file.
```
"C:\Program Files\PuTTY\plink.exe" ssh -f YourUsername@papio.biology.duke.edu -L 2222:localhost:5432 -N
```
- Save the batch file somewhere convenient with a name like "babase_tunnel.bat"
- Open the SSH tunnel
a. Run the batch file that you just created by double clicking on it.
b. If there is an error, the window will close on its own, you will have no tunnel, and you will need to correct the errors!
c. If you are successful, a command prompt window will display the text in the batch file, along with a message.
d. You can leave this window open or minimize it, but don't close it until you want to close the tunnel!
<br>
### Additional Notes
There are shortcuts that can make this process more streamlined, including:
- Using an SSH key and a .pgpass file rather than a password so that you don't have to type the password each time.
- Setting up an alias in your .bash_profile for the tunnel commands.
Fernando can help with this if you're interested.
Every so often, when the connection is left idle, the tunnel seems to get corrupted. Often, you can simply reconnect using the same steps as above. Sometimes reconnecting doesn't work. In those cases, you can "kill" the corrupted tunnel by typing the following into Terminal:
```
ps aux | grep ssh
```
A list of processes should appear in the window, similar to this
```
fac13 23128 0.0 0.0 2445836 260 ?? S 25Jun18 0:00.02 /usr/bin/ssh-agent -l
fac13 14419 0.0 0.0 2432804 776 s000 S+ 12:30PM 0:00.00 grep ssh
fac13 14205 0.0 0.0 2462536 904 ?? Ss 12:27PM 0:00.00 ssh -f fac13@papio.biology.duke.edu -L 2222:localhost:5432 -N
```
The last process is the corrupted tunnel. Kill it by typing `kill` followed by the process ID number:
```
kill 14205
```
<br>
<hr>
<br>
# Reading data from babase
With the tunnel created, we can now connect to babase.
```{r hidden-connection, echo=FALSE, message=FALSE, warning=FALSE}
library(ramboseli)
library(tidyverse)
library(dbplyr)
# You will need to change user to your personal babase login AND get your password
babase <- DBI::dbConnect(RPostgreSQL::PostgreSQL(),
host = "localhost",
port = 2222,
user = "fac13",
dbname = "babase")
```
```{r connect-to-babase, eval=FALSE}
library(ramboseli)
library(tidyverse)
library(dbplyr)
# You will need to change user to your personal babase login AND get your password
# One approach to doing that is through Rstudio
# You could also just type it in your script, but that's not recommended for security.
babase <- DBI::dbConnect(RPostgreSQL::PostgreSQL(),
host = "localhost",
port = 2222,
user = "fac13",
dbname = "babase",
password = rstudioapi::askForPassword("Database password"))
```
Now we can create a connection to any table or view in babase. Here's an example for the biograph table.
```{r read-biograph}
biograph <- tbl(babase, "biograph")
```
You now have a live connection to the biograph table, stored as an object in R. It's important to note that this hasn't pulled in all the data, only the first few rows. Let's see what this objects looks like:
```{r print-biograph}
biograph
```
We can perform basic queries and joins on the the table using the `dplyr` and `dbplyr` packages. There's a lot more information [here](https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html).
If we want to pull in all the data into our current R environment as a normal data frame, we can use the function `collect`.
```{r collect-biograph}
biograph_l <- collect(biograph)
biograph_l
```
Need a table that in a different schema? Use the `in_schema` function:
```{r schema}
all_ranks <- tbl(babase, in_schema("babase_pending", "ALL_RANKS"))
collect(all_ranks)
```
<br>
<hr>
<br>
# Best practices for using `ramboseli` functions
1. Supply a database connection and no tables to obtain data directly through queries
* Uses most up-to-date data
* Use when carrying out exploratory analysis
* Not reproducible
2. Supply tables from the R environment and no database connection
* Uses static input files (a "snapshot" of specific babase tables)
* Reqires that user queries database first to produce static files and loads them into R environment
* Use for final analysis if you need reproducibility