-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathinformatics_challenge_2023.Rmd
563 lines (415 loc) · 20.8 KB
/
informatics_challenge_2023.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
---
title: "Informatics Challenge! 2023"
date: "`r format(Sys.time(), '%m/%d/%Y')`"
output:
html_document:
toc: true
toc_float: true
toc_depth: 3
number_sections: true
code_folding: hide
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# How the informatics challenge will work
Welcome to the 2023 annual informatics challenge!
The purpose of this is to have some fun trying out things that you might not
have previously tried, and to learn something new. It's also a contest, so you
will have the chance to win fabulous* prizes**!
Below are a series of questions of varying difficulty. Each question will be
worth varying points. Some of these questions are language specific, while
others can be completed however you'd like. Either way,
**please show your work/code**.
I'm writing the questions to be either a task or specific question, such that
it is unambiguous if the answer is right. If you do the task or answer the
question correctly you get full points. No partial credit.
I am asking participants to specify their experience level to split folks
into two categories. I'll be awarding two trophies, one for each category.
- Beginners are those that are just getting started and have been doing
informatics regularly for a year or less.
- You should likely submit as Advanced if you have been doing this for a while
and if all the beginner questions seem really easy and the harder questions
don't seem too bad.
I'm giving two weeks to complete everything. You don't have to complete all the
questions, I'll be defining winners of the two categories based on the highest
total points. If you're the only one who enters, you can win by answering just
one question! Last year there was only one person in the beginner category, so
they won by default!
Feel free to do internet searches to figure things out, though I'm asking that
you **don't use chat-GPT or any AI methods** to generate your answers as you
won't learn anything from that (and that's half the point of all this).
You can put your answers in-line in the Rmd file or you can provide your answers
another way if you prefer.
**The due date will be April 18th at 11:59 PM**
Feel free to ask for clarification on anything!
------------------------------------------------------------------------
## Are you submitting in the beginner or advanced category?
------------------------------------------------------------------------
# Git and GitHub - 5 points per question
1. For your submission, fork my repository (https://github.com/MVesuviusC/2023_informatics_challenge) and send me a link to your fork with your results in a public GitHub repository
2. Make sure your repo includes at least two branches and two commits, with one of your new branches merged back into your main branch
# Beginner questions - 1 point per question
## Basic R data and variable types
3. Make a variable containing a vector of five names
4. Make a variable containing a boolean vector where two elements are true and three are false
5. Use the varible with the boolean values to subset the names variable down to two names inside a new variable
6. Create a varible containing a numeric matrix (10 rows and 10 columns) of random numbers
7. Add 1 to every element in your numeric matrix in a single line of code
## Sequencing basics
8. When doing Illumina sequencing, what is the purpose of the "index" read?
9. What is the difference between single-ended and paired-ended sequencing?
## Make basic plots
You can access a dataset in R using the following code (you'll need the dplyr or tidyverse package installed):
```{r}
storms <- dplyr::storms
```
10. Using the storms data, make a scatterplot of wind vs pressure
11. Using the storms data again, make a histogram of pressure
## Statistics
12. Using the storms data, perform a linear regression on the pressure to determine if wind speed is correlated with pressure
13. When performing PCA, how do you know how many principal components to use for further analysis?
14. What is the purpose of performing a variance stabilization transformation of gene expression values in RNAseq?
15. In statistical terms, what is alpha-level ($\alpha$)?
16. If a given event has a probability of success of 0.1, what is the probability of at least one success given three attempts, assuming that each trial is independant?
## Basic unix commands
17. Which single command would you use to change directories from `~/` to `~/data/expt1/pcrs/`?
18. How do you change permissions on a file to disallow all other users from reading it?
19. How do you rename a file?
20. What command creates a new empty file?
21. How do you make a varible in bash?
## Basic file/data formats
22. What file/data format is this?
```{verbatim, max.height='100px'}
chr10 18595 . A . 285 PASS DP=48;MQSB=1;MQ0F=0;AN=2;DP4=24,24,0,0;MQ=60 GT:DP 0/0:48
chr10 18596 . C . 285 PASS DP=48;MQSB=1;MQ0F=0;AN=2;DP4=23,24,0,0;MQ=60 GT:DP 0/0:47
chr10 18597 . C . 285 PASS DP=49;MQSB=1;MQ0F=0;AN=2;DP4=25,24,0,0;MQ=60 GT:DP 0/0:49
chr10 18598 . T . 285 PASS DP=49;MQSB=1;MQ0F=0;AN=2;DP4=25,24,0,0;MQ=60 GT:DP 0/0:49
chr10 18599 . G . 285 PASS DP=53;MQSB=1;MQ0F=0;AN=2;DP4=25,27,0,0;MQ=60 GT:DP 0/0:52
chr10 18600 . C . 285 PASS DP=53;MQSB=1;MQ0F=0;AN=2;DP4=25,27,0,0;MQ=60 GT:DP 0/0:52
chr10 18601 . A . 285 PASS DP=53;MQSB=1;MQ0F=0;AN=2;DP4=25,27,0,0;MQ=60 GT:DP 0/0:52
chr10 18602 . G . 285 PASS DP=53;MQSB=1;MQ0F=0;AN=2;DP4=26,27,0,0;MQ=60 GT:DP 0/0:53
```
23. What file/data format is this?
```{verbatim, max.height='100px'}
chr1_KZ114997v1_alt 4592 N 10 AAAAAaAAAA II1HIIHIDB
chr1_KZ114997v1_alt 4593 N 10 GGGGGgGGGG IICFIIIIDD
chr1_KZ114997v1_alt 4594 N 10 CCCCCcCCCC IIHHIIIIII
chr1_KZ114997v1_alt 4595 N 10 TTTTTtTTTT IIHIIGII@I
chr1_KZ114997v1_alt 4596 N 11 GGGGGgGGGG^]G IIHHIIIICID
chr1_KZ114997v1_alt 4597 N 11 CCCCCcCCCCC IIEGIIII@IA
chr1_KZ114997v1_alt 4598 N 11 AAAAAaAAAAA II@HIIIICHB
chr1_KZ114997v1_alt 4599 N 11 GGGGGgGGGGG IIGHIHIIDI@
chr1_KZ114997v1_alt 4600 N 11 GGGGGgGGGGG IIEHIHIIHHD
chr1_KZ114997v1_alt 4601 N 11 AAAAAaAAAAA II=HIHIIHI=
chr1_KZ114997v1_alt 4602 N 11 T$TTTTtTTTTT IIGEIHIIHIH
```
24. What file/data format is this?
```{verbatim, max.height='100px'}
@A00498:356:H53CMDRXY:1:2101:1859:1016 1:N:0:CTGACGCG
CTGACGCG
+
FFFFFFFF
@A00498:356:H53CMDRXY:1:2101:2184:1016 1:N:0:CTGACGCG
CTGACGCG
+
FFF:FFFF
@A00498:356:H53CMDRXY:1:2101:2220:1016 1:N:0:CTGACGCG
CTGACGCG
+
FFFFFFFF
@A00498:356:H53CMDRXY:1:2101:2745:1016 1:N:0:CTGACGCG
CTGACGCG
+
FFFFFFFF
```
25. These sequences are all from the same file. What is a possible reason that these sequences are not all the same length?
```{verbatim, max.height='100px'}
@A00498:586:HWWJ3DRX2:1:2101:7238:1063 1:N:0:ATTACTCG+AGGCTATA
TNTGGCTACACTCCAGATCGCAATATTTGTTTTCATGATGATCTGGGTACTGCCAGCAAGGCTCAGTAGAGTAAATGGTATCAAAAGCAGGATCCTCCAGTTACTTTTCCCGATCTCCATCCAAATATTTTCAGGGGTCAACCTGTACTTC
+
F#F:FF,F:FFFF::FFFFFF:FFFF,FF:FFFFF:,FFFFFFFF:FFF:F,F:FFFF:F,F,FFF,FFF,:FF,,F,FF:FFFFF,FF::F,:FFFFF:,F,FF,FF:,FF,FFF:,FFFF,FFFFFFF,F:,F,FF:FF,F:FFF:FF:
@A00498:586:HWWJ3DRX2:1:2101:7274:1063 1:N:0:ATTACTCG+AGGCTATA
CNGTCTTTCTGAGAAAGCAGCTGCAGCCTCCGTAAGGACAGAGGTCAGGCCTGATGCATCCTAGAATCTTC
+
F#F:FFFFFFFFFFFFFFFFFFFFF,F,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00498:586:HWWJ3DRX2:1:2101:7419:1063 1:N:0:ATTACTCG+CGGCTATA
TNTTTTTTGAGACGGAGTTTCGCTCTTATTGCCCAGGCTGGAGTGCCATGGCGTGATCTCGGCTCACCACAACCTCCGCCTCCCAGGTTCAAGTGATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGGATTACAGGCATTCACCACCATGCC
+
F#FFFFFFFFFFFFFFFF:FFFFFF:F,FFFFFFF:F,FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFF:F,FFFFFFFFFFFFFF:FFFFFF:F:FFFFFFF,FFFF:FFFF
@A00498:586:HWWJ3DRX2:1:2101:7618:1063 1:N:0:ATTACTCG+CGGCTATA
TNAAAGAACGCATTTGCACAGCATTACAGATCGCAGGCGGTAGCATGACTTAT
+
F#FFFFFFFFFFFFFFFFFFFFFFF:F,FFFFFFFFFFFFFFFFFFFFFFFFF
```
26. In the previous question, what information does "1:N:0:ATTACTCG+AGGCTATA" represent in the header line?
## Basic troubleshooting
27. I started a new R session and ran the following code and got the following error:
Fix my mistake and explain what I did wrong.
```{r}
ggplot(mtcars, aes(x = cyl, y = mpg)) +
geom_point()
Error in ggplot(mtcars, aes(x = cyl, y = mpg)) :
could not find function "ggplot"
```
28. I started a new R session and ran the following code:
```{r}
summary(mtcars$cy1)
```
I didn't get an error, but this output doesn't look right. What is wrong with the code and how can I fix it?
```{verbatim, max.height='100px'}
Length Class Mode
0 NULL NULL
```
29. I ran the following code and got this error:
Fix my mistake and explain what I did wrong.
```{r}
cutoff_val <- 0.05
pval <- 0.0001
if (pval < cutoff_val & pval >= 0.001) {
print("Significant")
} else if (< 0.001) {
print("Very significant")
} else {
print("Not significant")
}
Error: unexpected '<' in:
" print("Significant")
} else if (<"
> print("Very significant")
[1] "Very significant"
> } else {
Error: unexpected '}' in "}"
> print("Not significant")
[1] "Not significant"
> }
Error: unexpected '}' in "}"
```
------------------------------------------------------------------------
# Intermediate - 2 points per question
## Sequences
30. TRUE or FALSE. When aligning fastq files to a reference, each pair of R1 and R2 reads (reads that have the same header) must be the same sequence length.
## Make fancy plots
31. Using the storms data, make a jitter plot where x is a categorical variable of the wind speed (in bins of 10 knots per bin) and y is pressure. Be sure the colors used are color-blind friendly.
32. Use plotly to create an interactive plot of the dplyr::starwars data where x is homeworld and y = height. On mouseover, name, height and mass should be displayed. Save the plot as an html file.
33. Make a panel of four plots in a single png file where the plots use geom_point(), geom_histogram(), geom_density(), and geom_boxplot() using the storms dataset. The plots should be arranged in a 2x2 grid.
## Alignment
34. When aligning reads using HiSat2, when would you use the --phred64 option?
35. What does this CIGAR string from a bam file tell you about the alignment?
124M2D25M1S
36. What does the bitwise flag 163 found in a bam file mean?
## Bash
37. If you had 1,337 files in a directory and you wanted to change the name of each of them to remove "file_" from their file names, how would you do it in a single line of code (using no semicolons)?
## Intermediate troubleshooting
38. I ran the following code and got the following error:
Fix my mistake and explain what I did wrong.
```{r}
summary(paste(mtcars$mpg, mtcars$disp "mpg vs disp"))
Error: unexpected string constant in "summary(paste(mtcars$mpg, mtcars$disp "mpg vs disp""
```
39. I ran the following code and got the following warning:
Should I be concerned about this warning?
If so, fix my mistake and explain what I did wrong.
```{r}
col_names <- colnames(mtcars)
for (i in 1:12) {
print(max(mtcars[[col_names[i]]]))
}
[1] 33.9
[1] 8
[1] 472
[1] 335
[1] 4.93
[1] 5.424
[1] 22.9
[1] 1
[1] 1
[1] 5
[1] 8
[1] -Inf
Warning message:
In max(mtcars[[col_names[i]]]) :
no non-missing arguments to max; returning -Inf
```
40. I wrote a loop to make a bunch of plots, but when I run it, nothing is plotted.
Fix my code and explain what I did wrong.
```{r}
for (col_name in c("mpg", "disp", "hp")) {
ggplot(mtcars, aes(x = col_name, y = wt)) +
geom_point()
}
```
# Hard - 3 points per question
## R stats
41. When doing a statistical test in R, if you see a p-value of 2.2e-16, what is the special importance of this value compared to any other p-value you might see?
42. TRUE or FALSE. When doing a statistical test in R, if you see a p-value of 0.05, this means that there is a 5% chance that the null hypothesis is true
## 10x single-cell sequencing
43. When sequencing a library created using the 3' 10X kit, what are the recommended sequence lengths for R1, R2, I1 and I2?
## R objects
44. Create a S4 R object to hold the data contained in a single sequence in fastq format
45. Create a method for your S4 object to trim the fastq sequence and quality fields to a length provided as an argument
46. Create another class that has a slot to contain a list of your S4 fastq objects, and has a method to read in a fastq file provided as an argument. Create other slots in the object to: 1) contain the number of fastq sequences in the object and 2) store the min/max sequence length found across all fastq sequences (single min/max across the whole dataset, not min/max per sequence).
## Harder plotting
47. Use the storms data to make a plot with a single line for each named storm, where x is the latitude and y is the longitude. The line should trace the storm's path across time. Color the path by the pressure.
## Get data from NCBI
48. Get the data from SRA #SRR23690179, align to the mouse genome and call SNPs.
49. Get the data from GEO #GSE200744, do PCA on the normalized data and make a single plot showing the first ten principle components, but don't make the traditional PC1 vs PC2 scatterplots. Color each point by normal, met or premet status.
## R loops
50. I have a table of information saved in a variable called `ref_info` (below). Loop over each row of this table and make plots using the storms dataset with the relevant parameters. For instance the code first plot should be something like this:
```{r}
ggplot(storms,
aes(x = year, y = pressure)) +
geom_point() +
scale_y_log10() +
labs(title = "Pressure for Hurricane Irma",
y = "Pressure (log10)",
x = "Year")
```
```{r}
ref_info <-
tribble(~ y_column_name,
~ x_column_name,
~ y_data_transform,
~ plot_title,
~ y_axis_label,
~ x_axis_label,
"pressure",
"year",
"log10",
"Pressure by Year",
"Pressure (log10)",
"Year",
"pressure",
"month",
"log10",
"Pressure by Month",
"Pressure (log10)",
"Year",
"wind",
"pressure",
"none",
"Wind speed vs Pressure",
"Wind speed (knots)",
"Pressure",
"wind",
"status",
"Multiply by 1.15078",
"Wind Speed by Storm Type",
"Wind Speed (mph)",
"Storm Type")
```
## Harder troubleshooting
51. I ran the following code and got the following error:
Fix my mistake so that the plot is generated and explain what I did wrong.
```{r}
library(tidyverse)
library(GenomicFeatures)
library(EnsDb.Hsapiens.v86)
transcript_data <-
transcripts(EnsDb.Hsapiens.v86) %>%
as.data.frame()
mutate(transcript_data,
seqnames = droplevels(seqnames)) %>%
mutate(chrom = paste0("chr_", seqnames)) %>%
rename(Chromosome = chrom) %>%
filter(grepl("^[0-9XYM]", test$seqname, perl = TRUE)) %>%
ggplot(aes(y = Chromosome,
x = start,
color = log10(width))) +
geom_point() +
theme_bw()
Error in rename(., Chromosome = chrom) : object 'chrom' not found
```
## Bash variable expansion
52. Create a bash array with made up file names, then loop over the array and print out both the file name and the file name without the extension (.txt, .gz, etc.) for each.
53. Provide code to locate all uncompressed ".txt", ".sam", ".tsv" and ".csv" files in your home directory (and all downstream subdirectories) and compress them. Do this only for files larger than 5 MB.
------------------------------------------------------------------------
# Too hard - 4 points per question
## PCA interpretation
54. I've attached an image of a PCA plot that I made (odd_pca.png). How would you interpret the shape of the distribution of points in this plot? What does it tell you about the data?
## Super fancy plot
55. Remake your plot of storm paths above, but put a map of the world in the background, with accurate lat/long locations
## Very hard troubleshooting
56. I wrote a loop to make plots, but got this error:
```{r}
[1] "1978 wind range: 10 160"
[1] "1978 pressure range: 882 1022"
Error in 2023 - year : non-numeric argument to binary operator
```
Once you fix that error, your plots will all come out the same, even though they shouldn't.
What is wrong with my code? Explain and fix it so the four plots show the proper distributions, while still using the loops to create the plots.
Note:
- For each year, the two histograms should look different
- The distribution in all four histograms should be different
- Pay attention to the ranges printed out inside the loop compared to your plots
```{r}
library(patchwork)
years <- factor(c("1978", "2020"))
plot_list <- list()
for (year in years) {
print(paste(year,
"wind range:",
min(storms$wind),
max(storms$wind)))
print(paste(year,
"pressure range:",
min(storms$pressure),
max(storms$pressure)))
print(2023 - year)
for (limit in c("pressure", "wind")) {
title_text <- paste("Pressure from Storms in ", year)
plot_list[[paste(year, limit)]] <-
storms %>%
filter(year == year) %>%
ggplot(aes(x = get(limit))) +
geom_histogram(bins = 100) +
labs(title = title_text,
y = "Count",
x = str_to_title(limit))
}
}
(plot_list[[1]] + plot_list[[2]]) / (plot_list[[3]] + plot_list[[4]])
# Why are they all the same!?!?!?!?
message("You hear maniacal laughter in the distance...")
```
------------------------------------------------------------------------
# I hate myself - 10 point per question
## Animate a plot
57. Remake your plot of the storm paths, with the map of the world in the background, but animate it so that each frame of the animation is a day of the year starting at Jan 1st and going through the full year. All years should be plotted simultaneously. Save your results as a gif.
## Identify the mystery data type
58. On the Franklin cluster is a folder /gpfs0/scratch/mvc002/. Inside this
folder are two files:
- mysteryData_R1.bam
- mysteryData_R2.bam
These files are alignments for R1 and R2 from a single dataset. What type of data are these data derived from?
59. There is another file named mysteryData_otherData.bam in the same folder. What type of data is this file derived from?
## Harder programming
In /gpfs0/scratch/mvc002/ there is a bcf file: 'info_challenge.bcf'. This file has SNP calls for 29 samples. I want to calculate average pairwise distances between all samples, counting the number of differing alleles at each position covered in both samples across the whole genome, then dividing by the number of comparisons for each pair of samples.
For instance:
given the last four columns from a single line in the file:
1/1:189,255,0:168:127 1/0:208,255,0:178:127 0/1:183,255,0:140:127 ./.:186,255,0:105:127
The first sample differs from the second at one allele, from the third at two alleles and the fourth at unknown alleles.
For each position, ignore any comparison where one of the two samples isn't covered ("./.").
Since the data aren't phased, treat 0/1 and 1/0 as identical.
Remember that for each position, we are comparing two alleles, so the number of comparisons will be the number of shared sites (for each pair of samples) times two.
This means the matrix of the distance for this position would be:
0 1 1 NaN
1 0 0 NaN
1 0 0 NaN
NaN NaN NaN NaN
And the number of compared alleles would be:
2 2 2 NaN
2 2 2 NaN
2 2 2 NaN
NaN NaN NaN NaN
Your results should be a 29x29 matrix with the average pairwise distances between all samples across all positions. The correct output from this file is in /gpfs0/scratch/mvc002/info_challenge.distances.txt for reference.
I wrote a script to do this on the provided file and my best time running it on the cluster was just over 2 minutes.
(py3_10) [mvc002@r1pl-hpcf-h03 bioinfoTools]$ time python exec/vcfToMatrix.py --vcf test_sc_data.bcf -p 48
real 2m2.622s
user 89m19.035s
sys 1m10.783s
60. Write code that does the same thing
61. Beat my time
62. AI-generated code is becoming more and more common. It is a very powerful tool, but what is one way that using AI-generated code can be disadvantageous? (This question was written with the assistance of GitHub Copilot.)
* a plastic
** trophy, though it may be a bit delayed since I'm in the process of fixing my 3D printer. It shouldn't take too long though.