-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathREADME.Rmd
945 lines (712 loc) · 44.8 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
dpi = 100,
warning = FALSE,
message = FALSE
)
```
# tidytof: A user-friendly framework for interactive and reproducible cytometry data analysis <a href='https://keyes-timothy.github.io/tidytof/index.html'><img src='man/figures/tidytof_logo.png' align="right" height="139" /></a>
<!-- badges: start -->
[![R-CMD-check](https://github.com/keyes-timothy/tidytof/workflows/R-CMD-check/badge.svg)](https://github.com/keyes-timothy/tidytof/actions)
[![R-CMD-check-bioc](https://github.com/keyes-timothy/tidytof/workflows/R-CMD-check-bioc/badge.svg)](https://github.com/keyes-timothy/tidytof/actions)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![Codecov test coverage](https://codecov.io/gh/keyes-timothy/tidytof/branch/main/graph/badge.svg)](https://app.codecov.io/gh/keyes-timothy/tidytof?branch=main)
<!-- badges: end -->
`{tidytof}` is an R package that implements an open-source, integrated "grammar" of single-cell data analysis for high-dimensional cytometry data (i.e. [mass cytometry](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4860251/), full-spectrum flow cytometry, and sequence-based cytometry). Specifically, `{tidytof}` provides an easy-to-use pipeline for handling high-dimensional cytometry data at multiple levels of observation - the single-cell level, the cell subpopulation (or cluster) level, and the whole-sample level - by automating many common data-processing tasks under a common ["tidy data"](https://r4ds.had.co.nz/tidy-data.html) interface.
As an extension of the `tidyverse` ecosystem of data manipulation tools in R, all of `{tidytof}`'s functions have been developed with an internally consistent, human-centered set of design principles. This means that using `{tidytof}` should be equally intuitive among scientists with a wide range of coding experience (including beginners).
## Getting started
### Prerequisites
`{tidytof}` makes heavy use of two concepts that R beginners may be unfamiliar with. The first is the pipe (`|>`), which you can read about [here](https://r4ds.had.co.nz/pipes.html). The second is "grouping" data in a `data.frame` or `tibble` using `dplyr::group_by`, which you can read about [here](https://dplyr.tidyverse.org/articles/grouping.html).
Everything else should be self-explanatory for beginner and advanced R users, though if you have *zero* background in running R code, you should read [this chapter](https://r4ds.had.co.nz/workflow-basics.html) of [R for Data Science](https://r4ds.had.co.nz/index.html) by Hadley Wickham.
### Package structure
Broadly speaking, `{tidytof}`'s functionality is organized to support 3 levels of analysis inherent in single-cell data:
1. Reading, writing, preprocessing, and visualizing data at the level of **single cells**
2. Identifying and describing cell **subpopulations** or **clusters**
3. Building models (for inference or prediction) at the level of **patients** or **samples**
How to use `{tidytof}` at each of these levels of cytometry data analysis is detailed in the "Usage" section below.
## Installation
You can install the development version of tidytof from GitHub with the following command:
```{r, eval=FALSE}
if (!require(devtools)) install.packages("devtools")
devtools::install_github("keyes-timothy/tidytof")
```
Once `{tidytof}` is installed, you can attach it to your current R session using the following code:
```{r, message = FALSE, warning = FALSE}
library(tidytof)
```
In addition, we can install and load the other packages we need for this vignette:
```{r}
if (!require(FlowSOM)) BiocManager::install("FlowSOM")
library(FlowSOM)
if (!require(tidyverse)) install.packages("tidyverse")
library(tidyverse)
```
## Usage
### Analyzing data at the single-cell level
#### Reading data with `tof_read_data`
`{tidytof}` comes bundled with several example mass cytometry datasets. To access the raw .fcs and .csv files containing these data, use the `tidytof_example_data` function. When called with no arguments, `tidytof_example_data` will return a character vector naming the datasets contained in `{tidytof}`:
```{r}
tidytof_example_data()
```
To obtain the file path for the directory containing each dataset, call `tidytof_example_data` with one of these dataset names as its argument. For example, to obtain the directory for the phenograph data, we would use the following command:
```{r}
tidytof_example_data("phenograph")
```
Using one of these directories (or any other directory containing cytometry data on your local machine), we can use `tof_read_data` to read cytometry data from raw files. Acceptable formats include .fcs files and .csv files. Importantly, `tof_read_data` is smart enough to read single .fcs/.csv files or multiple .fcs/.csv files depending on whether its first argument (`path`) leads to a single file or to a directory of files.
Here, we can use `tof_read_data` to read in all of the .fcs files in the "phenograph" example dataset bundled into `{tidytof}` and store it in the `phenograph` variable.
```{r}
phenograph <-
tidytof_example_data("phenograph") |>
tof_read_data()
phenograph |>
class()
```
Regardless of its input format, `{tidytof}` reads data into an extended `tibble` called a `tof_tbl` (pronounced "tof tibble"), an S3 class identical to `tbl_df`, but with one additional attribute ("panel"). `{tidytof}` stores this additional attribute in `tof_tbl`s because, in addition to analyzing cytometry data from individual experiments, cytometry users often want to compare panels between experiments to find common markers or to compare which metals are associated with particular markers across panels.
A few notes about `tof_tbl`s:
- `tof_tbl`s contains one cell per row and one cytometry channel per column (to provide the data in its "tidy" format).
- `tof_read_data` adds an additional column to the output `tof_tbl` encoding the name of the file from which each cell was read (the "file_name" column).
- Because `tof_tbl`s inherit from the `tbl_df` class, all methods available to tibbles are also available to `tof_tbl`s. For example, `{dplyr}`'s useful `mutate` method can be applied to our `tof_tbl` named `phenograph` above to convert the columns encoding the phenograph cluster ID and stimulation condition to which each cell belongs into character vectors (instead of their original numeric codes in the uncleaned dataset).
```{r}
phenograph <-
phenograph |>
# mutate the input tof_tbl
mutate(
PhenoGraph = as.character(PhenoGraph),
Condition = as.character(Condition)
)
phenograph |>
# use dplyr's select method to show that the columns have been changed
select(where(is.character)) |>
head()
```
The `tof_tbl` class is preserved even after these transformations.
```{r}
phenograph |>
class()
```
Finally, to retrieve panel information from a `tof_tbl`, use `tof_get_panel`:
```{r}
phenograph |>
tof_get_panel() |>
head()
```
Importantly, `tof_read_data` uses an opinionated heuristic to mine different keyword slots of the input .fcs file(s) and guess which metals and antigens were used during data collection. Thus, when .csv files are being read using `tof_read_data`, it is recommended to use the `panel_info` argument to provide the panel manually (as .csv files, unlike .fcs files, do not provide built-in metadata about the columns they contain).
#### Pre-processing with `tof_preprocess`
Generally, the raw ion counts for each analyte measured on a mass cytometer need to be transformed before cytometry data analysis. Common preprocessing steps may include variance-stabilizing transformations - such as the hyperbolic arcsine (arcsinh) transformation or a log transformation - scaling/centering, and/or denoising.
To perform standard preprocessing tasks with `{tidytof}`, use `tof_preprocess`. `tof_preprocess`'s default behavior is to apply the arcsinh transformation (with a cofactor of 5) to each numeric column in the input `tof_tibble` as well as to remove the gaussian noise that Fluidigm software adds to each ion count (this noise is added for visualization purposes, but for most analyses, removing it is recommended).
As an example, we can preprocess our `phenograph` `tof_tibble` above and see how our first few measurements change before and after.
```{r}
# before preprocessing
phenograph |>
select(`CD45|Sm154`, `CD34|Nd148`, `CD38|Er167`) |>
head()
```
```{r}
# perform preprocessing
phenograph <-
phenograph |>
tof_preprocess()
# inspect new values
phenograph |>
select(`CD45|Sm154`, `CD34|Nd148`, `CD38|Er167`) |>
head()
```
To alter `tof_preprocess`'s default behavior, change the `channel_cols` argument (to specify which columns of `tof_tibble` should be transformed) and the `transform_fun` argument (to specify which vector-valued function should be used to transform each of the `channel_cols`). To keep the gaussian noise added by Fluidigm software (or if you are working with a dataset that does not have this noise), set the `undo_noise` argument to `FALSE`.
Finally, note that the built-in function `tof_postprocess` works nearly identically `tof_preprocess`, but provides different default behavior (namely, applying the reverse arcsinh transformation with a cofactor of 5 to all numeric columns. See `?tof_postprocess` for details).
#### Downsampling with `tof_downsample`
Often, cytometry experiments collect tens or hundreds or millions of cells in total, and it can be useful to downsample to a smaller, more computationally tractable number of cells - either for a final analysis or while developing code. To do this, `{tidytof}` implements the `tof_downsample` verb, which allows downsampling using 3 methods.
Using `{tidytof}`'s built-in dataset `phenograph_data` (which is a smaller version of the dataset we read in ourselves above), we can see that the original size of the dataset is 1000 cells per cluster, or 3000 cells in total:
```{r}
data(phenograph_data)
phenograph_data |>
count(phenograph_cluster)
```
To randomly sample 200 cells per cluster, we can use `tof_downsample` using the "constant" `method`:
```{r}
phenograph_data |>
# downsample
tof_downsample(
method = "constant",
group_cols = phenograph_cluster,
num_cells = 200
) |>
# count the number of downsampled cells in each cluster
count(phenograph_cluster)
```
Alternatively, if we wanted to sample 50% of the cells in each cluster, we could use the "prop" `method`:
```{r}
phenograph_data |>
# downsample
tof_downsample(
method = "prop",
group_cols = phenograph_cluster,
prop_cells = 0.5
) |>
# count the number of downsampled cells in each cluster
count(phenograph_cluster)
```
And finally, you might also be interested in taking a slightly different approach to downsampling that downsamples the number of cells not to a fixed constant or proportion, but to a fixed *density* in phenotypic space. For example, the following scatterplot demonstrates that there are certain areas of phenotypic density in `phenograph_data` that contain more cells than others along the `cd34`/`cd38` axes:
```{r, warning = FALSE, message = FALSE}
phenograph_data |>
# preprocess all numeric columns in the dataset
tof_preprocess(undo_noise = FALSE) |>
# make a scatterplot
ggplot(aes(x = cd34, y = cd38)) +
geom_point(alpha = 0.5) +
scale_x_continuous(limits = c(NA, 1.5)) +
scale_y_continuous(limits = c(NA, 4)) +
theme_bw()
```
To reduce the number of cells in our dataset until the local density around each cell in our dataset is relatively constant, we can use the "density" `method` of `tof_downsample`:
```{r, warning = FALSE, message = FALSE}
phenograph_data |>
tof_preprocess(undo_noise = FALSE) |>
tof_downsample(
density_cols = c(cd34, cd38),
target_prop_cells = 0.25,
method = "density",
) |>
ggplot(aes(x = cd34, y = cd38)) +
geom_point(alpha = 0.5) +
scale_x_continuous(limits = c(NA, 1.5)) +
scale_y_continuous(limits = c(NA, 4)) +
theme_bw()
```
For more details, check out the documentation for the 3 underlying members of the `tof_downsample_*` function family (which are wrapped by `tof_downsample`):
- `tof_downsample_constant`
- `tof_downsample_prop`
- `tof_downsample_density`
#### Writing data with `tof_write_data`
Finally, users may wish to store single-cell data as .fcs or .csv files after transformation, concatenation, filtering, or other data processing steps such as dimensionality reduction and/or clustering (see below). To write single-cell data from a `tof_tbl` into .fcs or .csv files, use `tof_write_data`.
```{r, eval = FALSE}
# when copying and pasting this code, feel free to change this path
# to wherever you'd like to save your output files
my_path <- file.path("~", "Desktop", "tidytof_vignette_files")
phenograph_data |>
tof_write_data(
group_cols = phenograph_cluster,
out_path = my_path,
format = "fcs"
)
```
`tof_write_data`'s trickiest argument is `group_cols`, the argument used to specify which columns in `tof_tibble` should be used to group cells (i.e. the rows of `tof_tibble`) into separate .fcs or .csv files. Simply put, this argument allows `tof_write_data` to create a single .fcs or .csv file for each unique combination of values in the columns specified by the user. In the example above, cells are grouped into 3 output .fcs files - one for each of the 3 clusters encoded by the `phenograph_cluster` column in `phenograph_data`. These files should have the following names (derived from the values in the `phenograph_cluster` column):
- cluster1.fcs
- cluster2.fcs
- cluster3.fcs
However, suppose we wanted to write multiple files for each cluster by breaking cells into two groups: those that express high levels of `pstat5` and those that express low levels of `pstat5`. We can use `dplyr::mutate` to create a new column in `phenograph_data` that breaks cells into high- and low-`pstat5` expression groups, then add this column to our `group_cols` specification:
```{r, eval = FALSE}
phenograph_data |>
# create a variable representing if a cell is above or below the median
# expression level of pstat5
mutate(expression_group = if_else(pstat5 > median(pstat5), "high", "low")) |>
tof_write_data(
group_cols = c(phenograph_cluster, expression_group),
out_path = my_path,
format = "fcs"
)
```
This will write 6 files with the following names (derived from the values in `phenograph_cluster` and `expression_group`).
- cluster1_low.fcs
- cluster1_high.fcs
- cluster2_low.fcs
- cluster2_high.fcs
- cluster3_low.fcs
- cluster3_high.fcs
A useful feature of `tof_write_data` is that it will automatically concatenate cells into single .fcs or .csv files based on the specified `group_cols` *regardless of how many unique files those cells came from*, allowing for easy concatenation of .fcs or .csv files containing data from a single sample acquired over multiple cytometry runs.
### Analyzing data at the cluster-level
#### Identifying clusters with `tof_cluster`
Once input files are read into a tabular format and preprocessed/downsampled, we might be interested in clustering our data to define communities of cells with shared characteristics.
To do so, we can use the `tof_cluster` verb. Several clustering methods are implemented in `{tidytof}`, including [FlowSOM](https://pubmed.ncbi.nlm.nih.gov/25573116/), [PhenoGraph](https://pubmed.ncbi.nlm.nih.gov/26095251/), k-means, and others.
To demonstrate, we can apply the FlowSOM clustering algorithm to our `phenograph_data` from above. Note that `phenograph_data` contains 6000 total cells (2000 each from 3 clusters identified in the [original PhenoGraph publication](https://pubmed.ncbi.nlm.nih.gov/26095251/)).
```{r}
phenograph_clusters <-
phenograph_data |>
tof_preprocess() |>
tof_cluster(method = "flowsom", cluster_cols = contains("cd"))
phenograph_clusters |>
select(sample_name, .flowsom_metacluster, everything()) |>
head()
```
The output of `tof_cluster` is a `tof_tbl` identical to the input tibble, now with the addition of an additional column (".flowsom_metacluster") that encodes the cluster id for each cell in the input `tof_tbl`. Note that all output columns added to a tibble or `tof_tbl` by `{tidytof}` begin with a full-stop (".") to reduce the likelihood of collisions with existing column names.
Because the output of `tof_cluster` is a `tof_tbl`, we can use `dplyr`'s `count` method to assess the accuracy of the FlowSOM clustering compared to the original clustering from the PhenoGraph paper.
```{r}
phenograph_clusters |>
count(phenograph_cluster, .flowsom_metacluster, sort = TRUE)
```
Here, we can see that the FlowSOM algorithm groups most cells from the same PhenoGraph cluster with one another (with a small number of mistakes per PhenoGraph cluster).
To change which clustering algorithm `tof_cluster` uses, alter the `method` flag; to change the columns used to compute the clusters, change the `cluster_cols` flag. And finally, if you want to return a `tibble` that only includes the cluster labels (not the cluster labels added as a new column to the input `tof_tbl`), set `augment` to `FALSE`.
```{r}
# will result in a tibble with only 1 column (the cluster labels)
phenograph_data |>
tof_preprocess() |>
tof_cluster(method = "flowsom", cluster_cols = contains("cd"), augment = FALSE) |>
head()
```
#### Dimensionality reduction with `tof_reduce_dimensions()`
After clusters are identified, a useful tool for visualizing them is dimensionality reduction, a form of unsupervised machine learning used to represent high-dimensional datasets in a smaller, easier-to-visualize number of dimensions.
`{tidytof}` includes several algorithms commonly used by biologists for dimensionality reduction: Principal component analysis (PCA), t-distributed stochastic neighbor embedding (tSNE), and uniform manifold approximation and projection (UMAP). To apply these to a dataset, use `tof_reduce_dimensions`:
```{r}
# perform the dimensionality reduction
phenograph_tsne <-
phenograph_clusters |>
tof_reduce_dimensions(method = "tsne")
# select only the tsne embedding columns using a tidyselect helper (contains)
phenograph_tsne |>
select(contains("tsne")) |>
head()
```
By default, `tof_reduce_dimensions` will add reduced-dimension feature embeddings to the input `tof_tbl` and return the augmented `tof_tbl` (that is, a `tof_tbl` with new columns for each embedding dimension) as its result. To return only the features embeddings themselves, set `augment` to `FALSE` (as in `tof_cluster`).
Regardless of the method used, reduced-dimension feature embeddings can be visualized using `{ggplot2}` (or any graphics package):
```{r}
# plot the tsne embeddings using color to distinguish between clusters
phenograph_tsne |>
ggplot(aes(x = .tsne1, y = .tsne2, fill = phenograph_cluster)) +
geom_point(shape = 21) +
theme_bw() +
labs(fill = NULL)
# plot the tsne embeddings using color to represent CD11b expression
phenograph_tsne |>
ggplot(aes(x = .tsne1, y = .tsne2, fill = cd11b)) +
geom_point(shape = 21) +
scale_fill_viridis_c() +
theme_bw() +
labs(fill = "CD11b expression")
```
Such visualizations can be helpful in qualitatively describing the phenotypic differences between the clusters in a dataset. For example, in the example above, we can see that one of the clusters has high CD11b expression, whereas the others have lower CD11b expression.
#### Differential discovery analysis with `tof_analyze_abundance` and `tof_analyze_expression`
While dimensionality reduction can be used to visualize a clustering result, many cytometry users also want to use statistical tools to rigorously quantify which clusters(s) in their dataset associate with a particular experimental or clinical variable.
Such analyses are often grouped under the umbrella term **differential discovery analysis** and include both comparing the relative *size* of clusters between experimental conditions (**differential abundance analysis; DAA**) as well as comparing marker expression patterns of clusters between experimental conditions (**differential expression analysis; DEA**). `{tidytof}` provides the `tof_analyze_abundance` and `tof_analyze_expression` verbs for differential abundance and differential expression analyses, respectively.
To demonstrate how to use these verbs, we'll first download a dataset originally collected for the development of the [CITRUS](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4084463/) algorithm. These data are available in the `{HDCytoData}` package, which is available on Bioconductor and can be downloaded with the following command:
```{r, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("HDCytoData")
```
To load the CITRUS data into our current R session, we can call a function from the `{HDCytoData}`, which will provide it to us in a format from the `{flowCore}` package (called a "flowSet"). To convert this into a tidy tibble, we can use `{tidytof}` built-in method for converting flowCore objects into `tof_tbl`'s .
```{r, message = FALSE, warning = FALSE}
citrus_raw <- HDCytoData::Bodenmiller_BCR_XL_flowSet()
citrus_data <-
citrus_raw |>
as_tof_tbl(sep = "_")
```
Thus, we can see that `citrus_data` is a `tof_tbl` with `r nrow(citrus_data)` cells (one in each row) and `r ncol(citrus_data)` pieces of information about each cell (one in each column).
We can also extract some metadata from the raw data and join it with our single-cell data using some functions from the `tidyverse`:
```{r}
citrus_metadata <-
tibble(
file_name = as.character(flowCore::pData(citrus_raw)[[1]]),
sample_id = seq_along(file_name),
patient = str_extract(file_name, "patient[:digit:]"),
stimulation = str_extract(file_name, "(BCR-XL)|Reference")
) |>
mutate(
stimulation = if_else(stimulation == "Reference", "Basal", stimulation)
)
citrus_metadata |>
head()
```
Thus, we now have sample-level information about which patient each sample was collected from and which stimulation condition ("Basal" or "BCR-XL") each sample was exposed to before data acquisition.
Finally, we can join this metadata with our single-cell `tof_tbl` to obtain the cleaned dataset.
```{r}
citrus_data <-
citrus_data |>
left_join(citrus_metadata, by = "sample_id")
```
After these data cleaning steps, we now have `citrus_data`, a `tof_tbl` containing cells collected from 8 patients. Specifically, 2 samples were taken from each patient: one in which the cells' B-cell receptors were stimulated (BCR-XL) and one in which they were not (Basal). In `citrus_data`, each cell's patient of origin is stored in the `patient` column, and each cell's stimulation condition is stored in the `stimulation` column. In addition, the `population_id` column stores information about cluster labels that were applied to each cell using a combination of FlowSOM clustering and manual merging (for details, run `?HDCytoData::Bodenmiller_BCR_XL` in the R console).
We might wonder if there are certain clusters that expand or deplete within patients between the two stimulation conditions described above - this is a question that requires differential abundance analysis (DAA). `{tidytof}`'s `tof_analyze_abundance` verb supports the use of 3 statistical approaches for performing DAA: diffcyt, generalized-linear mixed modeling (GLMMs), and simple t-tests. Because the setup described above uses a paired design and only has 2 experimental conditions of interest (Basal vs. BCR-XL), we can use the paired t-test method:
```{r}
daa_result <-
citrus_data |>
tof_analyze_abundance(
cluster_col = population_id,
effect_col = stimulation,
group_cols = patient,
test_type = "paired",
method = "ttest"
)
daa_result
```
Based on this output, we can see that 6 of our 8 clusters have statistically different abundance in our two stimulation conditions. Using `{tidytof}` easy integration with `{tidyverse}` packages, we can use this result to visualize the fold-changes of each cluster (within each patient) in the BCR-XL condition compared to the Basal condition using `{ggplot2}`:
```{r, message = FALSE}
plot_data <-
citrus_data |>
mutate(population_id = as.character(population_id)) |>
left_join(
select(daa_result, population_id, significant, mean_fc),
by = "population_id"
) |>
dplyr::count(patient, stimulation, population_id, significant, mean_fc, name = "n") |>
group_by(patient, stimulation) |>
mutate(prop = n / sum(n)) |>
ungroup() |>
pivot_wider(
names_from = stimulation,
values_from = c(prop, n),
) |>
mutate(
diff = `prop_BCR-XL` - prop_Basal,
fc = `prop_BCR-XL` / prop_Basal,
population_id = fct_reorder(population_id, diff),
direction =
case_when(
mean_fc > 1 & significant == "*" ~ "increase",
mean_fc < 1 & significant == "*" ~ "decrease",
TRUE ~ NA_character_
)
)
significance_data <-
plot_data |>
group_by(population_id, significant, direction) |>
summarize(diff = max(diff), fc = max(fc)) |>
ungroup()
plot_data |>
ggplot(aes(x = population_id, y = fc, fill = direction)) +
geom_violin(trim = FALSE) +
geom_hline(yintercept = 1, color = "red", linetype = "dotted", size = 0.5) +
geom_point() +
geom_text(
aes(x = population_id, y = fc, label = significant),
data = significance_data,
size = 8,
nudge_x = 0.2,
nudge_y = 0.06
) +
scale_x_discrete(labels = function(x) str_c("cluster ", x)) +
scale_fill_manual(
values = c("decrease" = "#cd5241", "increase" = "#207394"),
na.translate = FALSE
) +
labs(
x = NULL,
y = "Abundance fold-change (stimulated / basal)",
fill = "Effect",
caption = "Asterisks indicate significance at an adjusted p-value of 0.05"
)
```
Importantly, the output of `tof_analyze_abundance` depends slightly on the underlying statistical method being used, and details can be found in the documentation for each `tof_analyze_abundance_*` function family member:
- `tof_analyze_abundance_diffcyt`
- `tof_analyze_abundance_glmm`
- `tof_analyze_abundance_ttest`
Similarly, suppose we're interested in how intracellular signaling proteins change their expression levels between our two stimulation conditions in each of our clusters. This is a Differential Expression Analysis (DEA) and can be performed using `{tidytof}`'s `tof_analyze_expression` verb. As above, we can use paired t-tests with multiple-hypothesis correction to to test for significant differences in each cluster's expression of our signaling markers between stimulation conditions.
```{r}
signaling_markers <-
c(
"pNFkB_Nd142", "pStat5_Nd150", "pAkt_Sm152", "pStat1_Eu153", "pStat3_Gd158",
"pSlp76_Dy164", "pBtk_Er166", "pErk_Er168", "pS6_Yb172", "pZap70_Gd156"
)
dea_result <-
citrus_data |>
tof_preprocess(channel_cols = any_of(signaling_markers)) |>
tof_analyze_expression(
cluster_col = population_id,
marker_cols = any_of(signaling_markers),
effect_col = stimulation,
group_cols = patient,
test_type = "paired",
method = "ttest"
)
dea_result |>
head()
```
While the output of `tof_analyze_expression()` also depends on the underlying test being used, we can see that the result above looks relatively similar to the output from `tof_analyze_abundance()`. Above, the output is a tibble in which each row represents the differential expression results from a single cluster-marker pair - for example, the first row represents the difference in expression of pS6 in cluster 1 between the BCR-XL and Basal conditions. Each row includes the raw p-value and multiple-hypothesis-corrected p-value for each cluster-marker pair.
This result can be used to make a volcano plot to visualize the results for all cluster-marker pairs:
```{r}
volcano_plot <-
dea_result |>
tof_plot_clusters_volcano(
use_ggrepel = TRUE
)
volcano_plot
```
### Analyzing data at the patient- and sample-level
In addition to its verbs that operate on single-cell data directly, `{tidytof}` implements functions for aggregating single-cell measurements into cluster- and sample-level summary statistics that can be analyzed using a variety of statistical models.
#### Feature extraction with `tof_extract_features`
In addition to its functions for analyzing and visualizing cytometry data at the single-cell and cluster levels, `{tidytof}`'s `tof_extract_features` verb allows users to aggregate single-cell and cluster-level information in order to summarize whole-samples (or whole-patients) from which cells were collected. These features can be useful for visualizing the differences between patients and samples in different experimental conditions or for building machine learning models.
To understand how the `tof_extract_features` verb works, it's easiest to look at each of its subroutines (the members of the `tof_extract_*` function family) independently.
First, we have `tof_extract_proportion`, which extracts the proportion of cells in each cluster within each sample (with samples defined using the `group_cols` argument):
```{r}
# preprocess the numeric columns in the citrus dataset
citrus_data <-
citrus_data |>
mutate(cluster = str_c("cluster", population_id)) |>
tof_preprocess()
citrus_data |>
tof_extract_proportion(
cluster_col = cluster,
group_cols = c(patient, stimulation)
) |>
head()
```
Like all members of the `tof_extract_*` function family, `tof_extract_proportion()` returns one row for each sample (defined as a unique combination of values of the `group_cols`) and one column for each extracted feature (above, one column for the proportion of each of the 8 clusters in `citrus_data`). These values can also be returned in "long" format by changing the `format` argument:
```{r}
citrus_data |>
tof_extract_proportion(
cluster_col = cluster,
group_cols = c(patient, stimulation),
format = "long"
) |>
head()
```
Another member of the same function family, `tof_extract_central_tendency`, computes the central tendency (e.g. mean or median) of user-specified markers in each cluster.
```{r}
citrus_data |>
tof_extract_central_tendency(
cluster_col = cluster,
group_cols = c(patient, stimulation),
marker_cols = any_of(c("CD45_In115", "CD4_Nd145", "CD20_Sm147")),
central_tendency_function = mean
) |>
head()
```
`tof_extract_threshold` is similar to `tof_extract_central_tendency`, but calculates the proportion of cells above a user-specified expression value for each marker instead of a measure of central tendency:
```{r}
citrus_data |>
tof_extract_threshold(
cluster_col = cluster,
group_cols = c(patient, stimulation),
marker_cols = any_of(c("CD45_In115", "CD4_Nd145", "CD20_Sm147")),
threshold = 5
) |>
head()
```
The two final members of the `tof_extract_*` function family -- `tof_extract_emd` and `tof_extract_jsd` are designed specifically for comparing distributions of marker expression between stimulation conditions. As such, they must be given a `stimulation_col` that identifies which stimulation condition each cell is in, and a `basal_level` that specifies the reference (i.e. unstimulated) condition within the `stimulation_col`. With these additional arguments, `tof_extract_emd` computes the Earth-mover's distance between each marker's distribution in the stimulation conditions (within each cluster) and the basal condition; similarly, `tof_extract_jsd` computes the Jensen-Shannon divergence index between the same distributions. Both of these values are ways to compare how different 2 distributions are to one another and are more computationally expensive (but also higher-resolution) than simply comparing measures of central tendency.
```{r}
# Earth-mover's distance
citrus_data |>
tof_extract_emd(
cluster_col = cluster,
group_cols = patient,
marker_cols = any_of(c("CD45_In115", "CD4_Nd145", "CD20_Sm147")),
emd_col = stimulation,
reference_level = "Basal"
) |>
head()
```
```{r}
# Jensen-Shannon Divergence
citrus_data |>
tof_extract_jsd(
cluster_col = cluster,
group_cols = patient,
marker_cols = any_of(c("CD45_In115", "CD4_Nd145", "CD20_Sm147")),
jsd_col = stimulation,
reference_level = "Basal"
) |>
head()
```
Finally, the `tof_extract_features` verb provides a wrapper to each of the members of its function family, allowing users to extract multiple features types at once. For example, the following code extracts the proportion of each cluster, median of several markers in each cluster, and EMD between the basal condition and stimulated condition in each cluster for all patients in `citrus_data`.
```{r, eval = FALSE}
citrus_data |>
tof_extract_features(
cluster_col = cluster,
group_cols = patient,
stimulation_col = stimulation,
lineage_cols = any_of(c("CD45_In115", "CD20_Sm147", "CD33_Nd148")),
signaling_cols = any_of(signaling_markers),
signaling_method = "emd",
basal_level = "Basal"
) |>
head()
```
#### Outcomes modeling with `tof_model`
[brief intro to building predictive models and why we might be motivated to do so.]
`{tidytof}` implements several functions for building predictive models using sample- or patient-level data. To illustrate how they work, first we download some patient-level data from [this paper](https://pubmed.ncbi.nlm.nih.gov/29505032/) and combining it with sample-level clinical annotations in one of `{tidytof}`'s built-in data objects (`ddpr_metadata`).
```{r}
data(ddpr_metadata)
# link for downloading the sample-level data from the Nature Medicine website
data_link <-
"https://static-content.springer.com/esm/art%3A10.1038%2Fnm.4505/MediaObjects/41591_2018_BFnm4505_MOESM3_ESM.csv"
# downloading the data and combining it with clinical annotations
ddpr_patients <-
readr::read_csv(data_link, skip = 2L, n_max = 78L, show_col_types = FALSE) |>
dplyr::rename(patient_id = Patient_ID) |>
left_join(ddpr_metadata, by = "patient_id") |>
dplyr::filter(!str_detect(patient_id, "Healthy"))
ddpr_patients |>
select(where(~ !is.numeric(.x))) |>
head()
```
The data processing steps above result in the `ddpr_patients` tibble. The numeric columns in `ddpr_patients` represent aggregated cell population features for each sample (see [Supplementary Table 5 in this paper](https://www.nature.com/articles/nm.4505#MOESM3) for details). The non-numeric columns represent clinical metadata about each sample (run `?ddpr_metadata` for more information).
There are also a few preprocessing steps that we might want to perform now to save us some headaches when we're fitting models later.
```{r}
ddpr_patients <-
ddpr_patients |>
# convert the relapse_status variable to a factor first,
# which is something we'll want for fitting the model later
# and create the time_to_event and event columns for survival modeling
mutate(
relapse_status = as.factor(relapse_status),
time_to_event = if_else(relapse_status == "Yes", time_to_relapse, ccr),
event = if_else(relapse_status == "Yes", 1, 0)
)
```
##### Separating the training and validation cohorts
In the original DDPR paper, some patients were used to fit the model and the rest were used to assess the model after it was tuned. We can separate our training and validation cohorts using the `cohort` variable in `ddpr_patients`
```{r}
ddpr_training <-
ddpr_patients |>
dplyr::filter(cohort == "Training")
ddpr_validation <-
ddpr_patients |>
dplyr::filter(cohort == "Validation")
```
```{r}
nrow(ddpr_training)
nrow(ddpr_validation)
```
##### Building a classifier using logistic regression
First, we can build an elastic net classifier to predict which patients will relapse and which patients won't (ignoring time-to-event data for now). For this, we can use the `relapse_status` column in `ddpr_training` as the outcome variable:
```{r}
# find how many of each outcome we have in our cohort
ddpr_training |>
dplyr::count(relapse_status)
```
Specifically, we can use the `tof_split_data` function to split our cohort into a training and test set either once (a "simple" split) or multiple times (using either k-fold cross-validation or bootstrapping). In this case, we use 5-fold cross-validation, but reading the documentation of `tof_split_data` demonstrates how to use other methods.
```{r}
training_split <-
ddpr_training |>
tof_split_data(
split_method = "k-fold",
num_cv_folds = 5,
strata = relapse_status
)
training_split
```
The output of `tof_split_data` varies depending on which `split_method` is used. For cross-validation, the result is a `rset` object from the [rsample](https://rsample.tidymodels.org/) package. `rset` objects are a type of tibble with two columns:
* `splits` - a column in which each entry is an `rsplit` object (which contains a single resample of the full dataset)
* `id` - a character column in which each entry represents the name of the fold that each entry in `splits` belongs to.
We can inspect one of the resamples in the `splits` column to see what they contain:
```{r}
my_resample <- training_split$splits[[1]]
print(my_resample)
class(my_resample)
```
Note that you can use `rsample::training` and `rsample::testing` to return the training and test obeservations from each resampling:
```{r}
my_resample |>
rsample::training() |>
head()
my_resample |>
rsample::testing() |>
head()
```
From here, we can feed `training_split` into the `tof_train_model` function to [tune](https://www.tmwr.org/tuning.html) a logistic regression model that predicts the relapse_status of a leukemia patient. Be sure to check out the `tof_create_grid` documentation to learn how to make a hyperparameter search grid for model tuning (in this case, we limit the mixture parameter to a value of 1, which fits a sparse lasso model). Also note that for demonstration purposes, we include only the features that come from one cell population ("Population 2") in the original dataset, which means that we probably shouldn't expect our model to perform as well as the one in the original paper (which select from many more features).
```{r, warning = FALSE}
class_mod <-
training_split |>
tof_train_model(
predictor_cols = contains("Pop2"),
response_col = relapse_status,
model_type = "two-class",
hyperparameter_grid = tof_create_grid(mixture_values = 1),
impute_missing_predictors = TRUE,
remove_zv_predictors = TRUE # often a smart decision
)
```
The output of `tof_train_model` is a `tof_model`, an object containing information about the trained model (and that can be passed to the `tof_predict` and `tof_assess_model` verbs). When a `tof_model` is printed, some information about the optimal hyperparamters is printed, and so is a table of the nonzero model coefficients in the model.
```{r}
print(class_mod)
```
We can then use the trained model to make predictions on the validation data that we set aside earlier:
```{r}
class_predictions <-
class_mod |>
tof_predict(new_data = ddpr_validation, prediction_type = "class")
class_predictions |>
dplyr::mutate(
truth = ddpr_validation$relapse_status
)
```
And we can see that our model gets some (but not all!) predictions correct in the validation set we set aside.
We can also assess the model directly using `tof_assess_model`
```{r}
# calling the function with no new_data evaluates the
# the nodel using its training data
training_assessment <-
class_mod |>
tof_assess_model()
training_assessment
```
And we can make an ROC curve using our metrics:
```{r}
class_mod |>
tof_plot_model() +
labs(subtitle = "ROC Curve (Training data)")
```
We can then assess the model on the validation data...
```{r}
validation_assessment <-
class_mod |>
tof_assess_model(new_data = ddpr_validation)
validation_assessment
```
```{r}
class_mod |>
tof_plot_model(new_data = ddpr_validation) +
labs(subtitle = "ROC Curve (Validation data)")
```
## `{tidytof}`'s Design Principles (and some tips)
{tidytof} was designed by a multidisciplinary team of wet-lab biologists, bioinformaticians, and physician-scientists who analyze cytometry and other kinds of single-cell data to solve a variety of problems. As a result, `{tidytof}`'s high-level API was designed with great care to mirror that of the `{tidyverse}` itself - that is, to be [human-centered, consistent, composable, and inclusive](https://design.tidyverse.org/unifying-principles.html) for a wide userbase.
In this section, we describe some miscellaneous design decisions and tips for using `{tidytof}` that may help the enthusiastic user.
### 1. Use the `tof_` prefix to your advantage.
You may notice that most `{tidytof}` functions begin with the prefix `tof_`. This is intentional, as it will allow you to use your development environment's code-completing software to search for functions easily (even if you don't remember the function name). For this reason, we recommend using `{tidytof}` within the RStudio development environment; however, many code editors have predictive text functionality that serves a similar function.
In general, `{tidytof}` verbs are organized in such a way that your IDE's code-completion tools should also allow you to search for (and compare) related functions with relative ease. (For instance, the `tof_cluster_` prefix is used for all clustering functions, and the `tof_downsample_` prefix is used for all downsampling functions).
### 2. `{tidytof}` functions use 2 kinds of arguments
`{tidytof}` functions are optimized for working with "tidy" data in the form of `tibbles` or `data.frames`. This means that most `{tidytof}` functions share some basic design principles in terms of how their arguments work. For more details about these design principles, check out the [Getting Started with `tidytof` vignette](https://keyes-timothy.github.io/tidytof/articles/tidytof.html)
### 3. Use `{tidytof}` to write human-readable pipelines
The real "magic" of `{tidytof}` derives from its ability to simplify multistep data-processing tasks into a simple and readable chunk of code. For example, suppose we just acquired some .fcs files from a mass cytometer and want to perform the following analysis:
1. Read the .fcs files into our R session
2. Arcsinh-transform each column of protein measurements
3. Cluster our cells based on the surface markers in our panel
4. Downsample the dataset such that 100 random cells are picked from each cluster
5. Perform dimensionality reduction on the downsampled dataset using tSNE
6. Visualize the clusters using the low-dimensional tSNE embedding
By using the appropriate `{tidytof}` verbs for each step of our analysis, we can easily write code in which each function call corresponds to exactly one step of our pipeline:
```{r}
input_path <- tidytof_example_data("phenograph")
set.seed(0012)
input_path |>
# step 1
tof_read_data() |>
# step 2
tof_preprocess() |>
# step 3
tof_cluster(method = "phenograph") |>
# step 4
tof_downsample(
group_cols = .phenograph_cluster,
num_cells = 100,
method = "constant"
) |>
# step 5
tof_reduce_dimensions(perplexity = 50, method = "tsne") |>
# step 6
tof_plot_cells_embedding(
embedding_cols = starts_with(".tsne"),
color_col = .phenograph_cluster
)
```
As shown above, stringing together `{tidytof}` verbs creates a pipeline that can be read easily from left-to-right and top-to-bottom -- this means that it will be relatively easy for you to return to this code later (to modify it, or to write a methods section for your next high-impact manuscript!) or, perhaps more importantly, for one of your colleagues to return to it later when they want to recreate your analysis.
### 4. Additional resources
`{tidytof}` is built on top of the `tidyverse` family of R packages. As a result, most users of `{tidytof}` will benefit substantially from spending a few hours with the `{dplyr}`, `{ggplot2}`, and `{tidyr}` package vignettes to learn about some of the many useful functions these packages provide.
To access our recommended list of package vignettes, run the following lines of R code in the console:
```{r, eval = FALSE}
# dplyr
vignette(topic = "dplyr", package = "dplyr")
vignette(topic = "grouping", package = "dplyr")
vignette(topic = "colwise", package = "dplyr")
# ggplot2
vignette(topic = "ggplot2-specs", package = "ggplot2")
# tidyr
vignette(topic = "tidy-data", package = "tidyr")
vignette(topic = "nest", package = "tidyr")
```