-
Notifications
You must be signed in to change notification settings - Fork 937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify Read10X for GEO compatibility #4101
Conversation
Hi @samuel-marsh, thanks for the PR! We will be adding a new function to enable a more generic functionality for reading 10x-like files. I am going to close this one. We appreciate all the efforts you put in answering questions and submitting PRs! |
Just to note: the PR shows as merged (instead of closed) because your other PR #4149 was based off the same branch. |
Hi @saketkc no worries I knew it was more of band-aid fix than real bonafide solution to GEO issue. Especially when it’s large GEO repository that you want to read all files in and it’s same directory. Happy I can be helpful when I can answering questions and PRs. You rest of Seurat/@satijalab team do so much already it’s really fantastic for the single cell community! No worries about the weird merge either, will branch off first next time so it’s not overlapped. Best, |
Hi @saketkc & Seurat Team, I think I've got everything working in better function that I think might fit the bill depending on what you guys have in mind in terms of new function to read 10X-like files from GEO (or other repos that add a file prefix). At its heart it's (intentionally) just the same function as If you think this would be good or is something you and Seurat team would want to work off of just let me know and I'll start new PR and push this new function. Thanks! At its heart it's actually just the same function as Changed:
|
Thanks Sam! This looks great! What I have in mind is slightly more generic. This (I think) handles all the use cases you mentioned above, but requires some processing at the use end. There is no requirement for the files to be located in any particular directory, they can be even be read from a URL (including the matrix file). Happy to know your and others thoughts on this! I will also discuss your suggestion with Seurat team. #' Load in data from remote or local mtx files
#' Adapted and inspired from Seurat's Read10X
#'
#' Enables easy loading of sparse data matrices
#'
#' @param mtx Name or remote URL of the mtx file
#' @param cells Name or remote URL of the cells/barcodes file
#' @param features Name or remote URL of the features/genes file
#' @param feature.column Specify which column of features files to use for feature/gene names; default is 2
#' @param cell.column Specify which column of cells file to use for cell names; default is 1
#' @param unique.features Make feature names unique (default TRUE)
#' @param strip.suffix Remove trailing "-1" if present in all cell barcodes.
#'
#' @return A sparse matrix containing the expression data.
#'
#' @importFrom Matrix readMM
#' @importFrom utils read.delim
#' @importFrom httr build_url parse_url
#' @importFrom tools file_ext
#'
#'
#' @export
#' @concept preprocessing
#'
#' @examples
#' \dontrun{
#' # For local files:
#'
#' expression_matrix <- ReadMtx(genes = "count_matrix.mtx.gz", features = "features.tsv.gz", cells = "barcodes.tsv.gz")
#' seurat_object <- CreateSeuratObject(counts = expression_matrix)
#'
#' # For remote files:
#'
#' expression_matrix <- ReadMtx(
#' mtx = "http://localhost/matrix.mtx",
#' cells = "http://localhost/barcodes.tsv",
#' features = "http://localhost/genes.tsv"
#' )
#' seurat_object <- CreateSeuratObject(counts = data)
#' }
#'
ReadMtx <- function(mtx,
cells,
features,
cell.column = 1,
feature.column = 2,
unique.features = TRUE,
strip.suffix = FALSE) {
mtx <- build_url(url = parse_url(url = mtx))
cells <- build_url(url = parse_url(url = cells))
features <- build_url(url = parse_url(url = features))
all_files <- list("Expression matrix" = mtx, "Barcode" = cells, "Gene name" = features)
check_file_exists <- function(filetype, filepath) {
if (grepl(pattern = "^:///", x = filepath)) {
filepath <- gsub(pattern = ":///", replacement = "", x = filepath)
if (!file.exists(paths = filepath)) {
stop(paste(filetype, "file missing. Expecting", filepath), call. = FALSE)
}
}
}
# check if all files exist
lapply(seq_along(all_files), function(y, n, i) {
check_file_exists(n[[i]], y[[i]])
}, y = all_files, n = names(all_files))
# convenience fucntion to read local or remote tab delimited files
readTableUri <- function(uri) {
if (grepl(pattern = "^:///", x = uri)) {
uri <- gsub(pattern = ":///", replacement = "", x = uri)
textcontent <- read.table(file = uri, header = FALSE, sep = "\t", row.names = NULL)
} else {
if (file_ext(uri) == "gz") {
textcontent <- read.table(
file = gzcon(url(uri), text = TRUE),
header = FALSE, sep = "\t", row.names = NULL
)
} else {
textcontent <- read.table(
file = uri, header = FALSE,
sep = "\t", row.names = NULL
)
}
}
return(textcontent)
}
# read barcodes
cell.barcodes <- readTableUri(uri = cells)
bcols <- ncol(x = cell.barcodes)
if (bcols < cell.column) {
stop(paste0(
"cell.column was set to ", cell.column,
" but ", cells, " only has ", bcols, " columns.",
" Try setting the cell.column argument to a value <= to ", bcols, "."
))
}
cell.names <- cell.barcodes[, cell.column]
if (all(grepl(pattern = "\\-1$", x = cell.names)) & strip.suffix) {
cell.names <- as.vector(x = as.character(x = sapply(
X = cell.names,
FUN = ExtractField,
field = 1,
delim = "-"
)))
}
# read features
feature.names <- readTableUri(uri = features)
fcols <- ncol(x = feature.names)
if (fcols < feature.column) {
stop(paste0(
"feature.column was set to ", feature.column,
" but ", features, " only has ", fcols, " column(s).",
" Try setting the feature.column argument to a value <= to ", fcols, "."
))
}
if (any(is.na(x = feature.names[, feature.column]))) {
na.features <- which(x = is.na(x = feature.names[, feature.column]))
replacement.column <- ifelse(test = feature.column == 2, yes = 1, no = 2)
if (replacement.column > fcols) {
stop(
paste0(
"Some features names are NA in column ", feature.column,
". Try specifiying a different column."
),
call. = FALSE,
immediate. = TRUE
)
} else {
warning(
paste0(
"Some features names are NA in column ", feature.column,
". Replacing NA names with ID from column ", replacement.column, "."
),
call. = FALSE,
immediate. = TRUE
)
}
feature.names[na.features, feature.column] <- feature.names[na.features, replacement.column]
}
feature.names <- feature.names[, feature.column]
if (unique.features) {
feature.names <- make.unique(names = feature.names)
}
# read mtx
if (grepl(pattern = "^:///", x = mtx)) {
mtx <- gsub(pattern = ":///", replacement = "", x = mtx)
data <- readMM(mtx)
} else {
if (file_ext(mtx) == "gz") {
data <- readMM(gzcon(url(mtx)))
} else {
data <- readMM(mtx)
}
}
colnames(x = data) <- cell.names
rownames(x = data) <- feature.names
return(data)
} This enables reading GEO files both remotely and locally. Some examples: Example: GEO link mtx <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE132044&format=file&file=GSE132044%5Fmixture%5Fhg19%5Fmm10%5Fcount%5Fmatrix%2Emtx%2Egz"
cells <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE132044&format=file&file=GSE132044%5Fmixture%5Fhg19%5Fmm10%5Fcell%2Etsv%2Egz"
features <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE132044&format=file&file=GSE132044%5Fmixture%5Fhg19%5Fmm10%5Fgene%2Etsv%2Egz"
counts <- ReadMtx(mtx = mtx, cells = cells, features = features, feature.column = 1) Local: mtx <- "~/data/pbmc10k/filtered_feature_bc_matrix/matrix.mtx.gz"
cells <- "~/data/pbmc10k/filtered_feature_bc_matrix/barcodes.tsv.gz"
features <- "~/data/pbmc10k/filtered_feature_bc_matrix/features.tsv.gz"
counts <- ReadMtx(mtx = mtx, cells = cells, features = features) |
Just to add, the links to supplementary GEO files (GSE/GSM/GPL) can be obtained using the following function (it also allows you to download the files). These can then be passed on to the #' Fetch supplemetary files from GEO
#' @importFrom XML htmlParse xpathSApply
#' @importFrom httr GET parse_url
#' @export
FetchGEOFiles <- function(geo, download.dir = getwd(), download.files = FALSE, ...) {
geo <- trimws(toupper(geo))
geo_type <- substr(geo, 1, 3)
url.prefix <- "https://ftp.ncbi.nlm.nih.gov/geo/"
if (geo_type == "GSE") {
url.prefix <- paste0(url.prefix, "series/")
} else if (geo_type == "GSM") {
url.prefix <- paste0(url.prefix, "samples/")
} else if (geotype == "GPL") {
url.prefix <- paste0(url.prefix, "platform/")
}
# url.prefix <- "https://ftp.ncbi.nlm.nih.gov/geo/series/"
geo_prefix <- paste0(substr(x = geo, start = 1, stop = nchar(geo) - 3), "nnn")
url <- paste0(url.prefix, geo_prefix, "/", geo, "/", "suppl", "/")
response <- GET(url = url)
html_parsed <- htmlParse(file = response)
links <- xpathSApply(doc = html_parsed, path = "//a/@href")
suppl_files <- as.character(grep(pattern = "^G", x = links, value = TRUE))
if (length(suppl_files) == 0) {
return(NULL)
}
file.url <- paste0(url, suppl_files)
file_list <- data.frame(filename = suppl_files, url = file.url)
if (download.files) {
names(file.url) <- suppl_files
download_file <- function(url, filename, ...) {
message(paste0("Downloading ", filename, " to ", download.dir))
download.file(url = url, destfile = file.path(download.dir, filename), mode = "wb", ...)
message("Done!")
}
lapply(seq_along(file.url), function(y, n, i) {
download_file(y[[i]], n[[i]], ...)
},
y = file.url, n = names(file.url)
)
}
return(file_list)
} Example: FetchGEOFiles("GSE132044")
filename url
1 GSE132044_HEK293_PBMC_TPM_bulk.tsv.gz https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_HEK293_PBMC_TPM_bulk.tsv.gz
2 GSE132044_NIH3T3_cortex_TPM_bulk.tsv.gz https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_NIH3T3_cortex_TPM_bulk.tsv.gz
3 GSE132044_cortex_mm10_cell.tsv.gz https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_cortex_mm10_cell.tsv.gz
4 GSE132044_cortex_mm10_count_matrix.mtx.gz https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_cortex_mm10_count_matrix.mtx.gz
5 GSE132044_cortex_mm10_gene.tsv.gz https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_cortex_mm10_gene.tsv.gz
6 GSE132044_mixture_hg19_mm10_cell.tsv.gz https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_mixture_hg19_mm10_cell.tsv.gz
7 GSE132044_mixture_hg19_mm10_count_matrix.mtx.gz https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_mixture_hg19_mm10_count_matrix.mtx.gz
8 GSE132044_mixture_hg19_mm10_gene.tsv.gz https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_mixture_hg19_mm10_gene.tsv.gz
9 GSE132044_pbmc_hg38_cell.tsv.gz https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_pbmc_hg38_cell.tsv.gz
10 GSE132044_pbmc_hg38_count_matrix.mtx.gz https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_pbmc_hg38_count_matrix.mtx.gz
11 GSE132044_pbmc_hg38_gene.tsv.gz https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_pbmc_hg38_gene.tsv.gz |
Hi @saketkc, Ok nice! I love the idea of enabling the loading directly from a url (and FetchGEO function is great)!! In terms of comparison between the functions that is obviously for you the rest of team to decide but I think I favor a few aspects of my function vs. your solution. These are things I'm sure could be implemented in various ways in either function but I think are pluses overall for the end user.
Happy to discuss more but those are just my first thoughts! Thanks!! |
Also realizing that the second progress bar in my function for when user specifies single output matrix is faulty but I will work on that. |
Hi all, I tried to use the codes to get my barcode, features, and matrix in Seurat but have had no luck. Right now my error is that the barcode file is missing even though I have it in the folder. Thanks for all your hard work for this! |
Hi Seurat Team,
This PR is in response to #4096 which should simplify things for importing scRNA-seq data downloaded directly from GEO. Should be fully compatible with both the import of a single dataset as well as when supplying a named vector of directories to the
data.dir
parameter.Thanks again for all your hard work!!
Best,
Sam