Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify Read10X for GEO compatibility #4101

Merged
merged 1 commit into from
Feb 27, 2021
Merged

Modify Read10X for GEO compatibility #4101

merged 1 commit into from
Feb 27, 2021

Conversation

samuel-marsh
Copy link
Collaborator

Hi Seurat Team,

This PR is in response to #4096 which should simplify things for importing scRNA-seq data downloaded directly from GEO. Should be fully compatible with both the import of a single dataset as well as when supplying a named vector of directories to the data.dir parameter.

Thanks again for all your hard work!!
Best,
Sam

@saketkc
Copy link
Collaborator

saketkc commented Feb 27, 2021

Hi @samuel-marsh, thanks for the PR! We will be adding a new function to enable a more generic functionality for reading 10x-like files. I am going to close this one. We appreciate all the efforts you put in answering questions and submitting PRs!

@saketkc
Copy link
Collaborator

saketkc commented Feb 27, 2021

Just to note: the PR shows as merged (instead of closed) because your other PR #4149 was based off the same branch.

@samuel-marsh
Copy link
Collaborator Author

Hi @saketkc no worries I knew it was more of band-aid fix than real bonafide solution to GEO issue. Especially when it’s large GEO repository that you want to read all files in and it’s same directory.
I’ve been working on different solution that doesn’t require input of names to read all files from single directory but it’s not currently really in package implementable form. Will be great to have this kind of function in Seurat though, will really simplify file import process for public datasets!!

Happy I can be helpful when I can answering questions and PRs. You rest of Seurat/@satijalab team do so much already it’s really fantastic for the single cell community! No worries about the weird merge either, will branch off first next time so it’s not overlapped.

Best,
Sam

@samuel-marsh
Copy link
Collaborator Author

Hi @saketkc & Seurat Team,

I think I've got everything working in better function that I think might fit the bill depending on what you guys have in mind in terms of new function to read 10X-like files from GEO (or other repos that add a file prefix). At its heart it's (intentionally) just the same function as Read10X but with couple modifications. I'll post it below with brief description of the changes.

If you think this would be good or is something you and Seurat team would want to work off of just let me know and I'll start new PR and push this new function.

Thanks!
Sam

At its heart it's actually just the same function as Read10X but with couple modifications.
Removed:
--Remove multiple directory support as running function either in loop itself over multiple directories or multiple times should typically be sufficient (or files can be moved to one directory while maintaining their prefixes).
--Remove Cell Ranger V3 file name modification because files are zipped from GEO and function detects file names.

Changed:
--Now returns list with lists of matrices if single.matrix = F (Default behavior, see new point). Each list (sample/GEO series) is named by sample name/prefix. Sub-lists rename named by feature type present (i.e. Gene Expression).
--Combining all matrices into single matrix output is now a parameter single.matrix with default value of FALSE. According to comments in code that function relied on features file being the same across all samples which is not safely true across all repos/datasets. So this still allows for that combining when safe but otherwise returns lists which can be merged via Seurat objects (or by user before object creation). Users can also provide specific subset of sample names to new sample.names parameter if only part of series is desired.
--Added sample.names parameter to allow user to specify which samples from the record to import. Allows for import of parts of series without importing the entire record.

Read10X_GEO <- function(
  data.dir = NULL,
  sample.names = NULL,
  gene.column = 2,
  cell.column = 1,
  unique.features = TRUE,
  strip.suffix = FALSE,
  single.matrix = FALSE
) {
  if (!dir.exists(paths = data.dir)) {
    stop("Directory provided does not exist")
  }
  if (data.dir > 1) {
    stop("Read10X_GEO only supports reading from single data directory at a time.")
  }
  if (is.null(x = sample.names)) {
    # pull all file names from directory
    file.list <- list.files(path = data.dir, pattern = "barcodes.tsv", full.names = FALSE)
    # Remove "barcodes.tsv.gz" file suffix
    sample.names <- gsub(pattern = "barcodes.tsv.gz", x = file.list, replacement = "")
  }

  full.data <- list()
  message("Reading 10X files from directory")
  pb <- txtProgressBar(min = 0, max = length(sample.names), style = 3, file = stderr())
  for (i in 1:length(sample.names)) {
    barcode.loc <- file.path(data.dir, paste0(sample.names[i], 'barcodes.tsv.gz'))
    gene.loc <- file.path(data.dir, paste0(sample.names[i], 'genes.tsv.gz'))
    features.loc <- file.path(data.dir, paste0(sample.names[i], 'features.tsv.gz'))
    matrix.loc <- file.path(data.dir, paste0(sample.names[i], 'matrix.mtx.gz'))
    # Flag to indicate if this data is from CellRanger >= 3.0
    pre_ver_3 <- file.exists(gene.loc)
    if (!file.exists(barcode.loc)) {
      stop("Barcode file missing. Expecting ", basename(path = barcode.loc))
    }
    if (!pre_ver_3 && !file.exists(features.loc) ) {
      stop("Gene name or features file missing. Expecting ", basename(path = features.loc))
    }
    if (!file.exists(matrix.loc)) {
      stop("Expression matrix file missing. Expecting ", basename(path = matrix.loc))
    }
    data <- readMM(file = matrix.loc)
    cell.barcodes <- read.table(file = barcode.loc, header = FALSE, sep = '\t', row.names = NULL)
    if (ncol(x = cell.barcodes) > 1) {
      cell.names <- cell.barcodes[, cell.column]
    } else {
      cell.names <- readLines(con = barcode.loc)
    }
    if (all(grepl(pattern = "\\-1$", x = cell.names)) & strip.suffix) {
      cell.names <- as.vector(x = as.character(x = sapply(
        X = cell.names,
        FUN = ExtractField,
        field = 1,
        delim = "-"
      )))
    }
    if (is.null(x = names(x = data.dir))) {
      if (i < 2) {
        colnames(x = data) <- cell.names
      } else {
        colnames(x = data) <- paste0(i, "_", cell.names)
      }
    } else {
      colnames(x = data) <- paste0(names(x = data.dir)[i], "_", cell.names)
    }
    feature.names <- read.delim(
      file = ifelse(test = pre_ver_3, yes = gene.loc, no = features.loc),
      header = FALSE,
      stringsAsFactors = FALSE
    )
    if (any(is.na(x = feature.names[, gene.column]))) {
      warning(
        'Some features names are NA. Replacing NA names with ID from the opposite column requested',
        call. = FALSE,
        immediate. = TRUE
      )
      na.features <- which(x = is.na(x = feature.names[, gene.column]))
      replacement.column <- ifelse(test = gene.column == 2, yes = 1, no = 2)
      feature.names[na.features, gene.column] <- feature.names[na.features, replacement.column]
    }
    if (unique.features) {
      fcols = ncol(x = feature.names)
      if (fcols < gene.column) {
        stop(paste0("gene.column was set to ", gene.column,
                    " but feature.tsv.gz (or genes.tsv) only has ", fcols, " columns.",
                    " Try setting the gene.column argument to a value <= to ", fcols, "."))
      }
      rownames(x = data) <- make.unique(names = feature.names[, gene.column])
    }
    # In cell ranger 3.0, a third column specifying the type of data was added
    # and we will return each type of data as a separate matrix
    if (ncol(x = feature.names) > 2) {
      data_types <- factor(x = feature.names$V3)
      lvls <- levels(x = data_types)
      if (length(x = lvls) > 1 && length(x = full.data) == 0) {
        message("10X data contains more than one type and is being returned as a list containing matrices of each type.")
      }
      expr_name <- "Gene Expression"
      if (expr_name %in% lvls) { # Return Gene Expression first
        lvls <- c(expr_name, lvls[-which(x = lvls == expr_name)])
      }
      data <- lapply(
        X = lvls,
        FUN = function(l) {
          return(data[data_types == l, , drop = FALSE])
        }
      )
      names(x = data) <- lvls
    } else{
      data <- list(data)
    }
    full.data[[length(x = full.data) + 1]] <- data
    setTxtProgressBar(pb = pb, value = i)
  }
  close(con = pb)
  # Combine all the data from different directories into one big matrix, note this
  # assumes that all data directories essentially have the same features files
  if (single.matrix) {
    list_of_data <- list()
    pb <- txtProgressBar(min = 0, max = length(full.data), style = 3, file = stderr())
    for (j in 1:length(x = full.data[[1]])) {
      list_of_data[[j]] <- do.call(cbind, lapply(X = full.data, FUN = `[[`, j))
      # Fix for Issue #913
      list_of_data[[j]] <- as(object = list_of_data[[j]], Class = "dgCMatrix")
      setTxtProgressBar(pb = pb, value = j)
    }
    close(con = pb)
    names(x = list_of_data) <- names(x = full.data[[1]])
    # If multiple features, will return a list, otherwise
    # a matrix.
    if (length(x = list_of_data) == 1) {
      return(list_of_data[[1]])
    } else {
      return(list_of_data)
    }
  }
  names(full.data) <- sample.names
  return(full.data)
}

@saketkc
Copy link
Collaborator

saketkc commented Mar 2, 2021

Thanks Sam! This looks great! What I have in mind is slightly more generic. This (I think) handles all the use cases you mentioned above, but requires some processing at the use end. There is no requirement for the files to be located in any particular directory, they can be even be read from a URL (including the matrix file).

Happy to know your and others thoughts on this! I will also discuss your suggestion with Seurat team.

#' Load in data from remote or local mtx files
#' Adapted and inspired from Seurat's Read10X
#'
#' Enables easy loading of sparse data matrices
#'
#' @param mtx Name or remote URL of the mtx file
#' @param cells Name or remote URL of the cells/barcodes file
#' @param features Name or remote URL of the features/genes file
#' @param feature.column Specify which column of features files to use for feature/gene names; default is 2
#' @param cell.column Specify which column of cells file to use for cell names; default is 1
#' @param unique.features Make feature names unique (default TRUE)
#' @param strip.suffix Remove trailing "-1" if present in all cell barcodes.
#'
#' @return A sparse matrix containing the expression data.
#'
#' @importFrom Matrix readMM
#' @importFrom utils read.delim
#' @importFrom httr build_url parse_url
#' @importFrom tools file_ext
#'
#'
#' @export
#' @concept preprocessing
#'
#' @examples
#' \dontrun{
#' # For local files:
#'
#' expression_matrix <- ReadMtx(genes = "count_matrix.mtx.gz", features = "features.tsv.gz", cells = "barcodes.tsv.gz")
#' seurat_object <- CreateSeuratObject(counts = expression_matrix)
#'
#' # For remote files:
#'
#' expression_matrix <- ReadMtx(
#'   mtx = "http://localhost/matrix.mtx",
#'   cells = "http://localhost/barcodes.tsv",
#'   features = "http://localhost/genes.tsv"
#' )
#' seurat_object <- CreateSeuratObject(counts = data)
#' }
#'
ReadMtx <- function(mtx,
                    cells,
                    features,
                    cell.column = 1,
                    feature.column = 2,
                    unique.features = TRUE,
                    strip.suffix = FALSE) {
  mtx <- build_url(url = parse_url(url = mtx))
  cells <- build_url(url = parse_url(url = cells))
  features <- build_url(url = parse_url(url = features))
  all_files <- list("Expression matrix" = mtx, "Barcode" = cells, "Gene name" = features)

  check_file_exists <- function(filetype, filepath) {
    if (grepl(pattern = "^:///", x = filepath)) {
      filepath <- gsub(pattern = ":///", replacement = "", x = filepath)
      if (!file.exists(paths = filepath)) {
        stop(paste(filetype, "file missing. Expecting", filepath), call. = FALSE)
      }
    }
  }

  # check if all files exist
  lapply(seq_along(all_files), function(y, n, i) {
    check_file_exists(n[[i]], y[[i]])
  }, y = all_files, n = names(all_files))

  # convenience fucntion to read local or remote tab delimited files
  readTableUri <- function(uri) {
    if (grepl(pattern = "^:///", x = uri)) {
      uri <- gsub(pattern = ":///", replacement = "", x = uri)
      textcontent <- read.table(file = uri, header = FALSE, sep = "\t", row.names = NULL)
    } else {
      if (file_ext(uri) == "gz") {
        textcontent <- read.table(
          file = gzcon(url(uri), text = TRUE),
          header = FALSE, sep = "\t", row.names = NULL
        )
      } else {
        textcontent <- read.table(
          file = uri, header = FALSE,
          sep = "\t", row.names = NULL
        )
      }
    }
    return(textcontent)
  }

  # read barcodes
  cell.barcodes <- readTableUri(uri = cells)
  bcols <- ncol(x = cell.barcodes)
  if (bcols < cell.column) {
    stop(paste0(
      "cell.column was set to ", cell.column,
      " but ", cells, " only has ", bcols, " columns.",
      " Try setting the cell.column argument to a value <= to ", bcols, "."
    ))
  }
  cell.names <- cell.barcodes[, cell.column]

  if (all(grepl(pattern = "\\-1$", x = cell.names)) & strip.suffix) {
    cell.names <- as.vector(x = as.character(x = sapply(
      X = cell.names,
      FUN = ExtractField,
      field = 1,
      delim = "-"
    )))
  }

  # read features
  feature.names <- readTableUri(uri = features)
  fcols <- ncol(x = feature.names)
  if (fcols < feature.column) {
    stop(paste0(
      "feature.column was set to ", feature.column,
      " but ", features, " only has ", fcols, " column(s).",
      " Try setting the feature.column argument to a value <= to ", fcols, "."
    ))
  }
  if (any(is.na(x = feature.names[, feature.column]))) {
    na.features <- which(x = is.na(x = feature.names[, feature.column]))
    replacement.column <- ifelse(test = feature.column == 2, yes = 1, no = 2)
    if (replacement.column > fcols) {
      stop(
        paste0(
          "Some features names are NA in column ", feature.column,
          ". Try specifiying a different column."
        ),
        call. = FALSE,
        immediate. = TRUE
      )
    } else {
      warning(
        paste0(
          "Some features names are NA in column ", feature.column,
          ". Replacing NA names with ID from column ", replacement.column, "."
        ),
        call. = FALSE,
        immediate. = TRUE
      )
    }
    feature.names[na.features, feature.column] <- feature.names[na.features, replacement.column]
  }

  feature.names <- feature.names[, feature.column]
  if (unique.features) {
    feature.names <- make.unique(names = feature.names)
  }

  # read mtx
  if (grepl(pattern = "^:///", x = mtx)) {
    mtx <- gsub(pattern = ":///", replacement = "", x = mtx)
    data <- readMM(mtx)
  } else {
    if (file_ext(mtx) == "gz") {
      data <- readMM(gzcon(url(mtx)))
    } else {
      data <- readMM(mtx)
    }
  }

  colnames(x = data) <- cell.names
  rownames(x = data) <- feature.names

  return(data)
}

This enables reading GEO files both remotely and locally. Some examples:

Example: GEO link

mtx <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE132044&format=file&file=GSE132044%5Fmixture%5Fhg19%5Fmm10%5Fcount%5Fmatrix%2Emtx%2Egz"
cells <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE132044&format=file&file=GSE132044%5Fmixture%5Fhg19%5Fmm10%5Fcell%2Etsv%2Egz" 
features <- "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE132044&format=file&file=GSE132044%5Fmixture%5Fhg19%5Fmm10%5Fgene%2Etsv%2Egz"

counts <- ReadMtx(mtx = mtx, cells = cells, features = features, feature.column = 1)

Local:

mtx <- "~/data/pbmc10k/filtered_feature_bc_matrix/matrix.mtx.gz"
cells <- "~/data/pbmc10k/filtered_feature_bc_matrix/barcodes.tsv.gz"
features <- "~/data/pbmc10k/filtered_feature_bc_matrix/features.tsv.gz"
counts <- ReadMtx(mtx = mtx, cells = cells, features = features)

@saketkc
Copy link
Collaborator

saketkc commented Mar 2, 2021

Just to add, the links to supplementary GEO files (GSE/GSM/GPL) can be obtained using the following function (it also allows you to download the files). These can then be passed on to the ReadMtx() function.

#' Fetch supplemetary files from GEO
#' @importFrom XML htmlParse xpathSApply
#' @importFrom httr GET parse_url
#' @export
FetchGEOFiles <- function(geo, download.dir = getwd(), download.files = FALSE, ...) {
  geo <- trimws(toupper(geo))
  geo_type <- substr(geo, 1, 3)
  url.prefix <- "https://ftp.ncbi.nlm.nih.gov/geo/"
  if (geo_type == "GSE") {
    url.prefix <- paste0(url.prefix, "series/")
  } else if (geo_type == "GSM") {
    url.prefix <- paste0(url.prefix, "samples/")
  } else if (geotype == "GPL") {
    url.prefix <- paste0(url.prefix, "platform/")
  }


  # url.prefix <- "https://ftp.ncbi.nlm.nih.gov/geo/series/"

  geo_prefix <- paste0(substr(x = geo, start = 1, stop = nchar(geo) - 3), "nnn")
  url <- paste0(url.prefix, geo_prefix, "/", geo, "/", "suppl", "/")

  response <- GET(url = url)
  html_parsed <- htmlParse(file = response)
  links <- xpathSApply(doc = html_parsed, path = "//a/@href")
  suppl_files <- as.character(grep(pattern = "^G", x = links, value = TRUE))
  if (length(suppl_files) == 0) {
    return(NULL)
  }

  file.url <- paste0(url, suppl_files)
  file_list <- data.frame(filename = suppl_files, url = file.url)

  if (download.files) {
    names(file.url) <- suppl_files
    download_file <- function(url, filename, ...) {
      message(paste0("Downloading ", filename, " to ", download.dir))
      download.file(url = url, destfile = file.path(download.dir, filename), mode = "wb", ...)
      message("Done!")
    }
    lapply(seq_along(file.url), function(y, n, i) {
      download_file(y[[i]], n[[i]], ...)
    },
    y = file.url, n = names(file.url)
    )
  }

  return(file_list)
}

Example:

 FetchGEOFiles("GSE132044")
                                          filename                                                                                                               url
1            GSE132044_HEK293_PBMC_TPM_bulk.tsv.gz           https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_HEK293_PBMC_TPM_bulk.tsv.gz
2          GSE132044_NIH3T3_cortex_TPM_bulk.tsv.gz         https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_NIH3T3_cortex_TPM_bulk.tsv.gz
3                GSE132044_cortex_mm10_cell.tsv.gz               https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_cortex_mm10_cell.tsv.gz
4        GSE132044_cortex_mm10_count_matrix.mtx.gz       https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_cortex_mm10_count_matrix.mtx.gz
5                GSE132044_cortex_mm10_gene.tsv.gz               https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_cortex_mm10_gene.tsv.gz
6          GSE132044_mixture_hg19_mm10_cell.tsv.gz         https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_mixture_hg19_mm10_cell.tsv.gz
7  GSE132044_mixture_hg19_mm10_count_matrix.mtx.gz https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_mixture_hg19_mm10_count_matrix.mtx.gz
8          GSE132044_mixture_hg19_mm10_gene.tsv.gz         https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_mixture_hg19_mm10_gene.tsv.gz
9                  GSE132044_pbmc_hg38_cell.tsv.gz                 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_pbmc_hg38_cell.tsv.gz
10         GSE132044_pbmc_hg38_count_matrix.mtx.gz         https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_pbmc_hg38_count_matrix.mtx.gz
11                 GSE132044_pbmc_hg38_gene.tsv.gz                 https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_pbmc_hg38_gene.tsv.gz

@samuel-marsh
Copy link
Collaborator Author

Hi @saketkc,

Ok nice! I love the idea of enabling the loading directly from a url (and FetchGEO function is great)!!

In terms of comparison between the functions that is obviously for you the rest of team to decide but I think I favor a few aspects of my function vs. your solution. These are things I'm sure could be implemented in various ways in either function but I think are pluses overall for the end user.

  1. Minimizing user input.
    --Since part of my function reads the file prefixes in given directory it removes the need to specify multiple files including either full paths or urls by only requiring a data.dir parameter.
    --I think this is especially advantageous when you want to read in entire directory/repo that may contain 10, 20, 30+ samples which makes manual input more cumbersome. (Although using the output of your FetchGEO function or user creating list of files in directory this could obviously be looped/simplified but does require more work on user end overall).

  2. Input/Output control.
    --Allowing user to specify either a single output matrix across all input samples or list of matrices on per sample basis allows for greater flexibility.
    --Allowing user to either specify specific subset of samples to import or import all files in directory that follow naming convention.

Happy to discuss more but those are just my first thoughts!

Thanks!!
Sam

@samuel-marsh
Copy link
Collaborator Author

Also realizing that the second progress bar in my function for when user specifies single output matrix is faulty but I will work on that.

@lichenlady94
Copy link

Hi all, I tried to use the codes to get my barcode, features, and matrix in Seurat but have had no luck. Right now my error is that the barcode file is missing even though I have it in the folder. Thanks for all your hard work for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants