The purpose of this package is to add tools that make life easier when performing dataset prep.
- this repo was initially hosted at https://github.com/Benjamin-Vincent-Lab/datasetprep and was moved here on 2/13/2024
Restart R Session
In R:
devtools::install_github("Benjamin-Vincent-Lab/datasetprep")
Or for a specific version:
devtools::install_github("Benjamin-Vincent-Lab/datasetprep", ref = "0.4.8")
Use the package documentation for help:
??datasetprep
##Configuration methods
###get_run_only_columns Returns character vector of column names for data only relevant to a sample or sequencing run as opposed to a patient.
###get_patient_only_columns Returns character vector of column name relevant to the patient and should be true of all samples coming from that patient
###get_all_columns Returns character vector concatenation of get_patient_columns and get_run_only_columns
##Initializaton methods
###init_paths Sets up the post processing directory and prepares temp library folder if necessary.
Converts data's original sex characters to standard Female / Male.
Takes data's original race values and cleans them up for consistency, converts unknown races to Other, and sets the individual boolean race fields ( Caucasian, Asian, etc. ) based on updated Race.
This method standardizes the Best Response data to full names in CamelCase and then sets fields that are dependent on those values ( Progression, Clinical_Benefit and Responder ).
Data's original patient identifiers are often more complex than we want. This method takes those names ( within a limited subset of formats ) and reformats them as "p001" where the number portion is unique across patients.
Concatenates dataset with patient_names.
Combines normal_tissue, analyte and file_prefix values into Run_Names of format {a|n}{d|r|m|p}{unique part of file_prefix}.
Concatenates patient_ids with run_names.
Typically somatic workflow matches are when a given patient has samples of tumor data in RNA and DNA form along with normal tissue in DNA form. This method determines for which patients this is true.
DEPRICATED :: Takes a data.frame with a Drug column and returns formatted variables based on the drug names
Replaces drug name aliases with preferred name values
Looks up properties for drugs and sets itemized and boolean fields with appropriate values
Outputs summary of variables in data.frame