Cancer Mutation Analysis and Visualization Suite.

The CancerMutAnalyzer is an R package designed to streamline and enhance the workflow for analyzing mutation data in cancer research. The package focuses on processing, visualizing, and analyzing mutation frequencies and genomic contexts, particularly targeting single nucleotide polymorphisms (SNPs) and other mutation types within cancer datasets. By providing functions for data extraction, filtering, sequence analysis, and visualizations (such as heatmaps of mutation frequencies), the package enables researchers to quickly identify mutation patterns and genomic signatures associated with different cancer types or tumor samples.

This package adds to the current bioinformatics workflow by simplifying mutation analysis tasks that typically require multiple steps across various software. By centralizing these steps in one package, researchers can conduct analyses more efficiently, reduce the risk of errors from file conversions, and gain insights through streamlined visualizations. One unique feature of this package is the ability to examine the local nucleotide context surrounding mutations, which can be used to detect mutational hotspots or enrich mutation data with GC content or sequence-based patterns. Additionally, it addresses common issues such as data formatting, base filtering, and customized visualization for specific mutation characteristics, making it a comprehensive and user-friendly tool for cancer genomics.

The CancerMutAnalyzer package was developed using R version 4.4.1 (2024-06-14 ucrt), Platform: x86_64-w64-mingw32/x64 and Running under: Windows 11 x64 (build 22631).


To install the latest version of the package:

devtools::install_github("martien2kk/CancerMutAnalyzer", build_vignettes = TRUE)

To run the Shiny app:

runCancerMutAnalyzer() # not for Assessment 4; only for Assessment 5


data(package = "CancerMutAnalyzer") 

CancerMutAnalyzer contains 6 functions.

  1. extractMutationData: allows users to specify and extract specific columns from a mutation dataset in Mutation Annotation Format (MAF) or tabular format. It has default settings for commonly used columns but is customizable, so users can focus on relevant mutation details.

  2. filterMutations: provides flexible filtering based on user-specified conditions for any column. This allows users to subset mutation data based on exact values or numeric ranges for certain columns, such as specific genes or chromosomal regions.

  3. extractMutationSequences: retrieves nucleotide sequences surrounding mutation sites from the hg19 genome based on specified genomic coordinates in a data frame. Users can customize the length of the sequence extracted by adjusting the padding parameter, with the default setting providing a trinucleotide sequence centered around each mutation.

  4. visualizeMutationFrequencyBar: generates a bar plot to display the frequency of mutations based on a specified column in a mutation dataset. This function is ideal for visualizing the distribution of mutations by a single categorical variable, such as chromosome or variant type.

  5. visualizeMutationFrequencyHeatmap: creates a heatmap that visualizes mutation frequency based on two specified columns, such as Reference_Allele and Tumor_Seq_Allele2. It only includes rows where both columns contain nucleotide bases (A, T, C, G). This function is particularly useful for examining mutation patterns between pairs of alleles, highlighting high-frequency mutations across different nucleotide pairs.

  6. calculateGCContent: calculates the GC content (the percentage of G and C bases) for each sequence around a mutation site.

The package also contains two cancer mutation datasets, called UCS.mutations and filteredUCSFirst100SNP. Refer to package vignettes for more details. An overview of the package is illustrated below.


The author of the package is Keren Zhang. The author wrote the extractMutationData function, which extracts specified columns from a mutation dataset, returning key mutation details such as chromosome, start and end positions, variant type, and allele information. This function relies on dplyr functions for data manipulation and selection.

The extractMutationSequences function was also developed by the author and extracts nucleotide sequences surrounding mutation sites based on genomic coordinates. This function utilizes the BSgenome.Hsapiens.UCSC.hg19 and GenomicRanges packages to access the hg19 genome and create genomic ranges for accurate sequence extraction.

Additionally, the calculateGCContent function, written by the author, calculates the GC content of extracted sequences to assess the percentage of G and C bases. This function uses the Biostrings package for efficient sequence handling and base frequency calculation, allowing for streamlined GC content analysis.

The visualizeMutationFrequencyBar and visualizeMutationFrequencyHeatmap functions generate visualizations to explore mutation data. The bar plot function visualizes mutation frequency by a single column, while the heatmap function visualizes mutation frequency by two columns (e.g., Reference_Allele and Tumor_Seq_Allele2). Both functions use ggplot2 for visualization and rely on dplyr for data preparation.

Generative AI tools, OpenAI’s ChatGPT, were used to provide suggestions for brainstorming, debugging, and structuring the Roxygen documentation.



This package was developed as part of an assessment for 2024 BCB410H: Applied Bioinformatics course at the University of Toronto, Toronto, CANADA. CancerMutAnalyzer welcomes issues, enhancement requests, and other contributions. To submit an issue, use the GitHub issues. Many thanks to those who provided feedback to improve this package.


