Skip to content

hgb-bin-proteomics/MSAnnika_FDR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

workflow_state

MS Annika FDR

A script and functions to group and validate MS Annika results. The main use case would be for re-validating results after filtering or merging results from different MS Annika runs.

Usage

  • Install python 3.7+: https://www.python.org/downloads/
  • Install requirements: pip install -r requirements.txt
  • Export MS Annika results from Proteome Discoverer to Microsoft Excel format.
  • Run python msannika_fdr.py filename.xlsx -fdr 0.01 (see below for more examples).
  • The script may take a few minutes, depending on the number of CSMs/crosslinks to process.
  • Done!

Examples

msannika_fdr.py takes one positional and one optional argument. The first argument always has to be the filename(s) of the MS Annika result file(s). You may specify any number of result files, keep in mind however that msannika_fdr.py will process these files seperately, if you want to merge several result files, check out MS Annika Combine Results. For demonstration purposes we will use the files supplied in the /data folder:

  • DSSO_Crosslinks.xlsx contains unvalidated crosslinks from an MS Annika search.
  • DSSO_CSMs.xlsx contains unvalidated CSMs from an MS Annika search.

The following is a valid msannika_fdr.py call:

python msannika_fdr.py DSSO_Crosslinks.xlsx

This will not do anything because no FDR was given. You should see in the output that the script skipped the file. However, doing the same with a CSM file results in a different output:

python msannika_fdr.py DSSO_CSMs.xlsx

This will group the CSMs by sequence and position to crosslinks and you should see a file DSSO_CSMs_crosslinks.xlsx generated.

If you suppy the optional argument -fdr or --false_discovery_rate and the desired FDR as a floating point number, the results will be validated:

python msannika_fdr.py DSSO_Crosslinks.xlsx -fdr 0.01

This will validate the input crosslinks for estimated 1% FDR and will generate a a file called DSSO_Crosslinks_validated.xlsx containing only crosslinks above the estimated 1% FDR threshold. Note that the following command will produce the same output (FDR values >= 1 will automatically be divided by 100):

python msannika_fdr.py DSSO_Crosslinks.xlsx -fdr 1

Validating a CSMs file works the same way:

python msannika_fdr.py DSSO_CSMs.xlsx -fdr 0.01

This will will validate the input CSMs for estimated 1% FDR and will generate a a file DSSO_CSMs_validated.xlsx containing only CSMs above the estimated 1% FDR threshold. Furthermore, it will group the input CSMs to crosslinks and output them to the file DSSO_CSMs_crosslinks.xlsx and then validate those crosslinks for 1% estimated FDR and store the result in DSSO_CSMs_crosslinks_validated.xlsx.

You can also supply several files to the script like this:

python msannika_fdr.py DSSO_CSMs.xlsx DSSO_Crosslinks.xlsx -fdr 0.01

This will process the input files seperately and sequentially and produce the files as mentioned above:

  • DSSO_Crosslinks_validated.xlsx
  • DSSO_CSMs_validated.xlsx
  • DSSO_CSMs_crosslinks.xlsx
  • DSSO_CSMs_crosslinks_validated.xlsx

Parameters

"""
DESCRIPTION:
A script to group and validate results from MS Annika searches.
USAGE:
msannika_fdr.py f [f ...]
                  [-fdr FDR][--false_discovery_rate FDR]
                  [-h][--help]
                  [--version]
positional arguments:
  f                     MS Annika result files in Microsoft Excel format (.xlsx)
                        to process.
optional arguments:
  -fdr FDR, --false_discovery_rate FDR
                        False discovery rate to validate results for. Supports
                        both percentage input (e.g. 1) or fraction input (e.g.
                        0.01). By default not set and the input results will
                        just be grouped to crosslinks (if CSMs as input) or
                        nothing will be done (if crosslinks as input).
                        Default: None
  -h, --help            show this help message and exit
  --version             show program's version number and exit
"""

Function Documentation

If you want to integrate the MS Annika FDR calculation into your own scripts, you can import the following functions as given:

import pandas as pd

crosslinks = pd.read_excel("DSSO_Crosslinks.xlsx")
csms = pd.read_excel("DSSO_CSMs.xlsx")

# Grouping CSMs to crosslinks
from msannika_fdr import MSAnnika_CSM_Grouper
Crosslinks_grouped_from_CSMs = MSAnnika_CSM_Grouper.group(csms)

# The function signature of MSAnnika_CSM_Grouper.group is:
def group(data: pd.DataFrame) -> pd.DataFrame:
    """code omitted"""
    return

# Validating CSMs for 0.01 FDR
from msannika_fdr import MSAnnika_CSM_Validator
Validated_CSMs = MSAnnika_CSM_Validator.validate(csms, 0.01)

# The function signature of MSAnnika_CSM_Validator.validate is:
def validate(data: pd.DataFrame, fdr: float) -> pd.DataFrame:
    """code omitted"""
    return

# Validating Crosslinks for 0.01 FDR
from msannika_fdr import MSAnnika_Crosslink_Validator
Validated_Crosslinks = MSAnnika_Crosslink_Validator.validate(crosslinks, 0.01)

# The function signature of MSAnnika_Crosslink_Validator.validate is:
def validate(data: pd.DataFrame, fdr: float) -> pd.DataFrame:
    """code omitted"""
    return

Known Issues

List of known issues

Citing

If you are using the MS Annika FDR script please cite:

MS Annika 2.0 Identifies Cross-Linked Peptides in MS2–MS3-Based Workflows at High Sensitivity and Specificity
Micha J. Birklbauer, Manuel Matzinger, Fränze Müller, Karl Mechtler, and Viktoria Dorfer
Journal of Proteome Research 2023 22 (9), 3009-3021
DOI: 10.1021/acs.jproteome.3c00325

If you are using MS Annika please cite as described here.

License

Contact