Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle the stop codons, COSMIC errors download, ENSEMBL releases #31

Merged
merged 44 commits into from
Feb 3, 2021
Merged
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
204115e
handle the stop codons into split proteins instead of removing the st…
ypriverol Jan 26, 2021
7334999
commandline ordered
ypriverol Jan 26, 2021
bf93f1a
better error handling for cosmic downloads
ypriverol Jan 26, 2021
3715155
change command config path
ypriverol Jan 26, 2021
d8fcd94
new downloader for previous versions of ensembl
ypriverol Jan 26, 2021
87faff7
code clean
ypriverol Jan 26, 2021
1811e21
increase package version
ypriverol Jan 26, 2021
d4de5da
ensembl downloader updated
ypriverol Jan 26, 2021
4eb3b49
change command config path
ypriverol Jan 26, 2021
8b0bea8
add changelog.md
ypriverol Jan 26, 2021
e3155f4
added changelog
ypriverol Jan 26, 2021
95e3bf8
changelog actions added
ypriverol Jan 26, 2021
22a7607
changelog added
ypriverol Jan 26, 2021
b74013b
change command config path
ypriverol Jan 26, 2021
38c017b
change command config path
ypriverol Jan 26, 2021
c7579ad
change command config path
ypriverol Jan 26, 2021
894f3a5
Updates version info for
ypriverol Jan 26, 2021
e0f22f4
changelog done
ypriverol Jan 26, 2021
f03350d
Updates version info for
ypriverol Jan 26, 2021
ed03fd1
add changelog actions
ypriverol Jan 26, 2021
6c468b7
Merge remote-tracking branch 'origin/dev' into dev
ypriverol Jan 26, 2021
3c26fc7
Updates version info for
ypriverol Jan 26, 2021
853d672
update README
ypriverol Jan 26, 2021
fbc4a64
README.md updated
ypriverol Jan 26, 2021
fd67ccf
code coverage
ypriverol Jan 26, 2021
c9dfe7d
code coverage
ypriverol Jan 26, 2021
e772df8
added unit tests
ypriverol Jan 26, 2021
2917989
change command config path
ypriverol Jan 26, 2021
010fbce
coverage updated
ypriverol Jan 26, 2021
5effa00
change command config path
ypriverol Jan 26, 2021
fa84cdf
change command config path
ypriverol Jan 26, 2021
376790e
minor examples
ypriverol Jan 26, 2021
3a2da7b
still there
ypriverol Jan 26, 2021
de40c74
change command config path
ypriverol Jan 26, 2021
48ab4e7
change command config path
ypriverol Jan 27, 2021
7895bd8
code clean step
ypriverol Jan 27, 2021
a31dc39
code clean step
ypriverol Jan 27, 2021
78d4e5d
code clean step
ypriverol Jan 27, 2021
c9fdd9a
code clean step
ypriverol Jan 27, 2021
409f4af
code clean step
ypriverol Jan 27, 2021
6a0730b
Merge multiple clean code commits code clean step (+14 squashed commits)
ypriverol Jan 26, 2021
d57521a
Merge remote-tracking branch 'origin/dev' into dev
ypriverol Jan 27, 2021
74a7b9c
added badges
ypriverol Jan 28, 2021
58f30f7
error file open
ypriverol Feb 3, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions .github/workflows/pythonapp.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,26 @@ jobs:
python setup.py install
cd pypgatk
python tests/pypgatk_tests.py
- name: Update changelog
uses: heinrichreimer/github-changelog-generator-action@v2.2
with:
token: ${{ secrets.GITHUB_TOKEN }}

- name: Update repo with new version info
uses: EndBug/add-and-commit@v7
with:
add: CHANGELOG.md
author_name: Release Auto-Commit
author_email: ypriverol@gmail.com
branch: dev
message: Updates version info for ${{github.event.release.name}}

- name: Generate coverage report
run: |
pip install pytest
pip install pytest-cov
python setup.py install
cd pypgatk
pytest --cov=./ --cov-report=xml tests/*


21 changes: 21 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Changelog

## [v0.0.8](https://github.com/bigbio/py-pgatk/tree/v0.0.8) (2021-01-19)

**Implemented enhancements:**

- Cosmic mutations for celllines [\#28](https://github.com/bigbio/py-pgatk/issues/28)

## [v0.0.7](https://github.com/bigbio/py-pgatk/tree/v0.0.7) (2021-01-11)

## [v0.0.6](https://github.com/bigbio/py-pgatk/tree/v0.0.6) (2020-05-02)

**Implemented enhancements:**

- Add the decoy Sanger tool to the library [\#19](https://github.com/bigbio/py-pgatk/issues/19)
- Major refactoring of the library to be more package oriented design [\#13](https://github.com/bigbio/py-pgatk/issues/13)
- Create / Download Canonical protein fasta file from ENSEMBL [\#1](https://github.com/bigbio/py-pgatk/issues/1)



\* *This Changelog was automatically generated by [github_changelog_generator](https://github.com/github-changelog-generator/github-changelog-generator)*
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@

![Python application](https://github.com/bigbio/py-pgatk/workflows/Python%20application/badge.svg)
[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/pypgatk/README.html)

[![Codacy Badge](https://app.codacy.com/project/badge/Grade/f6d030fd7d69413987f7265a01193324)](https://www.codacy.com/gh/bigbio/py-pgatk/dashboard?utm_source=github.com&utm_medium=referral&utm_content=bigbio/py-pgatk&utm_campaign=Badge_Grade)
[![PyPI version](https://badge.fury.io/py/pypgatk.svg)](https://badge.fury.io/py/pypgatk)
![PyPI - Downloads](https://img.shields.io/pypi/dm/pypgatk)

**pypgatk** is a Python library part of the [ProteoGenomics Analysis Toolkit](https://pgatk.readthedocs.io/en/latest). It provides different bioinformatics tools for proteogenomics data analysis.

Expand Down Expand Up @@ -99,4 +101,6 @@ Commands:

The library provides multiple commands to download, translate and generate protein sequence databases from reference and mutation genome databases.

### Please read full docs here: <https://pgatk.readthedocs.io/en/latest/pypgatk.html>
# Full Documentation

[https://pgatk.readthedocs.io/en/latest/pypgatk.html](https://pgatk.readthedocs.io/en/latest/pypgatk.html)
72 changes: 36 additions & 36 deletions pypgatk/cgenomes/cgenomes_proteindb.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ class CancerGenomesService(ParameterConfiguration):
ACCEPTED_VALUES = "accepted_values"
SPLIT_BY_FILTER_COLUMN = "split_by_filter_column"
CLINICAL_SAMPLE_FILE = 'clinical_sample_file'

def __init__(self, config_file, pipeline_arguments):
"""
Init the class with the specific parameters.
Expand All @@ -34,7 +34,7 @@ def __init__(self, config_file, pipeline_arguments):

if self.CONFIG_OUTPUT_FILE in self.get_pipeline_parameters():
self._local_output_file = self.get_pipeline_parameters()[self.CONFIG_OUTPUT_FILE]

self._filter_column = self.get_default_parameters()[self.CONFIG_KEY_DATA][self.CONFIG_FILTER_INFO][self.FILTER_COLUMN]
if self.FILTER_COLUMN in self.get_pipeline_parameters():
self._filter_column = self.get_pipeline_parameters()[self.FILTER_COLUMN]
Expand All @@ -47,11 +47,11 @@ def __init__(self, config_file, pipeline_arguments):
self._split_by_filter_column = self.get_default_parameters()[self.CONFIG_KEY_DATA][self.CONFIG_FILTER_INFO][self.SPLIT_BY_FILTER_COLUMN]
if self.SPLIT_BY_FILTER_COLUMN in self.get_pipeline_parameters():
self._split_by_filter_column = self.get_pipeline_parameters()[self.SPLIT_BY_FILTER_COLUMN]

self._local_clinical_sample_file = self.get_default_parameters()[self.CONFIG_KEY_DATA][self.CONFIG_FILTER_INFO][self.CLINICAL_SAMPLE_FILE]
if self.CLINICAL_SAMPLE_FILE in self.get_pipeline_parameters():
self._local_clinical_sample_file = self.get_pipeline_parameters()[self.CLINICAL_SAMPLE_FILE]

@staticmethod
def get_multiple_options(options_str: str):
"""
Expand Down Expand Up @@ -129,7 +129,7 @@ def get_mut_pro_seq(snp, seq):
del_index1 = int(positions[0]) - 1
del_index2 = int(positions[1])
mut_pro_seq = protein_seq[:del_index1] + mut_aa + protein_seq[del_index2:]

elif "insertion" in snp.type:
ins_index1 = int(positions[0]) - 1
mut_pro_seq = protein_seq[:ins_index1] + mut_aa + protein_seq[ins_index1 + 1:]
Expand All @@ -141,7 +141,7 @@ def get_mut_pro_seq(snp, seq):
else:
del_index1 = int(positions[0]) - 1
mut_pro_seq = protein_seq[:del_index1] + mut_aa.replace("*", "")

return mut_pro_seq


Expand All @@ -158,7 +158,7 @@ def cosmic_to_proteindb(self):
COSMIC_CDS_DB[record.id].append(record)
except KeyError:
COSMIC_CDS_DB[record.id] = [record]

cosmic_input = open(self._local_mutation_file, encoding="latin-1") # CosmicMutantExport.tsv

header = cosmic_input.readline().split("\t")
Expand All @@ -171,7 +171,7 @@ def cosmic_to_proteindb(self):
filter_col = None
if self._filter_column:
filter_col = header.index(self._filter_column)

output = open(self._local_output_file, 'w')

mutation_dic = {}
Expand All @@ -188,7 +188,7 @@ def cosmic_to_proteindb(self):
if filter_col:
if row[filter_col] not in self._accepted_values and self._accepted_values!=['all']:
continue

if "coding silent" in row[muttype_col]:
continue

Expand All @@ -200,39 +200,39 @@ def cosmic_to_proteindb(self):
seqs = []
for record in this_gene_records:
seqs.append(record.seq)

except KeyError: # geneID is not in All_COSMIC_Genes.fasta
continue

for seq in seqs:
try:
mut_pro_seq = self.get_mut_pro_seq(snp, seq)
except IndexError:
continue
if mut_pro_seq:
break

if mut_pro_seq:
entry = ">%s\n%s\n" % (header, mut_pro_seq)
if header not in mutation_dic:
output.write(entry)
mutation_dic[header] = 1

if self._split_by_filter_column and filter_col:
try:
groups_mutations_dict[row[filter_col]][header] = entry
except KeyError:
groups_mutations_dict[row[filter_col]] = {header: entry}

for group_name in groups_mutations_dict.keys():
with open(self._local_output_file.replace('.fa', '')+ '_' + regex.sub('', group_name) +'.fa', 'w') as fn:
for header in groups_mutations_dict[group_name].keys():
fn.write(groups_mutations_dict[group_name][header])

self.get_logger().debug("COSMIC contains in total {} non redundant mutations".format(len(mutation_dic)))
cosmic_input.close()
output.close()

@staticmethod
def get_sample_headers(header_line):
try:
Expand All @@ -246,7 +246,7 @@ def get_sample_headers(header_line):
print('SAMPLE_ID was not found in the header row:', header_line)
return None, None
return tissue_type_col, sample_id_col

def get_tissue_type_per_sample(self, local_clinical_sample_file):
sample_tissue_type = {}
if local_clinical_sample_file:
Expand All @@ -262,24 +262,24 @@ def get_tissue_type_per_sample(self, local_clinical_sample_file):
if tissue_type_col and sample_id_col:
sample_tissue_type[sl[sample_id_col]] = sl[tissue_type_col].strip().replace(' ','_')
return sample_tissue_type

@staticmethod
def get_mut_header_cols(header_cols, row, filter_column, accepted_values, split_by_filter_column):
for col in header_cols.keys():
header_cols[col] = row.index(col)

#check if (filter_column) tissue type should be considered
if accepted_values!=['all'] or split_by_filter_column:
try:
header_cols[filter_column] = row.index(filter_column)
except ValueError:
print("{} was not found in the header {} of mutations file".format(filter_column, row))
header_cols[filter_column] = None

return header_cols

def cbioportal_to_proteindb(self):
"""cBioportal studies have a data_clinical_sample.txt file
"""cBioportal studies have a data_clinical_sample.txt file
that shows the Primary Tumor Site per Sample Identifier
it matches the the (filter_column) Tumor_Sample_Barcode column in the mutations file.
"""
Expand All @@ -289,19 +289,19 @@ def cbioportal_to_proteindb(self):
output = open(self._local_output_file, "w")
sample_groups_dict = {}
group_mutations_dict = {}

seq_dic = {}
for record in fafile:
newacc = record.id.split(".")[0]
if newacc not in seq_dic:
seq_dic[newacc] = record.seq
header_cols = {"HGVSc": None, "Transcript_ID": None, "Variant_Classification": None,

header_cols = {"HGVSc": None, "Transcript_ID": None, "Variant_Classification": None,
"Variant_Type": None, "HGVSp_Short": None}
nucleotide = ["A", "T", "C", "G"]
mutclass = ["Frame_Shift_Del", "Frame_Shift_Ins", "In_Frame_Del", "In_Frame_Ins", "Missense_Mutation",
"Nonsense_Mutation"]

# check if sample id and clinical files are given, if not and tissue type is required then exit
if self._accepted_values!=['all'] or self._split_by_filter_column:
if self._local_clinical_sample_file:
Expand All @@ -311,7 +311,7 @@ def cbioportal_to_proteindb(self):
else:
print('No clinical sample file is given therefore no tissue type can be detected.')
return

for i,line in enumerate(mutfile):
row = line.strip().split("\t")
if row[0]=='#':
Expand All @@ -337,7 +337,7 @@ def cbioportal_to_proteindb(self):
print("No sampleID was found in (line {}): {}".format(i, row))
if group not in self._accepted_values and self._accepted_values != ['all']:
continue

gene = row[0]
try:
pos = row[header_cols["HGVSc"]]
Expand All @@ -353,13 +353,13 @@ def cbioportal_to_proteindb(self):
continue
if varclass not in mutclass:
continue

try:
seq = seq_dic[enst]
except KeyError:
print("%s not found:" % enst)
continue

if ":" in pos:
cdna_pos = pos.split(":")[1]
else:
Expand Down Expand Up @@ -411,30 +411,30 @@ def cbioportal_to_proteindb(self):
enst_pos = int(re.findall(r'\d+', cdna_pos.split("_")[1])[0])
else:
print("unexpected insertion format")
continue;
continue

seq_mut = seq[:enst_pos] + ins_dna + seq[enst_pos:]

if seq_mut == "":
continue;
continue

mut_pro_seq = seq_mut.translate(to_stop=True)
if len(mut_pro_seq) > 6:
header = "cbiomut:%s:%s:%s:%s" % (enst, gene, aa_mut, varclass)
output.write(">%s\n%s\n" % (header, mut_pro_seq))

if self._split_by_filter_column:
try:
group_mutations_dict[group][header] = mut_pro_seq
except KeyError:
group_mutations_dict[group] = {header: mut_pro_seq}

output.close()
mutfile.close()
fafile.close()

for group in group_mutations_dict.keys():
with open(self._local_output_file.replace('.fa', '')+ '_' + regex.sub('', group) +'.fa', 'w') as fn:
for header in group_mutations_dict[group].keys():
fn.write(">{}\n{}\n".format(header, group_mutations_dict[group][header]))

Loading