Metagenomics Profiling

Given a subset of ENA project identifiers related to metagenomics can we explore the sample attributes, profile different types and curate? All code is written in python 3. Note that you also require samples.csv. This is a large file that exceeds the 100MB limit on git. This file can be generated using the BioSamples API with make_input.py (see https://github.com/EBIBioSamples/curami for more details). Otherwise please get in touch and I can send you the latest dump (hewgreen@ebi.ac.uk).

Getting Started

Have a look at the attribute counts in metagenome_profile.csv. This is for the whole subset of 38120 samples.
Look at the coocurence raw data in samples_subset_coocurences.csv or weighted coexistencesProb.csv
Download gephi and look at session_file.gephi or coexistences.gexf

Data Processing (`ID_converter.py`)

input.txt contains a list of ENA project identifiers. ID_converter.py expands these project IDs into individual sample IDS using the ENA API and XML parsing. Then we convert these into BiopSamples IDs again from the ENA API with XML parsing. The script makes three files, expanded_ENAIDs.json, missing_BioSampleIDs.json and fetched_BioSampleIDs.json. The latter is required for the next script.

Attribute Analysis

profiler.py requires fetched_BioSampleIDs.json and samples.csv. It returns the following:

profile_dict.json - counts of attributes in the samples metagenome_profile.csv - counts of attributes in the samples samples_subset.csv - each sample and the attributes associated with them (extracted from samples.csv) samples_subset_coocurences.json - nested dictionary of coocurences counted samples_subset_coocurences.csv - coocurences counted in a 4 column csv (index, attribute 1, attribute 2 and count respectively) coexistencesProb.csv - weighted coocurences taking popularity into account when considering coocurrence. This is used as the edge weight in gephi. coexistences.gexf - a file readable by gephi to explore the coocurences (a free graph layout too https://gephi.org). There is also a session file in the repo that has been colored and laid out.

Future

Dimension reduction of the samples and clustering to define different subsets
Building slicing capability into curation app curami so these attributes can be refined

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Metagenomics Profiling

Getting Started

Data Processing (`ID_converter.py`)

Attribute Analysis

Future

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
ID_converter.py		ID_converter.py
README.md		README.md
Screen Shot 2018-02-14 at 18.42.14.png		Screen Shot 2018-02-14 at 18.42.14.png
coexistences.gexf		coexistences.gexf
coexistencesProb.csv		coexistencesProb.csv
expanded_ENAIDs.json		expanded_ENAIDs.json
fetched_BioSampleIDs.json		fetched_BioSampleIDs.json
input.txt		input.txt
metagenome_profile.csv		metagenome_profile.csv
missing_BioSampleIDs.json		missing_BioSampleIDs.json
profile_dict.json		profile_dict.json
profiler.py		profiler.py
samples_subset.csv		samples_subset.csv
samples_subset_coocurences.csv		samples_subset_coocurences.csv
samples_subset_coocurences.json		samples_subset_coocurences.json
session_file.gephi		session_file.gephi

EBIBioSamples/metagenomics_profile

Folders and files

Latest commit

History

Repository files navigation

Metagenomics Profiling

Getting Started

Data Processing (ID_converter.py)

Attribute Analysis

Future

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Data Processing (`ID_converter.py`)

Packages