pairwise/clustering downstream-analysis research-driven thoughts #252

mr-eyes · 2024-02-27T22:12:23Z

Graph construction was implemented using rustworkx in MRG: Add graph-based clustering #234, I want to mention that the rustworkx python interface is remarkably optimized. Using Python, we can build all graph-downstream analyses after the initial creation of the graph (I suppose an undirected graph).
BiPartite: As visualized in DBRetina bipartite, this can be useful in so many applications (maybe like):
- metagenomes compositional analysis.
- pangenome-like signatures relations with genomes.
- Host-Pathogen Interactions.
- Strain-level analysis.
Community Detection: The current clustering algorithm is weakly_connected_component, @bluegenes tried it before with kSpider, and -as far as I remember- it did a great job in the ANI-based clustering of the GTDB-207. Here, I propose adopting community detection methods, which have been proven very useful in DBRetina, but I haven't tried them on DNA data.
- Note: RustworkX currently lacks variability in graph algorithms, unlike NetworkX.
- Suggested algorithms to explore:
  - infomap: paper, implementation
  - Leiden: An extension to the popular Louvain algorithm, they provide an excellent python-c++ package.
k-mer graph 🌟: : Here the graph will consists of k-mer hashes as nodes, and genomes/metagenomes/etc.. as edges, with abundance as edge-weight. This also can be useful for God knows how many applications (maybe like):
- Biomarkers detection
- Evolutionary and taxonomy analysis
- Low-complexity k-mers detection and removal
- and more ...
Interactive Dashboard: In DBRetina, I implemented a JS-based dashboard that loads the graphs and allows interactive researching by filtering/querying the graph with many features/thresholds/etc.. it was super helpful. Previously, this was done by exporting the graph to a graph database like Neo4J or memgraph, but it will not help software users.
(maybe odds ratio & p-value): In the pairwise script, we can allow an optional calculation of the similarity significance by calculating the odds ratio and p-value. But I will need to think more about it in this context.

The text was updated successfully, but these errors were encountered:

mr-eyes · 2024-02-27T22:20:26Z

Notes regarding visualization and clustering:

UMAP, tSNE, and other MDS algorithms usually require tweaking the parameter many times to get an expected output.
Constructing MDS, then performing k-means or other clustering algorithms can be super useful.

Examples for MDS visualizations done by kSpider: https://farm.cse.ucdavis.edu/~mhussien/hmp_bacterial_plots/

ref #248

mr-eyes mentioned this issue Mar 7, 2024

updated usage example dib-lab/kSpider#39

Open

ctb mentioned this issue May 14, 2024

sourmash compare runs out of memory on large comparisons sourmash-bio/sourmash#3134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pairwise/clustering downstream-analysis research-driven thoughts #252

pairwise/clustering downstream-analysis research-driven thoughts #252

mr-eyes commented Feb 27, 2024

mr-eyes commented Feb 27, 2024 •

edited

Loading

pairwise/clustering downstream-analysis research-driven thoughts #252

pairwise/clustering downstream-analysis research-driven thoughts #252

Comments

mr-eyes commented Feb 27, 2024

mr-eyes commented Feb 27, 2024 • edited Loading

mr-eyes commented Feb 27, 2024 •

edited

Loading