Skip to content

Commit

Permalink
Adding some experimental code directly to master. I'm considering the…
Browse files Browse the repository at this point in the history
… compositional analysis a "hotfix" because it's essential to the suite's functionality. THe cublas-gemm code was not a success. The module still doesn't compile because of some Cythonization issues outstanding and multiple. Okay, so while I'm figuring out how to properly Strassen, I can work with the regression.pyx file and continue to commit to main.

Squash commit:

Author: Matt Ralston <professional.bio.coder@gmail.com>
Date:   Tue Aug 27 13:58:53 2024 -0400
Date:   Mon Aug 26 22:40:36 2024 -0400
Date:   Fri Aug 16 20:26:20 2024 -0400

    Strassen assignment. Struggling with some portions. New to Cython and its documentation. Syntax seems good, agreeable. Some allocate then mutate patterns are bothersome but nothing is too confusing.

    Adds CUDA code for delegation of regression multiplications to cublas-gemm. Scaffolding from chatgpt so not entirely valid cython. Still producing errors. THanks in advance.

    Closes #154. Patches parse.py to remove deprecated parameter 'b' from legacy/deprecated code. Pushing to main directly.
  • Loading branch information
MatthewRalston committed Aug 28, 2024
1 parent 82de159 commit 2774ec6
Show file tree
Hide file tree
Showing 15 changed files with 737 additions and 23 deletions.
16 changes: 15 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,21 @@
__pycache__
kdb/__pycache__
**.kdb
docs/generated/
examples/example_report/*.png
examples/example_report
test/data
pypi_token.foo
kmerdb.egg-info/
kmerdb.log
test.*
build
dist
.reinstall.sh
examples/*/*.png
examples/*/*.jpg






90 changes: 86 additions & 4 deletions TODO.org
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,78 @@
# .kdb files should be debrujin graph databases
# The final prototype would be .bgzf format from biopython

* 8/14/24 compositional ideas

** Notes:
its a regression problem. the coefficients should sum to unity, and the sum of the k-mer count vectors is the the total k-mer count


the count vector is the 'coefficient' times the aggregate vector (.kdb file)

therefore, the 'coefficients' or proportions contributable, along with regression statistics,

can be made be performing least-squares on a count-matrix Ax = b, where b is the aggregate/collated/composite count-vector profile of multiple species, from which the decomposition progresses.

A is the count matrix obtained by collating suspected species of the 'metagenome' into one k-mer profile.

b is the 'observed' or artificially generated metagenome k-mer profile.

I want to do this in either Cython, Numba, and/or CUDA.

Then do the D2 statistics.
**

**



* TODO 8/13/24 priority: compositional analysis

** Lots of priorities, currently. Comp. analysis, custom SVD implementation for the matrix command? A = UsigmaVt

** Markov chain submodules? Log-likelihood ratio

** TODO [/] T O D O list

*** IN-PROGRESS [ x ] Quarto journals elsevier preprint for biorxiv
:LOGBOOK:
- State "IN-PROGRESS" from "NEXT" [2024-08-16 Fri 20:33]
:END:

*** IN-PROGRESS [ x ] compositional analysis (p0)
:LOGBOOK:
- State "IN-PROGRESS" from "NEXT" [2024-08-16 Fri 20:33]
:END:

*** DONE [ x ]Are all species downloaded
CLOSED: [2024-08-16 Fri 20:33]
:LOGBOOK:
- State "DONE" from "WAITING" [2024-08-16 Fri 20:33]
- State "WAITING" from "DONE" [2024-08-14 Wed 16:36]
- State "DONE" from "WAITING" [2024-08-14 Wed 16:36]
- State "WAITING" from "IN-PROGRESS" [2024-08-14 Wed 16:36]
- State "IN-PROGRESS" from "NEXT" [2024-08-14 Wed 16:36]
:END:

** Description from Xiong et al Mouse IBD.

Once GF status was confirmed as described, GF NOD mice were colonized by co-housing with gnotobiotic mice
colonized with defined cultured bacterial species (Altered Schaedler's flora; ASF - [35]) which were prepared
in the laboratory from cloned bacteria using sterile technique. The ASF consists of Lactobacillus acidophilus
(ASF 360), Lactobacillus murinus (ASF 361), Bacteroides distasonis, (ASF 519), Mucispirillum schaedleri (ASF 457),
Eubacterium plexicaudatum (ASF 492), a Fusiform-shaped bacterium (ASF 356) and two Clostridium species (ASF 500, ASF 502).
These ASF-colonized gnotobiotic mice were then bred in isolators to ensure no additional species were introduced.
The presence of the ASF species was confirmed by species-specific bacterial qPCR [58].

**

**

* 8/9/24 profile decomposition/recomposition(simulation) problem
** Merge profiles with ratios (0.25% B. bifidum)

**

* 8/8/24 Taking Notes on Xuejiang Xiong Mouse model IBD study

** SRA Accession id
Expand Down Expand Up @@ -145,18 +217,23 @@ Kolmogorov complexity comes in two flavors: prefix-free (K(x)) and simple comple


** TODO core species choices
*** chicken farm estuary system changes (algination, asphyxia, microbiological changes
*** chicken farm runoff - estuary system changes (algination, asphyxia, microbiological changes)
*** anti-human leaky gut syndrome changes.
**** i.e. looking at the human leaky gut syndrome, but in reverse. What are bioprotective species and niches that provide resilience to leaky-gut syndrome
**** TODO chemophore SMILES and gastrotoxic footprints
**** mouse model (SRA051354) currently being studied from Xuejiang Xiong
**** looking to assess the Altered Shaedler flora/formula changes in irritable bowel syndrome.
**** Currently, only have the accession and brief notes, still reading as of 8/12/24
****


*** pathology of lupus or auto-immune skin condition microbiome/metagenomic changes.
*** vaginal microbiome changes
***
** Perspective 1 from reivew on distance metrics
**
* IN PROGRESS 7/10/24 - [IMPORTANT] Needs a choice [cython d2 x graph algorithm features ]:
** [Key choice needed]: 1 [ 2 reviews + cython D2 metrics ] path 2 [ 2 reviews + graph algorithm ]

** cython d2 metrics including the delta distance : |pab(A)-pab(B)| (Karlin et al, tetra,tri,di- nucleotide frequencies)
** (describe Karlin delta, algorithm to calculate)
*** Karlin delta first requires the least ambiguous k-mer (4-mer) frequency, i.e. the frequency of self
Expand All @@ -165,13 +242,18 @@ Kolmogorov complexity comes in two flavors: prefix-free (K(x)) and simple comple
*** this specifies the numerator for the tetranucleotide frequency (lowercause tau)
*** the denominator is only the most specific tetra and 1-neighboring trinucleotide frequencies, and the mononucleotide frequencies. [ f(acc) f(accg) f(ccg) f(a) f(c) f(t) f(g) ]
**
** new graph file format specification ( walk,path is a subclass of unlabeled graph, where node labels can be visited, path order, and progressive or retro in the walk.
** new graph file format specification (walk, path is a subclass of unlabeled graph, where node labels can be visited, path order, and progressive or retro in the walk.
** contig generator method, and contig boundary definition specification
**
**
**
**
* 6/28/24 - [ ...whoops, forgot the date by 3 x24hr blocxz. ] okay, so the 0.8.4 release should have the graph labeling done.
* TODO 6/28/24 - [ ...whoops, forgot the date by 3 x24hr blocxz. ] okay, so the 0.8.4 release should have the graph labeling done.
:LOGBOOK:
- State "DELEGATED" from "CANCELED" [2024-08-12 Mon 17:02]
- State "CANCELED" from "DELEGATED" [2024-08-12 Mon 17:02]
- State "DELEGATED" from [2024-08-12 Mon 17:02]
:END:

** graph node labeling and classification, and walk strategy

Expand Down
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
35 changes: 35 additions & 0 deletions docs/make.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
@ECHO OFF

pushd %~dp0

REM Command file for Sphinx documentation

if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build

%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)

if "%1" == "" goto help

%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end

:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%

:end
popd
18 changes: 18 additions & 0 deletions kmerdb/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,13 @@ def expanded_help(arguments):

sys.stderr.write("\n\nUse --help for expanded usage\n")


def composition(arguments):


from kmerdb.regression import linear_regression


def distances(arguments):
"""
An end-user function to provide CLI access to certain distances
Expand Down Expand Up @@ -2419,6 +2426,17 @@ def cli():
dist_parser.set_defaults(func=distances)


comp_parser = subparsers.add_parser("composition", help=appmap.command_14_description)

comp_parser.add_argument("-v", "--verbose", help="Prints warnings to the console by default", default=0, action="count")
comp_parser.add_argument("--debug", action="store_true", default=False, help="Debug mode. Do not format errors and condense log")
comp_parser.add_argument("-l", "--log-file", type=str, default="kmerdb.log", help="Destination path to log file")
comp_parser.add_argument("-nl", "--num-log-lines", type=int, choices=config.default_logline_choices, default=50, help=argparse.SUPPRESS)
comp_parser.add_argument("kdb", metavar="composite.$K.kdb", type=str, help="Composite/collated metagenomic k-mer profile (as .kdb) to decompose into constituents")
comp_parser.add_argument("table", metavar="input.tsv", default=sys.STDIN, type=str, help="Input count matrix to use for compositional analysis")



index_parser = subparsers.add_parser("index", help=appmap.command_9_description)

index_parser.add_argument("--force", action="store_true", help="Force index creation (if previous index exists")
Expand Down
Loading

0 comments on commit 2774ec6

Please sign in to comment.