Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add get_model to _datasets #87

Merged
merged 12 commits into from
Jan 15, 2025
1 change: 1 addition & 0 deletions docs/api/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Downloading of use case datasets which ar explored in the example analyses.

get_dataset
get_motif_db
get_model
Genome
register_genome
```
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
installation
tutorials/index
api/index
models/index
changelog
contributing
references
Expand Down
45 changes: 45 additions & 0 deletions docs/models/biccn.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
BICCN
============

.. sidebar:: Model Features

- **Genome**: *mm10*
- **Type**: Peak Regression
- **Parameters**: 6.3M
- **Size**: 23MB
- **Input shape**: (2114, 4)
- **Output shape**: (19,)

The **BICCN** model is a peak regression model fine-tuned to cell type-specific regions for cell types in the mouse cortex. It was used in the BICCN Challenge, to predict in vivo activity of a large set of validated enhancers. The selected model was the one that had the highest ranking out of all submitted sequence-models.

After pretraining on all consensus peaks, the model was fine-tuned to specific peaks. Specific peaks were determined through the ratio of highest and second highest peak, and the ratio of the second and third highest peak. These sets of regions were then used as input to the model, where 2114bp one-hot encoded DNA sequences were used to per cell type the mean peak accessibility over the center 1000 bp of the peak.

The model is a CNN multiclass regression model using the :func:`~crested.tl.zoo.chrombpnet` architecture.

Details of the data and the model can be found in the original publication.

-------------------

.. admonition:: Citation

Johansen, N.J., Kempynck, N. et al. Evaluating Methods for the Prediction of Cell Type-Specific Enhancers in the Mammalian Cortex. bioRxiv (2024). https://doi.org/10.1101/2024.08.21.609075

Usage
-------------------

.. code-block:: python
:linenos:

import crested
import keras

# download model
model_path, output_names = crested.get_model("BICCN")

# load model
model = keras.models.load_model(model_path)

# make predictions
sequence = "A" * 500
predictions = crested.tl.predict(sequence, model)
print(predictions.shape)
45 changes: 45 additions & 0 deletions docs/models/deepchickenbrain1.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
DeepChickenBrain1
=================

.. sidebar:: Model Features

- **Genome**: *galGal6*
- **Type**: Topic Classification
- **Parameters**: 11.1M
- **Size**: 38MB
- **Input shape**: (500, 4)
- **Output shape**: (20,)

The **DeepChickenBrain1** model is a topic classification model, fine-tuned with differential accessible regions (DARs) to make cell type level predictions for cell types in the chicken telencephalon.

After pretraining on topics, obtained through `pycistopic <https://pycistopic.readthedocs.io/en/latest/>`_, DARs were calculated per cell type and used as cell type representation. These sets of regions were then used as input to the model, where 500bp one-hot encoded DNA sequences were used to predict the cell type(s) to which the regions belong.

The model is a CNN multiclass classifier which uses the :func:`~crested.tl.zoo.deeptopic_cnn` architecture.

Details of the data and the model can be found in the original publication.

-------------------

.. admonition:: Citation

Hecker, N., Kempynck, N. et al. Enhancer-driven cell type comparison reveals similarities between the mammalian and bird pallium. bioRxiv (2024). https://doi.org/10.1101/2024.04.17.589795

Usage
-------------------

.. code-block:: python
:linenos:

import crested
import keras

# download model
model_path, output_names = crested.get_model("DeepChickenBrain1")

# load model
model = keras.models.load_model(model_path)

# make predictions
sequence = "A" * 500
predictions = crested.tl.predict(sequence, model)
print(predictions.shape)
48 changes: 48 additions & 0 deletions docs/models/deepchickenbrain2.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
DeepChickenBrain2
=================

.. sidebar:: Model Features

- **Genome**: *galGal6*
- **Type**: Peak Regression
- **Parameters**: 6.3M
- **Size**: 23MB
- **Input shape**: (2114, 4)
- **Output shape**: (20,)


The **DeepChickenBrain2** model is a peak regression model fine-tuned to cell type-specific regions for cell types in the chicken telencephalon.

After pretraining on all consensus peaks, the model was fine-tuned to specific peaks obtained with the :func:`~crested.pp.filter_regions_on_specificity` function. These sets of regions were then used as input to the model, where 2114bp one-hot encoded DNA sequences were used to per cell type the mean peak accessibility over the center 1000 bp of the peak.

Peak heights were normalized across cell types with the :func:`~crested.pp.normalize_peaks` function.

The model is a CNN multiclass regression model that uses the :func:`~crested.tl.zoo.chrombpnet` architecture.

Details of the data and the model can be found in the original publication.

-------------------

.. admonition:: Citation

Hecker, N., Kempynck, N. et al. Enhancer-driven cell type comparison reveals similarities between the mammalian and bird pallium. bioRxiv (2024). https://doi.org/10.1101/2024.04.17.589795

Usage
-------------------

.. code-block:: python
:linenos:

import crested
import keras

# download model
model_path, output_names = crested.get_model("DeepChickenBrain2")

# load model
model = keras.models.load_model(model_path)

# make predictions
sequence = "A" * 500
predictions = crested.tl.predict(sequence, model)
print(predictions.shape)
53 changes: 53 additions & 0 deletions docs/models/deepflybrain.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
DeepFlyBrain
============

.. sidebar:: Model Features

- **Genome**: *dm6*
- **Type**: Topic Classification
- **Parameters**: 3.2M
- **Size**: 12MB
- **Input shape**: (500, 4)
- **Output shape**: (81,)

The **DeepFlyBrain** model is a topic classification model trained on KCs, T-Neurons, and Glia cells from the adult fly brain (17K cells total).

Using `pycistopic <https://pycistopic.readthedocs.io/en/latest/>`_, binarized topics per region were extracted for 81 target topics. These sets of regions were then used as input for a DL model, where 500bp one-hot encoded (ACGT) DNA sequences were used to predict the topic set to which the region belongs.

The model is a hybrid CNN-RNN multiclass classifier which is very similar to :func:`~crested.tl.zoo.deeptopic_lstm` with addition of a reverse complement layer in the first layer of the model.

Details of the data and model can be found in the original publication.

-------------------

.. admonition:: Citation

Janssens, J., Aibar, S., Taskiran, I.I. et al. Decoding gene regulation in the fly brain. Nature 601, 630–636 (2022). https://doi.org/10.1038/s41586-021-04262-z

Usage
-------------------

.. code-block:: python
:linenos:

import crested
import keras

# download model
model_path, output_names = crested.get_model("DeepFlyBrain")

# load model
model = keras.models.load_model(model_path)

# make predictions
sequence = "A" * 500
predictions = crested.tl.predict(sequence, model)
print(predictions.shape)

-------------------

.. warning::

DeepFlyBrain was originally trained using Tensorflow 1 as the backend.
Even though the model architecture and weights are exactly the same, there will be slight differences in the output compared to the original model due to backend changes between Tensorflow 1 and 2.
Overall the correlation between the original and the Keras 3 model is very high (0.99+), but if you want the exact same outputs and contribution plots as in the original publication, you should use an older, compatible environment which you can find in `kipoi <https://kipoi.org/models/DeepFlyBrain/>`_.
48 changes: 48 additions & 0 deletions docs/models/deephumanbrain.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
DeepHumanBrain
==============

.. sidebar:: Model Features

- **Genome**: *hg38*
- **Type**: Peak Regression
- **Parameters**: 25.3M
- **Size**: 91MB
- **Input shape**: (2114, 4)
- **Output shape**: (76,)


The **DeepHumanBrain** model is a peak regression model fine-tuned to cell type-specific regions for cell types in the whole human brain. The dataset was obtained from Li et al., 2023 (Science).

After pretraining on all consensus peaks, the model was fine-tuned to specific peaks obtained with the :func:`~crested.pp.filter_regions_on_specificity` function. These sets of regions were then used as input to the model, where 2114bp one-hot encoded DNA sequences were used to per cell type the mean peak accessibility over the center 1000 bp of the peak.

Peak heights were normalized across cell types with the :func:`~crested.pp.normalize_peaks` function.

The model is a CNN multiclass regression model that uses the :func:`~crested.tl.zoo.chrombpnet` architecture. It has 1024 convolutional filters per layer instead of the default 512..

Details of the data and the model can be found in the original publication.

-------------------

.. admonition:: Citation

Hecker, N., Kempynck, N. et al. Enhancer-driven cell type comparison reveals similarities between the mammalian and bird pallium. bioRxiv (2024). https://doi.org/10.1101/2024.04.17.589795

Usage
-------------------

.. code-block:: python
:linenos:

import crested
import keras

# download model
model_path, output_names = crested.get_model("DeepHumanBrain")

# load model
model = keras.models.load_model(model_path)

# make predictions
sequence = "A" * 500
predictions = crested.tl.predict(sequence, model)
print(predictions.shape)
46 changes: 46 additions & 0 deletions docs/models/deephumancortex1.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
DeepHumanCortex1
================

.. sidebar:: Model Features

- **Genome**: *hg38*
- **Type**: Topic Classification
- **Parameters**: 11.1M
- **Size**: 37MB
- **Input shape**: (500, 4)
- **Output shape**: (13,)


The **DeepHumanCortex1** model is a topic classification model, fine-tuned with differential accessible regions (DARs) to make cell type level predictions for cell types in the human motor cortex. The dataset was obtained from Ma et al., 2022 (Science).

After pretraining on topics, obtained through `pycistopic <https://pycistopic.readthedocs.io/en/latest/>`_, DARs were calculated per cell type and used as cell type representation. These sets of regions were then used as input to the model, where 500bp one-hot encoded DNA sequences were used to predict the cell type(s) to which the regions belong.

The model is a CNN multiclass classifier which uses the :func:`~crested.tl.zoo.deeptopic_cnn` architecture.

Details of the data and the model can be found in the original publication.

-------------------

.. admonition:: Citation

Hecker, N., Kempynck, N. et al. Enhancer-driven cell type comparison reveals similarities between the mammalian and bird pallium. bioRxiv (2024). https://doi.org/10.1101/2024.04.17.589795

Usage
-------------------

.. code-block:: python
:linenos:

import crested
import keras

# download model
model_path, output_names = crested.get_model("DeepHumanCortex1")

# load model
model = keras.models.load_model(model_path)

# make predictions
sequence = "A" * 500
predictions = crested.tl.predict(sequence, model)
print(predictions.shape)
46 changes: 46 additions & 0 deletions docs/models/deephumancortex2.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
DeepHumanCortex2
================

.. sidebar:: Model Features

- **Genome**: *hg38*
- **Type**: Topic Classification
- **Parameters**: 13.9M
- **Size**: 47MB
- **Input shape**: (500, 4)
- **Output shape**: (14,)


The **DeepHumanCortex2** model is a topic classification model, fine-tuned with differential accessible regions (DARs) to make cell type level predictions for cell types in the human motor cortex. The dataset was obtained from Bakken et al., 2021(Science).

After pretraining on topics, obtained through `pycistopic <https://pycistopic.readthedocs.io/en/latest/>`_, DARs were calculated per cell type and used as cell type representation. These sets of regions were then used as input to the model, where 500bp one-hot encoded DNA sequences were used to predict the cell type(s) to which the regions belong.

The model is a CNN multiclass classifier which is uses the :func:`~crested.tl.zoo.deeptopic_cnn` architecture.

Details of the data and the model can be found in the original publication.

-------------------

.. admonition:: Citation

Hecker, N., Kempynck, N. et al. Enhancer-driven cell type comparison reveals similarities between the mammalian and bird pallium. bioRxiv (2024). https://doi.org/10.1101/2024.04.17.589795

Usage
-------------------

.. code-block:: python
:linenos:

import crested
import keras

# download model
model_path, output_names = crested.get_model("DeepHumanCortex2")

# load model
model = keras.models.load_model(model_path)

# make predictions
sequence = "A" * 500
predictions = crested.tl.predict(sequence, model)
print(predictions.shape)
Loading
Loading