In this work, we constructed Weighted Cell-Specific Networks (WCSN) based on highly variable genes, capturing both gene expression patterns and gene-gene interaction strengths. A graph neural network is then employed to extract features from the WCSN, enabling accurate cell type annotation. We term our model WCSGNet.
- ubuntu18.04
- RTX 4080 SUPER (16G)
Requirements | Release |
---|---|
CUDA | 12.7 |
Python | 3.8.18 |
torch | 1.12.1 |
torch_geometric | 2.5.1 |
numpy | 1.24.1 |
scikit-learn | 1.3.1 |
tqdm | 4.66.2 |
pandas | 2.0.3 |
matplotlib | 3.7.5 |
This folder stores the code files.
-
DataPreprocessing
Jupyter notebooks for preprocessing individual datasets. The processed data generated by these scripts is saved in the
dataset/pre_data/scRNAseq_datasets
directory.data_to_csv.ipynb
: Obtain the gene expression matrix after cell filtering and highly variable gene selection, along with the data splits for five-fold cross-validation. Save them as CSV files for network construction using both WGCNA and PCA-PMI methods. The processed data is saved in thedataset/pre_data/scRNAseq_datasets_hvgs
directory. -
Figures
This folder contains code for drawing.
-
data_partitioning.py
This file stores the five-fold cross-validation splits for the corresponding dataset. The generated files will be stored in the
dataset/5fold_data
folder. -
gene_filter.py
This file can store the indices of the highly variable genes (eg. 2000) in the gene expression matrix for each dataset.
-
up_sample.py
This step performs up-sampling on cell types with fewer cells in the training set and generates the cell indices of the up-sampled training set. The result is saved as
*_train_index_imputed.npy
in the corresponding scRNA-seq dataset directory underdataset/5fold_data/
. -
wcsn_constr_train.py
This step constructs WCSNs based on highly variable genes for each scRNA-seq dataset's 5-fold training set.
-
wcsn_constr_test.py
This step constructs WCSNs based on highly variable genes for each scRNA-seq dataset's 5-fold testing set.
-
model.py
This file contains the code for the WCSGNet model.
-
datasets_wcsn.py
This file defines a custom PyTorch Geometric dataset class
MyDataset
, which is designed to handle graph data. There is no need to supplement the WCSN; it directly reads the WCSN data. -
datasets_LT.py
This file defines a custom PyTorch Geometric dataset class
MyDataset2
, which is designed to handle graph data. Its main functionalities include applying a logarithmic transformation to the WCSN weights. -
datasets_BT.py
This file defines a custom PyTorch Geometric dataset class
MyDataset2
, which is designed to handle graph data. Its main functionalities include using the binary transformation assign 1 to all edges. -
wcsn_classify_train.py
This step generates the 5-fold training set models using WCSN and saves them in
result/models
. -
wcsn_classify_test.py
This step generates the predicted results for the testing sets using WCSN and saves them in
result/preds
. -
LT_wcsn_classify_train.py
This step generates the 5-fold training set models using WCSN(logarithmic transformation) and saves them in
result/models_LT
. -
wgcna_classify_train.py
This step generates the 5-fold training set models using WCSN(WGCNA) and saves them in
result/wgcna_models
. -
grnboost2_classify_train.py
This step generates the 5-fold training set models using WCSN(GRNBoost2) and saves them in
result/grnboost2_models
. -
pmi_classify_train.py
This step generates the 5-fold training set models using WCSN(PCA-PMI) and saves them in
result/pca_pmi_models
. -
LT_wcsn_classify_test.py
This step generates the predicted results for the testing sets using WCSN(logarithmic transformation) and saves them in
result/preds_LT
. -
BT_wcsn_classify_train.py
This step generates the 5-fold training set models using WCSN(binary transformation) and saves them in
result/models_BT
. -
BT_wcsn_classify_test.py
This step generates the predicted results for the testing sets using WCSN(binary transformation) and saves them in
result/preds_BT
. -
wgcna_classify_test.py
This step generates the predicted results for the testing sets using WCSN(WGCNA) and saves them in
result/wgcna_preds
. -
grnboost2_classify_test.py
This step generates the predicted results for the testing sets using WCSN(GRNBoost2) and saves them in
result/grnboost2_preds
. -
pmi_classify_test.py
This step generates the predicted results for the testing sets using WCSN(PCA-PMI) and saves them in
result/pca-pmi_preds
.
Storage of Downloaded Raw Data
-
scRNAseq_Benchmark_datasets
The downloaded scRNA-seq datasets include: Muraro, Segerstolpe, Zheng 68k, Zhang_T, Kang, Baron, AMB, and TM.
Storing preprocessed data, five-fold splits of the running data, Entrez Gene IDs of genes, selected highly variable genes, generated high-confidence interaction subnetworks, WCSN data, etc.
-
pre_data
scRNA-seq_datasets: Preprocessed scRNA-seq datasets generated using the
.ipynb
files located in thesrc/DataProcessing
directory.scRNAseq_datasets_hvgs: This folder stores the preprocessed scRNA-seq datasets which is after cell filtering and highly variable gene selection, along with the data splits for five-fold cross-validation. Obtain from
src/DataPreprocessing/data_to_csv.ipynb
-
5fold_data
Store the data generated during the processing of each scRNA-seq dataset. This includes the five-fold splits of the dataset, the filtered list of highly variable genes, the indices obtained from up-sampling thetraining set, and the WCSNs generated for each fold of the training and testing sets.
-
Figures
Storage result diagram.
-
models
This folder contains the trained models and the models obtained from each fold of the cross validations by
src/wcsn_classify_train.py
-
models_LT
This folder contains the trained models and the models obtained from each fold of the cross validations by
src/LT_wcsn_classify_train.py
-
models_BT
This folder contains the trained models and the models obtained from each fold of the cross validations by
src/BT_wcsn_classify_train.py
-
wgcna_models
This folder contains the trained models and the models obtained from each fold of the cross validations by
src/wgcna_classify_train.py
-
grnboost2_models
This folder contains the trained models and the models obtained from each fold of the cross validations by
src/grnboost2_classify_train.py
-
pca_pmi_models
This folder contains the trained models and the models obtained from each fold of the cross validations by
src/pmi_classify_train.py
-
preds
This folder contains the predicted results generated by
src/wcsn_classify_test.py
-
preds_LT
This folder contains the predicted results generated by
src/LT_ewcsn_classify_test.py
. -
preds_BT
This folder contains the predicted results generated by
src/BT_ewcsn_classify_test.py
. -
wgcna_preds
This folder contains the predicted results generated by
src/wgcna_classify_test.py
. -
grnboost2_preds
This folder contains the predicted results generated by
src/grnboost2_classify_test.py
. -
pca_pmi_preds
This folder contains the predicted results generated by
src/pmi_classify_test.py
.
Muraro, Segerstolpe, Zheng 68k, Baron, AMB, and TM: Available for direct download from Zenodo.
Zhang T: Accessible via GEO under accession number GSE108989.
Kang: Accessible via GEO under accession number GSE96583.
Save to the data/scRNAseq_Benchmark_datasets
directory.
-
Initial preprocessing
In the GitHub project's section 3.1.1, the scRNA-seq datasets are downloaded and require initial preprocessing using the
ipynb
files in thesrc/DataProcessing
directory. The preprocessing steps include:- Filtering out cell types with fewer than 10 cells and cells with unclear annotations.
- Filtering out genes expressed in fewer than 10 cells.
After preprocessing, the resulting data should be saved in the
dataset/pre_data/scRNA-seq_datasets
directory.
-
Five-Fold Cross-Validation Splits
! python src/data_partitioning.py
Optional parameters
- -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz', Specify the scRNA-seq dataset.
- -outdir: Default='dataset/5fold_data/', Specify the output directory.
- --n_splits: Default=5, Indicates Five-fold cross-validation.
This step generates a seq_dict.npz
file for each dataset located in the dataset/5fold_data/
directory. These files are used to store the five-fold cross-validation splits for the corresponding dataset, ensuring consistent and reproducible training and evaluation.
-
Up-sampling
! python src/up_sample.py
Optional parameters
-
-expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz', Specify the scRNA-seq dataset.
-
-outdir: Default='dataset/5fold_data/', Specify the output directory.
-
--n_splits: Default=5, Indicates Five-fold cross-validation.
This step performs up-sampling on cell types with fewer cells in the training set and generates the cell indices of the up-sampled training set. The result is saved as
*_train_index_imputed.npy
in the corresponding scRNA-seq dataset directory underdataset/5fold_data/
. -
-
Selection of highly variable genes(HVGs)
! python src/gene_filter.py
Optional parameters
-
-expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz', Specify the scRNA-seq dataset.
-
-outdir: Default='dataset/5fold_data/', Specify the output directory.
-
-hvgs: Default=2000, Specify the number of HVGs.
This step generates a
.npy
file for each dataset, containing 2000 HVGs. The file stores the indices of the highly variable genes in the gene expression matrix. -
- WGCNA: R package "WGCNA"
- PCA-PMI: https://github.com/Pantrick/PCA-PMI
- GRNBoost2: http://arboreto.readthedocs.io.
! python src/wcsn_constr_train.py
Optional parameters
- -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
- -outdir: Default='dataset/5fold_data/'. Specify the output directory.
- -cuda: Default=True.
- -hvgs: Default=2000, The number of HVGs.
- -ca: Default=0.01, Significance level.
- --n_splits: Default=5, Indicates Five-fold cross-validation.
This step constructs WCSNs based on highly variable genes for each scRNA-seq dataset's 5-fold training set. The graph for each training set cell is saved as a .pt
file in the processed
folder of the corresponding fold (e.g., train_f1
) within the WCSN_a0.01_hvgs2000
folder, which is located under the corresponding dataset folder in dataset/5fold_data/
.
! python src/wcsn_constr_test.py
Optional parameters
- -expr: Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
- -outdir: Default='dataset/5fold_data/'. Specify the output directory.
- -cuda: Default=True.
- -hvgs: Default=2000, The number of HVGs.
- -ca: Default=0.01, Significance level.
- --n_splits: Default=5, Indicates Five-fold cross-validation.
This step constructs a 5-fold WCSN based on highly variable genes for each scRNA-seq dataset's testing set. The graph for each testing set cell is saved as a .pt
file in the processed
folder of the corresponding fold (e.g., test_f1
) within the WCSN_a0.01_hvgs2000
folder, which is located under the corresponding dataset folder in dataset/5fold_data/
.
Training
! python src/wcsn_classify_train.py
Optional parameters
-
-expr: : Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
-
-outdir: Default='result/models'. Specify the output directory.
-
-ca: Default=0.01, Significance level.
-
-hvgs: Default=2000, The number of HVGs.
-
-bs: Default=32, The batch size of this training.
This step generates the 5-fold training set models using WCSN and saves them in result/models
.
Testing
! python src/wcsn_classify_test.py
Optional parameters
-
-expr: : Default='dataset/pre_data/scRNAseq_datasets/Muraro.npz'. Specify the input scRNA-seq dataset.
-
-outdir: Default='result/models'. Specify the output directory.
-
-ca: Default=0.01, Significance level.
-
-hvgs: Default=2000, The number of HVGs.
-
-bs: Default=32.
This step generates the predicted results for the testing sets and saves them in result/preds
.
The results include:
*_Prediction.h5
: Contains the true labels and predicted labels for the test set cells, the probability matrix for each predicted cell type, and the cell embeddings for each cell.
*_F1.csv
: Includes the accuracy, Mean F1-Score, and the F1-Score for each cell type.
Training
! python src/LT_wcsn_classify_train.py
The optional parameters are mostly the same as those in wcsn_classify_train.py.
This step generates the 5-fold training set models using WCSN(logarithmic transformation) and saves them in result/models_LT
.
Testing
! python src/LT_wcsn_classify_test.py
The optional parameters are mostly the same as those in wcsn_classify_test.py.
This step generates the predicted results for the testing sets using WCSN(logarithmic transformation) and saves them in result/preds_LT
. The results include *_Prediction.h5
and *_F1.csv
.
Training
! python src/BT_wcsn_classify_train.py
The optional parameters are mostly the same as those in wcsn_classify_train.py.
This step generates the 5-fold training set models using WCSN(binary transformation) and saves them in result/models_BT
.
Testing
! python src/BT_wcsn_classify_test.py
The optional parameters are mostly the same as those in wcsn_classify_test.py.
This step generates the predicted results for the testing sets using WCSN(binary transformation) and saves them in result/preds_BT
. The results include *_Prediction.h5
and *_F1.csv
.
Network Construction
Rscript src/WGCNA.R
Training
! python src/wgcna_classify_train.py
The optional parameters are mostly the same as those in wcsn_classify_train.py.
different optional parameters
- -netname: default='wgcna', network construction method
This step generates the 5-fold training set models using WCSN(WGCNA) and saves them in result/wgcna_models
.
Testing
! python src/wgcna_classify_test.py
The optional parameters are mostly the same as those in wgcna_classify_train.py.
This step generates the predicted results for the testing sets using WCSN(WGCNA) and saves them in result/wgcna_preds
. The results include *_Prediction.h5
and *_F1.csv
.
Network Construction
src/GRNBoost2.ipynb
Training
! python src/grnboost2_classify_train.py
The optional parameters are mostly the same as those in wcsn_classify_train.py.
different optional parameters
- -netname: default='grnboost2', network construction method
This step generates the 5-fold training set models using WCSN(GRNBoost2) and saves them in result/grnboost2_models
.
Testing
! python src/grnboost2_classify_test.py
The optional parameters are mostly the same as those in grnboost2_classify_test.py
This step generates the predicted results for the testing sets using WCSN(GRNBoost2) and saves them in result/grnboost2_preds
. The results include *_Prediction.h5
and *_F1.csv
.
Network Construction
src/pmi_run.m
Training
! python src/pmi_classify_train.py
The optional parameters are mostly the same as those in wcsn_classify_train.py.
different optional parameters
- -netname: default='pca_pmi', network construction method
This step generates the 5-fold training set models using WCSN(PCA-PMI) and saves them in result/pca_pmi_models
.
Testing
! python src/pmi_classify_test.py
The optional parameters are mostly the same as those in pmi_classify_test.py
This step generates the predicted results for the testing sets using WCSN(PCA-PMI) and saves them in result/pca_pmi_preds
. The results include *_Prediction.h5
and *_F1.csv
.
All drawing codes are from
src/Figures/
-
Figure 2
src/Figures/Figure2.ipynb
Sankey diagram of the different datasets under WCSGNet's 5-fold cross-validation.
-
Figure 3
src/Figures/Figure-rare-cell-type.ipynb
Comparison of rare cell type identification performance across nine scRNA-seq datasets. Each panel presents a bar chart showing the mean F1-score for WCSGNet and eight baseline methods.
-
Figure 4
src/Figures/Figure-diff-net.py
Comparison of WCSGNet performance using different gene association networks across nine scRNA-seq datasets.
-
Figure 5(A) and (B)
src/Figures/Figure5AB.ipynb
Performance of WCSGNet with different edge weight representation methods, including the original method, binary transformation and binary transformation .
-
Figure 6(A-N)
BaronHuman_analysis.ipynb
src/Figures/R/Figure-hub-genes.R
Top degree gene analysis of WCSN for different cell types on the Baron Human dataset
-
Figure 7(A-N)
BaronHuman_analysis.ipynb
src/Figures/R/Figure-high-weight.R
Top high-weight edges analysis of WCSNs for different cell types in the Baron Human dataset.
-
Figure 8(A-N)
src/Figure/Figure-tsne.py
T-SNE visualization and feature analysis of the Baron Human dataset using WCSGNet.
-
Figure 9(A-H)
src/Figures/AMB_analysis.ipynb
src/Figures/R/Figure-AMB-gene.R
src/Figures/R/Figure-AMB-edge.R
Analysis of top degree genes and high-weight edges in WCSN for Different Cell Types on the AMB Dataset.
-
Figure S1
src/Figures/Figure-log.py
Distribution of edge weights before and after logarithmic transformation for the training sets in five-fold cross-validation across all datasets including Zhang T, Kang, Zheng 68k, Baron Human, Muraro, Segerstolpe, AMB, TM and Baron Mouse.
The following factors may result in slight differences in the Mean F1-score and Accuracy for cell type classification when reproducing the results, compared to those reported in the paper.
- The DataLoader applies a shuffle operation on the training dataset during model training, leading to some randomness in the input sequence of the training data.
- The use of the Dropout mechanism in the model introduces variability in the trained models across different runs.
- Parameter initialization also produces some randomness.
However, these differences do not have a disruptive impact on the conclusions of the paper.