Generalizing dataloader and loading multiple species #88

mtvector · 2025-01-02T20:56:41Z

Hi all,

I wanted to start this pull request as a discussion about some extensions to CREsted that I have considered/need, and have built out a preliminary version of in my fork. I haven't written tests and certainly the changes I have here break other things in CREsted I haven't checked.

I'm working on building models that train on data across species, as well as have additional information like gene expression vectors passed to the model.

Therefore what I have altered in my fork includes the following:

Altering AnnDataSet and AnnDataLoader so that you can load multiple fields from obs, obsm, var and varp, and deliver those to your trainer as a dict of tensors.n
Adding a MetaAnnDataset and MetaSampler, so that you can collate AnnData objects from multiple species, and sample from them randomly in minibatches.

I'm still working out some bugs, but figured I shouldn't go any further until I had reached out to see what you have in the works here, and if this code is useful to you and is the type of thing you would consider merging when it is mature. Otherwise I'll continue developing this as independent extensions for CREsted.

Thanks so much for developing this great package!

Matthew

P.S. The low-level usage for the extended classes I've written looks like so:

adata = crested.import_bigwigs(bigwigs_folder=atac_bigwig_dir,regions_file=bin_path,chromsizes_file=chromsizes_file,target='raw')
bdata = crested.import_bigwigs(bigwigs_folder=atac_bigwig_dir,regions_file=bin_path,chromsizes_file=chromsizes_file,target='raw')

adata.obs['imaginary'] = np.random.randint(0,10,adata.shape[0])
adata.obsm['test'] = np.random.randn(adata.shape[0],3)
bdata.obs['imaginary'] = np.random.randint(-5,-1,adata.shape[0])
bdata.obsm['test'] = np.random.randn(adata.shape[0],3)-2

p = np.full(adata.n_vars, 0.5/(adata.n_vars - 1), dtype=float) 
p[0] = 0.5  # Give the first feature a 0.5 probability
adata.var["sample_prob"] = p
bdata.var["sample_prob"] = p

crested.pp.train_val_test_split(
    adata, strategy="chr", val_chroms=["chr8"], test_chroms=["chr9"]
)
crested.pp.train_val_test_split(
    bdata, strategy="chr", val_chroms=["chr8"], test_chroms=["chr9"]
)

datamodule = MetaAnnDataModule(
    adatas=[adata,bdata],
    genomes=[genome,genome],
    batch_size=32,
    epoch_size=5000,
    max_stochastic_shift=3, 
    always_reverse_complement=True, 
    obs_columns=['imaginary'],
    obsm_keys=['test']
)

datamodule.setup('fit')

for x in datamodule.train_dataloader.data:
    print(x)
    for k in x.keys():
        print(k,x[k].shape)

mtvector and others added 3 commits January 2, 2025 11:28

first round of changes

95cceb5

annoying checkpoints

f6b5a2c

Merge branch 'aertslab:main' into extended_loader

00f1995

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalizing dataloader and loading multiple species #88

Generalizing dataloader and loading multiple species #88

mtvector commented Jan 2, 2025 •

edited

Loading

Generalizing dataloader and loading multiple species #88

Are you sure you want to change the base?

Generalizing dataloader and loading multiple species #88

Conversation

mtvector commented Jan 2, 2025 • edited Loading

mtvector commented Jan 2, 2025 •

edited

Loading