-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
advice for using catvae in a new contexts #74
Comments
Hi @taylorreiter, thank you for reaching out - it sounds like a cool problem.
|
The count table would be samples x microbes, where sample will either be a contaminated genome or a contaminated RNA seq run. I'm doing the taxonomic profiling using This is flexible to a certain extent...I could keep the abundances, summarize to a higher lineage, or not summarize lineages at all.
Ok great, thank you! I haven't worked with the biom format before, but I'm sure I could convert my data into that format pretty easily. What would be the best data to train on? Sorry I know that's a naive question. But I could have counts from all of the GTDB rs202 genomes in e.g. 10k SRA metagenomes, would that be reasonable to use? It's not really the same application (meatgenome composition vs. contamination of samples), but I would expect many of the same species-species relationship to exist across all of these data types.
Ok thanks. By dimensionality, does that mean rows x columns? If I stick with order-level contamination detection in genomes, I would have 1,312 orders (rows) x 258,406 genomes (columns). For the RNA-seq problem, I'm guessing I'll have 3,000 species (rows) x 50,000 RNA samples (columns). That would be running everything all at once, which while I think that would be preferable, I could break it down so that I'm only looking within one order or one species at a time for a given set of samples. In that case, the columns number would shrink substantially and would max out around 10k or 15k. That's less desirable as it carves the data up in a way that I think could mask some correlations, but is preferable to not being able to look for these correaltions at all :) |
Right. When I refer to dimensionality, I typically refer to samples (rows) x microbes (columns) -- where the goal is to infer microbe-microbe correlations (aka column-column correlations). In order for the VAEs to work, there needs to be more microbes (columns) than samples (rows) -- otherwise we need fancy regularization (aka stronger biological priors), which requires redesigning the core algorithm. Our approach to avoid this was just to bump up the number of samples. Otherwise all bets are off, you'd need big machines to fit your giant covariance matrix in memory, on top of strong priors; your estimates most certainly will be biased and it won't be clear what exactly could be learned. If your first case is to learn pairwise interactions between 258k x 258k microbes, that will probably not be tractable. But if you are trying to estimate 1.3k x 1.3k order interactions, that's easy and will take a few days to run on a GPU. Similarity, if your second case is is to pairwise interactions between 3k x 3k microbes, that's even easier. |
Hello! I saw your recent preprint, "Scalable estimation of microbial co-occurrence networks with Variational Autoencoders" and I'm hopeful your method may solve my issues, but I wanted to touch base to see if you think this method is appropriate for my use cases/scale of problem I'm hoping to address.
Use cases
Detecting contaminant pairs that co-occur in microbial genomes: We built a tool to detect and remove contamination from genomes and metagenome assembled genomes. We're running this tool in the ~350k genomes in GTDBrs202. We detect contamination at the order level in about 15% of genomes. We want to use order-level lineages detected in each sample to determine if any contaminants co-occur more than would be expected by chance.
Detecting contaminant pairs that co-occur in "isolate" RNAseq data sets: We're creating a compendia of bacterial and archaeal isolate RNA seq data. There are ~60k isolate data sets on the SRA. As part of this, we're looking for contamination in these isolates, and we generally find that there is some (usually 2-10 species detected in each sample). We want to know if any species co-occur (e.g. is Faecalibacterium prausnitzii likely to be contaminated with it's friend Roseburia inulinivorans?
Questions
I'd appreciate any insights/feedback you'd be willing to give!
The text was updated successfully, but these errors were encountered: