Check why a relation such as SubClassOf(ObjectIntersectionOf(<> ObjectSomeValuesFrom(<> <>)) <>)
has not been inserted in the database. Does it work when we redo the insertion? Check indirect relations not reached by a chain of direct relations.
Use the strain mapping file in the pipeline for inserting conditions
- Add expression rank and expression score info
- Add propagation information:
- See email 19.11.19 02:04, Tom Conlin
- We could add the values from
- Get rid of all the "xxxExperimentExpression" tables and related code inserting data in them. For Bgee 15, confidence levels will be based on corrected p-values, not on number of experiments.
- Modify the globalExpression table and related code accordingly.
- Rerun Affymetrix analyses to be able to store p-values (Sara, for new FDR correction)
- Use cdna.all.fa files from Ensembl FTP instead of Biomart cdna extraction that looks to have limits and be truncated
- Use a tool more sensitive than blast to map ESTs (such as CD-HIT)
Do not produce absent calls for some gene biotypes, depending on the library type
Same for the ranks: for now, we consider that all genes that have received at least one read in any library are all always accessible to rank computation in all libraries.
Have different calls quality depending on the threshold intergenic/genes
Check discarded libraries, see which one should be recovered
Globin reduction on blood samples: we need a test to determine whether blood samples had globin reduction or not. Let's implement the test and look at the distribution of samples with/without reduction. Notes about that in the Bgee meeting minutes from 2020-04-07
- for all samples that are blood, we will run a test to check the globin depletion status
- insert the information in the database. Maybe a specific column, or same information as the type of targeting of the library (miRNA, lncRNA, etc)
- either the depletion will be known from annotation, provided by the data providers, or from the test.
- add the result of the test in the rnaSeqInfo file already used by the pipeline.
- check all files created and put their names in the Makefile.common (with variable parts like SPECIES, LIBRARY_ID that will be update on the fly)
- parallelize kallisto_bus (oher rules could be parallelize too but they take less than 2 days to run as for bgee 15.0)
- check again all created files to be sure it is not possible to duplicate data inside of them
- clean directory_names and homogenize variable/path/script names to keep one naming approach (camel or snake)
- generate target based calls using BgeeCall
- post-processing to remove genes never seen expressed anywhere. Note: this filter already exists for Affy and RNA-Seq data independently. EST only produce present calls. Such a situation should then only happens from in situ data where only absence of expression of a gene was reported, and with no present calls from other data types. => Do we really need a post-processing filtering step for this?
- discuss about filtering of calls based on expressionFlag (present/absent). Calls from a gene never present for a given datatype are not used to generate an expression (not associated to an expressionId) in Bgee 15.0
- delete columns not used anymore from the RDB schema (e.g rnaSeqResult.)
- update single cell pipelines (e.g parallelize steps, never use hardcoded names in scripts, etc.)
- disable autocommit each time it is possible during insertion steps (not possible for insertion of condition, maybe not possible for expression)
- update README files
- create insertion scripts for target base