Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor vector generation (lazily?) #1014

Open
tdudgeon opened this issue Nov 28, 2022 · 4 comments
Open

Refactor vector generation (lazily?) #1014

tdudgeon opened this issue Nov 28, 2022 · 4 comments

Comments

@tdudgeon
Copy link
Collaborator

The viewer_compound table needs to be de-duplicated.
This also needs the table referring to this table to be updated accordingly. This will need a series of SQL statements to be generated. The only change to Django should be add a unique constraint to the SMILES.

@tdudgeon tdudgeon self-assigned this Nov 28, 2022
@tdudgeon tdudgeon added the ASAP label Nov 28, 2022
@tdudgeon
Copy link
Collaborator Author

tdudgeon commented Nov 28, 2022

Initial analysis on de-duplicating compounds has been done and can be found on the asap_migrate branch.
See here for details: https://github.com/xchem/fragalysis-backend/tree/asap_migrate/asap_migrate

In short, the compound de-duplication is going to be a bit more complex than expected.
Before the viewer_compound table can be de-deduplicated all tables referencing it also need to be updated.
More details can be found in the above link, but in summary:

viewer_molecule table: no particular problems with updating the compound IDs, but Django code will need to be investigated to ensure that compound IDs are unique

viewer_computedmolecule table: no particular problems with updating the compound IDs, but the computed set upload functionality needs to be investigated to ensure that compound IDs are unique

viewer_compound_project_id table: this is a join table and it should be relatively easy to de-duplicate, but Django code will need to be investigated to ensure that new rows use unique compound ID

hypothesis_vector table: this defines the fragment network vectors and is the most complex to address. De-duplicating seems possible, but the way this data is generated also needs to be investigated. This table is also referenced by the hypothesis_vector3d table that will also need de-duplicating.

scoring_cmpdchoice, viewer_activitypoint, viewer_compound_inspirations, viewer_designset_compounds tables: currently these contain no data. Need to investigate whether any Django code needs updating should data come into existence (or do we drop these tables completely if they are not used and just add to the confusion).

@phraenquex
Copy link
Collaborator

@tdudgeon points to multiple different things:

  • deduplicating the compounds
  • a vector generator job squonk-side (says @phraenquex, because users must be able to trigger it, and it may take long to run; @tdudgeon wonders if it can't stay backend-side.)
  • a vector loader job backend-side (to slurp up the output of squonk)
  • a way to trigger the vector generator (though presumably already dealt with by ALC2 work, specifically Implement generic squonk job configuration as backend feature #944).

Scope out later...?

@phraenquex
Copy link
Collaborator

phraenquex commented Dec 1, 2022

@tdudgeon updates:

  • what gets calculated at upload time is the set of vectors for the uploaded compounds
  • (we still don't understand that algorithm completely - Anthony Bradley took it from the Astex paper)
  • what queries the graph database is the click-on-the-vector, and does not need updating

What may need updating is:

  • the algorithm that generates vectors
  • the vectors for the RHS compounds - because vector calculation is not triggered when they are inserted.

Also, four tables with "scoring" in name need to be checked out.

@phraenquex phraenquex changed the title De-duplicate compounds table Fix dependencies (ids, vectors, etc) for duplicate compounds Dec 1, 2022
@phraenquex phraenquex changed the title Fix dependencies (ids, vectors, etc) for duplicate compounds Refactor vector generation - likely lazy Jan 12, 2023
@phraenquex phraenquex changed the title Refactor vector generation - likely lazy Refactor vector generation (lazily?) Jan 12, 2023
@phraenquex
Copy link
Collaborator

Scope has moved from IDs - they don't need curation, now that we (@phraenquex ) is happy to have them duplicate.

The implication is that the vectors are probably / definitely broken, in two ways:

  • many compounds won't have vectors stored
  • we cannot easily change/update vector definitions.

Solution is (probably) to generate them lazily - is that feasible?

Action:

  • benchmark vector generation - can it happen on-the-fly. (Chances are yes.)
  • isolate the code, and document it properly (I think that's needed, because we cannot easily answer "how are they generated")

@mwinokan mwinokan moved this to Ada Lovelace 3 (flotsam) in Fragalysis May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Ada Lovelace 3 (flotsam)
Development

No branches or pull requests

2 participants