Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New fragmentation run #1486

Open
Waztom opened this issue Aug 1, 2024 · 60 comments
Open

New fragmentation run #1486

Waztom opened this issue Aug 1, 2024 · 60 comments
Assignees
Labels
2024-04-26 orange Design (RHS) dissemination stack

Comments

@Waztom
Copy link
Collaborator

Waztom commented Aug 1, 2024

Matteo is providing a new set of molecules that need to be added to the graph database. The expectation is that there will be about 200M molecules.
To process this we need to re-instate the fragmentation machinery and as the database will now not be co-located with the compute cluster we need top make minor changes to the process to copy files between clusters.

We aim to process this data as a new dataset (e.g. starting with an empty database) and then use the combine play to combine it with the old data. Hence the process will look like this:

  1. standardize the new molecules (cluster)
  2. load standardized molecules into the database (database)
  3. extract out the molecules to be fragmented (database)
  4. fragment (cluster)
  5. load fragmented data into database (database)
  6. extract out molecules needing generation of additional info (e.g. InChi) (database)
  7. generate additional info (cluster)
  8. load additional info into database (database)
  9. generate nodes and edges csv files (databse)
  10. combine with existing data (cluster)

We plan to run a small test dataset through this process to make sure that it's running properly.

@Waztom Waztom converted this from a draft issue Aug 1, 2024
@Waztom Waztom moved this from Backlog to ASAP critical in Fragalysis Aug 1, 2024
@Waztom Waztom added stack 2024-06-14 mint Data dissemination 2 labels Aug 1, 2024
@mwinokan
Copy link
Collaborator

mwinokan commented Aug 22, 2024

Chris Reynolds is able to assist after shutdown (after week 1 of Sept).

@Waztom asks if we can move this job to the DLS cluster (not STFC). Either way @tdudgeon says we need significant resources, which are available at DLS (150TB of GPFS).

@tdudgeon assumes that functionality the DLS cluster will be the same as STFC as they both use SLURM. @Waztom says we can leverage @ConorFWild's experience rather than relying on Chris.

Object storage for IRIS/STFC has not been investigated by @Waztom or @mwinokan.

The CPU requirement is around 2000 cores. @mwinokan has briefly checked and there are around 56 nodes with 64 cores and 500GB of RAM idle, both gpfs03 and gpfs04 filesystems are mounted.

The PostGres server will need to be available as well. Importing the PostGres volume into the DLS cluster will need Diamond IT/SC assistance.

@mwinokan mwinokan added 2024-04-25 pink Stack maintenance/monitoring and removed 2024-04-25 pink Stack maintenance/monitoring labels Aug 22, 2024
@alanbchristie
Copy link
Collaborator

A simple diagram illustrating non-cluster elements (like DB and NFS server) along with the "expected" shared filesystem: -

Image

@mwinokan
Copy link
Collaborator

mwinokan commented Aug 29, 2024

@ConorFWild says that obtaining 2000 concurrently available cores will need SC to increase the job limits. Graham is the SC contact for this. It's estimated that 2000 cores will take a week or two

Conor suggests that many jobs with fewer cores each will be more friendly to not interrupting beamline processes. I.e. 2000 jobs with a core each, instead of 1 job with 2000 cores.

@alanbchristie to create a SC request ticket. SChelpdesk@diamond.ac.uk

@alanbchristie
Copy link
Collaborator

SC Request ID: SCHD-5779

@alanbchristie
Copy link
Collaborator

The SC Request has been shutdown and closed as "Won't Do".

I have forwarded the email but clearly they is no desire to support execution outside of Iris. I will step aside on this topic as there is nothing more I can do.

@alanbchristie alanbchristie removed their assignment Aug 30, 2024
@mwinokan
Copy link
Collaborator

mwinokan commented Sep 12, 2024

@alanbchristie has made some progress, it will be done on the development cluster. 6 new machines with 380 have been created, and playbook work has been initialised. Alan is optimistic and will do a dry-run tomorrow. In addition to existing resources the number of CPU's will be about 3x less than the previous run which took around 3 weeks (on the galaxy cluster).

Matteo is back online tomorrow and the apparently the data is ready and somewhere on /dls. ~250M compounds. The method of the compound selection process will need to be documented as well.

@phraenquex phraenquex added 2024-04-26 orange Design (RHS) dissemination and removed 2024-06-14 mint Data dissemination 2 labels Sep 17, 2024
@mwinokan
Copy link
Collaborator

mwinokan commented Sep 24, 2024

@alanbchristie has been testing with a smaller (1M) dataset and optimising parameters. We are ready to fragment a bigger dataset

There is some confusion regarding the number of molecules, @phraenquex says that there should be a 220M compound file. @matteoferla can provide the path.

There are also duplicate molecules, same smiles but different identifiers (within the same supplier). Ideally we would keep all the identifiers but for our uses it is not a big loss to lose one of the identifiers. @tdudgeon will write a pre-processing script to remove duplicates

@tdudgeon says he can start the 220M compound run tomorow, and then by Friday there should be an estimated ETA

@mwinokan
Copy link
Collaborator

To weed out highly fragmentable molecules, @tdudgeon please set a stringent filter (e.g. 10 fragments per molecule maximum) and output a list of molecules that have been excluded due to that filter, so we can review and add them back in later

@tdudgeon
Copy link
Collaborator

Here are some example molecules that are currently excluded from fragmentation (heavy atom count > 36 or num frags > 12)

Image

The good news is that file is already being created, the bad news is that the IDs are not being written, so we need to do a small tweak to allow it to be used as an input.

@phraenquex
Copy link
Collaborator

@tdudgeon - great, we definitely don't need that shit for our scaffolds.

(@mwinokan @Waztom do say if you disagree.)

@matteoferla
Copy link
Collaborator

matteoferla commented Sep 25, 2024

Sorry all, I should have nudged my query email ages back to get the details straight on the onset.
@tdudgeon, What does num frags > 12 mean? Are those BRICS decompositions?

Re the screenshot. There was a penalty for excessive methylene groups, but not for methyl groups.
Triazoles and tetrazoles are uncommon in Enamine, but triazoles (Huisgen cycloaddition both 1,3 = Cu and 1,5 = Ru) were enriched by the synthon enrichment used custom SMARTS patterns for common late stage functionalisation reaction products.
Re morpholino groups, hetero-alicyclics such as were marginally favoured by a filtered aimed at limiting polyphenyl chains.
Majorly, the molecules are rich in HBond donors and have high TPSA. Whereas these are denigrated as bad for permeability, the compounds that bind frequently in XChem have high TPSA and number of enthalpic interactions, whereas entropic / hydrophobic / greasy interactions underperform both in the screen and as followups reflecting the difference between what is best for the crystallographic screen (=well placed compounds) vs. what is best for cell based assays. Consequently, the strongly tailed lower quartile of the greasy Enamine compounds was excluded, but not an upper quantile of high TSPA, which are already filtered by Enamine.

@tdudgeon
Copy link
Collaborator

num frags is the number of fragments that would be generated by the fragment network

@phraenquex
Copy link
Collaborator

Thanks @matteoferla - the relevant question at this point, is whether they are likely to make "good" scaffolds, i.e. very productive input for Syndirella.

And to summarise what we arrived at yesterday: if the answer is only "maybe" as well as "but they'll take awfully long to fragment and slow down Max getting a functional Syndirnet", then they get thrown onto a "don't do now" pile, and we'll deal with them later, e.g. as a slow-running low-priority fragmentation queue.

@phraenquex
Copy link
Collaborator

Transferring an email thread for the record - especially as it slightly documents (a bit) the algorithm.

@matteoferla Sept 14:

I filtered down the 7B molecule catalogues in three steps, followed by a final sort.
6B ==first and second pass==> 220M ==third pass==> 20M ==sorting==> ordered 20M.
The number of compounds that can be fit in the network I am told is variable in the 10M-20M range, hence the ordered part.

(The order comes from a sum of weighted Winsorised Z-scores, accounting for pharmacophoric trio distance uniqueness, presence of reaction product moieties, presence/absence of certain moieties (e.g. excessive number of benzenes and alkanes) and regular filters, all with a bonus for those that were a partial superstructure of a XChem library compound, in accordance to Frank’s request)

I sent you a link to a MS OneDrive folder with the ordered 20M compounds. This was not my ideal choice, but GitHub was being problematic —hence my delay.

Warren tells me you might have access to the NFS already, in which case

The data is in the ‘shared XChem drive’, namely the NTF folder mounted from 192.168.212.119:/mnt/xchem-fragalysis-2 as /opt/xchem-fragalysis-2 in some locations. Let’s say this is $DATA In the folder $DATA/mferla/library_making/selected_final contains ordered smiles as mentioned.

In the folders in $DATA/mferla/library_making/third_pass are the conformers. These are clustered by the sorting metric in Z1 (top), Z1-08 (high), Z05-08 (mid) and Z0-05 (low). Within these folder they are not sorted and are split into the files were they came from. Namely, the original Enamine files were split into chunks (eg. Enamine_REAL_HAC_29_38_1.3B_Part_2_chunk66.sdf.bz2 is the 66th 200M chunk of REAL HAC 29-38).

$DATA/mferla/library_making/second_pass is smiles and is with basic filters, without sorting for pharmacophoric and synthetic properties. It is not needed (it’s the 220M), but I for one would be curious as to what is in ‘2’ after ‘3’ gets mentioned.

Unfortunately, just now I noticed that the HAC27 subset of REAL is missing. I will run it over the weekend and will be in third_pass folder,

But the ordered set on OneDrive will need deleting and remaking, which is a problem as I will be travelling on Monday.

@phraenquex
Copy link
Collaborator

@alanbchristie 19 Sept:

It’s taken more time than I expected to get to initial processing of the data. One thing that does stand out is the dramatic change to the identifier style throughout the file. In the early part of the file it’s a pleasant-looking MCULE-2832428593. Deeper into the file I see things like m_274552____23520642____24972820____13454082. Is the latter identity OK for users?

@phraenquex
Copy link
Collaborator

@tdudgeon 24 Sept:

I just wanted to clarify which set of molecules we should be fragmenting.

The files in /opt/xchem-fragalysis-2/mferla/library_making/selected_final seem to add up to about 20M molecules (but the file shorlist0017.1M.cxsmiles.bz2 seems to be corrupt).

We were under the impression that we were to be fragmenting about 200M molecules. Is it those in /opt/xchem-fragalysis-2/mferla/library_making/second_pass that we should be using?

Also, as already discussed, there are a significant number of duplicates (same SMILES but with different IDs). Is this because you are doing a desalting operation or something similar. If so it looks like the original molecule has been lost.

Not really a problem, just need to know what's going on.

@phraenquex
Copy link
Collaborator

@tdudgeon 25 Sept:

Further to this, tube duplicates DO cause a problem.

I'm assuming it's impractical for Matteo to regenerate without duplicates, so we should generate a simple preprocessing script that removes them from the data.

Presumably it's OK to just keep the first (based on SMILES) and discard any subsequent ones?

@mwinokan mwinokan added the blocked An issue blocked by another internal or 3rd party issue label Oct 24, 2024
@mwinokan
Copy link
Collaborator

@tdudgeon reports that @alanbchristie has restarted the run

@mwinokan mwinokan removed the blocked An issue blocked by another internal or 3rd party issue label Oct 31, 2024
@alanbchristie
Copy link
Collaborator

alanbchristie commented Nov 8, 2024

The first 250M molecules have been fragmented and extracted to Echo/S3. Here's a summary of execution stats for the 250M-C1 run: -

  1. Standardisation: 2h 15m
  2. Fragmentation: 75h 30m
  3. Inchi generation: 15h 00m
  4. Extract CSV: 78h 15m (to dls-echo:/im-fragnet/extract/enamine_mferla/250M-C1)

A total run-time of about 171h (7 days 21 hours) - i.e. approximately 8 days with an extracted set of CSV files totalling 72.688 GiB

It is important to realise that the total run-time may be longer (or shorter) than 8 days because a) each molecule set is different, and b) a human has to start each step once the previous has completed. So if a run completes at 1am the next step may not be started until maybe 8 or 9am that day (adding 8 hours to the run). So, to be conservative, add at least 8 hours to each step, e.g. 24 hours for the above, making the total run-time a more realistic 9 days

@mwinokan
Copy link
Collaborator

@tdudgeon says the first chunk of 50M (250M-C1) has finished in about 8 days.

@alanbchristie says there was an issue on Sunday night causing the second run to need restarting, but it is now running.

@mwinokan
Copy link
Collaborator

@alanbchristie suspects a problem and will investigate. Please update in this ticket if anything noteworthy comes up

@mwinokan
Copy link
Collaborator

mwinokan commented Nov 19, 2024

@alanbchristie reports that the runs stall, a few things have been tried but Alan has not found a fix yet. There may be an issue with the second set due to the size of the batch / longer execution time (5-6 hours).

@mwinokan
Copy link
Collaborator

@alanbchristie reports that the fragmentation problem has been resolved and the second 50M batch is nearing completion.

@alanbchristie will then move the fragmentation job to the new cluster to free up hardware for #1451

@alanbchristie
Copy link
Collaborator

Fragmentation has been successfully moved to the new (k8s 1.30) cluster, and appears to be running as expected. This leaves the old (k8s 1.23) cluster running the extract of the C2 run. When this finishes the old cluster will be free to remove (after backing up the moldb database and old AWX data).

C2 extract is extecped to complete sometime tomorrow.

The new cluster is fragmenting the 50M molecule from chunk C3 . this has already failed 3 times after many hours of execution due to unknown cluster/networking issues.

C3 failed at 00:33:31 on Sunday night and at exactly the same time last night (25th Nov). I have moved to a new nextflow which supports parameters that might help with the overnight (network) failures. A 4th execution of the C3 fragment run is now underway.

@alanbchristie
Copy link
Collaborator

The extract of the 2nd 50 million molecules (C2) is complete, here's a brief summary: -

  1. Standardisation: 1h 55m
  2. Fragmentation: 76h 30m
  3. Inchi generation: 22h 20m
  4. Extract CSV: 71h 30m (to dls-echo:/im-fragnet/extract/enamine_mferla/250M-C2)

The extract actually failed at the end of th erun but the extraction had completed.

A total run-time similar to the C1 run of about 172h (7 days 22 hours) - i.e. approximately 8 days with an extracted set of CSV files totalling 74 GiB

@phraenquex
Copy link
Collaborator

Thanks @alanbchristie for the super informative breakdown.

How well could the long jobs this parallelise? Would it shrink once you engineer use of the Diamond's (much more) infinite-ish CPUs available?

@alanbchristie
Copy link
Collaborator

The actual 'fragment' generation in fragmentation is executed in parallel so that really just depends on cores but a significant majority of work involves database interactions some of which are run in parallel. Others might benefit from a re-write.

There's also the 'quality' of the hardware. For example I have had to start the third (C3) fragmentation run four times since last Friday after each attempt has been affected by some sort of underlying cloud/network issue.

With regard to the 200M I am hoping we have enough time left to run the C3, C4 and C5 "chunks". C3 is already a day into its 3 day fragment run. But, if we run continue to be disrupted by the hardware we will not have enough time. I think we're looking at 24 days still to go and It's nearly December.

@phraenquex
Copy link
Collaborator

Thanks. Run as much as you can, and whatever remains just doesn't make it in, no disaster.)

(Scientifically, it's precisely equivalent to Matteo having run some filter more stringently, or the vendors having a slightly smaller catalogue. Whatever.)

When you get to reengineering it (for future iterations):

  • how can DB interactions be scaled. (DB on larger node?)
  • more robust handling of failures by dispatcher mechanism.

@mwinokan
Copy link
Collaborator

@alanbchristie had to restart the run after the NFS server outage. The third batch is running again now

@mwinokan
Copy link
Collaborator

mwinokan commented Dec 3, 2024

@tdudgeon says that the database itself is probably at the peak of it's performance, even if allocated more resources. The parallel data manipulation in postgres is the limiting factor.

Tim suggests that a refactor Kafka message queues to improve the fragmentation performance beyond estimated 2x from allocating more resources to the DB, but this is a large body of work

@mwinokan
Copy link
Collaborator

mwinokan commented Dec 5, 2024

@alanbchristie says that we are hitting networking issues but the job is running again

@alanbchristie
Copy link
Collaborator

alanbchristie commented Dec 10, 2024

The extract of the 3rd 'chunk' of 50 million molecules (C3) is nearly complete (with just the extract play to run), here's a brief summary: -

  1. Standardisation: 1h 52m
  2. Fragmentation: 61h 28m
  3. Inchi generation: 17h 05m
  4. Extract CSV: 28h 43m (to dls-echo:/im-fragnet/extract/enamine_mferla/250M-C3)

A total run-time of about 110h (4.5 days) - with an extracted set of CSV files totalling 73 GiB

Potential improvements in execution time due to the revised database configuration used on this run.

@mwinokan
Copy link
Collaborator

mwinokan commented Dec 17, 2024

@alanbchristie reports that a new failure happened over the weekend, but the playbooks have been patched to allow for faster/easier restarts.

We're now working on batch 4

@Waztom asks about documentation (see #1619) and @alanbchristie says that the README's are up to date. @alanbchristie please link to the repositories here

@alanbchristie
Copy link
Collaborator

alanbchristie commented Dec 18, 2024

We'er running fragmentation using machines in the Development cluster. It is written in Ansible and uses two container images (the actual fragmentor and a player that orchestrates each play).

The container images are built from the fragmentor repository and Kubernetes orchestration takes place from within the fragmentor-k8s-orchestration repository.

@alanbchristie
Copy link
Collaborator

The extract of the 4th 'chunk' of 50 million molecules (C4) is complete. Here's a brief summary: -

  1. Standardisation: (about) 1h 45m
  2. Fragmentation: (about) 60h
  3. Inchi generation: 22h 28m
  4. Extract CSV: 54h 18m (to dls-echo:/im-fragnet/extract/enamine_mferla/250M-C4)

A total run-time of about 138h (5 days) - with an extracted set of CSV files totalling 92 GiB

@alanbchristie
Copy link
Collaborator

alanbchristie commented Jan 7, 2025

The extract of the 5th 'chunk' of 50 million molecules (C5) is complete. Here's a brief summary: -

  1. Standardisation: (about) 1h 45m
  2. Fragmentation: (about) 80h 26m
  3. Inchi generation: 24h 23m
  4. Extract CSV: 58h 54m (to dls-echo:/im-fragnet/extract/enamine_mferla/250M-C5)

A total run-time of about 166h (7 days) - with an extracted set of CSV files totalling 96 GiB

@alanbchristie
Copy link
Collaborator

alanbchristie commented Jan 7, 2025

  • Approximate (average) processing time (per 50 million chunk): 146h (6 days)
  • Compressed extracted output (all 5 chunks): 407Gi

@mwinokan
Copy link
Collaborator

@alanbchristie's bandwidth is limited until after the workshop. @alanbchristie will need to work with @Waztom to come up with a strategy to deduplicate 100's of GBs of data

@mwinokan
Copy link
Collaborator

mwinokan commented Jan 28, 2025

The fragmented molecules from the five separate runs, and existing graph database have to be combined into a single graph. The combine process is already existing and includes de-duplication of molecules. The combination previously took weeks on many cores. These jobs are unable to be interrupted currently.

@alanbchristie is working on a dry-run of the combination step (ETA today) to get an estimate of the resources required on IRIS

@Waztom suggests aggregating on molecular weight, but @alanbchristie says that it would need changes to the playbooks and @tdudgeon says it may duplicate the existing hashing

@phraenquex says that we are unable to change the IRIS resources so more work may be needed to introduce checkpointing

@tdudgeon says that molecular weight, heavy atom count, and/or molecular formula could be used to split the combination into chunks in which duplicates may exist. Checkpointing will be difficult, but if a small job fails it could be re-run. @Waztom says this will help because you won't have to check all nodes, but only a subset with the same molecular weight.

@mwinokan
Copy link
Collaborator

mwinokan commented Jan 30, 2025

@tdudgeon has been exploring the molecular properties binning for de-duplication.

Tim says that using molecular weights is concerning him as there will be a normal distribution, with the bins of median molecular weights will very large.

@phraenquex says that the key thing is that you cluster fragments with the smallest predictable sets of other fragments that could possibly be duplicates (e.g. they must have the same molecular properties)

@tdudgeon says that the hashing would also achieve this partition, i.e. group fragments based on a hash of their smiles, and you will only get very few collisions.

@mwinokan
Copy link
Collaborator

mwinokan commented Feb 4, 2025

@alanbchristie is waiting for discussions with @tdudgeon on the molecule hash binning

@mwinokan
Copy link
Collaborator

mwinokan commented Feb 11, 2025

@tdudgeon is prototyping new combine step and will work with @alanbchristie on the implementation tomorrow

(@tdudgeon to check confluence access, and then @Waztom can kick off the contractor onboarding for @alanbchristie)

@phraenquex
Copy link
Collaborator

Code written to handle the new combine algorithm; now integrating into NextFlow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024-04-26 orange Design (RHS) dissemination stack
Projects
Status: In Progress (DEV)
Development

No branches or pull requests

6 participants