New fragmentation run #1486

Waztom · 2024-08-01T10:33:37Z

Matteo is providing a new set of molecules that need to be added to the graph database. The expectation is that there will be about 200M molecules.
To process this we need to re-instate the fragmentation machinery and as the database will now not be co-located with the compute cluster we need top make minor changes to the process to copy files between clusters.

We aim to process this data as a new dataset (e.g. starting with an empty database) and then use the combine play to combine it with the old data. Hence the process will look like this:

standardize the new molecules (cluster)
load standardized molecules into the database (database)
extract out the molecules to be fragmented (database)
fragment (cluster)
load fragmented data into database (database)
extract out molecules needing generation of additional info (e.g. InChi) (database)
generate additional info (cluster)
load additional info into database (database)
generate nodes and edges csv files (databse)
combine with existing data (cluster)

We plan to run a small test dataset through this process to make sure that it's running properly.

mwinokan · 2024-08-22T10:31:36Z

Chris Reynolds is able to assist after shutdown (after week 1 of Sept).

@Waztom asks if we can move this job to the DLS cluster (not STFC). Either way @tdudgeon says we need significant resources, which are available at DLS (150TB of GPFS).

@tdudgeon assumes that functionality the DLS cluster will be the same as STFC as they both use SLURM. @Waztom says we can leverage @ConorFWild's experience rather than relying on Chris.

Object storage for IRIS/STFC has not been investigated by @Waztom or @mwinokan.

The CPU requirement is around 2000 cores. @mwinokan has briefly checked and there are around 56 nodes with 64 cores and 500GB of RAM idle, both gpfs03 and gpfs04 filesystems are mounted.

The PostGres server will need to be available as well. Importing the PostGres volume into the DLS cluster will need Diamond IT/SC assistance.

alanbchristie · 2024-08-29T10:38:28Z

A simple diagram illustrating non-cluster elements (like DB and NFS server) along with the "expected" shared filesystem: -

mwinokan · 2024-08-29T10:52:18Z

@ConorFWild says that obtaining 2000 concurrently available cores will need SC to increase the job limits. Graham is the SC contact for this. It's estimated that 2000 cores will take a week or two

Conor suggests that many jobs with fewer cores each will be more friendly to not interrupting beamline processes. I.e. 2000 jobs with a core each, instead of 1 job with 2000 cores.

@alanbchristie to create a SC request ticket. SChelpdesk@diamond.ac.uk

alanbchristie · 2024-08-29T13:06:13Z

SC Request ID: SCHD-5779

alanbchristie · 2024-08-30T17:03:19Z

The SC Request has been shutdown and closed as "Won't Do".

I have forwarded the email but clearly they is no desire to support execution outside of Iris. I will step aside on this topic as there is nothing more I can do.

mwinokan · 2024-09-12T10:43:14Z

@alanbchristie has made some progress, it will be done on the development cluster. 6 new machines with 380 have been created, and playbook work has been initialised. Alan is optimistic and will do a dry-run tomorrow. In addition to existing resources the number of CPU's will be about 3x less than the previous run which took around 3 weeks (on the galaxy cluster).

Matteo is back online tomorrow and the apparently the data is ready and somewhere on /dls. ~250M compounds. The method of the compound selection process will need to be documented as well.

mwinokan · 2024-09-24T11:12:13Z

@alanbchristie has been testing with a smaller (1M) dataset and optimising parameters. We are ready to fragment a bigger dataset

There is some confusion regarding the number of molecules, @phraenquex says that there should be a 220M compound file. @matteoferla can provide the path.

There are also duplicate molecules, same smiles but different identifiers (within the same supplier). Ideally we would keep all the identifiers but for our uses it is not a big loss to lose one of the identifiers. @tdudgeon will write a pre-processing script to remove duplicates

@tdudgeon says he can start the 220M compound run tomorow, and then by Friday there should be an estimated ETA

mwinokan · 2024-09-24T12:13:09Z

To weed out highly fragmentable molecules, @tdudgeon please set a stringent filter (e.g. 10 fragments per molecule maximum) and output a list of molecules that have been excluded due to that filter, so we can review and add them back in later

tdudgeon · 2024-09-24T14:19:09Z

Here are some example molecules that are currently excluded from fragmentation (heavy atom count > 36 or num frags > 12)

The good news is that file is already being created, the bad news is that the IDs are not being written, so we need to do a small tweak to allow it to be used as an input.

phraenquex · 2024-09-24T15:53:37Z

@tdudgeon - great, we definitely don't need that shit for our scaffolds.

(@mwinokan @Waztom do say if you disagree.)

matteoferla · 2024-09-25T09:59:02Z

Sorry all, I should have nudged my query email ages back to get the details straight on the onset.
@tdudgeon, What does num frags > 12 mean? Are those BRICS decompositions?

Re the screenshot. There was a penalty for excessive methylene groups, but not for methyl groups.
Triazoles and tetrazoles are uncommon in Enamine, but triazoles (Huisgen cycloaddition both 1,3 = Cu and 1,5 = Ru) were enriched by the synthon enrichment used custom SMARTS patterns for common late stage functionalisation reaction products.
Re morpholino groups, hetero-alicyclics such as were marginally favoured by a filtered aimed at limiting polyphenyl chains.
Majorly, the molecules are rich in HBond donors and have high TPSA. Whereas these are denigrated as bad for permeability, the compounds that bind frequently in XChem have high TPSA and number of enthalpic interactions, whereas entropic / hydrophobic / greasy interactions underperform both in the screen and as followups reflecting the difference between what is best for the crystallographic screen (=well placed compounds) vs. what is best for cell based assays. Consequently, the strongly tailed lower quartile of the greasy Enamine compounds was excluded, but not an upper quantile of high TSPA, which are already filtered by Enamine.

tdudgeon · 2024-09-25T10:04:57Z

num frags is the number of fragments that would be generated by the fragment network

phraenquex · 2024-09-25T10:17:46Z

Thanks @matteoferla - the relevant question at this point, is whether they are likely to make "good" scaffolds, i.e. very productive input for Syndirella.

And to summarise what we arrived at yesterday: if the answer is only "maybe" as well as "but they'll take awfully long to fragment and slow down Max getting a functional Syndirnet", then they get thrown onto a "don't do now" pile, and we'll deal with them later, e.g. as a slow-running low-priority fragmentation queue.

phraenquex · 2024-09-25T10:24:30Z

Transferring an email thread for the record - especially as it slightly documents (a bit) the algorithm.

@matteoferla Sept 14:

I filtered down the 7B molecule catalogues in three steps, followed by a final sort.
6B ==first and second pass==> 220M ==third pass==> 20M ==sorting==> ordered 20M.
The number of compounds that can be fit in the network I am told is variable in the 10M-20M range, hence the ordered part.

(The order comes from a sum of weighted Winsorised Z-scores, accounting for pharmacophoric trio distance uniqueness, presence of reaction product moieties, presence/absence of certain moieties (e.g. excessive number of benzenes and alkanes) and regular filters, all with a bonus for those that were a partial superstructure of a XChem library compound, in accordance to Frank’s request)

I sent you a link to a MS OneDrive folder with the ordered 20M compounds. This was not my ideal choice, but GitHub was being problematic —hence my delay.

Warren tells me you might have access to the NFS already, in which case

The data is in the ‘shared XChem drive’, namely the NTF folder mounted from 192.168.212.119:/mnt/xchem-fragalysis-2 as /opt/xchem-fragalysis-2 in some locations. Let’s say this is $DATA In the folder $DATA/mferla/library_making/selected_final contains ordered smiles as mentioned.

In the folders in $DATA/mferla/library_making/third_pass are the conformers. These are clustered by the sorting metric in Z1 (top), Z1-08 (high), Z05-08 (mid) and Z0-05 (low). Within these folder they are not sorted and are split into the files were they came from. Namely, the original Enamine files were split into chunks (eg. Enamine_REAL_HAC_29_38_1.3B_Part_2_chunk66.sdf.bz2 is the 66th 200M chunk of REAL HAC 29-38).

$DATA/mferla/library_making/second_pass is smiles and is with basic filters, without sorting for pharmacophoric and synthetic properties. It is not needed (it’s the 220M), but I for one would be curious as to what is in ‘2’ after ‘3’ gets mentioned.

Unfortunately, just now I noticed that the HAC27 subset of REAL is missing. I will run it over the weekend and will be in third_pass folder,

But the ordered set on OneDrive will need deleting and remaking, which is a problem as I will be travelling on Monday.

phraenquex · 2024-09-25T10:25:05Z

@alanbchristie 19 Sept:

It’s taken more time than I expected to get to initial processing of the data. One thing that does stand out is the dramatic change to the identifier style throughout the file. In the early part of the file it’s a pleasant-looking MCULE-2832428593. Deeper into the file I see things like m_274552____23520642____24972820____13454082. Is the latter identity OK for users?

phraenquex · 2024-09-25T10:26:35Z

@tdudgeon 24 Sept:

I just wanted to clarify which set of molecules we should be fragmenting.

The files in /opt/xchem-fragalysis-2/mferla/library_making/selected_final seem to add up to about 20M molecules (but the file shorlist0017.1M.cxsmiles.bz2 seems to be corrupt).

We were under the impression that we were to be fragmenting about 200M molecules. Is it those in /opt/xchem-fragalysis-2/mferla/library_making/second_pass that we should be using?

Also, as already discussed, there are a significant number of duplicates (same SMILES but with different IDs). Is this because you are doing a desalting operation or something similar. If so it looks like the original molecule has been lost.

Not really a problem, just need to know what's going on.

phraenquex · 2024-09-25T10:27:03Z

@tdudgeon 25 Sept:

Further to this, tube duplicates DO cause a problem.

I'm assuming it's impractical for Matteo to regenerate without duplicates, so we should generate a simple preprocessing script that removes them from the data.

Presumably it's OK to just keep the first (based on SMILES) and discard any subsequent ones?

mwinokan · 2024-10-31T12:30:39Z

@tdudgeon reports that @alanbchristie has restarted the run

alanbchristie · 2024-11-08T10:07:42Z

The first 250M molecules have been fragmented and extracted to Echo/S3. Here's a summary of execution stats for the 250M-C1 run: -

Standardisation: 2h 15m
Fragmentation: 75h 30m
Inchi generation: 15h 00m
Extract CSV: 78h 15m (to dls-echo:/im-fragnet/extract/enamine_mferla/250M-C1)

A total run-time of about 171h (7 days 21 hours) - i.e. approximately 8 days with an extracted set of CSV files totalling 72.688 GiB

It is important to realise that the total run-time may be longer (or shorter) than 8 days because a) each molecule set is different, and b) a human has to start each step once the previous has completed. So if a run completes at 1am the next step may not be started until maybe 8 or 9am that day (adding 8 hours to the run). So, to be conservative, add at least 8 hours to each step, e.g. 24 hours for the above, making the total run-time a more realistic 9 days

mwinokan · 2024-11-12T12:25:30Z

@tdudgeon says the first chunk of 50M (250M-C1) has finished in about 8 days.

@alanbchristie says there was an issue on Sunday night causing the second run to need restarting, but it is now running.

mwinokan · 2024-11-14T11:14:59Z

@alanbchristie suspects a problem and will investigate. Please update in this ticket if anything noteworthy comes up

mwinokan · 2024-11-19T13:48:53Z

@alanbchristie reports that the runs stall, a few things have been tried but Alan has not found a fix yet. There may be an issue with the second set due to the size of the batch / longer execution time (5-6 hours).

mwinokan · 2024-11-21T12:57:58Z

@alanbchristie reports that the fragmentation problem has been resolved and the second 50M batch is nearing completion.

@alanbchristie will then move the fragmentation job to the new cluster to free up hardware for #1451

alanbchristie · 2024-11-26T09:59:13Z

Fragmentation has been successfully moved to the new (k8s 1.30) cluster, and appears to be running as expected. This leaves the old (k8s 1.23) cluster running the extract of the C2 run. When this finishes the old cluster will be free to remove (after backing up the moldb database and old AWX data).

C2 extract is extecped to complete sometime tomorrow.

The new cluster is fragmenting the 50M molecule from chunk C3 . this has already failed 3 times after many hours of execution due to unknown cluster/networking issues.

C3 failed at 00:33:31 on Sunday night and at exactly the same time last night (25th Nov). I have moved to a new nextflow which supports parameters that might help with the overnight (network) failures. A 4th execution of the C3 fragment run is now underway.

alanbchristie · 2024-11-27T10:42:01Z

The extract of the 2nd 50 million molecules (C2) is complete, here's a brief summary: -

Standardisation: 1h 55m
Fragmentation: 76h 30m
Inchi generation: 22h 20m
Extract CSV: 71h 30m (to dls-echo:/im-fragnet/extract/enamine_mferla/250M-C2)

The extract actually failed at the end of th erun but the extraction had completed.

A total run-time similar to the C1 run of about 172h (7 days 22 hours) - i.e. approximately 8 days with an extracted set of CSV files totalling 74 GiB

phraenquex · 2024-11-27T10:56:48Z

Thanks @alanbchristie for the super informative breakdown.

How well could the long jobs this parallelise? Would it shrink once you engineer use of the Diamond's (much more) infinite-ish CPUs available?

alanbchristie · 2024-11-27T11:07:06Z

The actual 'fragment' generation in fragmentation is executed in parallel so that really just depends on cores but a significant majority of work involves database interactions some of which are run in parallel. Others might benefit from a re-write.

There's also the 'quality' of the hardware. For example I have had to start the third (C3) fragmentation run four times since last Friday after each attempt has been affected by some sort of underlying cloud/network issue.

With regard to the 200M I am hoping we have enough time left to run the C3, C4 and C5 "chunks". C3 is already a day into its 3 day fragment run. But, if we run continue to be disrupted by the hardware we will not have enough time. I think we're looking at 24 days still to go and It's nearly December.

phraenquex · 2024-11-27T11:46:43Z

Thanks. Run as much as you can, and whatever remains just doesn't make it in, no disaster.)

(Scientifically, it's precisely equivalent to Matteo having run some filter more stringently, or the vendors having a slightly smaller catalogue. Whatever.)

When you get to reengineering it (for future iterations):

how can DB interactions be scaled. (DB on larger node?)
more robust handling of failures by dispatcher mechanism.

mwinokan · 2024-11-28T12:31:36Z

@alanbchristie had to restart the run after the NFS server outage. The third batch is running again now

mwinokan · 2024-12-03T13:04:47Z

@tdudgeon says that the database itself is probably at the peak of it's performance, even if allocated more resources. The parallel data manipulation in postgres is the limiting factor.

Tim suggests that a refactor Kafka message queues to improve the fragmentation performance beyond estimated 2x from allocating more resources to the DB, but this is a large body of work

mwinokan · 2024-12-05T13:26:37Z

@alanbchristie says that we are hitting networking issues but the job is running again

alanbchristie · 2024-12-10T07:54:51Z

The extract of the 3rd 'chunk' of 50 million molecules (C3) is nearly complete (with just the extract play to run), here's a brief summary: -

Standardisation: 1h 52m
Fragmentation: 61h 28m
Inchi generation: 17h 05m
Extract CSV: 28h 43m (to dls-echo:/im-fragnet/extract/enamine_mferla/250M-C3)

A total run-time of about 110h (4.5 days) - with an extracted set of CSV files totalling 73 GiB

Potential improvements in execution time due to the revised database configuration used on this run.

mwinokan · 2024-12-17T12:40:41Z

@alanbchristie reports that a new failure happened over the weekend, but the playbooks have been patched to allow for faster/easier restarts.

We're now working on batch 4

@Waztom asks about documentation (see #1619) and @alanbchristie says that the README's are up to date. @alanbchristie please link to the repositories here

alanbchristie · 2024-12-18T16:56:01Z

We'er running fragmentation using machines in the Development cluster. It is written in Ansible and uses two container images (the actual fragmentor and a player that orchestrates each play).

The container images are built from the fragmentor repository and Kubernetes orchestration takes place from within the fragmentor-k8s-orchestration repository.

For a discussion of the fragment process see the fragmentor repo's README
For a discussion of k8s orchestration see the fragmentor-k8s-orchestration README

alanbchristie · 2024-12-23T07:27:49Z

The extract of the 4th 'chunk' of 50 million molecules (C4) is complete. Here's a brief summary: -

Standardisation: (about) 1h 45m
Fragmentation: (about) 60h
Inchi generation: 22h 28m
Extract CSV: 54h 18m (to dls-echo:/im-fragnet/extract/enamine_mferla/250M-C4)

A total run-time of about 138h (5 days) - with an extracted set of CSV files totalling 92 GiB

alanbchristie · 2025-01-07T08:02:07Z

The extract of the 5th 'chunk' of 50 million molecules (C5) is complete. Here's a brief summary: -

Standardisation: (about) 1h 45m
Fragmentation: (about) 80h 26m
Inchi generation: 24h 23m
Extract CSV: 58h 54m (to dls-echo:/im-fragnet/extract/enamine_mferla/250M-C5)

A total run-time of about 166h (7 days) - with an extracted set of CSV files totalling 96 GiB

alanbchristie · 2025-01-07T08:09:38Z

Approximate (average) processing time (per 50 million chunk): 146h (6 days)
Compressed extracted output (all 5 chunks): 407Gi

mwinokan · 2025-01-14T12:34:52Z

@alanbchristie's bandwidth is limited until after the workshop. @alanbchristie will need to work with @Waztom to come up with a strategy to deduplicate 100's of GBs of data

mwinokan · 2025-01-28T12:32:50Z

The fragmented molecules from the five separate runs, and existing graph database have to be combined into a single graph. The combine process is already existing and includes de-duplication of molecules. The combination previously took weeks on many cores. These jobs are unable to be interrupted currently.

@alanbchristie is working on a dry-run of the combination step (ETA today) to get an estimate of the resources required on IRIS

@Waztom suggests aggregating on molecular weight, but @alanbchristie says that it would need changes to the playbooks and @tdudgeon says it may duplicate the existing hashing

@phraenquex says that we are unable to change the IRIS resources so more work may be needed to introduce checkpointing

@tdudgeon says that molecular weight, heavy atom count, and/or molecular formula could be used to split the combination into chunks in which duplicates may exist. Checkpointing will be difficult, but if a small job fails it could be re-run. @Waztom says this will help because you won't have to check all nodes, but only a subset with the same molecular weight.

mwinokan · 2025-01-30T11:19:49Z

@tdudgeon has been exploring the molecular properties binning for de-duplication.

Tim says that using molecular weights is concerning him as there will be a normal distribution, with the bins of median molecular weights will very large.

@phraenquex says that the key thing is that you cluster fragments with the smallest predictable sets of other fragments that could possibly be duplicates (e.g. they must have the same molecular properties)

@tdudgeon says that the hashing would also achieve this partition, i.e. group fragments based on a hash of their smiles, and you will only get very few collisions.

mwinokan · 2025-02-04T12:34:16Z

@alanbchristie is waiting for discussions with @tdudgeon on the molecule hash binning

mwinokan · 2025-02-11T12:25:04Z

@tdudgeon is prototyping new combine step and will work with @alanbchristie on the implementation tomorrow

(@tdudgeon to check confluence access, and then @Waztom can kick off the contractor onboarding for @alanbchristie)

phraenquex · 2025-02-13T11:36:43Z

Code written to handle the new combine algorithm; now integrating into NextFlow.

tdudgeon added this to Fragalysis Jul 30, 2024

Waztom converted this from a draft issue Aug 1, 2024

Waztom assigned Waztom, alanbchristie and tdudgeon Aug 1, 2024

Waztom moved this from Backlog to ASAP critical in Fragalysis Aug 1, 2024

Waztom added stack 2024-06-14 mint Data dissemination 2 labels Aug 1, 2024

mwinokan added 2024-04-25 pink Stack maintenance/monitoring and removed 2024-04-25 pink Stack maintenance/monitoring labels Aug 22, 2024

alanbchristie removed their assignment Aug 30, 2024

phraenquex added 2024-04-26 orange Design (RHS) dissemination and removed 2024-06-14 mint Data dissemination 2 labels Sep 17, 2024

mwinokan added the blocked An issue blocked by another internal or 3rd party issue label Oct 24, 2024

mwinokan removed the blocked An issue blocked by another internal or 3rd party issue label Oct 31, 2024

mwinokan mentioned this issue Nov 19, 2024

Migration of applications to new infrastructure #1451

Open

mwinokan mentioned this issue Jan 16, 2025

Workshops / hackathons at DLS #1607

Open

New fragmentation run #1486

New fragmentation run #1486

Comments

Waztom commented Aug 1, 2024

mwinokan commented Aug 22, 2024 • edited Loading

alanbchristie commented Aug 29, 2024

mwinokan commented Aug 29, 2024 • edited Loading

alanbchristie commented Aug 29, 2024

alanbchristie commented Aug 30, 2024

mwinokan commented Sep 12, 2024 • edited Loading

mwinokan commented Sep 24, 2024 • edited Loading

mwinokan commented Sep 24, 2024

tdudgeon commented Sep 24, 2024

phraenquex commented Sep 24, 2024

matteoferla commented Sep 25, 2024 • edited Loading

tdudgeon commented Sep 25, 2024

phraenquex commented Sep 25, 2024

phraenquex commented Sep 25, 2024

phraenquex commented Sep 25, 2024

phraenquex commented Sep 25, 2024

phraenquex commented Sep 25, 2024

mwinokan commented Oct 31, 2024

alanbchristie commented Nov 8, 2024 • edited Loading

mwinokan commented Nov 12, 2024

mwinokan commented Nov 14, 2024

mwinokan commented Nov 19, 2024 • edited Loading

mwinokan commented Nov 21, 2024

alanbchristie commented Nov 26, 2024

alanbchristie commented Nov 27, 2024

phraenquex commented Nov 27, 2024

alanbchristie commented Nov 27, 2024

phraenquex commented Nov 27, 2024

mwinokan commented Nov 28, 2024

mwinokan commented Dec 3, 2024

mwinokan commented Dec 5, 2024

alanbchristie commented Dec 10, 2024 • edited Loading

mwinokan commented Dec 17, 2024 • edited by Waztom Loading

alanbchristie commented Dec 18, 2024 • edited Loading

alanbchristie commented Dec 23, 2024

alanbchristie commented Jan 7, 2025 • edited Loading

alanbchristie commented Jan 7, 2025 • edited Loading

mwinokan commented Jan 14, 2025

mwinokan commented Jan 28, 2025 • edited Loading

mwinokan commented Jan 30, 2025 • edited Loading

mwinokan commented Feb 4, 2025 • edited Loading

mwinokan commented Feb 11, 2025 • edited Loading

phraenquex commented Feb 13, 2025

mwinokan commented Aug 22, 2024 •

edited

Loading

mwinokan commented Aug 29, 2024 •

edited

Loading

mwinokan commented Sep 12, 2024 •

edited

Loading

mwinokan commented Sep 24, 2024 •

edited

Loading

matteoferla commented Sep 25, 2024 •

edited

Loading

alanbchristie commented Nov 8, 2024 •

edited

Loading

mwinokan commented Nov 19, 2024 •

edited

Loading

alanbchristie commented Dec 10, 2024 •

edited

Loading

mwinokan commented Dec 17, 2024 •

edited by Waztom

Loading

alanbchristie commented Dec 18, 2024 •

edited

Loading

alanbchristie commented Jan 7, 2025 •

edited

Loading

alanbchristie commented Jan 7, 2025 •

edited

Loading

mwinokan commented Jan 28, 2025 •

edited

Loading

mwinokan commented Jan 30, 2025 •

edited

Loading

mwinokan commented Feb 4, 2025 •

edited

Loading

mwinokan commented Feb 11, 2025 •

edited

Loading