-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New fragmentation run #1486
Comments
Chris Reynolds is able to assist after shutdown (after week 1 of Sept). @Waztom asks if we can move this job to the DLS cluster (not STFC). Either way @tdudgeon says we need significant resources, which are available at DLS (150TB of GPFS). @tdudgeon assumes that functionality the DLS cluster will be the same as STFC as they both use SLURM. @Waztom says we can leverage @ConorFWild's experience rather than relying on Chris. Object storage for IRIS/STFC has not been investigated by @Waztom or @mwinokan. The CPU requirement is around 2000 cores. @mwinokan has briefly checked and there are around 56 nodes with 64 cores and 500GB of RAM idle, both The PostGres server will need to be available as well. Importing the PostGres volume into the DLS cluster will need Diamond IT/SC assistance. |
@ConorFWild says that obtaining 2000 concurrently available cores will need SC to increase the job limits. Graham is the SC contact for this. It's estimated that 2000 cores will take a week or two Conor suggests that many jobs with fewer cores each will be more friendly to not interrupting beamline processes. I.e. 2000 jobs with a core each, instead of 1 job with 2000 cores. @alanbchristie to create a SC request ticket. SChelpdesk@diamond.ac.uk |
SC Request ID: SCHD-5779 |
The SC Request has been shutdown and closed as "Won't Do". I have forwarded the email but clearly they is no desire to support execution outside of Iris. I will step aside on this topic as there is nothing more I can do. |
@alanbchristie has made some progress, it will be done on the development cluster. 6 new machines with 380 have been created, and playbook work has been initialised. Alan is optimistic and will do a dry-run tomorrow. In addition to existing resources the number of CPU's will be about 3x less than the previous run which took around 3 weeks (on the galaxy cluster). Matteo is back online tomorrow and the apparently the data is ready and somewhere on |
@alanbchristie has been testing with a smaller (1M) dataset and optimising parameters. We are ready to fragment a bigger dataset There is some confusion regarding the number of molecules, @phraenquex says that there should be a 220M compound file. @matteoferla can provide the path. There are also duplicate molecules, same smiles but different identifiers (within the same supplier). Ideally we would keep all the identifiers but for our uses it is not a big loss to lose one of the identifiers. @tdudgeon will write a pre-processing script to remove duplicates @tdudgeon says he can start the 220M compound run tomorow, and then by Friday there should be an estimated ETA |
To weed out highly fragmentable molecules, @tdudgeon please set a stringent filter (e.g. 10 fragments per molecule maximum) and output a list of molecules that have been excluded due to that filter, so we can review and add them back in later |
Sorry all, I should have nudged my query email ages back to get the details straight on the onset. Re the screenshot. There was a penalty for excessive methylene groups, but not for methyl groups. |
|
Thanks @matteoferla - the relevant question at this point, is whether they are likely to make "good" scaffolds, i.e. very productive input for Syndirella. And to summarise what we arrived at yesterday: if the answer is only "maybe" as well as "but they'll take awfully long to fragment and slow down Max getting a functional Syndirnet", then they get thrown onto a "don't do now" pile, and we'll deal with them later, e.g. as a slow-running low-priority fragmentation queue. |
Transferring an email thread for the record - especially as it slightly documents (a bit) the algorithm. @matteoferla Sept 14:
|
@alanbchristie 19 Sept:
|
@tdudgeon 24 Sept:
|
@tdudgeon 25 Sept:
|
@tdudgeon reports that @alanbchristie has restarted the run |
The first 250M molecules have been fragmented and extracted to Echo/S3. Here's a summary of execution stats for the
A total run-time of about 171h (7 days 21 hours) - i.e.
|
@tdudgeon says the first chunk of 50M ( @alanbchristie says there was an issue on Sunday night causing the second run to need restarting, but it is now running. |
@alanbchristie suspects a problem and will investigate. Please update in this ticket if anything noteworthy comes up |
@alanbchristie reports that the runs stall, a few things have been tried but Alan has not found a fix yet. There may be an issue with the second set due to the size of the batch / longer execution time (5-6 hours). |
@alanbchristie reports that the fragmentation problem has been resolved and the second 50M batch is nearing completion. @alanbchristie will then move the fragmentation job to the new cluster to free up hardware for #1451 |
Fragmentation has been successfully moved to the new (k8s 1.30) cluster, and appears to be running as expected. This leaves the old (k8s 1.23) cluster running the extract of the
The new cluster is fragmenting the 50M molecule from chunk
|
The extract of the 2nd 50 million molecules (C2) is complete, here's a brief summary: -
A total run-time similar to the C1 run of about 172h (7 days 22 hours) - i.e. approximately 8 days with an extracted set of CSV files totalling 74 GiB |
Thanks @alanbchristie for the super informative breakdown. How well could the long jobs this parallelise? Would it shrink once you engineer use of the Diamond's (much more) infinite-ish CPUs available? |
The actual 'fragment' generation in fragmentation is executed in parallel so that really just depends on cores but a significant majority of work involves database interactions some of which are run in parallel. Others might benefit from a re-write. There's also the 'quality' of the hardware. For example I have had to start the third (C3) fragmentation run four times since last Friday after each attempt has been affected by some sort of underlying cloud/network issue.
|
Thanks. Run as much as you can, and whatever remains just doesn't make it in, no disaster.) (Scientifically, it's precisely equivalent to Matteo having run some filter more stringently, or the vendors having a slightly smaller catalogue. Whatever.) When you get to reengineering it (for future iterations):
|
@alanbchristie had to restart the run after the NFS server outage. The third batch is running again now |
@tdudgeon says that the database itself is probably at the peak of it's performance, even if allocated more resources. The parallel data manipulation in postgres is the limiting factor. Tim suggests that a refactor Kafka message queues to improve the fragmentation performance beyond estimated 2x from allocating more resources to the DB, but this is a large body of work |
@alanbchristie says that we are hitting networking issues but the job is running again |
The extract of the 3rd 'chunk' of 50 million molecules (C3) is nearly complete (with just the extract play to run), here's a brief summary: -
A total run-time of about 110h (4.5 days) - with an extracted set of CSV files totalling 73 GiB
|
@alanbchristie reports that a new failure happened over the weekend, but the playbooks have been patched to allow for faster/easier restarts. We're now working on batch 4 @Waztom asks about documentation (see #1619) and @alanbchristie says that the README's are up to date. @alanbchristie please link to the repositories here |
We'er running fragmentation using machines in the Development cluster. It is written in Ansible and uses two container images (the actual fragmentor and a player that orchestrates each play). The container images are built from the
|
The extract of the 4th 'chunk' of 50 million molecules (C4) is complete. Here's a brief summary: -
A total run-time of about 138h (5 days) - with an extracted set of CSV files totalling 92 GiB |
The extract of the 5th 'chunk' of 50 million molecules (C5) is complete. Here's a brief summary: -
A total run-time of about 166h (7 days) - with an extracted set of CSV files totalling 96 GiB |
|
@alanbchristie's bandwidth is limited until after the workshop. @alanbchristie will need to work with @Waztom to come up with a strategy to deduplicate 100's of GBs of data |
The fragmented molecules from the five separate runs, and existing graph database have to be combined into a single graph. The combine process is already existing and includes de-duplication of molecules. The combination previously took weeks on many cores. These jobs are unable to be interrupted currently. @alanbchristie is working on a dry-run of the combination step (ETA today) to get an estimate of the resources required on IRIS @Waztom suggests aggregating on molecular weight, but @alanbchristie says that it would need changes to the playbooks and @tdudgeon says it may duplicate the existing hashing @phraenquex says that we are unable to change the IRIS resources so more work may be needed to introduce checkpointing @tdudgeon says that molecular weight, heavy atom count, and/or molecular formula could be used to split the combination into chunks in which duplicates may exist. Checkpointing will be difficult, but if a small job fails it could be re-run. @Waztom says this will help because you won't have to check all nodes, but only a subset with the same molecular weight. |
@tdudgeon has been exploring the molecular properties binning for de-duplication. Tim says that using molecular weights is concerning him as there will be a normal distribution, with the bins of median molecular weights will very large. @phraenquex says that the key thing is that you cluster fragments with the smallest predictable sets of other fragments that could possibly be duplicates (e.g. they must have the same molecular properties) @tdudgeon says that the hashing would also achieve this partition, i.e. group fragments based on a hash of their smiles, and you will only get very few collisions. |
@alanbchristie is waiting for discussions with @tdudgeon on the molecule hash binning |
@tdudgeon is prototyping new combine step and will work with @alanbchristie on the implementation tomorrow (@tdudgeon to check confluence access, and then @Waztom can kick off the contractor onboarding for @alanbchristie) |
Code written to handle the new combine algorithm; now integrating into NextFlow. |
Matteo is providing a new set of molecules that need to be added to the graph database. The expectation is that there will be about 200M molecules.
To process this we need to re-instate the fragmentation machinery and as the database will now not be co-located with the compute cluster we need top make minor changes to the process to copy files between clusters.
We aim to process this data as a new dataset (e.g. starting with an empty database) and then use the combine play to combine it with the old data. Hence the process will look like this:
We plan to run a small test dataset through this process to make sure that it's running properly.
The text was updated successfully, but these errors were encountered: