Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix squonk job execution #1559

Open
mwinokan opened this issue Oct 31, 2024 · 15 comments
Open

Fix squonk job execution #1559

mwinokan opened this issue Oct 31, 2024 · 15 comments
Assignees
Labels
2024-04-26 orange Design (RHS) dissemination 2024-10-31 oscars WP1 OSCARS grant WP/task 1

Comments

@mwinokan
Copy link
Collaborator

mwinokan commented Oct 31, 2024

@tdudgeon and @alanbchristie please give us an estimate for the amount of work needed to update the deployments to use the new data model and resuscitate the algorithms

@mwinokan mwinokan converted this from a draft issue Oct 31, 2024
@mwinokan mwinokan moved this from Backlog to Orange in Fragalysis Oct 31, 2024
@mwinokan mwinokan added the 2024-04-26 orange Design (RHS) dissemination label Oct 31, 2024
@mwinokan
Copy link
Collaborator Author

mwinokan commented Nov 14, 2024

@tdudgeon says that the work needed depends largely on whether the shared filesystem between Fragalysis and Squonk is needed.

@phraenquex says that the first step is to get the job submission working again and that conversations with SC have ruled out a shared file system between the Fragalysis stack and thread for job execution

@alanbchristie A schematic of the architecture for Fragalysis and Squonk would help immensely

@alanbchristie
Copy link
Collaborator

alanbchristie commented Nov 18, 2024

Here are some references for the original design: -

The original high-level design for Squonk Job Execution in Fragalysis: -
With references to various google docs
#839

A simplified low-level design document (Fragalysis/Squonk AC (937) LLD)
https://docs.google.com/document/d/1lFpN29dK1luz80lwSGi0Rnj1Rqula2_CTRuyWDUBu14/edit?tab=t.0#heading=h.y56jkt6kfj5y

A brief but practical documentation on how to use Job Execution on the b/e ReadTheDocs site: -
https://fragalysis-stack-kubernetes.readthedocs.io/en/latest/architecture/index.html
See the section: - Squonk Integration

An issue discussing the global Job Configuration
#944

Other related issues include: -

Fix Squonk access control
An early and detailed discussion of how Fragalysis and Squonk objects are related
#937
Implement job status refresh
Relates to availability of a token and a refresh mechanism in the F/E
An attempt to review job status that may have bene lost (due to callback errors)
#872

API

job_request: https://fragalysis.xchem.diamond.ac.uk/api/job_request/
job_file_transfer: https://fragalysis.xchem.diamond.ac.uk/api/job_file_transfer/
job_callback: https://fragalysis.xchem.diamond.ac.uk/api/job_callback/
job_config: "https://fragalysis.xchem.diamond.ac.uk/api/job_config/
job_override: https://fragalysis.xchem.diamond.ac.uk/api/job_override/
job_access: https://fragalysis.xchem.diamond.ac.uk/viewer/job_access/

Views (in viewer.views)

  • JobRequestView [GET, POST]
  • JobFileTransferView [GET, POST]
  • JobCallbackView [GET, PUT] (PUT used by Squonk)
  • JobConfigView [GET]
  • JobOverrideView [GET, POST]
  • JobAccessView [GET]

Basic flow

  1. F/E initiates a File Transfer
    Providing: -
    • Target Access String (Project/visit)
    • Target
    • Snapshot
    • Session Project
    • Proteins
    • Compounds
      The B/E then transfers files to Squonk using the User and Target
      as the basis for the Project name. Each Stack also has
      adds a unique prefix so that projects from Alan on Stack X
      are different from those on Stack Y. A typical Squonk Project name
      will be Fragalysis fs-abc achristie::lb18141-1
  2. F/E initiates a Job by using Job Request
    Providing: -
    • Target Access String (Project/visit)
    • Target
    • Snapshot
    • Session Project
    • Squonk Job Name
    • Squonk Job Spec
  3. Squonk calls into Fragalysis using the Job Callback API
  4. The B/E responds to callbacks that indicate 'completion' by fetching the generated file from Squonk's filesystem and then uploading it as a new "computed/compound set'.
  5. F/E polls Job Request until complete

Transferring Target files (to squonk)

After a lot of "preamble" logic the JobFileTransferView: -

  • Creates a FileTransfer record
  • Runs the Transfer logic as a Celery task. This is accomplished by process_job_file_transfer.delay(...) in tasks.py, which is given the ID of the JobTransfer record. The record contains JSON record listing the proteins and compounds required (amongst other things).
  • Ultimately this code relies on logic in squonk_job_file_transfer.py to locate the files and then use the Data Manager API to send them to the Squonk filesystem (using its put_unmanaged_project_files() API method).

Uploading Job results (from squonk)

The JobCallbackView received notifications about completed Jobs. When a job is complete: -

  • A number of "chained" Celery tasks are executed: -
    • process_compound_set_job_file(...)
    • validate_compound_set()
    • process_compound_set()
    • erase_compound_set_job_material()

These four functions retrieve the SD file (and JSON parameters) from Squonk and 'format' the content to better fit Fragalysis, then simply validate and process the file (using MolOps) to create a new "compound set" before removing the transient files.

Effort

If Squonk Job execution remains the same, as does the configuration, then the only things broken will be the file transfer's ability to locate the files. We think the RHS upload will still work, but might need an adjustment. In summary: -

  • 0.5 days to reconfigure squonk in the new cluster and make sure its API works
  • 2 days to adjust the file transfer code, test and release the new logic

Maybe 2-3 days?

@mwinokan
Copy link
Collaborator Author

@alanbchristie asks if the f/e feature for starting squonk jobs is still there

@boriskovar-m2ms says it's there but won't work with the new data format, and expects there to be several issues

@mwinokan mwinokan added the 2024-11-27 oscars WP3 OSCARS grant WP/task 3 label Nov 26, 2024
@phraenquex phraenquex moved this from Orange to In Progress (DEV) in Fragalysis Dec 10, 2024
@mwinokan mwinokan added 2024-10-31 oscars WP1 OSCARS grant WP/task 1 and removed 2024-11-27 oscars WP3 OSCARS grant WP/task 3 labels Dec 10, 2024
@alanbchristie
Copy link
Collaborator

alanbchristie commented Jan 7, 2025

Before we start to tackle the broken Squonk/2 job execution I think it would be extremely valuable if we first used the existing fragalysis-stack-behaviour-tests repo to write a small number of behaviour (gherkin) CI tests (based on a clean deployment of the latest stack code).

The repo is functional and is already able to create stacks and run tests against the Fragalysis Stack REST API.

So let's use this repository to write some business-level tests that will help to document the feature and define its precise behaviour. Tests that will: -

  1. Load (and verify) reference target data
  2. Run (and verify) the execution of Squonk Jobs on the target

While we're doing this any work we do is clearly going to benefit XCHem-Align automated (CI) testing as we'll build the foundations of a test suite that could test all sorts of future target loading behaviour (two birds, one stone?).

Image

The test language (gherkin) is also extremely useful in that it's very high level and can (should) be understood by anyone, programmer or not.

In fact we'll need to automate the loading of targets before we can test job execution.

A target loader "behaviour" test

A typical (and real) behaviour test for target loading might look like this: -

Feature: Verify legitimate Public Targets can be loaded into the stack

  This test ensures that 'well known' target data (generated by XChem-Align)
  can be loaded into a new stack against the public proposal,
  while also verifying the results. We rely on a clean stack
  and a CAS-authenticated user.

  Background: Start with a new (empty) stack
    Given an empty behaviour stack tagged latest
    Then the stack landing page should return http 200

  Scenario Template: Load a public target
    Given I can login to the behaviour stack
    When I load the target <target> from <file> under TAS <tas>
    Then the operation should return a status of CREATED
    And the operation should return a Task ID
    And the Task should complete within 10 minutes
    And I should find target <target> in the stack against TAS <tas>

    Examples:
    | target  | file        | tas       |
    | A71EV2A | A71EV2A.zip | lb18145-1 |

A Squonk/2 job execution "behaviour" test

A typical (and real) behaviour test for Squonk/2 job execution might start-out looking like this: -

Feature: Verify the stack can run Squonk/2 Jobs

  Here we check that a properly configured to run Squonk/2 Jobs.
  Tests typically load target data, run a Job and verify the results.

  Background: Start with a new (empty) stack
    Given an empty behaviour stack tagged latest
    Then the stack landing page should return http 200
    And the stack should support Squonk/2 Job execution

  Scenario Template: Run a Squonk/2 Job
    Given I can login to the behaviour stack
    And I load the target <target> from <file> under TAS <tas>
    And the target has been loaded successfully
    When I run job <job> on the target <target>
    Then the operation should return a status of CREATED
    And the operation should return a Job Request ID
    And the Job should complete within 10 minutes
    And I should find the result loaded into the stack

    Examples:
    | target  | file        | tas       | job     |
    | A71EV2A | A71EV2A.zip | lb18145-1 | Job XYZ |

@mwinokan
Copy link
Collaborator Author

mwinokan commented Jan 7, 2025

@alanbchristie clarifies that the behaviour testing has been implemented in another repository and is able to create a fresh stack on the development cluster for testing.

@tdudgeon please provide @alanbchristie with the two test data sets from XCA #1588 and help him write a minimal LHS upload test

@mwinokan
Copy link
Collaborator Author

mwinokan commented Jan 9, 2025

@alanbchristie has implemented daily automated testing of two XChem targets in the [fragalysis-stack-behaviour-tests](https://github.com/xchem/fragalysis-stack-behaviour-tests repository)

@alanbchristie says that @kaliif will be needed to help locate the files needed for job executing in the v2 schema

@alanbchristie says that getting jobs running via the API is easier to achieve than getting the frontend UI support for jobs resuscitated

The API to launch jobs passes JSON which exposes specific files to the algorithms running in the job

@mwinokan
Copy link
Collaborator Author

mwinokan commented Jan 14, 2025

@alanbchristie says that @kaliif is not needed continuously for this ticket

@alanbchristie is believes that the job execution via API will be ready by the hackathon

@alanbchristie
Copy link
Collaborator

alanbchristie commented Jan 15, 2025

API-based Job execution has been restored using minor code modifications on a development branch of the backend.

Automated tests that create a stack, load suitable target data, and run the fragmenstein-combine Job can be found in the fragalysis-stack-behaviour-tests repository in the squonk-basic-job-execution.feature feature file.

These tests now execute automatically, every day.

As a basic reminder of the existing Job execution logic, I've produced a couple of block diagrams. One illustrates the file transfer step, and the other the job execution step: -

Image

Image

Image

Image

@alanbchristie
Copy link
Collaborator

One problem we have with the current Job is the content of the generated SD-File, which has the required ref_mols property. The property, for the test I run is populated with the following content: -

>  <ref_mols>  (1) 
A71EV2A-x0202_A_147_1_A71EV2A-x3977+A+202+1_LIG,A71EV2A-x0202_A_201_1_A71EV2A-x0488+A+147+1_LIG

These ref_mols strings are not compatible with the existing stack RHS logic, which appears to expect the string to contain a SiteObservation code (something like A0202a or A0202b). These strings obviously do not contain the code, so creating of RHS data currently relies on a hard-coded hack for the current test.

What is the recommended fix for this?

@mwinokan
Copy link
Collaborator Author

@alanbchristie what are the input files to fragmenstein that produced an output with the long-codes in the SDF? I believe fragmenstein uses the molecule names from the input SD files.

In case we need to map between the short and long observation codes we can use the metadata.csv we serve in the downloads

@alanbchristie
Copy link
Collaborator

alanbchristie commented Jan 15, 2025

The files used in the test (for fragmenstein-combine) are: -

    "proteins": [
      "target_loader_data/A71EV2A_lb18145-1/upload_1/aligned_files/A71EV2A-x0152/A71EV2A-x0152_A_201_1_A71EV2A-x3977+A+202+1_apo-desolv.pdb",
    ],
    "compounds": [
      "target_loader_data/A71EV2A_lb18145-1/upload_1/aligned_files/A71EV2A-x0202/A71EV2A-x0202_A_147_1_A71EV2A-x3977+A+202+1_ligand.mol",
      "target_loader_data/A71EV2A_lb18145-1/upload_1/aligned_files/A71EV2A-x0202/A71EV2A-x0202_A_201_1_A71EV2A-x0488+A+147+1_ligand.mol",
    ],

So we either need to a) manipulate the outgoing file (but remember this is just one example) or b) add new information on the way out to Squonk or c) do a much more complicated search for the original SiteObservations for the ref_mols in returned the merged.sdf.

One of the outgoing MOLs: -

A71EV2A-x0202_A_147_1_A71EV2A-x3977+A+202+1_LIG
     RDKit          3D

 13 14  0  0  0  0  0  0  0  0999 V2000
   17.6570   -2.9850   19.5570 O   0  0  0  0  0  0  0  0  0  0  0  0
   17.9410   -1.6700   20.2190 S   0  0  0  0  0  0  0  0  0  0  0  0
   19.2010   -1.1100   19.5960 O   0  0  0  0  0  0  0  0  0  0  0  0
   16.6170   -0.5940   19.9510 N   0  0  0  0  0  0  0  0  0  0  0  0
   18.2720   -1.8890   21.9180 C   0  0  0  0  0  0  0  0  0  0  0  0
   18.6090   -0.9110   22.7690 N   0  0  0  0  0  0  0  0  0  0  0  0
   18.7800   -1.4780   23.9690 C   0  0  0  0  0  0  0  0  0  0  0  0
   18.5490   -2.7970   23.8580 N   0  0  0  0  0  0  0  0  0  0  0  0
   18.2300   -3.0840   22.5860 C   0  0  0  0  0  0  0  0  0  0  0  0
   18.7790    0.4160   22.5960 N   0  0  0  0  0  0  0  0  0  0  0  0
   19.1250    1.2410   23.5980 C   0  0  0  0  0  0  0  0  0  0  0  0
   19.3110    0.7030   24.8530 C   0  0  0  0  0  0  0  0  0  0  0  0
   19.1380   -0.6510   25.0310 C   0  0  0  0  0  0  0  0  0  0  0  0
  4  2  1  0
  3  2  2  0
  2  1  2  0
  5  2  1  0
  5  9  2  0
  9  8  1  0
  8  7  2  0
  7 13  1  0
 13 12  2  0
 12 11  1  0
 11 10  2  0
 10  6  1  0
  6  5  1  0
  7  6  1  0
M  END

@mwinokan
Copy link
Collaborator Author

@alanbchristie in my opinion we should solve this by including a custom script to run before/after fragmenstein to rename the molecules / properties using the metadata.csv. I reckon we will encounter many such cases where the inputs/outputs will need to be sanitised depending on the implementation of the algorithms. For fragmenstein, we can work on this during the hackathon, (it's in the agenda as "Update Fragmenstein job parameters/script")

@mwinokan
Copy link
Collaborator Author

@alanbchristie and @mwinokan agree that we should rename files sent to jobs so that they most resemble what is observed in the download, i.e. the .mols are named after the observation short code. Then a small change would be needed for Fragmenstein to use the filename as the fragment identifier rather than the _Name property in the MOL file.

Further discussion at the hackathon

@mwinokan
Copy link
Collaborator Author

mwinokan commented Jan 21, 2025

Option 1: Unify the long-code spec across XCA & Fragalysis

In order to have compatibility between the longcode generated by the loader and the XCA observation identifier the "v" needs to be added to the XCA code:

XCA code (now): A71EV2A-x0379_A_147_1_A71EV2A-x0379+A+147+1_LIG
Fragalysis long code: A71EV2A-x0379_A_147_v1
XCA code (future)?: A71EV2A-x0379_A_147_v1_A71EV2A-x0379+A+147+v1_LIG

Then:

  • XCA will have a breaking version change
  • The loader code that constructs the long code needs to change
  • Existing files will need to be renamed

See further discussion in #1634

Option 2: store the XCA long identifier as a database field

Store the XCA code identifier as it is now in the database, e.g. A71EV2A-x0379_A_147_1_A71EV2A-x0379+A+147+1_LIG

Then:

  • Patch the loader to store the _Name property of the ligand in the .mol file in a database field

@mwinokan mwinokan moved this from In Progress (DEV) to QA done in dev.env - move to staging in Fragalysis Jan 23, 2025
@mwinokan
Copy link
Collaborator Author

@boriskovar-m2ms needs to fix some last frontend bugs then this can go to staging

@mwinokan mwinokan moved this from QA done in dev.env - move to staging to In Progress (DEV) in Fragalysis Jan 23, 2025
@boriskovar-m2ms boriskovar-m2ms moved this from In Progress (DEV) to Dev Done - Do review (DEV) in Fragalysis Feb 4, 2025
@mwinokan mwinokan moved this from Dev Done - Do review (DEV) to QA done in dev.env - move to staging in Fragalysis Feb 4, 2025
@boriskovar-m2ms boriskovar-m2ms moved this from QA done in dev.env - move to staging to In staging - assess function vs spec in Fragalysis Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024-04-26 orange Design (RHS) dissemination 2024-10-31 oscars WP1 OSCARS grant WP/task 1
Projects
Status: In staging - assess function vs spec
Development

No branches or pull requests

5 participants