Fix squonk job execution #1559

mwinokan · 2024-10-31T12:33:46Z

@tdudgeon and @alanbchristie please give us an estimate for the amount of work needed to update the deployments to use the new data model and resuscitate the algorithms

mwinokan · 2024-11-14T11:26:06Z

@tdudgeon says that the work needed depends largely on whether the shared filesystem between Fragalysis and Squonk is needed.

@phraenquex says that the first step is to get the job submission working again and that conversations with SC have ruled out a shared file system between the Fragalysis stack and thread for job execution

@alanbchristie A schematic of the architecture for Fragalysis and Squonk would help immensely

alanbchristie · 2024-11-18T16:19:49Z

Here are some references for the original design: -

The original high-level design for Squonk Job Execution in Fragalysis: -
With references to various google docs
#839

A simplified low-level design document (Fragalysis/Squonk AC (937) LLD)
https://docs.google.com/document/d/1lFpN29dK1luz80lwSGi0Rnj1Rqula2_CTRuyWDUBu14/edit?tab=t.0#heading=h.y56jkt6kfj5y

A brief but practical documentation on how to use Job Execution on the b/e ReadTheDocs site: -
https://fragalysis-stack-kubernetes.readthedocs.io/en/latest/architecture/index.html
See the section: - Squonk Integration

An issue discussing the global Job Configuration
#944

Other related issues include: -

Fix Squonk access control
An early and detailed discussion of how Fragalysis and Squonk objects are related
#937
Implement job status refresh
Relates to availability of a token and a refresh mechanism in the F/E
An attempt to review job status that may have bene lost (due to callback errors)
#872

API

job_request: https://fragalysis.xchem.diamond.ac.uk/api/job_request/
job_file_transfer: https://fragalysis.xchem.diamond.ac.uk/api/job_file_transfer/
job_callback: https://fragalysis.xchem.diamond.ac.uk/api/job_callback/
job_config: "https://fragalysis.xchem.diamond.ac.uk/api/job_config/
job_override: https://fragalysis.xchem.diamond.ac.uk/api/job_override/
job_access: https://fragalysis.xchem.diamond.ac.uk/viewer/job_access/

Views (in viewer.views)

JobRequestView [GET, POST]
JobFileTransferView [GET, POST]
JobCallbackView [GET, PUT] (PUT used by Squonk)
JobConfigView [GET]
JobOverrideView [GET, POST]
JobAccessView [GET]

Basic flow

F/E initiates a File Transfer
Providing: -
- Target Access String (Project/visit)
- Target
- Snapshot
- Session Project
- Proteins
- Compounds
  The B/E then transfers files to Squonk using the User and Target
  as the basis for the Project name. Each Stack also has
  adds a unique prefix so that projects from Alan on Stack X
  are different from those on Stack Y. A typical Squonk Project name
  will be Fragalysis fs-abc achristie::lb18141-1
F/E initiates a Job by using Job Request
Providing: -
- Target Access String (Project/visit)
- Target
- Snapshot
- Session Project
- Squonk Job Name
- Squonk Job Spec
Squonk calls into Fragalysis using the Job Callback API
The B/E responds to callbacks that indicate 'completion' by fetching the generated file from Squonk's filesystem and then uploading it as a new "computed/compound set'.
F/E polls Job Request until complete

Transferring Target files (to squonk)

After a lot of "preamble" logic the JobFileTransferView: -

Creates a FileTransfer record
Runs the Transfer logic as a Celery task. This is accomplished by process_job_file_transfer.delay(...) in tasks.py, which is given the ID of the JobTransfer record. The record contains JSON record listing the proteins and compounds required (amongst other things).
Ultimately this code relies on logic in squonk_job_file_transfer.py to locate the files and then use the Data Manager API to send them to the Squonk filesystem (using its put_unmanaged_project_files() API method).

Uploading Job results (from squonk)

The JobCallbackView received notifications about completed Jobs. When a job is complete: -

A number of "chained" Celery tasks are executed: -
- process_compound_set_job_file(...)
- validate_compound_set()
- process_compound_set()
- erase_compound_set_job_material()

These four functions retrieve the SD file (and JSON parameters) from Squonk and 'format' the content to better fit Fragalysis, then simply validate and process the file (using MolOps) to create a new "compound set" before removing the transient files.

Effort

If Squonk Job execution remains the same, as does the configuration, then the only things broken will be the file transfer's ability to locate the files. We think the RHS upload will still work, but might need an adjustment. In summary: -

0.5 days to reconfigure squonk in the new cluster and make sure its API works
2 days to adjust the file transfer code, test and release the new logic

Maybe 2-3 days?

mwinokan · 2024-11-21T12:30:04Z

@alanbchristie asks if the f/e feature for starting squonk jobs is still there

@boriskovar-m2ms says it's there but won't work with the new data format, and expects there to be several issues

alanbchristie · 2025-01-07T11:16:59Z

Before we start to tackle the broken Squonk/2 job execution I think it would be extremely valuable if we first used the existing fragalysis-stack-behaviour-tests repo to write a small number of behaviour (gherkin) CI tests (based on a clean deployment of the latest stack code).

The repo is functional and is already able to create stacks and run tests against the Fragalysis Stack REST API.

So let's use this repository to write some business-level tests that will help to document the feature and define its precise behaviour. Tests that will: -

Load (and verify) reference target data
Run (and verify) the execution of Squonk Jobs on the target

While we're doing this any work we do is clearly going to benefit XCHem-Align automated (CI) testing as we'll build the foundations of a test suite that could test all sorts of future target loading behaviour (two birds, one stone?).

The test language (gherkin) is also extremely useful in that it's very high level and can (should) be understood by anyone, programmer or not.

In fact we'll need to automate the loading of targets before we can test job execution.

A target loader "behaviour" test

A typical (and real) behaviour test for target loading might look like this: -

Feature: Verify legitimate Public Targets can be loaded into the stack

  This test ensures that 'well known' target data (generated by XChem-Align)
  can be loaded into a new stack against the public proposal,
  while also verifying the results. We rely on a clean stack
  and a CAS-authenticated user.

  Background: Start with a new (empty) stack
    Given an empty behaviour stack tagged latest
    Then the stack landing page should return http 200

  Scenario Template: Load a public target
    Given I can login to the behaviour stack
    When I load the target <target> from <file> under TAS <tas>
    Then the operation should return a status of CREATED
    And the operation should return a Task ID
    And the Task should complete within 10 minutes
    And I should find target <target> in the stack against TAS <tas>

    Examples:
    | target  | file        | tas       |
    | A71EV2A | A71EV2A.zip | lb18145-1 |

A Squonk/2 job execution "behaviour" test

A typical (and real) behaviour test for Squonk/2 job execution might start-out looking like this: -

Feature: Verify the stack can run Squonk/2 Jobs

  Here we check that a properly configured to run Squonk/2 Jobs.
  Tests typically load target data, run a Job and verify the results.

  Background: Start with a new (empty) stack
    Given an empty behaviour stack tagged latest
    Then the stack landing page should return http 200
    And the stack should support Squonk/2 Job execution

  Scenario Template: Run a Squonk/2 Job
    Given I can login to the behaviour stack
    And I load the target <target> from <file> under TAS <tas>
    And the target has been loaded successfully
    When I run job <job> on the target <target>
    Then the operation should return a status of CREATED
    And the operation should return a Job Request ID
    And the Job should complete within 10 minutes
    And I should find the result loaded into the stack

    Examples:
    | target  | file        | tas       | job     |
    | A71EV2A | A71EV2A.zip | lb18145-1 | Job XYZ |

mwinokan · 2025-01-07T12:26:05Z

@alanbchristie clarifies that the behaviour testing has been implemented in another repository and is able to create a fresh stack on the development cluster for testing.

@tdudgeon please provide @alanbchristie with the two test data sets from XCA #1588 and help him write a minimal LHS upload test

mwinokan · 2025-01-09T11:23:10Z

@alanbchristie has implemented daily automated testing of two XChem targets in the [fragalysis-stack-behaviour-tests](https://github.com/xchem/fragalysis-stack-behaviour-tests repository)

@alanbchristie says that @kaliif will be needed to help locate the files needed for job executing in the v2 schema

@alanbchristie says that getting jobs running via the API is easier to achieve than getting the frontend UI support for jobs resuscitated

The API to launch jobs passes JSON which exposes specific files to the algorithms running in the job

mwinokan · 2025-01-14T12:24:09Z

@alanbchristie says that @kaliif is not needed continuously for this ticket

@alanbchristie is believes that the job execution via API will be ready by the hackathon

alanbchristie · 2025-01-15T16:59:12Z

API-based Job execution has been restored using minor code modifications on a development branch of the backend.

Automated tests that create a stack, load suitable target data, and run the fragmenstein-combine Job can be found in the fragalysis-stack-behaviour-tests repository in the squonk-basic-job-execution.feature feature file.

These tests now execute automatically, every day.

As a basic reminder of the existing Job execution logic, I've produced a couple of block diagrams. One illustrates the file transfer step, and the other the job execution step: -

alanbchristie · 2025-01-15T17:03:11Z

One problem we have with the current Job is the content of the generated SD-File, which has the required ref_mols property. The property, for the test I run is populated with the following content: -

>  <ref_mols>  (1) 
A71EV2A-x0202_A_147_1_A71EV2A-x3977+A+202+1_LIG,A71EV2A-x0202_A_201_1_A71EV2A-x0488+A+147+1_LIG

These ref_mols strings are not compatible with the existing stack RHS logic, which appears to expect the string to contain a SiteObservation code (something like A0202a or A0202b). These strings obviously do not contain the code, so creating of RHS data currently relies on a hard-coded hack for the current test.

What is the recommended fix for this?

mwinokan · 2025-01-15T19:16:48Z

@alanbchristie what are the input files to fragmenstein that produced an output with the long-codes in the SDF? I believe fragmenstein uses the molecule names from the input SD files.

In case we need to map between the short and long observation codes we can use the metadata.csv we serve in the downloads

alanbchristie · 2025-01-15T19:27:34Z

The files used in the test (for fragmenstein-combine) are: -

    "proteins": [
      "target_loader_data/A71EV2A_lb18145-1/upload_1/aligned_files/A71EV2A-x0152/A71EV2A-x0152_A_201_1_A71EV2A-x3977+A+202+1_apo-desolv.pdb",
    ],
    "compounds": [
      "target_loader_data/A71EV2A_lb18145-1/upload_1/aligned_files/A71EV2A-x0202/A71EV2A-x0202_A_147_1_A71EV2A-x3977+A+202+1_ligand.mol",
      "target_loader_data/A71EV2A_lb18145-1/upload_1/aligned_files/A71EV2A-x0202/A71EV2A-x0202_A_201_1_A71EV2A-x0488+A+147+1_ligand.mol",
    ],

So we either need to a) manipulate the outgoing file (but remember this is just one example) or b) add new information on the way out to Squonk or c) do a much more complicated search for the original SiteObservations for the ref_mols in returned the merged.sdf.

One of the outgoing MOLs: -

A71EV2A-x0202_A_147_1_A71EV2A-x3977+A+202+1_LIG
     RDKit          3D

 13 14  0  0  0  0  0  0  0  0999 V2000
   17.6570   -2.9850   19.5570 O   0  0  0  0  0  0  0  0  0  0  0  0
   17.9410   -1.6700   20.2190 S   0  0  0  0  0  0  0  0  0  0  0  0
   19.2010   -1.1100   19.5960 O   0  0  0  0  0  0  0  0  0  0  0  0
   16.6170   -0.5940   19.9510 N   0  0  0  0  0  0  0  0  0  0  0  0
   18.2720   -1.8890   21.9180 C   0  0  0  0  0  0  0  0  0  0  0  0
   18.6090   -0.9110   22.7690 N   0  0  0  0  0  0  0  0  0  0  0  0
   18.7800   -1.4780   23.9690 C   0  0  0  0  0  0  0  0  0  0  0  0
   18.5490   -2.7970   23.8580 N   0  0  0  0  0  0  0  0  0  0  0  0
   18.2300   -3.0840   22.5860 C   0  0  0  0  0  0  0  0  0  0  0  0
   18.7790    0.4160   22.5960 N   0  0  0  0  0  0  0  0  0  0  0  0
   19.1250    1.2410   23.5980 C   0  0  0  0  0  0  0  0  0  0  0  0
   19.3110    0.7030   24.8530 C   0  0  0  0  0  0  0  0  0  0  0  0
   19.1380   -0.6510   25.0310 C   0  0  0  0  0  0  0  0  0  0  0  0
  4  2  1  0
  3  2  2  0
  2  1  2  0
  5  2  1  0
  5  9  2  0
  9  8  1  0
  8  7  2  0
  7 13  1  0
 13 12  2  0
 12 11  1  0
 11 10  2  0
 10  6  1  0
  6  5  1  0
  7  6  1  0
M  END

mwinokan · 2025-01-16T10:30:26Z

@alanbchristie in my opinion we should solve this by including a custom script to run before/after fragmenstein to rename the molecules / properties using the metadata.csv. I reckon we will encounter many such cases where the inputs/outputs will need to be sanitised depending on the implementation of the algorithms. For fragmenstein, we can work on this during the hackathon, (it's in the agenda as "Update Fragmenstein job parameters/script")

mwinokan · 2025-01-16T11:44:15Z

@alanbchristie and @mwinokan agree that we should rename files sent to jobs so that they most resemble what is observed in the download, i.e. the .mols are named after the observation short code. Then a small change would be needed for Fragmenstein to use the filename as the fragment identifier rather than the _Name property in the MOL file.

Further discussion at the hackathon

mwinokan · 2025-01-21T13:58:42Z

Option 1: Unify the long-code spec across XCA & Fragalysis

In order to have compatibility between the longcode generated by the loader and the XCA observation identifier the "v" needs to be added to the XCA code:

XCA code (now): A71EV2A-x0379_A_147_1_A71EV2A-x0379+A+147+1_LIG
Fragalysis long code: A71EV2A-x0379_A_147_v1
XCA code (future)?: A71EV2A-x0379_A_147_v1_A71EV2A-x0379+A+147+v1_LIG

Then:

XCA will have a breaking version change
The loader code that constructs the long code needs to change
Existing files will need to be renamed

See further discussion in #1634

Option 2: store the XCA long identifier as a database field

Store the XCA code identifier as it is now in the database, e.g. A71EV2A-x0379_A_147_1_A71EV2A-x0379+A+147+1_LIG

Then:

Patch the loader to store the _Name property of the ligand in the .mol file in a database field

mwinokan · 2025-01-23T15:14:17Z

@boriskovar-m2ms needs to fix some last frontend bugs then this can go to staging

mwinokan added this to Fragalysis Oct 31, 2024

mwinokan converted this from a draft issue Oct 31, 2024

mwinokan moved this from Backlog to Orange in Fragalysis Oct 31, 2024

mwinokan added the 2024-04-26 orange Design (RHS) dissemination label Oct 31, 2024

mwinokan assigned alanbchristie and tdudgeon Oct 31, 2024

alanbchristie assigned kaliif Nov 19, 2024

mwinokan added the 2024-11-27 oscars WP3 OSCARS grant WP/task 3 label Nov 26, 2024

phraenquex moved this from Orange to In Progress (DEV) in Fragalysis Dec 10, 2024

mwinokan added 2024-10-31 oscars WP1 OSCARS grant WP/task 1 and removed 2024-11-27 oscars WP3 OSCARS grant WP/task 3 labels Dec 10, 2024

mwinokan mentioned this issue Jan 9, 2025

Workshops / hackathons at DLS #1607

Open

mwinokan moved this from In Progress (DEV) to QA done in dev.env - move to staging in Fragalysis Jan 23, 2025

mwinokan moved this from QA done in dev.env - move to staging to In Progress (DEV) in Fragalysis Jan 23, 2025

alanbchristie assigned boriskovar-m2ms Jan 24, 2025

alanbchristie unassigned tdudgeon and kaliif Jan 24, 2025

boriskovar-m2ms moved this from In Progress (DEV) to Dev Done - Do review (DEV) in Fragalysis Feb 4, 2025

mwinokan moved this from Dev Done - Do review (DEV) to QA done in dev.env - move to staging in Fragalysis Feb 4, 2025

boriskovar-m2ms moved this from QA done in dev.env - move to staging to In staging - assess function vs spec in Fragalysis Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix squonk job execution #1559

Fix squonk job execution #1559

mwinokan commented Oct 31, 2024 •

edited

Loading

mwinokan commented Nov 14, 2024 •

edited

Loading

alanbchristie commented Nov 18, 2024 •

edited

Loading

mwinokan commented Nov 21, 2024

alanbchristie commented Jan 7, 2025 •

edited

Loading

mwinokan commented Jan 7, 2025

mwinokan commented Jan 9, 2025 •

edited

Loading

mwinokan commented Jan 14, 2025 •

edited

Loading

alanbchristie commented Jan 15, 2025 •

edited

Loading

alanbchristie commented Jan 15, 2025

mwinokan commented Jan 15, 2025

alanbchristie commented Jan 15, 2025 •

edited

Loading

mwinokan commented Jan 16, 2025

mwinokan commented Jan 16, 2025

mwinokan commented Jan 21, 2025 •

edited

Loading

mwinokan commented Jan 23, 2025

Fix squonk job execution #1559

Fix squonk job execution #1559

Comments

mwinokan commented Oct 31, 2024 • edited Loading

mwinokan commented Nov 14, 2024 • edited Loading

alanbchristie commented Nov 18, 2024 • edited Loading

Other related issues include: -

API

Views (in viewer.views)

Basic flow

Transferring Target files (to squonk)

Uploading Job results (from squonk)

Effort

mwinokan commented Nov 21, 2024

alanbchristie commented Jan 7, 2025 • edited Loading

A target loader "behaviour" test

A Squonk/2 job execution "behaviour" test

mwinokan commented Jan 7, 2025

mwinokan commented Jan 9, 2025 • edited Loading

mwinokan commented Jan 14, 2025 • edited Loading

alanbchristie commented Jan 15, 2025 • edited Loading

alanbchristie commented Jan 15, 2025

mwinokan commented Jan 15, 2025

alanbchristie commented Jan 15, 2025 • edited Loading

mwinokan commented Jan 16, 2025

mwinokan commented Jan 16, 2025

mwinokan commented Jan 21, 2025 • edited Loading

Option 1: Unify the long-code spec across XCA & Fragalysis

Option 2: store the XCA long identifier as a database field

mwinokan commented Jan 23, 2025

mwinokan commented Oct 31, 2024 •

edited

Loading

mwinokan commented Nov 14, 2024 •

edited

Loading

alanbchristie commented Nov 18, 2024 •

edited

Loading

alanbchristie commented Jan 7, 2025 •

edited

Loading

mwinokan commented Jan 9, 2025 •

edited

Loading

mwinokan commented Jan 14, 2025 •

edited

Loading

alanbchristie commented Jan 15, 2025 •

edited

Loading

alanbchristie commented Jan 15, 2025 •

edited

Loading

mwinokan commented Jan 21, 2025 •

edited

Loading