Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure datagov-harvesting-logic as a cloud.gov python application #4617

Closed
7 of 9 tasks
btylerburton opened this issue Feb 15, 2024 · 2 comments
Closed
7 of 9 tasks
Assignees
Labels
H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0

Comments

@btylerburton
Copy link
Contributor

btylerburton commented Feb 15, 2024

User Story

In order to allow datagov-harvesting-logic (DHL) to operate as an process independent of other parts of the Harvest 2.0 system, datagovteam wants to create a new cloud.gov python application.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN I have created a Manifest file in the DHL repo
    AND I have configured it to deploy the repo as a standalone application with no public routes
    WHEN I run cf apps in the appropriate space
    THEN I expect to see datagov-harvesting-logic as its own app, with zero instances running.

  • GIVEN I have created a python script similar to loadtest.py in the DHL repo
    AND that script is tied to a command that can be invoked from the CLI which accepts harvest source Id as an argument
    WHEN I run cf run-task with the appropriate -c flag
    THEN I expect to see the app launch a task and begin harvesting that source

  • GIVEN I have run a harvest job in the DHL repo
    AND that job has completed
    WHEN I run cf tasks datagov-harvesting-logic
    THEN I expect to see that the latest task is marked as: SUCCEEDED
    AND WHEN I run cf app datagov-harvesting-logic
    THEN I expect to see that the app has cycled back to being idle

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

As we are starting to understand more about Airflow, we are realizing that bundling DHL as a pypi module into an airflow instance is not worthwhile. Rather, we want to push up DHL as its own application that can be invoked by running a cf task command with a harvest source object as a payload. This means we will no longer publish this as a PyPi module.

For an idea of how the manifest should look, you can reference the catalog-gather app:
https://github.com/GSA/catalog.data.gov/blob/main/manifest.yml#L84-L109

For reference on CF tasks: https://docs.cloudfoundry.org/devguide/using-tasks.html

For reference on invoking a task with arguments here are a few verified patterns:

  • cf run-task airflow-test-webserver -c "airflow users create \\n --username admin \\n --firstname admin \\n --lastname admin \\n --password admin \\n --role Admin \\n --email email@email.com"
  • cf run-task catalog-web -c "ckan report generate organization=None metrics-dashboard" --name=reports-list

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

DHL should only be able to to be invoked from within the cloud.gov space, so there is no threat of it's being compromised by external actors.

Sketch

  • Add new manifest to datagov-harvesting-logic repo
  • Add requisite CKAN_URL and and CKAN_API_TOKEN as environment variables
  • Configure the manifest so that it doesn't launch any instances of itself.
  • Create a command that can be invoked from the CLI which will accept a harvest source ID as an argument
  • Confirm app can be invoked from the cli.
    • ex. cf run-task example-app "{harvest-cli-command} id={id}" --name {org_name}-harvest
  • Create script that will take the id, extract the harvest source config from the DB and execute the harvest. This can be part of flask api endpoint/action
@gujral-rei gujral-rei moved this to 📟 Sprint Backlog [7] in data.gov team board Feb 15, 2024
@btylerburton btylerburton added the H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 label Feb 16, 2024
@FuhuXia FuhuXia moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Feb 21, 2024
@FuhuXia FuhuXia self-assigned this Feb 21, 2024
@rshewitt
Copy link
Contributor

does it make sense to just create a utility function in harvesting logic to fetch a harvest source config from the db?

@FuhuXia
Copy link
Member

FuhuXia commented Feb 29, 2024

App datagov-harvesting-logic is pushed to cloud.gov with some smoke-test script, verifying HarvestSource and Record objects can be initialized. No interaction with DB at this moment.

$ cf run-task datagov-harvesting-logic -c "/home/vcap/app/scripts/smoke-test.py" --name smoke-test
Creating task for app datagov-harvesting-logic in org gsa-datagov / space development as fuhu.xia@gsa.gov...
Task has been submitted successfully for execution.
OK

task name:   smoke-test
task id:     15
🐧 Thu 12:27:54 [~/git/datagov/datagov-harvesting-logic (develop =)]
$ cf logs  --recent datagov-harvesting-logic
Retrieving logs for app datagov-harvesting-logic in org gsa-datagov / space development as fuhu.xia@gsa.gov...

   2024-02-29T12:27:54.84-0500 [CELL/0] OUT Cell 4e44949f-1406-4044-98ee-a5756f487e58 creating container for instance f7f7b9c5-2a76-49a9-9a5e-56b55d20cfaf
   2024-02-29T12:27:55.21-0500 [CELL/0] OUT Security group rules were updated
   2024-02-29T12:27:55.22-0500 [CELL/0] OUT Cell 4e44949f-1406-4044-98ee-a5756f487e58 successfully created container for instance f7f7b9c5-2a76-49a9-9a5e-56b55d20cfaf
   2024-02-29T12:28:00.24-0500 [APP/TASK/smoke-test/0] OUT Invoking pre-start scripts.
   2024-02-29T12:28:00.34-0500 [APP/TASK/smoke-test/0] OUT Invoking start command.
   2024-02-29T12:28:01.16-0500 [APP/TASK/smoke-test/0] OUT HarvestSource(_title='a', _url='a', _owner_org='a', _extract_type='a', _waf_config={}, _extra_source_name='harvest_source_name', _no_harvest_resp=False, _ckan_start=0, _ckan_row=1000)
   2024-02-29T12:28:01.16-0500 [APP/TASK/smoke-test/0] OUT Record(_harvest_source='a', _identifier='a', _metadata={}, _metadata_hash='', _operation=None, _valid=None, _validation_msg='', _status='nothing', ckanified_metadata={})
   2024-02-29T12:28:01.25-0500 [APP/TASK/smoke-test/0] OUT Exit status 0
   2024-02-29T12:28:01.82-0500 [CELL/0] OUT Cell 4e44949f-1406-4044-98ee-a5756f487e58 stopping instance f7f7b9c5-2a76-49a9-9a5e-56b55d20cfaf
   2024-02-29T12:28:01.82-0500 [CELL/0] OUT Cell 4e44949f-1406-4044-98ee-a5756f487e58 destroying container for instance f7f7b9c5-2a76-49a9-9a5e-56b55d20cfaf
   2024-02-29T12:28:05.03-0500 [CELL/0] OUT Cell 4e44949f-1406-4044-98ee-a5756f487e58 successfully destroyed container for instance f7f7b9c5-2a76-49a9-9a5e-56b55d20cfaf
🐧 Thu 12:28:24 [~/git/datagov/datagov-harvesting-logic (develop =)]

@FuhuXia FuhuXia closed this as completed Feb 29, 2024
@github-project-automation github-project-automation bot moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Feb 29, 2024
@gujral-rei gujral-rei moved this from ✔ Done to 🗄 Closed in data.gov team board Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0
Projects
Archived in project
Development

No branches or pull requests

3 participants