Create Harvest DB and Table Structure #4612

btylerburton · 2024-02-07T23:14:21Z

User Story

In order to track the state of the new Harvesting 2.0 pipeline, datagovteam wants to create a Harvesting DB to record the status of harvest sources, harvest jobs, and harvest errors.

The Harvest DB will be defined in code using SQLAlchemy, so as to be reproducible through environments and deployments.

NOTE: As Google docs allows for more flexibility in table structure as modifications are ongoing, we will use the Harvesting Research Doc (here) to finalize table structure before adding an accompanying ERD to datagov-harvesting-logic docs.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

GIVEN the team has settled on the Table structure above
AND it has been defined in an SQLAlchemy file
THEN it should be able to be created programaticaly in a Docker container by running a DB initialization script.
GIVEN that the DB has been initialized
WHEN tests are run locally in pytest
THEN the results should be posted to the DB and be verifiable by running a query to generate the resulting report

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

DB should be properly secured

Sketch

Define Table Structure in SQLAlchemy
Initialize this structure into a local Docker PostGres container
Create an interface in datagov-harvesting-logic to interact with the DB
Call that interface as needed to report the results of harvester operations into the DB

The text was updated successfully, but these errors were encountered:

rshewitt · 2024-02-07T23:26:53Z

sqlachemy work in harvesting logic

rshewitt · 2024-02-08T18:16:53Z

active branch

rshewitt · 2024-02-08T22:26:06Z

performance in sqlalchemy

rshewitt · 2024-02-08T22:37:43Z

web request pattern in sqlalchemy

rshewitt · 2024-02-09T19:56:37Z

migration and db versioning links

Jin-Sun-tts · 2024-02-13T18:00:32Z

Defined table structure in SQLAlchemy, create a docker container running Postgres, connected to create tables in DB:

Inserted a sample record in the harvest_source table:

Created HarvestDBInterface to create/get/update db. Will add more detail functions in.
also created a simple testing to insert records to the harvest_job table etc.

robert-bryson · 2024-02-14T17:43:37Z

DB created in dev space, see keys with cf service-key harvesting-logic-db access-test. Connecting to the service remotely needs to be done through an ssh tunnel: https://docs.cloudfoundry.org/devguide/deploy-apps/ssh-services.html#ssh-tunnel. I'm working on a proof of concept to connect to the internal from a github action.

jbrown-xentity · 2024-02-14T18:03:52Z

DB created in dev space, see keys with cf service-key harvesting-logic-db access-test. Connecting to the service remotely needs to be done through an ssh tunnel: https://docs.cloudfoundry.org/devguide/deploy-apps/ssh-services.html#ssh-tunnel. I'm working on a proof of concept to connect to the internal from a github action.

Just FYI, we have ssh turned off on prod at the entire space level. So if we need to use ssh to connect, that's probably not a good long term approach... However, knowing that we cannot connect outside of cloud.gov to the RDS services is helpful to know!

btylerburton · 2024-02-14T18:26:42Z

I would think the long-term solution is a proxy app.

btylerburton · 2024-02-14T18:29:48Z

Connecting to the service remotely needs to be done through an ssh tunnel

I tried to connect with SQLEctron yesterday and got basically the same connection error.

Jin-Sun-tts · 2024-02-15T20:56:24Z

Developed a Flask application and set up a PostgreSQL Docker container for local execution, as detailed in branch create-harvest-db.

Additionally, interfaces were established to interact with the corresponding tables.

for examples, to insert test records, utilize the following endpoints:
Add a source: /add_source
Add a job: /add_job?source_id=
Add an error: /add_error?job_id=

To-do:
Question: Regarding the harvest_source table, should organization_name and config.url be designated as the primary/unique key?
Add additional load test scripts
Setup and conduct testing on cloud.gov, utilizing the database defined within that environment.

rshewitt · 2024-02-20T18:26:55Z

Developed a Flask application and set up a PostgreSQL Docker container for local execution, as detailed in branch create-harvest-db.

Additionally, interfaces were established to interact with the corresponding tables.

for examples, to insert test records, utilize the following endpoints: Add a source: /add_source Add a job: /add_job?source_id= Add an error: /add_error?job_id=

To-do: Question: Regarding the harvest_source table, should organization_name and config.url be designated as the primary/unique key? Add additional load test scripts Setup and conduct testing on cloud.gov, utilizing the database defined within that environment.

@Jin-Sun-tts in response to your question, so far all the tables inherit an id field from Base ( source ) which is used as the primary key. are you suggesting creating a composite key using organization_name + config.url? I think that's more meaningful from a data perspective than using a uuid. I suppose it's a matter of whether we want to use a business key or a surrogate key.

Jin-Sun-tts · 2024-02-21T17:55:47Z

Pushed the Fask app to cloud.gov development https://harvesting-logic.app.cloud.gov/

btylerburton · 2024-02-22T18:37:27Z

@Jin-Sun-tts chating with @krishnasanaka about the Flask form she's working on, and I believe there is work to be done to update the harvest_source table to the latest schema (here).

Id (str) *primary
name (str)
notification_emails (array)
organization_name (str) (*self-reported)
frequency (str) (cron)
url (str)
schema_type (str)
source_type (str)

@GSA/data-gov-team can you take a look at the updates and add comments if you feel something was lost in translation? thanks.

https://docs.google.com/document/d/1XzfTrPxu-asJ_55GoeZ2UOJsie9CuCegStS28BAL_40/edit

rshewitt · 2024-02-22T19:20:15Z

@Jin-Sun-tts chating with @krishnasanaka about the Flask form she's working on, and I believe there is work to be done to update the harvest_source table to the latest schema (here).
Id (str) *primary
name (str)
notification_emails (array)
organization_name (str) (*self-reported)
frequency (str) (cron)
url (str)
schema_type (str)
source_type (str)
@GSA/data-gov-team can you take a look at the updates and add comments if you feel something was lost in translation? thanks.

https://docs.google.com/document/d/1XzfTrPxu-asJ_55GoeZ2UOJsie9CuCegStS28BAL_40/edit

we ended up exploding the config json field into separate fields so url and schema_type are their own thing now

Jin-Sun-tts · 2024-02-22T19:59:59Z

@btylerburton @rshewitt agree, that is also what @FuhuXia and I talked about the other day. And we may have the url as unique key.

Jin-Sun-tts · 2024-02-22T20:09:22Z

Do we require the primary key id to be of type UUID, generated each time a record is added, OR retrieved/copied as a string from the CKAN database?

btylerburton · 2024-02-22T20:23:23Z

Removed AC/Sketch regarding load test, as that is dependent on #4617 and related to #4619.

Copied old AC/Sketch below:

GIVEN that the above action has been successful locally
AND a database has been created on cloud.gov
THEN it should be possible to run a load test via Github Actions and generate a report of that test from the entries in the DB

Invoke a load test and confirm that the expected results are written to the DB

rshewitt · 2024-02-22T20:28:15Z

Do we require the primary key id to be of type UUID, generated each time a record is added, OR retrieved/copied as a string from the CKAN database?

i figure id for harvest_error and harvest_job would be generated on insertion. using the ckan id in harvest_source allows us to relate tables by that field which could be valuable?

btylerburton · 2024-02-22T20:30:41Z

using the ckan id in harvest_source allows us to relate tables by that field which could be valuable?

Which id are we talking about? I believe CKAN would generate an identifier for a dataset, but would there be any concept of a harvest source in this new system? I don't think so.

rshewitt · 2024-02-22T20:34:04Z

@btylerburton I believe we're talking about the id field for harvest_source, harvest_error, and harvest_job ( source )

btylerburton · 2024-02-22T22:02:51Z

to add to harvest_source db table:

harvest_source_id
harvest_source_name

@krishnasanaka these are non-required text fields, copy/pasted over from CKAN

Jin-Sun-tts · 2024-02-26T18:31:40Z

I have drafted a pull request (PR) at GSA/datagov-harvester#36 containing all the changes related to this issue. Need to transfer these changes to a separate repository, https://github.com/GSA/datagov-harvest-orchestrator, to facilitate further Flask UI modifications.

Jin-Sun-tts · 2024-02-26T18:55:21Z

@krishnasanaka Please reference the changes mentioned above and ensure that they are incorporated into the new repository datagov-harvest-orchestrator. Additionally, proceed with further modifications as required for ticket #4619

rshewitt · 2024-02-28T18:22:10Z

we should think about things we want to seed into the database as part of the initialization

Jin-Sun-tts · 2024-02-28T18:43:19Z

@jbrown-xentity @rshewitt @btylerburton @FuhuXia @robert-bryson
Could you please share your opinion on using Flask-SQLAlchemy versus pure SQLAlchemy?

Please let me know if it's necessary to create a new ticket for conducting research and experimentation with Flask-SQLAlchemy?

jbrown-xentity · 2024-02-28T18:57:53Z

@jbrown-xentity @rshewitt @btylerburton @FuhuXia @robert-bryson Could you please share your opinion on using Flask-SQLAlchemy versus pure SQLAlchemy?

Please let me know if it's necessary to create a new ticket for conducting research and experimentation with Flask-SQLAlchemy?

I don't have any knowledge or opinion on the subject, discovery would be warranted unless someone else has experience...

robert-bryson · 2024-02-28T22:44:12Z

I don't have a strong opinion. I haven't used that specific project. Sometimes those intersectional type projects are very handy. Sometimes they are more hassle than they're worth. Sometimes they aren't maintained or the maintainer goes MIA without explanation.

btylerburton · 2024-02-28T23:07:22Z

This offers some good context. Overall, seems like global session management is the win, and locking us into that lib for better or worse is the price of that.

https://stackoverflow.com/questions/14343740/flask-sqlalchemy-or-sqlalchemy

Jin-Sun-tts · 2024-02-29T17:53:05Z

thank you everyone for the feedback. We will conduct further research and uncover additional insights within the Flask app ticket #4619

Close this ticket as the table structure settled and defined in SQLAlchemy, and DB initialized and pytest tests run successfully, with results posted and verifiable via interfaces.

btylerburton added this to data.gov team board Feb 7, 2024

rshewitt self-assigned this Feb 7, 2024

gujral-rei moved this to 📔 Product Backlog in data.gov team board Feb 8, 2024

gujral-rei moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Feb 8, 2024

jbrown-xentity moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Feb 9, 2024

Jin-Sun-tts self-assigned this Feb 12, 2024

btylerburton added the H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 label Feb 12, 2024

robert-bryson self-assigned this Feb 14, 2024

btylerburton mentioned this issue Feb 16, 2024

Create Basic Flask Harvest Administration App #4619

Closed

10 tasks

Jin-Sun-tts mentioned this issue Feb 26, 2024

Create harvest db GSA/datagov-harvester#36

Merged

krishnasanaka mentioned this issue Feb 27, 2024

Outline Routes to Flask app #4634

Closed

6 tasks

Jin-Sun-tts mentioned this issue Feb 28, 2024

Create harvest db GSA/datagov-harvest-orchestrator#1

Merged

Jin-Sun-tts closed this as completed Feb 29, 2024

github-project-automation bot moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Feb 29, 2024

krishnasanaka mentioned this issue Mar 1, 2024

Create Edit Flask route #4641

Closed

3 tasks

robert-bryson mentioned this issue Mar 4, 2024

Outline delete route for H2.0 flask app #4644

Closed

6 tasks

btylerburton moved this from ✔ Done to 🗄 Closed in data.gov team board Mar 4, 2024

Jin-Sun-tts mentioned this issue Mar 6, 2024

Implement Integration of Flask Application with HarvestDB Update Feature #4647

Closed

5 tasks

Jin-Sun-tts mentioned this issue Mar 18, 2024

Harvest db update GSA/datagov-harvester#46

Merged

btylerburton added H2.0/Harvest-DB Postgres and related DB tickets and removed H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 labels Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Harvest DB and Table Structure #4612

Create Harvest DB and Table Structure #4612

btylerburton commented Feb 7, 2024 •

edited by Jin-Sun-tts

Loading

rshewitt commented Feb 7, 2024

rshewitt commented Feb 8, 2024

rshewitt commented Feb 8, 2024

rshewitt commented Feb 8, 2024

rshewitt commented Feb 9, 2024

Jin-Sun-tts commented Feb 13, 2024 •

edited

Loading

robert-bryson commented Feb 14, 2024

jbrown-xentity commented Feb 14, 2024

btylerburton commented Feb 14, 2024

btylerburton commented Feb 14, 2024 •

edited

Loading

Jin-Sun-tts commented Feb 15, 2024

rshewitt commented Feb 20, 2024 •

edited

Loading

Jin-Sun-tts commented Feb 21, 2024

btylerburton commented Feb 22, 2024

rshewitt commented Feb 22, 2024

Jin-Sun-tts commented Feb 22, 2024 •

edited

Loading

Jin-Sun-tts commented Feb 22, 2024

btylerburton commented Feb 22, 2024

rshewitt commented Feb 22, 2024

btylerburton commented Feb 22, 2024

rshewitt commented Feb 22, 2024

btylerburton commented Feb 22, 2024

Jin-Sun-tts commented Feb 26, 2024

Jin-Sun-tts commented Feb 26, 2024

rshewitt commented Feb 28, 2024 •

edited

Loading

Jin-Sun-tts commented Feb 28, 2024

jbrown-xentity commented Feb 28, 2024

robert-bryson commented Feb 28, 2024

btylerburton commented Feb 28, 2024

Jin-Sun-tts commented Feb 29, 2024

Create Harvest DB and Table Structure #4612

Create Harvest DB and Table Structure #4612

Comments

btylerburton commented Feb 7, 2024 • edited by Jin-Sun-tts Loading

User Story

Acceptance Criteria

Background

Security Considerations (required)

Sketch

rshewitt commented Feb 7, 2024

rshewitt commented Feb 8, 2024

rshewitt commented Feb 8, 2024

rshewitt commented Feb 8, 2024

rshewitt commented Feb 9, 2024

Jin-Sun-tts commented Feb 13, 2024 • edited Loading

robert-bryson commented Feb 14, 2024

jbrown-xentity commented Feb 14, 2024

btylerburton commented Feb 14, 2024

btylerburton commented Feb 14, 2024 • edited Loading

Jin-Sun-tts commented Feb 15, 2024

rshewitt commented Feb 20, 2024 • edited Loading

Jin-Sun-tts commented Feb 21, 2024

btylerburton commented Feb 22, 2024

rshewitt commented Feb 22, 2024

Jin-Sun-tts commented Feb 22, 2024 • edited Loading

Jin-Sun-tts commented Feb 22, 2024

btylerburton commented Feb 22, 2024

rshewitt commented Feb 22, 2024

btylerburton commented Feb 22, 2024

rshewitt commented Feb 22, 2024

btylerburton commented Feb 22, 2024

Jin-Sun-tts commented Feb 26, 2024

Jin-Sun-tts commented Feb 26, 2024

rshewitt commented Feb 28, 2024 • edited Loading

Jin-Sun-tts commented Feb 28, 2024

jbrown-xentity commented Feb 28, 2024

robert-bryson commented Feb 28, 2024

btylerburton commented Feb 28, 2024

Jin-Sun-tts commented Feb 29, 2024

btylerburton commented Feb 7, 2024 •

edited by Jin-Sun-tts

Loading

Jin-Sun-tts commented Feb 13, 2024 •

edited

Loading

btylerburton commented Feb 14, 2024 •

edited

Loading

rshewitt commented Feb 20, 2024 •

edited

Loading

Jin-Sun-tts commented Feb 22, 2024 •

edited

Loading

rshewitt commented Feb 28, 2024 •

edited

Loading