Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Harvest DB and Table Structure #4612

Closed
6 tasks done
btylerburton opened this issue Feb 7, 2024 · 30 comments
Closed
6 tasks done

Create Harvest DB and Table Structure #4612

btylerburton opened this issue Feb 7, 2024 · 30 comments
Assignees
Labels
H2.0/Harvest-DB Postgres and related DB tickets

Comments

@btylerburton
Copy link
Contributor

btylerburton commented Feb 7, 2024

User Story

In order to track the state of the new Harvesting 2.0 pipeline, datagovteam wants to create a Harvesting DB to record the status of harvest sources, harvest jobs, and harvest errors.

The Harvest DB will be defined in code using SQLAlchemy, so as to be reproducible through environments and deployments.

NOTE: As Google docs allows for more flexibility in table structure as modifications are ongoing, we will use the Harvesting Research Doc (here) to finalize table structure before adding an accompanying ERD to datagov-harvesting-logic docs.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN the team has settled on the Table structure above
    AND it has been defined in an SQLAlchemy file
    THEN it should be able to be created programaticaly in a Docker container by running a DB initialization script.

  • GIVEN that the DB has been initialized
    WHEN tests are run locally in pytest
    THEN the results should be posted to the DB and be verifiable by running a query to generate the resulting report

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

  • DB should be properly secured

Sketch

  • Define Table Structure in SQLAlchemy
  • Initialize this structure into a local Docker PostGres container
  • Create an interface in datagov-harvesting-logic to interact with the DB
  • Call that interface as needed to report the results of harvester operations into the DB
@rshewitt
Copy link
Contributor

rshewitt commented Feb 7, 2024

sqlachemy work in harvesting logic

@rshewitt
Copy link
Contributor

rshewitt commented Feb 8, 2024

active branch

@gujral-rei gujral-rei moved this to 📔 Product Backlog in data.gov team board Feb 8, 2024
@gujral-rei gujral-rei moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Feb 8, 2024
@rshewitt
Copy link
Contributor

rshewitt commented Feb 8, 2024

@rshewitt
Copy link
Contributor

rshewitt commented Feb 8, 2024

@jbrown-xentity jbrown-xentity moved this from 📟 Sprint Backlog [7] to 🏗 In Progress [8] in data.gov team board Feb 9, 2024
@rshewitt
Copy link
Contributor

rshewitt commented Feb 9, 2024

@Jin-Sun-tts Jin-Sun-tts self-assigned this Feb 12, 2024
@btylerburton btylerburton added the H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 label Feb 12, 2024
@Jin-Sun-tts
Copy link
Contributor

Jin-Sun-tts commented Feb 13, 2024

Defined table structure in SQLAlchemy, create a docker container running Postgres, connected to create tables in DB:
Screenshot 2024-02-13 at 12 55 56 PM

Inserted a sample record in the harvest_source table:
Screenshot 2024-02-13 at 12 56 16 PM

Created HarvestDBInterface to create/get/update db. Will add more detail functions in.
also created a simple testing to insert records to the harvest_job table etc.

@robert-bryson robert-bryson self-assigned this Feb 14, 2024
@robert-bryson
Copy link
Contributor

DB created in dev space, see keys with cf service-key harvesting-logic-db access-test. Connecting to the service remotely needs to be done through an ssh tunnel: https://docs.cloudfoundry.org/devguide/deploy-apps/ssh-services.html#ssh-tunnel. I'm working on a proof of concept to connect to the internal from a github action.

@jbrown-xentity
Copy link
Contributor

DB created in dev space, see keys with cf service-key harvesting-logic-db access-test. Connecting to the service remotely needs to be done through an ssh tunnel: https://docs.cloudfoundry.org/devguide/deploy-apps/ssh-services.html#ssh-tunnel. I'm working on a proof of concept to connect to the internal from a github action.

Just FYI, we have ssh turned off on prod at the entire space level. So if we need to use ssh to connect, that's probably not a good long term approach... However, knowing that we cannot connect outside of cloud.gov to the RDS services is helpful to know!

@btylerburton
Copy link
Contributor Author

I would think the long-term solution is a proxy app.

@btylerburton
Copy link
Contributor Author

btylerburton commented Feb 14, 2024

Connecting to the service remotely needs to be done through an ssh tunnel

I tried to connect with SQLEctron yesterday and got basically the same connection error.

@Jin-Sun-tts
Copy link
Contributor

Developed a Flask application and set up a PostgreSQL Docker container for local execution, as detailed in branch create-harvest-db.

Additionally, interfaces were established to interact with the corresponding tables.

for examples, to insert test records, utilize the following endpoints:
Add a source: /add_source
Add a job: /add_job?source_id=
Add an error: /add_error?job_id=

Image

To-do:
Question: Regarding the harvest_source table, should organization_name and config.url be designated as the primary/unique key?
Add additional load test scripts
Setup and conduct testing on cloud.gov, utilizing the database defined within that environment.

@rshewitt
Copy link
Contributor

rshewitt commented Feb 20, 2024

Developed a Flask application and set up a PostgreSQL Docker container for local execution, as detailed in branch create-harvest-db.

Additionally, interfaces were established to interact with the corresponding tables.

for examples, to insert test records, utilize the following endpoints: Add a source: /add_source Add a job: /add_job?source_id= Add an error: /add_error?job_id=

Image

To-do: Question: Regarding the harvest_source table, should organization_name and config.url be designated as the primary/unique key? Add additional load test scripts Setup and conduct testing on cloud.gov, utilizing the database defined within that environment.

@Jin-Sun-tts in response to your question, so far all the tables inherit an id field from Base ( source ) which is used as the primary key. are you suggesting creating a composite key using organization_name + config.url? I think that's more meaningful from a data perspective than using a uuid. I suppose it's a matter of whether we want to use a business key or a surrogate key.

@Jin-Sun-tts
Copy link
Contributor

Pushed the Fask app to cloud.gov development https://harvesting-logic.app.cloud.gov/

@btylerburton
Copy link
Contributor Author

@Jin-Sun-tts chating with @krishnasanaka about the Flask form she's working on, and I believe there is work to be done to update the harvest_source table to the latest schema (here).

Id (str) *primary
name (str)
notification_emails (array)
organization_name (str) (*self-reported)
frequency (str) (cron)
url (str)
schema_type (str)
source_type (str)

@GSA/data-gov-team can you take a look at the updates and add comments if you feel something was lost in translation? thanks.

https://docs.google.com/document/d/1XzfTrPxu-asJ_55GoeZ2UOJsie9CuCegStS28BAL_40/edit

@rshewitt
Copy link
Contributor

@Jin-Sun-tts chating with @krishnasanaka about the Flask form she's working on, and I believe there is work to be done to update the harvest_source table to the latest schema (here).

Id (str) *primary
name (str)
notification_emails (array)
organization_name (str) (*self-reported)
frequency (str) (cron)
url (str)
schema_type (str)
source_type (str)

@GSA/data-gov-team can you take a look at the updates and add comments if you feel something was lost in translation? thanks.

https://docs.google.com/document/d/1XzfTrPxu-asJ_55GoeZ2UOJsie9CuCegStS28BAL_40/edit

we ended up exploding the config json field into separate fields so url and schema_type are their own thing now

@Jin-Sun-tts
Copy link
Contributor

Jin-Sun-tts commented Feb 22, 2024

@btylerburton @rshewitt agree, that is also what @FuhuXia and I talked about the other day. And we may have the url as unique key.

@Jin-Sun-tts
Copy link
Contributor

Do we require the primary key id to be of type UUID, generated each time a record is added, OR retrieved/copied as a string from the CKAN database?

@btylerburton
Copy link
Contributor Author

Removed AC/Sketch regarding load test, as that is dependent on #4617 and related to #4619.

Copied old AC/Sketch below:

GIVEN that the above action has been successful locally
AND a database has been created on cloud.gov
THEN it should be possible to run a load test via Github Actions and generate a report of that test from the entries in the DB

Invoke a load test and confirm that the expected results are written to the DB

@rshewitt
Copy link
Contributor

Do we require the primary key id to be of type UUID, generated each time a record is added, OR retrieved/copied as a string from the CKAN database?

i figure id for harvest_error and harvest_job would be generated on insertion. using the ckan id in harvest_source allows us to relate tables by that field which could be valuable?

@btylerburton
Copy link
Contributor Author

using the ckan id in harvest_source allows us to relate tables by that field which could be valuable?

Which id are we talking about? I believe CKAN would generate an identifier for a dataset, but would there be any concept of a harvest source in this new system? I don't think so.

@rshewitt
Copy link
Contributor

@btylerburton I believe we're talking about the id field for harvest_source, harvest_error, and harvest_job ( source )

@btylerburton
Copy link
Contributor Author

to add to harvest_source db table:

  • harvest_source_id
  • harvest_source_name

@krishnasanaka these are non-required text fields, copy/pasted over from CKAN

@Jin-Sun-tts
Copy link
Contributor

I have drafted a pull request (PR) at GSA/datagov-harvester#36 containing all the changes related to this issue. Need to transfer these changes to a separate repository, https://github.com/GSA/datagov-harvest-orchestrator, to facilitate further Flask UI modifications.

@Jin-Sun-tts
Copy link
Contributor

@krishnasanaka Please reference the changes mentioned above and ensure that they are incorporated into the new repository datagov-harvest-orchestrator. Additionally, proceed with further modifications as required for ticket #4619

@rshewitt
Copy link
Contributor

rshewitt commented Feb 28, 2024

we should think about things we want to seed into the database as part of the initialization

@Jin-Sun-tts
Copy link
Contributor

@jbrown-xentity @rshewitt @btylerburton @FuhuXia @robert-bryson
Could you please share your opinion on using Flask-SQLAlchemy versus pure SQLAlchemy?

Please let me know if it's necessary to create a new ticket for conducting research and experimentation with Flask-SQLAlchemy?

@jbrown-xentity
Copy link
Contributor

@jbrown-xentity @rshewitt @btylerburton @FuhuXia @robert-bryson Could you please share your opinion on using Flask-SQLAlchemy versus pure SQLAlchemy?

Please let me know if it's necessary to create a new ticket for conducting research and experimentation with Flask-SQLAlchemy?

I don't have any knowledge or opinion on the subject, discovery would be warranted unless someone else has experience...

@robert-bryson
Copy link
Contributor

I don't have a strong opinion. I haven't used that specific project. Sometimes those intersectional type projects are very handy. Sometimes they are more hassle than they're worth. Sometimes they aren't maintained or the maintainer goes MIA without explanation.

@btylerburton
Copy link
Contributor Author

This offers some good context. Overall, seems like global session management is the win, and locking us into that lib for better or worse is the price of that.

https://stackoverflow.com/questions/14343740/flask-sqlalchemy-or-sqlalchemy

@Jin-Sun-tts
Copy link
Contributor

thank you everyone for the feedback. We will conduct further research and uncover additional insights within the Flask app ticket #4619

Close this ticket as the table structure settled and defined in SQLAlchemy, and DB initialized and pytest tests run successfully, with results posted and verifiable via interfaces.

@github-project-automation github-project-automation bot moved this from 🏗 In Progress [8] to ✔ Done in data.gov team board Feb 29, 2024
@btylerburton btylerburton moved this from ✔ Done to 🗄 Closed in data.gov team board Mar 4, 2024
@btylerburton btylerburton added H2.0/Harvest-DB Postgres and related DB tickets and removed H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 labels Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
H2.0/Harvest-DB Postgres and related DB tickets
Projects
Archived in project
Development

No branches or pull requests

5 participants