Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate from MongoDB to Postgres #378

Merged
merged 7 commits into from
Feb 24, 2022
Merged

Migrate from MongoDB to Postgres #378

merged 7 commits into from
Feb 24, 2022

Conversation

kinow
Copy link
Member

@kinow kinow commented Jan 12, 2022

Closes #254

Description

Replacing MongoDB by Postgres in CWLViewer.

  • update pom.xml dependencies, removing Mongo, adding Postgres + Hibernate (or data/jdbc)
  • replace the Spring configuration used by Mongo, probably add one for Postgres (pool settings, etc)
  • update the repositories to use Hibernate or JDBC instead of Data+Mongo
  • update entities and other annotations & classes as needed
  • add database migration, with initial DDL, but try to include some instructions for future DML's too
  • update tests (will probably need TestContainers to spin up a Postgres the QueuedWorkflowRepository tests)
  • manual tests
    • simple hello world workflow visualization
    • API
    • cron jobs
  • update containers
  • send to review and write notes about anything important to observe when migrating, any docs, etc

Motivation and Context

Licensing issues with MongoDB, see linked issue.

How Has This Been Tested?

  • via unit tests
  • manually running the viewer
    • running mvn spring-boot:run to trigger the flyway migrations
  • in GitHub Actions CI

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Manual tests

  • WorkflowRepository
    • findByRetrievedFrom (tested by submitting the compile.wdl example Workflow in the landing page)
    • findByCommitAndPath (tested by accessing http://localhost:8080/git/767d700e602805112a4c953d166e570cddfa2605/workflows/compile/compile1.cwl?part=main&format=jsonld)
    • findByCommit (tested by accessing http://localhost:8080/git/767d700e602805112a4c953d166e570cddfa2605/workflows/compile/compile1.cwl?format=raw)
    • findAllByOrderByRetrievedOnDesc (tested by clicking the "Explore" top menu link)
    • findByLabelContainingOrDocContainingIgnoreCase (tested by searching for "aaa" and "compile" in the search bar, after submitting the compile.cwl workflow)
  • QueuedWorkflowRepositoryImpl
    • findByRetrievedFrom (tested in unit test QueuedWorkflowRepositoryTest.java)
    • deleteByTempRepresentation_RetrievedFrom (tested in unit test QueuedWorkflowRepositoryTest.java)
  • QueuedWorkflowRepository
    • deleteByTempRepresentation_RetrievedOnLessThanEqual (tested by adding a queued workflow manually, below, and changing the scheduler to temporarily run every minute in application.properties)
    • findByTempRepresentation_RetrievedOnLessThanEqual (not used??)

To add a really old queued workflow:

insert into queued_workflow (id, cwltool_status, cwltool_version, message, temp_representation, workflow_list)
values ('bruno-testing', '{}'::jsonb, '1', 'yo', '{
"retrievedOn": "2016-02-04 00:45:56.000"
}'::jsonb, '{}'::jsonb);

Then query the table:

select * from queued_workflow;

And wait for the log line to confirm, or use a debugger. The log line should look like:

2022-02-24 22:44:23,001 INFO  [scheduling-1] org.commonwl.view.Scheduler: Deleting queued workflows older than or equal to Wed Feb 23 22:44:23 NZDT 2022
2022-02-24 22:44:23,004 INFO  [scheduling-1] org.commonwl.view.Scheduler: 1 Old queued workflows removed

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@kinow
Copy link
Member Author

kinow commented Jan 12, 2022

New schema

Now comes probably the biggest part of the work, converting the mongodb schema to postgres. I will try to keep things as simple as possible. Instead of creating complex tables for inputs, outputs, I'll just use a JSONB field. So it won't be too different than what we had in mongodb, maybe we will end up with two tables.

To compare data, I started the main branch, uploaded some workflows, then stopped it, and started working on this branch.

Then, with docker-compose up mongo I can start just the mongo server while I develop. And use DataGrip to connect to the local mongo DB. To connect, one needs to also add a docker compose override file with something like:

version: '3.2'
services:
  spring:
    build: .
  mongo:
    image: mongo:3.4
    ports:
      - "27017:27017"
    volumes:
      - type: volume
        source: mongo
        target: /data/db

This way it's easy to consult the old DB schema to assist with the migration.

image

Migration

Good reference: https://developer.okta.com/blog/2019/02/20/spring-boot-with-postgresql-flyway-jsonb

@kinow kinow changed the title Add dependencies for Postgres with Spring Data Migrate from MongoDB to Postgres Jan 12, 2022
@kinow kinow force-pushed the goodbye-mongo branch 2 times, most recently from 321f248 to 734b2c0 Compare January 16, 2022 08:48
@kinow
Copy link
Member Author

kinow commented Jan 16, 2022

Unit Tests

With exception of one, all unit tests passed. Doesn't mean it is almost working though :-) the one test that failed used to start a MongoDB database for the test. With PostgreSQL, the closest is running Docker containers for tests.

Luckily there is the TestContainers project that supports SpringBoot. I've used it successfully today, and it started the Docker containers with each test, then stopped them. Still trying to confirm it created the DDL and executed for tests (won't do it in production), and also need to fix this one test that is failing due to the SQL query executed having some issues.

@kinow
Copy link
Member Author

kinow commented Jan 28, 2022

One last unit test failing. Almost there 🙏

@kinow
Copy link
Member Author

kinow commented Feb 6, 2022

Stashing work for today. Before I knew where the problem was, but didn't know what the problem was. Now I know what the problem is, just trying to figure out a way to fix it :-)

We have at least two open problems to be fixed. The first one is in the unit tests that I committed now. The fields in Postgres are being correctly created as JSONB, and querying those fields with a TEXT JSON value works.

However, the Java code appears to be converting the GitDetails object into bytea, not in jsonb. Trying to find a way to re-write the repository query to have a jsonb in the parameter.

Once that's fixed, I will have to uncomment a part of the JSON deserialization, since it was complaining about fields that are not annotated with the JsonProperty, and do not have a setter method (creating the setter works, but not clear if that's the correct solution).

Troubleshooting notes

  1. A Postgres Docker command that starts with an one-liner, and logs all queries to terminal: docker run --rm --name cwlviewer-postgres -p 5432:5432 -e POSTGRES_PASSWORD=sa postgres -c log_statement=all
  2. What Hibernate created and inserted before the test failed:
create table queued_workflow (id varchar(36) not null, cwltool_status jsonb, cwltool_version varchar(1000), message varchar(1000), temp_representation jsonb, workflow_list jsonb, primary key (id));

create table workflow (id varchar(36) not null, cwltool_version varchar(1000), doc varchar(1000), docker_link varchar(1000), inputs jsonb, label varchar(1000), last_commit varchar(255), license_link varchar(1000), outputs jsonb, retrieved_from jsonb, retrieved_on timestamp, ro_bundle_path varchar(1000), steps jsonb, visualisation_dot varchar(1000), primary key (id));

create index IDX14ahubfm3f1ynds84uhdx7ews on workflow (retrieved_on);

alter table workflow add constraint UK13qbig8o1om524ht4txhbbvhf unique (retrieved_from);

# I re-wrote it to simplify querying
insert into queued_workflow (cwltool_status, cwltool_version, message, temp_representation, workflow_list, id)
values ('{"status": "OK"}', 'v1.2', 'message?', '{"name": "buffy"}', '[{"name": "willow"}]', 1);
  1. What the test is trying to do:
# Works!
SELECT * FROM queued_workflow q
WHERE
      cast(q.temp_representation -> 'name' as text) = '"buffy"';
;

# The last part, with ?1 is a bytea type, need to figure a way that that's converted in jsonb IN SPRING JPA NATIVE
SELECT * FROM queued_workflow q
WHERE q.temp_representation -> 'name' = ?1

# to test the type of a column:
SELECT q.temp_representation, pg_typeof(q.temp_representation -> 'name') FROM queued_workflow q;

@kinow
Copy link
Member Author

kinow commented Feb 12, 2022

Argh. CI will be broken for a little while, but I have finally understood what was wrong with the Spring repository when trying to query a JSONB type in Postgres. In case another dev needs it (from CWL or from another community/project), here it goes:

  • The best way to set up JSON/JSONB serialization and de-serialization is really with Vlad's hibernate-types library (ignore the deprecation in @Type; the Hibernate devs deprecated it without giving a replacement) - https://vladmihalcea.com/how-to-map-json-objects-using-generic-hibernate-types/
  • When you query with JSON or JSONB, with something like WHERE column-> 'json_value' = ?1, you may receive a warning as Hibernate/JPA/Spring/etc were not able to compare the left jsonb with a right hand side parameter of type bytea; you'll have to create a implementation for the repository, and use the two-interface trick to have a custom method implemented instead of only interface methods, again, thanks Vlad - https://vladmihalcea.com/jpa-query-setparameter-hibernate/
  • After you have done that, the query might still fail if you have a typo as I had! I was doing table.column -> retrieved_from and Postgres never complained about it. After debugging all the way from the Spring data test, to the hibernate-types and Spring boot lib code, setting a breakpoint and trying a few combinations with an EntityManager, I managed to find a query that worked with retrievedFrom 🤦‍♂️

Once you have done that, don't touch anything. Commit (fix & squash later) and go enjoy your weekend! 🍻

At least this issue was enough for me to refresh my memory about Spring Data, Hibernate, JSON/Jackson, and get more familiar with the old Mongo structure. This last failing test must be fixed soon, then will compare the 2 databases, and the final stage will commence where I will have to start importing workflows, comparing, planning the migration, etc 👍

-Bruno

p.s. a good place to set a breakpoint is in JsonTypeDescriptor, in the unwrap method 😉 that gives you the debugger when the hibernate-types is wrapping/unwrapping the objects, and you can return to the previous method where you should have some Hibernate/SpringData classes where you can explore the context, see the prepared statement with the complete SQL including the parameters.

@kinow kinow force-pushed the goodbye-mongo branch 2 times, most recently from b4ad46d to 0a5a7d0 Compare February 12, 2022 06:06
@mr-c
Copy link
Member

mr-c commented Feb 12, 2022

Great to see the progress!

@kinow
Copy link
Member Author

kinow commented Feb 16, 2022

Managed to update the docs and remaining parts of the code that referenced Mongo. Also updated the Dockerfile and docker-compose.yml, these should be ready and won't require further work, I think.

Finally, the application was successfully initialized, latest spring, Jena Fuseki, and Postgres. I used Docker Compose and the updated docker-compose.yml. Needed to clear volumes and re-create as I was testing.

docker-compose up --force-recreate
docker-compose down -v

I've marked the subtask for Docker as completed! Now I will work on manual tests. First test failed, but that's not too bad. At least now I already have an idea what's broken, and even think I know how to fix it quickly 🙂

image

@kinow
Copy link
Member Author

kinow commented Feb 19, 2022

With the latest commit, we are now able to visualize workflows, and the data gets put into the Postgres DB 🙂 We are almost done now I think!

@kinow
Copy link
Member Author

kinow commented Feb 23, 2022

Hmmm, Flyway license doesn't seem very good. Might have to use the other migration library that is integrated with Spring Boot 😞

Caused by: org.flywaydb.core.internal.license.FlywayEditionUpgradeRequiredException: Flyway Teams Edition or PostgreSQL upgrade required: PostgreSQL 9.6 is no longer supported by Flyway Community Edition, but still supported by Flyway Teams Edition.

@@ -18,9 +18,6 @@ jobs:
ref: ${{ github.event.pull_request.head.ref }}
repository: ${{ github.event.pull_request.head.repo.full_name }}

- name: MongoDB in GitHub Actions
uses: supercharge/mongodb-github-action@1.3.0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed anymore. We are using TestContainers in the QueuedWorkflowRepositoryTest. That's a Maven dependency that starts a container for that unit test only. We can control when the container is started, destroyed, etc. Later we can think about splitting the tests into groups/suites, so that mvn test doesn't need to start the container.

@lgtm-com

This comment was marked as outdated.

@kinow kinow force-pushed the goodbye-mongo branch 2 times, most recently from 8b812b4 to 4a5c13b Compare February 24, 2022 11:03
@lgtm-com
Copy link

lgtm-com bot commented Feb 24, 2022

This pull request fixes 1 alert when merging 4a5c13b into 15d269f - view on LGTM.com

fixed alerts:

  • 1 for Spurious Javadoc @param tags

@kinow kinow marked this pull request as ready for review February 24, 2022 11:27
@kinow
Copy link
Member Author

kinow commented Feb 24, 2022

Ready for review 🎉 working on how to migrate the data now.

@lgtm-com
Copy link

lgtm-com bot commented Feb 24, 2022

This pull request fixes 1 alert when merging 7426f6b into 15d269f - view on LGTM.com

fixed alerts:

  • 1 for Spurious Javadoc @param tags

@mr-c
Copy link
Member

mr-c commented Feb 24, 2022

Woo-hoo! Thank you @kinow !

Were you able to do a dump/restore from view.commonwl.org? https://github.com/common-workflow-language/cwlviewer#dumprestore

I would recommend doing this in two phases

  1. Dump view.commonwl.org
  2. Restore using the codebase from main
  3. Probably there are errors from resources being offline. Dump the database again from the instance in step 2 (not the production instance)
  4. Restore using this PR's codebase using the dump from step 3.

That way you know which errors are probably due to these changes versus external resources being down.

@kinow
Copy link
Member Author

kinow commented Feb 24, 2022

Thanks @mr-c ! And woo-hoo indeed!!! 🥳

Were you able to do a dump/restore from view.commonwl.org?

Not yet. I spoke with Ward & Peter, and they told me the dump/restore process can be quite lengthy... so I was looking how hard it would be to just dump Mongo and import into PostgreSQL.

But not sure how this final part will be done. At least the code changes are done I think. Just a matter of defining how to migrate to the new PostgreSQL DB 🚀

🛏️ now for meeting in ~6:30 hrs 👋

@mr-c
Copy link
Member

mr-c commented Feb 24, 2022

Please get good sleep @kinow !

I've done a dump/restore cycle before and yes the restore takes a while. The point would be to stress-test the PR. You could do a full dump & partial restore as a sanity check. Maybe 1%?

@mr-c
Copy link
Member

mr-c commented Feb 24, 2022

I did a dump/load cycle until my local hard-drive filled up. Almost reached 1%!

Here's the new dump (the subset that successfully loaded). Going to load it into the codebase from this PR now

2022-02-24T150823+0000.json.gz

Copy link
Member

@mr-c mr-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After loading up ~210 workflows, this seems to work!

@kinow
Copy link
Member Author

kinow commented Feb 24, 2022

After loading up ~210 workflows, this seems to work!

Whew!!! Thanks for testing it 👏 🍾 🥳 🎂

How long did it take to load these workflows, BTW?

@mr-c mr-c merged commit 9d87837 into main Feb 24, 2022
@mr-c mr-c deleted the goodbye-mongo branch February 24, 2022 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

replace mongodb with a F/OSS database
2 participants