Merge branch 'main' into data-pipeline-docs

GSA · Jan 4, 2024 · 4c16eb9 · 4c16eb9 · github-actions · Jan 4, 2024
2 parents 25f8e3b + 9a278fa
commit 4c16eb9
Show file tree

Hide file tree

Showing 37 changed files with 1,837 additions and 414 deletions.
diff --git a/.env b/.env
@@ -14,3 +14,4 @@ S3FILESTORE__AWS_ACCESS_KEY_ID=_placeholder
 S3FILESTORE__AWS_SECRET_ACCESS_KEY=_placeholder
 S3FILESTORE__SIGNATURE_VERSION=s3v4
 
+MDTRANSLATOR_URL=http://127.0.0.1:3000/translates
diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml
@@ -1,32 +1,39 @@
 name: Publish to PyPI
 on:
-  # TODO: configure repo to and uncomment below
-  # pull_request:
-  #   branches: [main]
-  #   types: [closed]
+  pull_request:
+    branches: [main]
+    types: [closed]
   workflow_dispatch:
     inputs:
       version_no:
         description: 'Release Version:'
         required: true
 
+env:
+  POETRY_VERSION: "1.7.1"
+
 jobs:
   deploy:
     name: Publish to PyPI
     runs-on: ubuntu-latest
+    environment:
+      name: pypi
+      url: https://pypi.org/project/datagov-harvesting-logic/
     if: github.event.pull_request.merged == true || github.event_name == 'workflow_dispatch'
     steps:
       - name: checkout
         uses: actions/checkout@v3
       - name: Update setup.py if manual release
         if: github.event_name == 'workflow_dispatch'
         run: |
-          # TODO update for pyproject.toml
-          sed -i "s/version='[0-9]\{1,2\}.[0-9]\{1,4\}.[0-9]\{1,4\}',/version='${{github.event.inputs.version_no}}',/g" setup.py
+          sed -i "s/version='[0-9]\{1,2\}.[0-9]\{1,4\}.[0-9]\{1,4\}',/version='${{github.event.inputs.version_no}}',/g" pyproject.toml
+      - name: Install Poetry
+        uses: abatilo/actions-poetry@v2
+        with:
+          poetry-version: ${{ env.POETRY_VERSION }}
       - name: Create packages
         run: |
-          python setup.py sdist
-          python setup.py bdist_wheel
+          poetry build --verbose
       - name: pypi-publish
         uses: pypa/gh-action-pypi-publish@v1.8.11
         with:

diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -1,6 +1,6 @@
 ---
 name: Tests on Commit
-on: [push]
+on: [push, pull_request]
 
 env:
   PY_VERSION: "3.10"
@@ -44,7 +44,7 @@ jobs:
         run: docker-compose up -d
 
       - name: Run Pytest
-        run: set -o pipefail; poetry run pytest --junitxml=pytest.xml --cov=harvester | tee pytest-coverage.txt
+        run: set -o pipefail; poetry run pytest --junitxml=pytest.xml --cov=harvester ./tests/unit | tee pytest-coverage.txt
 
       - name: Report test coverage
         uses: MishaKav/pytest-coverage-comment@main

diff --git a/Makefile b/Makefile
@@ -11,7 +11,7 @@ clean-dist:  ## Cleans dist dir
 	rm -rf dist/*
 
 test: up ## Runs poetry tests, ignores ckan load
-	poetry run pytest --ignore=./tests/load/ckan
+	poetry run pytest --ignore=./tests/integration
 
 up: ## Sets up local docker environment
 	docker compose up -d
@@ -22,8 +22,10 @@ down: ## Shuts down local docker instance
 clean: ## Cleans docker images
 	docker compose down -v --remove-orphans
 
-lint:  ## Lints wtih ruff
+lint:  ## Lints wtih ruff, isort, black
 	ruff .
+	isort .
+	black .
 
 # Output documentation for top-level targets
 # Thanks to https://marmelab.com/blog/2016/02/29/auto-documented-makefile.html

diff --git a/README.md b/README.md
@@ -5,35 +5,86 @@ transformation, and loading into the data.gov catalog.
 
 ## Features
 
-The datagov-harvesting-logic offers the following features:
-
 - Extract
-  - general purpose fetching and downloading of web resources.
-  - catered extraction to the following data formats:
+  - General purpose fetching and downloading of web resources.
+  - Catered extraction to the following data formats:
     - DCAT-US
 - Validation
   - DCAT-US
-    - jsonschema validation using draft 2020-12.
+    - `jsonschema` validation using draft 2020-12.
 - Load
   - DCAT-US
-    - conversion of dcatu-us catalog into ckan dataset schema
-    - create, delete, update, and patch of ckan package/dataset
+    - Conversion of dcat-us catalog into ckan dataset schema
+    - Create, delete, update, and patch of ckan package/dataset
 
 ## Requirements
 
-This project is using poetry to manage this project. Install [here](https://python-poetry.org/docs/#installation).
+This project is using `poetry` to manage this project. Install [here](https://python-poetry.org/docs/#installation).
 
 Once installed, `poetry install` installs dependencies into a local virtual environment.
 
 ## Testing
+
 ### CKAN load testing
+
 - CKAN load testing doesn't require the services provided in the `docker-compose.yml`.
 - [catalog-dev](https://catalog-dev.data.gov/) is used for ckan load testing.
-- Create an api-key by signing into catalog-dev. 
+- Create an api-key by signing into catalog-dev.
 - Create a `credentials.py` file at the root of the project containing the variable `ckan_catalog_dev_api_key` assigned to the api-key.
-- run tests with the command `poetry run pytest ./tests/load/ckan`
+- Run tests with the command `poetry run pytest ./tests/load/ckan`
+
 ### Harvester testing
-- These tests are found in `extract`, and `validate`. Some of them rely on services in the `docker-compose.yml`. run using docker `docker compose up -d` and with the command `poetry run pytest --ignore=./tests/load/ckan`. 
+
+- These tests are found in `extract`, and `validate`. Some of them rely on services in the `docker-compose.yml`. Run using docker `docker compose up -d` and with the command `poetry run pytest --ignore=./tests/load/ckan`.
 
 If you followed the instructions for `CKAN load testing` and `Harvester testing` you can simply run `poetry run pytest` to run all tests.
 
+## Comparison
+
+- `./tests/harvest_sources/ckan_datasets_resp.json`
+  - Represents what ckan would respond with after querying for the harvest source name
+- `./tests/harvest_sources/dcatus_compare.json`
+  - Represents a changed harvest source
+  - Created:
+    - datasets[0]
+
+        ```diff
+        + "identifier" = "cftc-dc10"
+        ```
+
+  - Deleted:
+    - datasets[0]
+
+        ```diff
+        - "identifier" = "cftc-dc1"
+        ```
+
+  - Updated:
+    - datasets[1]
+
+        ```diff
+        - "modified": "R/P1M"
+        + "modified": "R/P1M Update"
+        ```
+
+    - datasets[2]
+
+        ```diff
+        - "keyword": ["cotton on call", "cotton on-call"]
+        + "keyword": ["cotton on call", "cotton on-call", "update keyword"]
+        ```
+
+    - datasets[3]
+
+        ```diff
+        "publisher": {
+          "name": "U.S. Commodity Futures Trading Commission",
+          "subOrganizationOf": {
+        -   "name": "U.S. Government"
+        +   "name": "Changed Value"
+          }
+        }
+        ```
+
+- `./test/harvest_sources/dcatus.json`
+  - Represents an original harvest source prior to change occuring.
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -1,6 +1,15 @@
 version: "3"
 
 services:
+  mdtranslator:
+    image: ghcr.io/gsa/mdtranslator:latest
+    ports:
+      - 3000:3000
+    healthcheck:
+      test: ["CMD", "curl", "-d", "{}", "-X", "POST", "http://localhost:3000/translates"]
+      interval: 10s
+      timeout: 10s
+      retries: 5
   nginx-harvest-source:
     image: nginx
     volumes:

diff --git a/harvester/__init__.py b/harvester/__init__.py
@@ -1,16 +1,35 @@
 # TODO: maybe turn off this ruff ignore?
 # ruff: noqa: F405, F403
 
-__all__ = ["compare", "extract", "load", "transform", "validate", "utils"]
+__all__ = [
+    "compare",
+    "extract",
+    "traverse_waf",
+    "download_waf",
+    "load",
+    "create_ckan_package",
+    "update_ckan_package",
+    "patch_ckan_package",
+    "purge_ckan_package",
+    "dcatus_to_ckan",
+    "transform",
+    "validate",
+    "utils",
+]
+
+from dotenv import load_dotenv
 
 # TODO these imports will need to be updated to ensure a consistent api
 from .compare import compare
-from .extract import extract
-from .load import load
+from .extract import download_waf, extract, traverse_waf
+from .load import (create_ckan_package, dcatus_to_ckan, load,
+                   patch_ckan_package, purge_ckan_package, update_ckan_package)
 from .transform import transform
 from .utils import *
 from .validate import *
 
+load_dotenv()
+
 # configuration settings
 bucket_name = "test-bucket"
 content_types = {

diff --git a/harvester/compare.py b/harvester/compare.py
@@ -3,9 +3,22 @@
 logger = logging.getLogger("harvester")
 
 
-# stub, TODO complete
-def compare(compare_obj):
+def compare(harvest_source, ckan_source):
     """Compares records"""
     logger.info("Hello from harvester.compare()")
 
-    return compare_obj
+    output = {
+        "create": [],
+        "update": [],
+        "delete": [],
+    }
+
+    harvest_ids = set(harvest_source.keys())
+    ckan_ids = set(ckan_source.keys())
+    same_ids = harvest_ids & ckan_ids
+
+    output["create"] += list(harvest_ids - ckan_ids)
+    output["delete"] += list(ckan_ids - harvest_ids)
+    output["update"] += [i for i in same_ids if harvest_source[i] != ckan_source[i]]
+
+    return output
diff --git a/harvester/extract.py b/harvester/extract.py
@@ -29,6 +29,9 @@ def download_dcatus_catalog(url):
 
 
 def traverse_waf(url, files=[], file_ext=".xml", folder="/", filters=[]):
+    """Transverses WAF
+    Please add docstrings
+    """
     # TODO: add exception handling
     parent = os.path.dirname(url.rstrip("/"))
 
@@ -56,6 +59,9 @@ def traverse_waf(url, files=[], file_ext=".xml", folder="/", filters=[]):
 
 
 def download_waf(files):
+    """Downloads WAF
+    Please add docstrings
+    """
     output = []
     for file in files:
         data = {}
Original file line number	Diff line number	Diff line change
Expand Up		@@ -14,3 +14,4 @@ S3FILESTORE__AWS_ACCESS_KEY_ID=_placeholder
		S3FILESTORE__AWS_SECRET_ACCESS_KEY=_placeholder
		S3FILESTORE__SIGNATURE_VERSION=s3v4

		MDTRANSLATOR_URL=http://127.0.0.1:3000/translates
File	Stmts	Miss	Cover	Missing
harvester
__init__.py	12	0	100%
compare.py	12	0	100%
extract.py	48	7	7	85%
load.py	100	10	10	90%
transform.py	13	7	7	46%
harvester/utils
__init__.py	2	0	100%
json.py	4	0	100%
util.py	7	0	100%
harvester/validate
__init__.py	2	0	100%
dcat_us.py	24	3	3	88%
TOTAL	224	27	88%