Skip to content

Commit

Permalink
Merge branch 'main' into data-pipeline-docs
Browse files Browse the repository at this point in the history
  • Loading branch information
btylerburton authored Jan 4, 2024
2 parents 25f8e3b + 9a278fa commit 4c16eb9
Show file tree
Hide file tree
Showing 37 changed files with 1,837 additions and 414 deletions.
1 change: 1 addition & 0 deletions .env
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,4 @@ S3FILESTORE__AWS_ACCESS_KEY_ID=_placeholder
S3FILESTORE__AWS_SECRET_ACCESS_KEY=_placeholder
S3FILESTORE__SIGNATURE_VERSION=s3v4

MDTRANSLATOR_URL=http://127.0.0.1:3000/translates
23 changes: 15 additions & 8 deletions .github/workflows/deploy.yml
Original file line number Diff line number Diff line change
@@ -1,32 +1,39 @@
name: Publish to PyPI
on:
# TODO: configure repo to and uncomment below
# pull_request:
# branches: [main]
# types: [closed]
pull_request:
branches: [main]
types: [closed]
workflow_dispatch:
inputs:
version_no:
description: 'Release Version:'
required: true

env:
POETRY_VERSION: "1.7.1"

jobs:
deploy:
name: Publish to PyPI
runs-on: ubuntu-latest
environment:
name: pypi
url: https://pypi.org/project/datagov-harvesting-logic/
if: github.event.pull_request.merged == true || github.event_name == 'workflow_dispatch'
steps:
- name: checkout
uses: actions/checkout@v3
- name: Update setup.py if manual release
if: github.event_name == 'workflow_dispatch'
run: |
# TODO update for pyproject.toml
sed -i "s/version='[0-9]\{1,2\}.[0-9]\{1,4\}.[0-9]\{1,4\}',/version='${{github.event.inputs.version_no}}',/g" setup.py
sed -i "s/version='[0-9]\{1,2\}.[0-9]\{1,4\}.[0-9]\{1,4\}',/version='${{github.event.inputs.version_no}}',/g" pyproject.toml
- name: Install Poetry
uses: abatilo/actions-poetry@v2
with:
poetry-version: ${{ env.POETRY_VERSION }}
- name: Create packages
run: |
python setup.py sdist
python setup.py bdist_wheel
poetry build --verbose
- name: pypi-publish
uses: pypa/gh-action-pypi-publish@v1.8.11
with:
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: Tests on Commit
on: [push]
on: [push, pull_request]

env:
PY_VERSION: "3.10"
Expand Down Expand Up @@ -44,7 +44,7 @@ jobs:
run: docker-compose up -d

- name: Run Pytest
run: set -o pipefail; poetry run pytest --junitxml=pytest.xml --cov=harvester | tee pytest-coverage.txt
run: set -o pipefail; poetry run pytest --junitxml=pytest.xml --cov=harvester ./tests/unit | tee pytest-coverage.txt

- name: Report test coverage
uses: MishaKav/pytest-coverage-comment@main
Expand Down
6 changes: 4 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ clean-dist: ## Cleans dist dir
rm -rf dist/*

test: up ## Runs poetry tests, ignores ckan load
poetry run pytest --ignore=./tests/load/ckan
poetry run pytest --ignore=./tests/integration

up: ## Sets up local docker environment
docker compose up -d
Expand All @@ -22,8 +22,10 @@ down: ## Shuts down local docker instance
clean: ## Cleans docker images
docker compose down -v --remove-orphans

lint: ## Lints wtih ruff
lint: ## Lints wtih ruff, isort, black
ruff .
isort .
black .

# Output documentation for top-level targets
# Thanks to https://marmelab.com/blog/2016/02/29/auto-documented-makefile.html
Expand Down
73 changes: 62 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,35 +5,86 @@ transformation, and loading into the data.gov catalog.

## Features

The datagov-harvesting-logic offers the following features:

- Extract
- general purpose fetching and downloading of web resources.
- catered extraction to the following data formats:
- General purpose fetching and downloading of web resources.
- Catered extraction to the following data formats:
- DCAT-US
- Validation
- DCAT-US
- jsonschema validation using draft 2020-12.
- `jsonschema` validation using draft 2020-12.
- Load
- DCAT-US
- conversion of dcatu-us catalog into ckan dataset schema
- create, delete, update, and patch of ckan package/dataset
- Conversion of dcat-us catalog into ckan dataset schema
- Create, delete, update, and patch of ckan package/dataset

## Requirements

This project is using poetry to manage this project. Install [here](https://python-poetry.org/docs/#installation).
This project is using `poetry` to manage this project. Install [here](https://python-poetry.org/docs/#installation).

Once installed, `poetry install` installs dependencies into a local virtual environment.

## Testing

### CKAN load testing

- CKAN load testing doesn't require the services provided in the `docker-compose.yml`.
- [catalog-dev](https://catalog-dev.data.gov/) is used for ckan load testing.
- Create an api-key by signing into catalog-dev.
- Create an api-key by signing into catalog-dev.
- Create a `credentials.py` file at the root of the project containing the variable `ckan_catalog_dev_api_key` assigned to the api-key.
- run tests with the command `poetry run pytest ./tests/load/ckan`
- Run tests with the command `poetry run pytest ./tests/load/ckan`

### Harvester testing
- These tests are found in `extract`, and `validate`. Some of them rely on services in the `docker-compose.yml`. run using docker `docker compose up -d` and with the command `poetry run pytest --ignore=./tests/load/ckan`.

- These tests are found in `extract`, and `validate`. Some of them rely on services in the `docker-compose.yml`. Run using docker `docker compose up -d` and with the command `poetry run pytest --ignore=./tests/load/ckan`.

If you followed the instructions for `CKAN load testing` and `Harvester testing` you can simply run `poetry run pytest` to run all tests.

## Comparison

- `./tests/harvest_sources/ckan_datasets_resp.json`
- Represents what ckan would respond with after querying for the harvest source name
- `./tests/harvest_sources/dcatus_compare.json`
- Represents a changed harvest source
- Created:
- datasets[0]

```diff
+ "identifier" = "cftc-dc10"
```

- Deleted:
- datasets[0]

```diff
- "identifier" = "cftc-dc1"
```

- Updated:
- datasets[1]

```diff
- "modified": "R/P1M"
+ "modified": "R/P1M Update"
```

- datasets[2]

```diff
- "keyword": ["cotton on call", "cotton on-call"]
+ "keyword": ["cotton on call", "cotton on-call", "update keyword"]
```

- datasets[3]

```diff
"publisher": {
"name": "U.S. Commodity Futures Trading Commission",
"subOrganizationOf": {
- "name": "U.S. Government"
+ "name": "Changed Value"
}
}
```

- `./test/harvest_sources/dcatus.json`
- Represents an original harvest source prior to change occuring.
9 changes: 9 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,15 @@
version: "3"

services:
mdtranslator:
image: ghcr.io/gsa/mdtranslator:latest
ports:
- 3000:3000
healthcheck:
test: ["CMD", "curl", "-d", "{}", "-X", "POST", "http://localhost:3000/translates"]
interval: 10s
timeout: 10s
retries: 5
nginx-harvest-source:
image: nginx
volumes:
Expand Down
25 changes: 22 additions & 3 deletions harvester/__init__.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,35 @@
# TODO: maybe turn off this ruff ignore?
# ruff: noqa: F405, F403

__all__ = ["compare", "extract", "load", "transform", "validate", "utils"]
__all__ = [
"compare",
"extract",
"traverse_waf",
"download_waf",
"load",
"create_ckan_package",
"update_ckan_package",
"patch_ckan_package",
"purge_ckan_package",
"dcatus_to_ckan",
"transform",
"validate",
"utils",
]

from dotenv import load_dotenv

# TODO these imports will need to be updated to ensure a consistent api
from .compare import compare
from .extract import extract
from .load import load
from .extract import download_waf, extract, traverse_waf
from .load import (create_ckan_package, dcatus_to_ckan, load,
patch_ckan_package, purge_ckan_package, update_ckan_package)
from .transform import transform
from .utils import *
from .validate import *

load_dotenv()

# configuration settings
bucket_name = "test-bucket"
content_types = {
Expand Down
19 changes: 16 additions & 3 deletions harvester/compare.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,22 @@
logger = logging.getLogger("harvester")


# stub, TODO complete
def compare(compare_obj):
def compare(harvest_source, ckan_source):
"""Compares records"""
logger.info("Hello from harvester.compare()")

return compare_obj
output = {
"create": [],
"update": [],
"delete": [],
}

harvest_ids = set(harvest_source.keys())
ckan_ids = set(ckan_source.keys())
same_ids = harvest_ids & ckan_ids

output["create"] += list(harvest_ids - ckan_ids)
output["delete"] += list(ckan_ids - harvest_ids)
output["update"] += [i for i in same_ids if harvest_source[i] != ckan_source[i]]

return output
6 changes: 6 additions & 0 deletions harvester/extract.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ def download_dcatus_catalog(url):


def traverse_waf(url, files=[], file_ext=".xml", folder="/", filters=[]):
"""Transverses WAF
Please add docstrings
"""
# TODO: add exception handling
parent = os.path.dirname(url.rstrip("/"))

Expand Down Expand Up @@ -56,6 +59,9 @@ def traverse_waf(url, files=[], file_ext=".xml", folder="/", filters=[]):


def download_waf(files):
"""Downloads WAF
Please add docstrings
"""
output = []
for file in files:
data = {}
Expand Down
Loading

1 comment on commit 4c16eb9

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coverage

Coverage Report
FileStmtsMissCoverMissing
harvester
   __init__.py120100% 
   compare.py120100% 
   extract.py4877 85%
   load.py1001010 90%
   transform.py1377 46%
harvester/utils
   __init__.py20100% 
   json.py40100% 
   util.py70100% 
harvester/validate
   __init__.py20100% 
   dcat_us.py2433 88%
TOTAL2242788% 

Tests Skipped Failures Errors Time
28 0 💤 0 ❌ 0 🔥 17.354s ⏱️

Please sign in to comment.