Skip to content

Commit

Permalink
Merge pull request #48 from digital-land/feat/eleventy
Browse files Browse the repository at this point in the history
Feat/eleventy
  • Loading branch information
eveleighoj authored Sep 30, 2024
2 parents 1c68d13 + 6613507 commit b50781f
Show file tree
Hide file tree
Showing 101 changed files with 1,757 additions and 428 deletions.
90 changes: 67 additions & 23 deletions .github/workflows/deploy-documentation.yml
Original file line number Diff line number Diff line change
@@ -1,33 +1,77 @@
name: Deploy Documentation

# on:
# push:
# branches:
# - main

# jobs:
# build-and-deploy:
# runs-on: ubuntu-latest
# steps:
# - uses: actions/checkout@v4
# with:
# fetch-depth: 0
# - uses: ruby/setup-ruby@v1
# with:
# ruby-version: '3.1.2'
# - uses: actions/setup-java@v3
# with:
# distribution: 'adopt-openj9'
# java-version: '17'
# - run: sudo apt-get update && sudo apt-get install -y graphviz
# - run: gem install middleman
# - run: bundle install
# - run: git worktree add -B gh-pages build origin/gh-pages
# - run: make build
# - run: |
# git add .
# git config --global user.name "digital-land-bot"
# git config --global user.email "digitalland@communities.gov.uk"
# git commit -m "Publishing changes"
# git push
# working-directory: build

name: deploy

on:
push:
branches:
- main

# Set permissions of GITHUB_TOKEN
permissions:
contents: read
pages: write
id-token: write

# Allow one concurrent deployment
concurrency:
group: pages
cancel-in-progress: true

jobs:
build-and-deploy:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup Pages
uses: actions/configure-pages@v5
- name: Install dependencies
run: npm ci
- name: Build with Eleventy
run: make build
- name: Upload artifact
uses: actions/upload-pages-artifact@v3

deploy:
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
runs-on: ubuntu-latest
needs: build
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: ruby/setup-ruby@v1
with:
ruby-version: '3.1.2'
- uses: actions/setup-java@v3
with:
distribution: 'adopt-openj9'
java-version: '17'
- run: sudo apt-get update && sudo apt-get install -y graphviz
- run: gem install middleman
- run: bundle install
- run: git worktree add -B gh-pages build origin/gh-pages
- run: make build
- run: |
git add .
git config --global user.name "digital-land-bot"
git config --global user.email "digitalland@communities.gov.uk"
git commit -m "Publishing changes"
git push
working-directory: build
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v4
30 changes: 19 additions & 11 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,21 +1,29 @@
SHELL := bash
PLANTUML_VERSION := $(shell curl --silent "https://api.github.com/repos/plantuml/plantuml/releases/latest" | jq -rc '.name | .[1:]')
.PHONY: init clean

init: .bin/plantuml.jar
init:
npm install

.bin:
@mkdir -p .bin/
# old commands for plant uml that could be useful fin the future
# .bin:
# @mkdir -p .bin/

# .bin/plantuml.jar: .bin
# @echo "Downloading version $(PLANTUML_VERSION) of plantuml"
# @curl -sL -o .bin/plantuml.jar "https://github.com/plantuml/plantuml/releases/download/v$(PLANTUML_VERSION)/plantuml-$(PLANTUML_VERSION).jar"

# PLANTUML_VERSION := $(shell curl --silent "https://api.github.com/repos/plantuml/plantuml/releases/latest" | jq -rc '.name | .[1:]')

.bin/plantuml.jar: .bin
@echo "Downloading version $(PLANTUML_VERSION) of plantuml"
@curl -sL -o .bin/plantuml.jar "https://github.com/plantuml/plantuml/releases/download/v$(PLANTUML_VERSION)/plantuml-$(PLANTUML_VERSION).jar"

clean:
@rm -rf .bin/

dev: init
@bundle exec middleman server
serve:
npx eleventy --serve

build:
npx eleventy




build: init
@bundle exec middleman build --verbose
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,13 @@ Technical Documentation for the planning data service.

### [Live Documentation](https://digital-land.github.io/technical-documentation)

This project uses the [Tech Docs Template][template], which is a [Middleman template][mmt] that you can use to build
This project used to use the [Tech Docs Template][template], which is a [Middleman template][mmt] that you can use to build
technical documentation using a GOV.UK style.

But we found that using the template required knowledge of Ruby which isn't a requirement of our project and was difficult for alll members of the team to interact with so we switched to [X-GOVUK Eleventy Plugin](https://x-govuk.github.io/govuk-eleventy-plugin/).

We have customised the layout so that edits don't need any special knowledge other than creating and editing markdown files. To make a change simply edit the relvent document or create a new one in the docs directory and the sidenav will automatically update!

[mit]: LICENCE
[copyright]: http://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/crown-copyright/
[mmt]: https://middlemanapp.com/advanced/project_templates/
Expand Down
File renamed without changes.
87 changes: 87 additions & 0 deletions archive/003-data-quality-framework.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
Author(s) - Owen Eveleigh
## Introduction

![images showing a high level version of our data workflow with two areas highlighted where checkpoints could go](https://github.com/digital-land/digital-land/blob/main/images/high_level_architecture_v2_with_checkpoints.drawio.png)

We currently only record data quality issues during the pipeline section of our data workflow. this is present in the above diagram in the issue log box. While this is powerful as it can fully explain the transformations being applied to a data point it doesn't provide any framework for more checkpoint style validations. I have highlighted two key area above where additional validations have been requested:
- On incoming resources from provider systems further left in the data workflow
- On the dataset sqlite files before it's made available to the public/platform

This ODP outlines how a new python module focussed on data validations called expectations can apply expectations on two specific checkpoints:

- converted resource - to enable us to run expectations on an individual resource to communicate possible errors/warnings back to providers. (we do minimal processing first to establish a common tabular representation)
- Dataset - enables use to run internal expectations to see if anything is wrong at the end of dataset creation and add a layer of protection against adding incorrect data into the public domain.

## Status

Open

* Draft: proposal is still being authored and is not officially open for comment yet
* Open: proposal is open for comment
* Closed: proposal is closed for comment with implementation expected
* On Hold: proposal is on hold due to concerns raised/project changes with implementation not expected

## Detail

### Section 1: Current Data Quality Issues Raised
![our current pipeline setup showing that multiple phases have access to the issue log but not all of them and most of them are row based access to the data](https://github.com/digital-land/digital-land/blob/main/images/current-pipeline.drawio.png)

The above [picture](https://github.com/digital-land/digital-land/blob/main/images/current-pipeline.drawio.png) shows how throughout the current pipeline step of our data workflow we regularly record issues to an issues log. At the time of writing this is primarily focussed on only recording issues when either:

- a value is transformed/changed when we beleive we can fix/improve the quality of the data. E.g. We convert the CRS
- a data point/value is removed because we beleive the data is incorrect e.g. a geometry is outside of an appropriate bounding box

There are a few limitations associated with this and it may not be capable of handling the requirements of the providers and management teams. For example:
- It probably isn't an appropriate place to record problems in the data that aren't fixed or removed. For example missing fields. It could raise this as a warning but it would imply that it is making a transformation.
- If a validation/check needs access to multiple rows (like duplication checks) then right now the pipeline only accesses data row by. row
- what if there is a critical error with the data and no further processing should be done.
- what if you wanted to raise more speculative problems rather than taking an action?

I believe that certain types of validations should have a more formal framework to register what checks should be completed against which data at what stage. For example there should be a checkpoint in the above diagram at the converted resource stage (see the above diagram). this would allow us to communicate problems with a resource back to providers if there are elements missing from their data.

Processing issues should still be recorded but only when changes are being made. Together with both processing issues and these new validation issues we can easily communicate problems back to publishers.

We have decided to name these new validations expectations. This new framework should be expandable to not just the example in the pipeline stage above but also to the sqlite dataset files we create in the dataset phase

### Section 2: Expectations Module

This is where we need to identify or produce a framework for these additional data validation tests. The word framework here is used as where ever these checks/tests are applied it would be good to have similar meaningful outputs along with commands to make them runnable in multiple environments.

After looking around we need something very specific. The great expectations python package seemed like if would be useful but after looking at the required set-up and how difficult it is to make custom tests it seemed impractical. Hence work has been done to create our own version with similar ideals but much more customisable. We should regularly review this though as it may reduce the maintenance burden to change in the future.

The main aim of the module is that we create checkpoints which take in a set of expected inputs (including data probably through a file path) and run. a set of expectations against that data. The responses to these will be record as expectation responses and if problems are found then they. will be raised as expectation issues.

We will need to add support for outputting expectation issues and reformat the current expectations responses.


### Section 3: Changes to be made and order

The above is difficult to apply at once and will be of different levels of interest to different parties. I suggest the below:

1. Code up and test expectations work and apply dataset checkpoint
3. Apply converted resource checkpoint

#### 1: Code up and test expectations work

![diagram showing hwo the classes connect to data models](https://github.com/digital-land/digital-land/blob/main/images/Data_Issues.drawio.png)

The changes required of the expectations module:
- implement issue data classes as in the above diagram for each issue scope (It's more of s sketch than every field).
- update response model to link to issues.
- remove suite and just build checkpoints. they could in theory load from yamls but isn't needed at the minute so just use python lists
- update so that dataset checkpoint works and is ran overnight
- Incorporate the dataset data into digital-land-builder so that it's on datasette

Use the dataset checkpoint as a starting point. build a base checkpoint with core functionality then delegate loading of expectations to the checkpoint class.

#### 2: Apply converted resource checkpoint

![diagram with altered flow for additional checkpoint](https://github.com/digital-land/digital-land/blob/main/images/add-converted-resource-check-point.drawio.png)

we will need to run expectations on the csv that is a converted version of the providers data. This will allow us to run checks and raise problems that can be directly connected to the data provided by them. These checks would be ran in both the pipeline and the check tool.

As you can see from the above I think it will be worth altering the pipeline for this checkpoint. Right now the mapping phase takes place after the file is converted and read in row-by-row. First of all this checkpoint needs to take place before the streaming begins so that if checks fail with an error than the pipeline is stopped also so that the checkpoint has access to the entire file. To write consistent checks that take into account the changed column names from the mapping phase it would be best to be able to do it before the checks are ran.

Also the mapping phase it repeated for every row right now, given that it will be the same for every row it makes sense to do it in one step. We can then use the column_field mapping to translate between columns and fields whenever we need to refer to their original data.

## Design Comments
46 changes: 46 additions & 0 deletions archive/004‐handling-of-empty-fields.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
Author(s) - [Chris Johns](mailto:cjohns@scottlogic.com)

## Introduction

The platform currently removes empty fields from the data during processing. This is usually, but not always, the required behavior.

A scenario where this isn't the the required behavior is when a later resource has a blank end-date. See [This Ticket](https://trello.com/c/xtDuvX0z/1347-bug-nullable-fields-cannot-be-updated-to-blank)

## Status

Open

* Draft: proposal is still being authored and is not officially open for comment yet
* Open: proposal is open for comment
* Closed: proposal is closed for comment with implementation expected
* On Hold: proposal is on hold due to concerns raised/project changes with implementation not expected

## Detail

### The difference between blank and missing data

One of the distinctions that needs to be made is between data that is not provided, and data that is provided as blank. An example for a CSV source is not having a column, vs having a column with an empty field.

In addition, we have some fields which can be expected to be empty - such as the end-date in the above example.

### Nullable fields

In order to accommodate fields that must be present, but may be blank the specification needs to extended to reflect this. This can be done by adding an additional 'nullable" attribute to the field. If this is set to true then the field can contain blank values. If set to false (or not present) the field cannot contain blank values.

Blank values in a non-nullable field should be considered an issue.

### Processing empty fields in the pipeline

Currently, the pipeline will remove any empty fields from the facts (done in the `FactPrunePhase`). This phase needs to be keep these fields in. In addition, the dataset builder package excludes empty 'facts' when building the entities.

### Nullable fields in the pipeline

Currently the `HarmonisePhase` will check for mandatory fields in hard-coded list, and generate an issue if they are missing or blank. This would make it a good candidate to also check for nullability. Longer term, the check for mandatory fields should move to be data-driven (and most likely have a better name). The mandatory fields do (currently) vary between collection, which may impact this (or result in a standard set). This aspect is outside the scope of this proposal.

### Updating to blank

The root cause of the above bug would appear to be a later resource is not correctly updating the end-date to be blank. Not stripping the blank facts is a pre-requisite of this, but the dataset generation code will also require updating. This behavior needs to be updated - even in the case where the field is not nullable. If the data we are given is blank, this is what we should reflect. If it SHOULDN'T be blank then an issue should be raised.

## Scenario List

![image](https://github.com/digital-land/digital-land/assets/95475146/abb9b8fa-c714-4bd4-b405-67f76a05c520)
Loading

0 comments on commit b50781f

Please sign in to comment.