Merge pull request #50 from digital-land/Data_Management_Docs

add data operational manual docs
digital-land · Oct 2, 2024 · 56c12bb · 56c12bb
2 parents d4d5e87 + ca28528
commit 56c12bb
Show file tree

Hide file tree

Showing 24 changed files with 1,173 additions and 6 deletions.
diff --git a/docs/data-operations-manual/Explanation/Key-Concepts/Data-quality-needs.md b/docs/data-operations-manual/Explanation/Key-Concepts/Data-quality-needs.md
@@ -0,0 +1 @@
+You can find a google sheets with all the data quality needs [here](https://docs.google.com/spreadsheets/d/1kMAKOAm6Wam-AJb6R0KU-vzdvRCmLVN7PbbTAUh9Sa0/edit?gid=2142834080#gid=2142834080)
diff --git a/docs/data-operations-manual/Explanation/Key-Concepts/Organisation-and-provision.md b/docs/data-operations-manual/Explanation/Key-Concepts/Organisation-and-provision.md
@@ -0,0 +1,25 @@
+## Organisation
+
+We maintain a list of organisations that supply data to our platform. These organisations are categorised into various types, such as development corporations, government organisations, local authorities, and more. Each data source added to our platform must be linked to the organisation that provided the data.
+
+The [organisation](https://datasette.planning.data.gov.uk/digital-land/organisation) table includes key details for each organisation, such as the organisation name, website, start date, end date, etc.
+
+## Provision
+
+The [Provision](https://datasette.planning.data.gov.uk/digital-land/provision?_sort=rowid) table is created to identify the organisations from which we expect data for a given dataset. This table helps in identifying the expected publishers for a specific dataset.
+
+It contains key information such as:
+
+* provision_reason: Specifies the reason why an organisation is expected to provide a particular dataset.
+* provision_rule: Defines the rules governing the data we expect each organisation to supply. These rules are used to generate the final provision dataset. 
+
+## Provision Rule
+
+The [Provision Rule](https://datasette.planning.data.gov.uk/digital-land/provision_rule) table contains two key fields, **project** and **role**, which are used to identify the organisations expected to provide data.
+
+For example, in the case of the Article 4 Direction dataset:
+
+* The role is set to local-planning-authority. Organisations linked to this role are stored in the [Role Organisation](https://datasette.planning.data.gov.uk/digital-land/role_organisation?role=local-planning-authority) Table, and all organisations associated with this role are added as expected for the dataset.
+* Additionally, the dataset is associated with the project Open Digital Planning. Organisations linked to this project, found in the [Project Organisation](https://datasette.planning.data.gov.uk/digital-land/project_organisation?project=open-digital-planning) Table, are also added as expected for this dataset.
+
+This process ensures that all relevant organisations, based on their roles and project affiliations, are accurately associated with each dataset.
diff --git a/docs/data-operations-manual/Explanation/Key-Concepts/Pipeline-configuration.md b/docs/data-operations-manual/Explanation/Key-Concepts/Pipeline-configuration.md
@@ -0,0 +1,19 @@
+## Pipeline configuration 
+
+Configuration files control where data is collected from and how data is transformed from resources into the fact and entity model described above. Each collection has its own set of configuration files organised into two folders: `collection` and `pipeline`. 
+
+Each configuration file in `pipeline` can be used to apply configurations to a particular dataset, endpoint, or resource in the collection by using the dataset name or endpoint/resource hash values in the corresponding fields of the configuration file. The `endpoint` and `resource` fields can be left blank to apply a configuration to all resources in a dataset (useful to set default configurations), or just the `resource` field left blank to apply a configuration to all resources from an endpoint.
+
+E.g. this line below in the listed-building column.csv would apply a column re-mapping from `ListEntry` to `reference` for all endpoints and resources in the `listed-building` collection:
+
+```
+dataset,endpoint,resource,column,field,start-date,end-date,entry-date
+listed-building,,,ListEntry,reference,,,
+
+```
+
+> [!NOTE] 
+> In most cases we prefer configuration to be made against an endpoint rather than a resource, as this means that the configuration persists when an endpoint is updated and a new resource is created.
+
+For more information on the configuration options read [How to configure an endpoint](../../How-To-Guides/Adding/Configure-an-endpoint)
+
diff --git a/docs/data-operations-manual/Explanation/Key-Concepts/Specification.md b/docs/data-operations-manual/Explanation/Key-Concepts/Specification.md
@@ -0,0 +1,11 @@
+## Specification Repo
+
+The [specification](https://github.com/digital-land/specification) repository contains the information regarding fields, tables, and other essential configurations.
+
+Within the [datasets](https://github.com/digital-land/specification/tree/main/content/dataset) folder, you'll find a list of all available datasets, each accompanied by key details, some of them are:
+
+* Collection: The collection to which the dataset belongs.
+* Entity Minimum-Maximum: The range of entity numbers allocated for use with this dataset, used when assigning entities to it.
+* Fields: The specific fields that publishers can provide for this dataset.
+* [Licence](https://datasette.planning.data.gov.uk/digital-land/licence): The type of licence governing the use of the dataset.
+* [Typology](https://github.com/digital-land/specification/blob/40c777610a8e292145635ff875203145ee5f1e49/specification/typology.csv): The classification or typology of the dataset.
diff --git a/docs/data-operations-manual/Explanation/Key-Concepts/pipeline-processes.md b/docs/data-operations-manual/Explanation/Key-Concepts/pipeline-processes.md
@@ -0,0 +1,57 @@
+## Pipeline process / data model 
+
+See the [about section of the planning.data website](https://www.planning.data.gov.uk/about/)  to learn more about the website and programme objectives:
+
+*“Our platform collects planning and housing data from local planning authorities (LPAs) and transforms it into a consistent state, across England. Anyone can view, download and analyse the data we hold.”*
+
+We ask Local Planning Authorities (LPAs) to publish open data on their website in the form of an accessible URL, or API endpoint. These URLs are called **endpoints**.
+
+The system that is used to take data from endpoints and process it into a consistent format is called the **pipeline**. The pipeline is able to collect data hosted in many different formats, identify common quality issues with data (and in some cases resolve them), and transform data into a consistent state to be presented on the website.
+
+\!\! For more detail on how the pipeline works see the [documentation here](https://github.com/digital-land/digital-land/wiki/Historic-Documentation#run-the-pipeline-to-make-the-dataset).
+
+Data is organised into separate **datasets**, each of which may consist of data collected from just one or many endpoints. Datasets might be referred to as either **compiled** or **national** based on how data for them is provided. For example the [article-4-direction-area dataset](https://www.planning.data.gov.uk/dataset/article-4-direction-area) has many providers as we collect data from LPAs to add to this dataset, and is therefore a compiled dataset. The [agricultural-land-classification dataset](https://www.planning.data.gov.uk/dataset/agricultural-land-classification) on the other hand has just one provider as it is a dataset with national coverage published by Natural England, and is therefore a national dataset.
+
+Each dataset is organised into separate **collections**, which are groups of datasets collected together based on their similarity. For example, the `conservation-area-collection` is the home for the `conservation-area` and the `conservation-area-document` dataset. There are a few key components to collections, which are outlined below using the conservation-area-collection as an example:
+
+* The collection repo (note the “-collection” after the name):  [https://github.com/digital-land/conservation-area-collection/](https://github.com/digital-land/conservation-area-collection/). This is the repo which is used to build the collection data, and is triggered each night by a github workflow.  
+* The collection and pipeline configuration files, which store configuration data which controls how data feeding into the collection is processed (see [section below](#pipeline-configuration) for more detail):  
+  * [https://github.com/digital-land/config/tree/main/collection/conservation-area](https://github.com/digital-land/config/tree/main/collection/conservation-area)  
+  * [https://github.com/digital-land/config/tree/main/pipeline/conservation-area](https://github.com/digital-land/config/tree/main/pipeline/conservation-area) 
+
+The data management team is responsible for adding data to the platform, and maintaining it once it’s there, see [here for the list of team responsibilities](https://docs.google.com/document/d/1PoAUktKj80qOTvI4BB3qZkZdwpiGEq_woEfrIdwg2Ac/edit#heading=h.aoi2nezcsd1h) in the Planning Data Service Handbook.
+
+## Resources 
+
+Once an endpoint is added to our data processing pipeline it will be checked each night for the latest data. When an endpoint is added for the first time we take a copy of the data; this unique copy is referred to as a **resource**. If the pipeline detects any changes in the data, no matter how small, we save a new version of the entire dataset, creating a new resource. Each separate resource gets given a unique reference which we can use to identify it.
+
+## Facts 
+
+The data from each resource is saved as a series of facts. If we imagine a resource as a table of data, then each combination of entry (row) and field (column) generates a separate **fact**: a record of the value for that entry and field. For example, if a table has a field called “reference”, and the value of that field for the first entry is “Ar4.28”, we record the name of the field and the value of it along with a unique reference for this fact. You can see how this appears in our system [here](https://datasette.planning.data.gov.uk/article-4-direction-area?sql=select+fr.resource%2C+f.fact%2C+f.entity%2C+f.field%2C+f.value%0D%0Afrom+fact_resource+fr%0D%0Ainner+join+fact+f+on+fr.fact+%3D+f.fact%0D%0Awhere+%0D%0A+++resource+%3D+%22684deb1f613f6e74e31858176704c33c4437996c60210975c27be5f0c82b4057%22%0D%0A+++and+field+%3D+%22reference%22%0D%0A+++and+value+%3D+%22Ar4.28%22%0D%0A%0D%0A).
+
+So a table with 10 rows and 10 columns would generate 100 facts. And each time data changes on an endpoint, all of the facts for the new resource are recorded again, including any new facts. We can use these records to trace back through the history of data from an endpoint.
+
+A fact has the following attributes:
+
+* `fact` \- UUID, primary key on `fact` table in database  
+* `entity` \- optional, numeric ID, `entity` to which fact applies  
+* `start-date` \- optional, date at which fact begins to apply (not date at which fact is created within data platform)  
+* `end-date` \- optional, date at which fact ceases to apply  
+* `entry-date` \- optional, date at which fact was first collected
+
+## Entities 
+
+An Entity is the basic unit of data within the platform. It can take on one of many types [defined by `digital-land/specification/typology.csv`](https://github.com/digital-land/specification/blob/40c777610a8e292145635ff875203145ee5f1e49/specification/typology.csv). An entity has the following attributes:
+
+* `entity` \- incrementing numeric ID, manually assigned on ingest, different numeric ranges represent different datasets, primary key on `entity` table in SQLite and Postgis databases  
+* `start-date` \- optional, date at which entity comes into existence (not date at which entity is created within data platform)  
+* `end-date` \- optional, date at which entity ceases to exists  
+* `entry-date` \- optional, date at which entity was first collected  
+* `dataset` \- optional, name of `dataset` (which should correspond to [`dataset` field in `digital-land/specification/dataset.csv`](https://github.com/digital-land/specification/blob/40c777610a8e292145635ff875203145ee5f1e49/specification/dataset.csv)) to which entity belongs  
+* `geojson` \- optional, a JSON object conforming to [RFC 7946 specification](https://datatracker.ietf.org/doc/html/rfc7946) which specifies the geographical bounds of the entity  
+* `typology` \- optional, the type of the entity which should correspond to [the `typoology` field in `digital-land/specification/typology.csv`](https://github.com/digital-land/specification/blob/40c777610a8e292145635ff875203145ee5f1e49/specification/typology.csv)  
+* `json` \- optional, a JSON object containing metadata relating to the entity
+
+Facts that are collected from resources get assigned to entities based on a  combination of the reference of the record in the resource, the organisation that provided the resource and the dataset it belongs to (*needs more clarification, or a link out to more detail somewhere*).
+
+So as well as the default(?) attributes above, an [entity in the article-4-direction-area dataset](https://www.planning.data.gov.uk/entity/5010000101) can also have attributes like `permitted-development-rights` and `notes`.
diff --git a/docs/data-operations-manual/Explanation/Operational-Procedures.md b/docs/data-operations-manual/Explanation/Operational-Procedures.md
@@ -0,0 +1,81 @@
+# Operational Procedures
+
+One of the key responsibilities of the data management team is adding new endpoints and keeping existing ones up to date. This page gives an overview of the important concepts behind the procedures we follow at different stages of the data lifecycle.
+
+These procedures can vary based on whether a dataset is national or compiled, whether the data provider publishes updates to the same endpoint or a completely new one, and what sort of update has been made to an endpoint.
+
+To help with this complexity, we've got a few levels of documentation to help:
+
+1. This explanatory overview is at the highest level.
+2. Below that, the tutorials section covers a range of different scenarios that can occur when [adding](Adding-Data) and [maintaining](Maintaining-Data) data and explain the procedure that should be followed in each one.
+3. The procedure steps in the scenarios link to the most detailed level of documentation - the [how-to guides](How-to-guides) - which give step by step instructions for how to complete particular tasks.
+
+## Validating data
+
+When receiving data from the LPA, we need to first validate the data to check that it conforms to our data requirements.
+
+Depending on the dataset, the LPAs usually use the [planning form](https://submit.planning.data.gov.uk/check/) to check if the data is good to go. They don't do that all the time though, so we still need to manually validate the data. However, the check tool does not yet work for Brownfield-land/site datasets so we always need to validate the data on our end.
+
+Read the [how to validate an endpoint guide](Validate-an-endpoint) to see the steps we follow.
+
+## Adding data
+
+There are two main scenarios for adding data:
+
+- Adding an endpoint for a new dataset and/or collection (e.g. we don't have a the dataset on file at all)
+- Adding a new endpoint to an existing dataset
+
+Based on this, the process is slightly different.
+
+A how-to on adding a new dataset and collection can be found [here](Add-a-new-dataset-and-collection).
+
+A how-to on adding a new endpoint to an existing dataset can be found [here](Add-an-endpoint). Endpoints can come in a variety of types. The format can differ from endpoint to endpoint as can the required plugins needed to process the endpoint correctly.
+
+More information on types can be found [here](Endpoint-URL-Types-And-Plugins#data-formats-of-resources-that-can-be-processed)
+
+More information on plugins can be found [here](Endpoint-URL-Types-And-Plugins#adding-query-parameters-to-arcgis-server-urls)
+
+## Maintaining data
+
+Maintaining data means making sure that the changes a data provider makes to their data are reflected on the platform, which might be done either by adding new endpoints or managing updates by the provider to existing ones.
+
+### Assigning entities
+
+All entries on the platform must be assigned an entity number in the `lookup.csv` for the collection. This usually happens automatically when adding a new endpoint through the `add-endpoints-and-lookups` script. However, when an endpoint is already on the platform but the LPA has indicated that the endpoint has been updated with a new resource and new entries, we can’t just re-add the endpoint. Instead, we assign the new entries their entity numbers differently.
+
+A how-to on assigning entities can be found [here](Assign-entities)
+
+### Merging entities
+
+There can be duplicates present in a dataset. This primarily takes place where multiple organisations are providing data against the same object (or entity). We do not automatically detect and remove these, the old-entity table is used to highlight these duplications and provide them under a single entity number.
+
+A how-to on merging entities can be found [here](Merge-entities)
+
+## Retiring data
+
+### Retiring endpoints
+
+When an endpoint consistently fails, or LPAs give us a different endpoint (as opposed to the one we already have) to retrieve the data, we need to retire the old/failing endpoint. It is important to understand that while we retire an endpoint, this does not mean that the data associated with it will be retired as well. This only makes it so that the collector stops collecting new resources (data) from the endpoint. The data that has been retrieved previously from that endpoint will still be on the platform and will still be used.
+
+When we retire an endpoint, we also need to retire the source(s) associated with it as sources are dependent on endpoints.
+
+Read [how-to retire an endpoint](Retire-endpoints) to learn more.
+
+### Retiring resources
+
+It won’t be necessary to do this step often, however, sometimes a resource should not continue to be processed and included in the platform. This can be for multiple reasons, and in most cases will occur when it has been found that the resource contains significant errors.
+
+A how-to on retiring resources can be found [here](Retire-resources)
+
+### Retiring entities
+
+**Note:** We may want to keep old entities on our platform as historical data. There are two reasons an entity might be removed:
+
+1. It was added in error. In this case, we should remove it from our system.
+2. It has been stopped for some reason. In this scenario, we should retain the entity.  
+   Ideally, we would mark such entities with end-dates to indicate they have been stopped, but implementing this requires additional work.
+
+For example, a World Heritage Site was added as an entity to our platform. Although it is no longer a World Heritage Site, we want to retain the entity to indicate that it held this status during a specific period.
+
+In a given scenario, determine the reason why the entities are no longer present.  
+Check with Swati before deleting entities.
diff --git a/docs/data-operations-manual/Explanation/index.md b/docs/data-operations-manual/Explanation/index.md
@@ -0,0 +1,10 @@
+# Explanation
+
+This section explain our data operations and key concepts to understanding the project
+
+- [Data Quality Needs](./Key-Concepts/Data-quality-needs)
+- [Organisation and provision](./Key-Concepts/Organisation-and-provision)
+- [Pipeline configuration](./Key-Concepts/Pipeline-configuration)
+- [Pipeline processes](./Key-Concepts/pipeline-processes)
+- [Specification](./Key-Concepts/Specification)
+- [Operational Procedures](Operational-Procedures/)
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		You can find a google sheets with all the data quality needs [here](https://docs.google.com/spreadsheets/d/1kMAKOAm6Wam-AJb6R0KU-vzdvRCmLVN7PbbTAUh9Sa0/edit?gid=2142834080#gid=2142834080)