Skip to content

Commit

Permalink
Explanation fixed
Browse files Browse the repository at this point in the history
  • Loading branch information
averheecke-tpx committed Oct 2, 2024
1 parent 78e5768 commit ca28528
Show file tree
Hide file tree
Showing 9 changed files with 72 additions and 71 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,5 @@ listed-building,,,ListEntry,reference,,,
> [!NOTE]
> In most cases we prefer configuration to be made against an endpoint rather than a resource, as this means that the configuration persists when an endpoint is updated and a new resource is created.
For more information on the configuration options read [How to configure an endpoint](https://github.com/digital-land/digital-land/wiki/Configure-an-endpoint)
For more information on the configuration options read [How to configure an endpoint](../../How-To-Guides/Adding/Configure-an-endpoint)

10 changes: 10 additions & 0 deletions docs/data-operations-manual/Explanation/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Explanation

This section explain our data operations and key concepts to understanding the project

- [Data Quality Needs](./Key-Concepts/Data-quality-needs)
- [Organisation and provision](./Key-Concepts/Organisation-and-provision)
- [Pipeline configuration](./Key-Concepts/Pipeline-configuration)
- [Pipeline processes](./Key-Concepts/pipeline-processes)
- [Specification](./Key-Concepts/Specification)
- [Operational Procedures](Operational-Procedures/)
50 changes: 25 additions & 25 deletions docs/data-operations-manual/How-To-Guides/Adding/Add-an-endpoint.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,54 +2,51 @@

**Prerequisites:**

- Cloned the [config repo](https://github.com/digital-land/config) by running `git clone [gitURL]` and updated it with `make init` in your virtual environment
- Validated the data. If you haven’t done this yet, follow the steps in ‘[Validating an Endpoint](Validate-an-endpoint)’ before continuing.
- Cloned the [config repo](https://github.com/digital-land/config) by running `git clone [gitURL]` and updated it with `make init` in your virtual environment
- Validated the data. If you haven’t done this yet, follow the steps in ‘[Validating an Endpoint](../../Validating/Validate-an-endpoint)’ before continuing.

> [!NOTE]
> The endpoint_checker will pre-populate some of the commands mentioned in the steps below, check the end of the notebook underneath ‘_scripting_’.
> The endpoint*checker will pre-populate some of the commands mentioned in the steps below, check the end of the notebook underneath ‘\_scripting*’.
1. **Create an import file**
If you don’t already have an import.csv file in the root of the config file, simply create one with the command `touch import.csv`

1. **Add configurations**

>[!TIP]
>Check the [Endpoint-edge-cases](Add-an-endpoint#Endpoint edge-cases) section below for guidance on how to handle configuration for some non-standard scenarios, like a single endpoint being used for multiple provisions, or an endpoint for the `tree` dataset with polygon instead of point geometry.
> [!TIP]
> Check the [Endpoint-edge-cases](../../Adding/Add-an-endpoint#Endpoint edge-cases) section below for guidance on how to handle configuration for some non-standard scenarios, like a single endpoint being used for multiple provisions, or an endpoint for the `tree` dataset with polygon instead of point geometry.
1. **Populate the import file**

The following columns need to be included in `import.csv`:

* `endpoint-url` - the url that the collector needs to extract data from
* `documentation-url` - a url on the provider's website which contains information about the data
* `start-date` - the date that the collector should start from (this can be in the past)
* `plugins` - if a plugin is required to extract that data then it can be noted here otherwise leave blank
* `pipelines` - the pipelines that need to be ran on resources collected from this endpoint. These are equivalent to the datasets and where more than one is necessary they should be separated by `;`
* `organisation` - the organisation which the endpoint belongs to. The name should be in [this list](https://datasette.planning.data.gov.uk/digital-land/organisation)
* `licence` - the type of licence the data is published with. This can usually be found at the dataset's documentation url.
- `endpoint-url` - the url that the collector needs to extract data from
- `documentation-url` - a url on the provider's website which contains information about the data
- `start-date` - the date that the collector should start from (this can be in the past)
- `plugins` - if a plugin is required to extract that data then it can be noted here otherwise leave blank
- `pipelines` - the pipelines that need to be ran on resources collected from this endpoint. These are equivalent to the datasets and where more than one is necessary they should be separated by `;`
- `organisation` - the organisation which the endpoint belongs to. The name should be in [this list](https://datasette.planning.data.gov.uk/digital-land/organisation)
- `licence` - the type of licence the data is published with. This can usually be found at the dataset's documentation url.

The endpoint checker should output text you can copy into `import.csv` with the required headers and values, or alternatively copy the headers below:

```
organisation,endpoint-url,documentation-url,start-date,pipelines,plugin,licence
```
Using the same example from [validating an endpoint](Validate-an-endpoint), the `import.csv` should look like this:
Using the same example from [validating an endpoint](../../Validating//Validate-an-endpoint), the `import.csv` should look like this:
```
organisation,documentation-url,endpoint-url,start-date,pipelines,plugin,licence
local-authority-eng:SAW,https://www.sandwell.gov.uk/downloads/download/868/article-4-directions-planning-data,https://www.sandwell.gov.uk/downloads/file/2894/article-4-direction-dataset,,article-4-direction,,ogl3
```
1. **Make changes to pipeline configuration files**
1. **Make changes to pipeline configuration files**
Use the [how to configure an endpoint guide](Configure-an-endpoint) to see how each of the configuration files works.
Use the [how to configure an endpoint guide](../Configure-an-endpoint) to see how each of the configuration files works.
The most common step here will be using `column.csv` to add in extra column mappings.
1. **Run add_endpoint_and_lookups script**
Run the following command inside the config repository within the virtual environmen:
Expand All @@ -71,8 +68,8 @@
- A new line should be added to endpoint.csv and source.csv.
- For each new lookup, a new line should be added to the lookup.csv.
The console output will show a list of new lookups entries organised by organisation and resource-hash. Seeing this is a good indication that the command ran successfully.
For example:
The console output will show a list of new lookups entries organised by organisation and resource-hash. Seeing this is a good indication that the command ran successfully.
For example:
```
----------------------------------------------------------------------
Expand All @@ -85,16 +82,16 @@
```
1. **Test locally**
Once the changes have been made and pushed, checkout the relevant collections repository i.e., if the data added was conservation-area, checkout the conversation-area collection repository. Run the pipeline in the collection repo by running `make`. After the pipeline has finished running, use `make datasette` to interrogate the local datasets; this will enable you to check that the data is on the local platform as expected. In `lookups`, check if the entities added in the lookup.csv in step 4 are there.
Once the changes have been made and pushed, checkout the relevant collections repository i.e., if the data added was conservation-area, checkout the conversation-area collection repository. Run the pipeline in the collection repo by running `make`. After the pipeline has finished running, use `make datasette` to interrogate the local datasets; this will enable you to check that the data is on the local platform as expected. In `lookups`, check if the entities added in the lookup.csv in step 4 are there.
1. **Push changes**
Use git to push changes up to the repository, each night when the collection runs the files are downloaded from here. It is a good idea to name the commit after the organisation you are importing.
Use git to push changes up to the repository, each night when the collection runs the files are downloaded from here. It is a good idea to name the commit after the organisation you are importing.
1. **Run action workflow (optional)**
Optionally, you can run the overnight workflow yourself if you don’t want to wait until the next day to check if the data is actually on the platform. Navigate to the corresponding collection’s repository actions page e.g. [article-4-direction-collection](https://github.com/digital-land/article-4-direction-collection/actions) and under ‘Call Collection Run’, run the workflow manually. Depending on the collection, this can take a while but after it has finished running you can check on datasette if the data is on the platform.
## Endpoint edge-cases
### Handling Combined Endpoints
Note that, when adding an endpoint that feeds into separate datasets or pipelines (such as an endpoint with data for _tree-preservation-zone_ and _tree)_, the pipeline field in the import.csv file should be formatted to contains both datasets as follows:
Expand All @@ -115,20 +112,23 @@ When handling this type of endpoint, two possible scenarios may arise.
At times, the endpoints we receive might include Tree and TPZ data. In cases like these, we need to add to the `filter.csv` file in the `tree-preservation-order pipeline`. The filter works based on the `tree-preservation-zone-type` pattern. Any data that corresponds to an `Area` pattern relates to a TPZ while data corresponding to an `Individual` pattern relates to a Tree.
For example:
```
tree-preservation-zone,28cff16a15892b5d99e0fbdb99921bf1cfce6ac4a72017c54c012c4c07378169,tree-preservation-zone-type,Area,,,,,
tree,28cff16a15892b5d99e0fbdb99921bf1cfce6ac4a72017c54c012c4c07378169,tree-preservation-zone-type,Individual,,,,,
```
To find out whether there are multiple datasets in an endpoint, look at the raw data by searching for tree-preservation-zone-type. Based on the value, it will either belong to TPZ or tree.
### Tree data with polygon instead of point
By default, the tree dataset `wkt` field (which is the incoming geometry from the resource) is mapped to `point`, with by a global mapping in `column.csv`. When a provider gives us a `polygon` data instead of a `point`, we need to add a mapping in the `column.csv`file for the specific endpoint or resource from `wkt` to `geometry` which will override the default mapping.
For example:
```
tree,422e2a9f2fb1d809d8849e05556aa7c232060673c1cc51d84bcf9bb586d5de52,,WKT,geometry,,,
```
As an example, this [datasette query](https://datasette.planning.data.gov.uk/digital-land/column_field?_sort=rowid&resource__exact=0889c8a96914abc22521f738a6cbad7b104ccff6256118a0a39bf94912cb38d4) shows a resource where we were provided a `polygon` dataset for tree so we mapped `wkt` to `geometry`.
Whereas [this](https://datasette.planning.data.gov.uk/digital-land/column_field?resource=05182443ad8ea72ec17fd2f46dd6e19126e86ddbc2d5f386bb2dab8b5f922d49) one was a `point` format so we did not need to override the mapping. You’ll notice that the field related to the column `wkt` is point.
Whereas [this](https://datasette.planning.data.gov.uk/digital-land/column_field?resource=05182443ad8ea72ec17fd2f46dd6e19126e86ddbc2d5f386bb2dab8b5f922d49) one was a `point` format so we did not need to override the mapping. You’ll notice that the field related to the column `wkt` is point.
Loading

0 comments on commit ca28528

Please sign in to comment.