README.qmd

---
title: "Testing network joining functions"
author: "Robin Lovelace"
format: gfm
execute: 
  echo: false
  message: false
  warning: false
---

# Introduction

Joining data is key to data science, allowing value to be added to disparate datasets by combining them.

There are various types of join, including based on shared 'key' values and shared space for spatial joins.
However, neither of these join types works for joining network data of the type shown below, which represents 2 separate networks with different but related geometries (source: [ATIP browse tool](https://acteng.github.io/atip/browse.html?style=streets#13.32/53.79562/-1.6874)).

![](images/paste-1.png)

Imagine you want to know what kind of cycle infrastructure is associated with each segment of the MRN.
That's the kind of problem that network joins can tackle.

This guide outlines the challenges of network joining and demonstrates implementation-agnostic solutions.

It's based on previous work:

-   The networkmerge project (and related [parenx](https://github.com/anisotropi4/parenx) Python package available on pip): <https://nptscot.github.io/networkmerge/>

-   An approach in JavaScript at <https://github.com/acteng/amat/tree/main/pct_lcwip_join>, described at <https://github.com/acteng/amat/blob/main/js/model.md#pct-join>

-   The rnetmatch approach, which has been implemented in Rust with a nascent R wrapper (there are plans for a Python wrapper)
    - See text for paper at <https://github.com/nptscot/rnetmatch/blob/main/paper.qmd>
    - See description of the algorithm implemented in Rust here: <https://github.com/nptscot/rnetmatch/blob/main/rust/README.md>

We'll use data from the Propensity to Cycle Tool and the [OpenRoads](https://osdatahub.os.uk/downloads/open/OpenRoads) dataset as an example.

# Example dataset

Datasets were take from a few case study areas.

```{r}
#| include: false
remotes::install_dev("stplanr")
library(tidyverse)
library(patchwork)
```

We'll focus on Thornbury, West Yorkshire for now.

```{r}
#| include: false
study_area_name = "Thornbury West Yorkshire"
study_area_zones = zonebuilder::zb_zone(study_area_name)
mapview::mapview(study_area_zones)
study_area_centroid = sf::st_centroid(study_area_zones[1, ])
study_area_1km = sf::st_buffer(study_area_centroid, 1000)
study_area_1km_projected = sf::st_transform(study_area_1km, 27700)
```

```{r}
if (!file.exists("data/open_roads_thornbury.geojson")) {
    message("You lack open roads data locally")
    u = "https://api.os.uk/downloads/v1/products/OpenRoads/downloads?area=GB&format=GeoPackage&redirect"
    f = "oproad_gpkg_gb.zip"
    if (!file.exists(f)) {
        download.file(u, f)
        unzip(f)
    } 
    sf::st_layers("Data/oproad_gb.gpkg")
    open_roads_national = sf::read_sf("Data/oproad_gb.gpkg", layer = "road_link")
    open_roads_thornbury = open_roads_national[study_area_1km_projected, , op = sf::st_within]
    names(open_roads_thornbury)
#  [1] "id"                         "fictitious"                
#  [3] "road_classification"        "road_function"             
#  [5] "form_of_way"                "road_classification_number"
#  [7] "name_1"                     "name_1_lang"               
#  [9] "name_2"                     "name_2_lang"               
# [11] "road_structure"             "length"                    
# [13] "length_uom"                 "loop"                      
# [15] "primary_route"              "trunk_road"                
# [17] "start_node"                 "end_node"                  
# [19] "road_number_toid"           "road_name_toid"            
# [21] "geometry"   
    table(open_roads_thornbury$road_classification)
    table(open_roads_thornbury$road_function)
    table(open_roads_thornbury$form_of_way)
    table(open_roads_thornbury$road_structure)
    table(open_roads_thornbury$trunk_road)
    plot(open_roads_thornbury$geom)
    sf::st_write(open_roads_thornbury, "data/open_roads_thornbury.gpkg")
    # Prepare to save as geojson
    open_roads_thornbury |>
      sf::st_transform(4326) |>
      sf::write_sf("data/open_roads_thornbury.geojson")
}
```

```{r}
#| include: false
#| label: pct-data
if (!file.exists("data/pct_thornbury.geojson")) {
    message("You lack PCT data locally")
    rnet_wyca = pct::get_pct_rnet("west-yorkshire")
    rnet_wyca = rnet_wyca |>
      transmute(flow = bicycle)
    names(rnet_wyca)
    rnet_wyca_projected = sf::st_transform(rnet_wyca, 27700)
    pct_thornbury = rnet_wyca_projected[study_area_1km_projected, , op = sf::st_within]
    sf::st_write(pct_thornbury, "data/pct_thornbury.gpkg", delete_dsn = TRUE)
    pct_thornbury = pct_thornbury |>
      sf::st_transform(4326) |>
      sf::write_sf("data/pct_thornbury.geojson")
}
```

The following commands will load the datar (available in the [data](data)) folder in the repo into R (the equivalent function in Python would be `geopandas.read_file`):

```{r}
#| echo: true
net_x = sf::read_sf("data/open_roads_thornbury.geojson")
net_y = sf::read_sf("data/pct_thornbury.geojson")
```

For the purposes of this demo we'll use projected data, although the functions should work with unprojected data too:

```{r}
#| echo: true
net_x = sf::st_transform(net_x, "EPSG:27700")
net_y = sf::st_transform(net_y, "EPSG:27700")
```


```{r}
#| label: load-data-thornbury
# net_combined = bind_rows(
#     net_x |> transmute(source = "OpenRoads"),
#     net_y |> transmute(source = "PCT")
# )
# net_combined |>
#     ggplot() +
#     geom_sf(aes(colour = source, size = source)) +
#     scale_size_manual(values = c("OpenRoads" = 1, "PCT" = 2)) +
#     theme_void()
ggplot() +
    geom_sf(data = net_x, colour = "grey", linewidth = 2) +
    geom_sf(data = net_y, aes(colour = flow)) +
    # Categorised flow with breaks at 0, 5, 10, 20:
    scale_colour_viridis_c(breaks = c(0, 5, 10, 20)) +
    theme_void()
```

# Subsetting the target 'x' network (optional)

A first step, to speed-up the join and reduce the size of the data, can be to keep only the records in the target 'x' dataset that are relevant.
After this filtering step, the datasets look like this:

```{r}
#| label: subset-data
#| layout-ncol: 2
# ?stplanr::rnet_subset
net_x_subset = stplanr::rnet_subset(net_x, net_y, dist = 30, crop = FALSE)
net_x_subset_10 = stplanr::rnet_subset(net_x, net_y, dist = 10, crop = FALSE)
net_x_subset_20 = stplanr::rnet_subset(net_x, net_y, dist = 20, crop = FALSE)
net_x_subset_cropped = stplanr::rnet_subset(net_x, net_y, dist = 30, crop = TRUE)
g1 = net_x_subset |>
    ggplot() +
    geom_sf(colour = "grey", linewidth = 2) +
    geom_sf(data = net_y, aes(colour = flow)) +
    scale_colour_viridis_c(breaks = c(0, 5, 10, 20)) +
    theme_void() +
    ggtitle("Subset without cropping (dist = 30)")    

g2 = net_x_subset_10 |>
    ggplot() +
    geom_sf(colour = "grey", linewidth = 2) +
    geom_sf(data = net_y, aes(colour = flow)) +
    scale_colour_viridis_c(breaks = c(0, 5, 10, 20)) +
    theme_void() +
    ggtitle("Subset without cropping (dist = 10)")

g3 = net_x_subset_20 |>
    ggplot() +
    geom_sf(colour = "grey", linewidth = 2) +
    geom_sf(data = net_y, aes(colour = flow)) +
    scale_colour_viridis_c(breaks = c(0, 5, 10, 20)) +
    theme_void() +
    ggtitle("Subset without cropping (dist = 20)")

g4 = net_x_subset_cropped |>
    ggplot() +
    geom_sf(colour = "grey", linewidth = 2) +
    geom_sf(data = net_y, aes(colour = flow)) +
    scale_colour_viridis_c(breaks = c(0, 5, 10, 20)) +
    theme_void() +
    ggtitle("Subset with cropping")

(g1 | g2) / (g3 | g4)
```

Of the four options, the third (with a distance of 20) looks like the best compromise between omitting unwanted links while retaining the majority of the network.

The most appropriate distance depends on your data and use case, it may be worth keeping more of the 'x' network than you need and using the join to filter out unwanted links (the subsetting stage is not essential).

# Basic spatial join

A simple approach to joining the two networks is with a simple spatial join, using one of the available 'binary predicates', such as `st_intersects`, `st_within` (relevant for buffers), `st_contains` or `st_touches`.

The results when the `flow` attributes are summed are shown below:

```{r}
#| label: spatial-join
net_join = sf::st_join(net_x_subset_20, net_y, join = sf::st_intersects)
# nrow(net_join) / nrow(net_x_subset_20)
net_join_values = net_join |>
  sf::st_drop_geometry() |>
  group_by(id) |>
  summarise(flow = sum(flow, na.rm = TRUE))
net_x_joined = left_join(net_x_subset_20, net_join_values, by = "id")
g1 = net_y |>
    ggplot() +
    geom_sf(aes(colour = flow)) +
    scale_colour_viridis_c(breaks = c(0, 5, 10, 20)) +
    theme_void()
g2 = net_x_joined |>
    ggplot() +
    geom_sf(aes(colour = flow)) +
    scale_colour_viridis_c(breaks = c(0, 5, 10, 20)) +
    theme_void()
g1 + g2
total_flow_simple_join = sum(net_x_joined$flow * sf::st_length(net_x_joined)) |>
  as.numeric()
total_flow_y = sum(net_y$flow * sf::st_length(net_y)) |>
  as.numeric()
```

Unfortunately the results are way out: the total flow in the joined network using this basic join approach is less than half (`r round(total_flow_simple_join / total_flow_y * 100)` %) the total flow in the original network.

# The rnet_merge approach

The `rnet_join` and `rnet_merge()` functions in {stplanr}were developed to address this issue.
Code to undertake the join are shown below.

```{r}
#| label: rnet-merge
#| echo: true
rnet_joined_1 = stplanr::rnet_join(net_x_subset_20, net_y, dist = 15)
rnet_joined_2 = stplanr::rnet_join(net_x_subset_20, net_y, dist = 15, segment_length = 10)
```

The results are based on 'flat headed' buffers around the x geometry, with results kept in this form in the output, as shown in the table and figure below.

```{r}
rnet_joined_2 |>
  sf::st_drop_geometry() |>
  slice(c(1, 2, 9)) |>
  knitr::kable()
```

```{r}
g1 = rnet_joined_1 |>
    ggplot() +
    geom_sf(aes(fill = flow), colour = NA, alpha = 0.2) +
    scale_fill_viridis_c(breaks = c(0, 5, 10, 20)) +
    theme_void()
g2 = rnet_joined_2 |>
    ggplot() +
    geom_sf(aes(fill = flow), colour = NA, alpha = 0.2) +
    scale_fill_viridis_c(breaks = c(0, 5, 10, 20)) +
    theme_void()

g1 + g2
```

The plot shows big differences in the results of the `rnet_join()` function depending on the `segment_length` parameter, which splits `y` links into segments of the specified length, before doing the join to the buffer.

Let's aggregate the results and re-join to the geometries of `x`, this time using the lengths of `x` and `y` to ensure more important values are given more weight, and see if the results are more accurate.

```{r}
rnet_joined_1_values = rnet_joined_1 |>
  sf::st_drop_geometry() |>
  group_by(id) |>
  mutate(flow_distance = flow * length_y) |>
  summarise(flow_distance = sum(flow_distance, na.rm = TRUE))
rnet_joined_2_values = rnet_joined_2 |>
    sf::st_drop_geometry() |>
    group_by(id) |>
    mutate(flow_distance = flow * length_y) |>
    summarise(flow_distance = sum(flow_distance, na.rm = TRUE))
rnet_joined_1 = left_join(net_x_subset_20, rnet_joined_1_values, by = "id")

rnet_joined_2 = left_join(net_x_subset_20, 
rnet_joined_2_values, by = "id")

# re-caluclate total flow
rnet_joined_1 = rnet_joined_1 |>
  mutate(flow = flow_distance / length) 
rnet_joined_2 = rnet_joined_2 |>
  mutate(flow = flow_distance / length)

total_flow_rnet_join_1 = sum(rnet_joined_1$flow * sf::st_length(rnet_joined_1)) |>
  as.numeric()
total_flow_rnet_join_2 = sum(rnet_joined_2$flow * sf::st_length(rnet_joined_2)) |>
    as.numeric()
```

The total flow on the networks are `r round(total_flow_rnet_join_1 / total_flow_y * 100)` % and `r round(total_flow_rnet_join_2 / total_flow_y * 100)` % of the total flow in the original network, respectively.
This shows the importance of the `segment_length` parameter in the `rnet_join()` function, which splits the 'y' network segments with attributes into segments.

An alternative approach, not yet implemented, would be to split y not at regular intervals but at intersections with x, which would be more accurate but more computationally intensive.

<!-- The results are shown below: -->

```{r}
g1 = rnet_joined_1 |>
    ggplot() +
    geom_sf(aes(colour = flow)) +
    scale_colour_viridis_c(breaks = c(0, 5, 10, 20)) +
    theme_void()
g2 = rnet_joined_2 |>
    ggplot() +
    geom_sf(aes(colour = flow)) +
    scale_colour_viridis_c(breaks = c(0, 5, 10, 20)) +
    theme_void()
g1 + g2
```


```{r}
#| include: false
# This shows the values are quite different
waldo::compare(rnet_joined_1$geometry, rnet_joined_2$geometry)
waldo::compare(rnet_joined_1$id, rnet_joined_2$id) # big differences
waldo::compare(rnet_joined_1$flow, rnet_joined_2$flow)
```

The results above show that the `rnet_join()` function works well, capturing the majority of the flow in the `y` network, with the `segment_length` parameter allowing the user to control the level of detail in the output.
Users have full control over the aggregating functions used, which can be useful for different use cases, e.g. to classify the highway type, as shown in the code snippet below, which demonstrates the function's ability to work in either direction (x and y can be swapped).

```{r}
most_common_value = function(x) {
  if (length(x) == 0) {
    return(NA)
  } else {
    # Remove NA values if length X is greater than 1 and there are non NA values:
    x = x[!is.na(x)]
    res = names(sort(table(x), decreasing = TRUE)[1])
    if (is.null(res)) {
      return(NA)
    } else {
      return(res)
    }
  }
}
```

```{r}
#| label: rnet-join-classify
#| echo: true
net_y = sf::read_sf("data/open_roads_thornbury.geojson") |>
  transmute(road_function = road_function) |>
  sf::st_transform(27700)
net_x = sf::read_sf("data/pct_thornbury.geojson") |>
  transmute(id = 1:n()) |>
  sf::st_transform(27700)
net_x = stplanr::rnet_subset(net_x, net_y, dist = 20)
rnet_joined = stplanr::rnet_join(net_x, net_y, segment_length = 10, dist = 15)
rnet_joined_values = rnet_joined |>
  sf::st_drop_geometry() |>
  group_by(id) |>
  summarise(
    # Most frequent road function:
    road_function = most_common_value(road_function)
    )
rnet_joined_x = left_join(net_x, rnet_joined_values, by = "id")
rnet_joined_x |>
  ggplot() +
  geom_sf(aes(colour = road_function)) +
  theme_void()
```


```{r}
#| eval: false
# See https://github.com/JosiahParry/rsgeo/issues/42
# Try stplanr::line_sement on every line to find error:
for (i in 1:nrow(net_y)) {
  tryCatch({
    stplanr::line_segment(net_y[i, ], segment_length = 10, use_rsgeo = TRUE)
  }, error = function(e) {
    print(i)
    print(e)
  })
}
# 350 fails:
sf = net_y[350, ]
sf$geom
mapview::mapview(sf)
stplanr::line_segment(sf, segment_length = 10, use_rsgeo = TRUE)
sfc = sf::st_geometry(sf)
sf::st_crs(sfc) = NA
dput(sfc)

remotes::install_dev("rsgeo")

sfc_integer = sf::st_linestring(
    cbind(
        c(418938.4, 418949.7, 418961),
        c(434303.2, 434280.1, 434257)
    )
) |>
  sf::st_sfc()
sfc_no_integer = sf::st_linestring(
    cbind(
        c(418938.4, 418949.7, 418961.1),
        c(434303.2, 434280.1, 434257)
    )
) |>
  sf::st_sfc()

rsgeo::line_segmentize(rsgeo::as_rsgeo(sfc_integer), n = 6) |>
  sf::st_as_sfc() |>
  sf::st_cast("LINESTRING") |>
  length()

rsgeo::line_segmentize(rsgeo::as_rsgeo(sfc_no_integer), n = 6) |>
  sf::st_as_sfc() |>
  sf::st_cast("LINESTRING") |>
  length()

sf::st_crs(sfc_integer) = "EPSG:27700"
sf::st_crs(sfc_no_integer) = "EPSG:27700"
stplanr::line_segment(sfc_integer, segment_length = 10, use_rsgeo = TRUE)
stplanr::line_segment(sfc_no_integer, segment_length = 10, use_rsgeo = TRUE)

sfc_integer_wgs84 = sf::st_transform(sfc_integer, 4326)

res = lwgeom::st_geod_segmentize(sfc_integer_wgs84, max_seg_length = units::set_units(10, "m")) 
waldo::compare(res, sfc_integer_wgs84)
```