Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implemented transfer distance validator #1958

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

feat: implemented transfer distance validator #1958

wants to merge 6 commits into from

Conversation

cka-y
Copy link
Contributor

@cka-y cka-y commented Feb 3, 2025

Summary

This PR introduces an info notice when the transfer distance exceeds 2 km and a warning when it exceeds 10 km.

Expected Behavior

Using mdb-784, we observe the following results:
image
image

Distribution Analysis

The graph below shows the distribution of the max transfer distance across feeds in the Mobility Database.
Most transfer distances fall below the 2 km info threshold, with only a few exceeding the error threshold of 10 km.

transfer_distance_distribution

Top 3 Largest Distances

  1. mdb-7845,463 km
  2. mdb-92724 km
  3. mdb-117218 km

While mdb-784 contains an extreme outlier, most feeds remain within the defined thresholds (2 km for warnings, 10 km for errors).

  • Run the unit tests with gradle test to make sure you didn't break anything
  • Add or update any needed documentation to the repo
  • Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
  • Linked all relevant issues
  • Include screenshot(s) showing how this pull request works and fixes the issue(s)

@emmambd emmambd linked an issue Feb 3, 2025 that may be closed by this pull request
@emmambd
Copy link
Contributor

emmambd commented Feb 3, 2025

@cka-y I know this isn't code review ready yet but I got excited...

  1. transfer_distance_too_large should be a warning and transfer_distance_above_threshold should be INFO
  2. Let's change the name of transfer_distance_above_threshold to something easier to understand: transfer_distance_above_2km?

@emmambd emmambd requested a review from skalexch February 3, 2025 20:57
@cka-y cka-y marked this pull request as draft February 3, 2025 21:11
@MobilityData MobilityData deleted a comment from github-actions bot Feb 3, 2025
@cka-y cka-y marked this pull request as ready for review February 3, 2025 21:27
Copy link
Contributor

github-actions bot commented Feb 3, 2025

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit 0ad6a39
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Errors (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

New Warnings (4 out of 1801 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset Notice Code
de-thuringen-verkehrsverbund-mittelthuringen-vmt-gtfs-1172 transfer_distance_too_large
fr-ile-de-france-regie-autonome-des-transports-parisiens-gtfs-1291 transfer_distance_too_large
ie-unknown-bus-eireann-gtfs-941 transfer_distance_too_large
us-new-york-sullivan-county-transit-gtfs-927 transfer_distance_too_large
Dropped Warnings (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

🛡️ Corruption Check

0 out of 1801 sources (~0 %) are corrupted.

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 3.65 3.76 ⬆️+0.11
Median -- 1.33 1.42 ⬆️+0.09
Standard Deviation -- 10.46 10.46 ⬆️+0.01
Minimum in References Reports us-california-city-of-wasco-gtfs-1788 0.48 0.55 ⬆️+0.08
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 280.08 281.76 ⬆️+1.68
Minimum in Latest Reports us-california-city-of-wasco-gtfs-1788 0.48 0.55 ⬆️+0.08
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 280.08 281.76 ⬆️+1.68
📜 Memory Consumption
Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 459.29 MiB 468.13 MiB ⬆️+8.84 MiB
Median -- 331.92 MiB 333.92 MiB ⬆️+2.00 MiB
Standard Deviation -- 752.18 MiB 784.36 MiB ⬆️+32.18 MiB
Minimum in References Reports ro-vrancea-consiliul-judetean-vrancea-gtfs-1984 38.38 MiB 69.06 MiB ⬆️+30.68 MiB
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.58 GiB 10.58 GiB ⬇️-946.71 KiB
Minimum in Latest Reports us-virginia-star-transit-gtfs-819 415.92 MiB 39.89 MiB ⬇️-376.03 MiB
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.58 GiB 10.58 GiB ⬇️-946.71 KiB

Copy link

@skalexch skalexch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the validator table and the test code. The output looks correct.

@emmambd
Copy link
Contributor

emmambd commented Feb 4, 2025

@cka-y - Looks great.

  1. Curious why the list of feeds in the acceptance tests for transfer_distance_too_large looks smaller than it did in the first tests
  2. Can we make an issue to add INFO notices to the acceptance tests in the future? As we add more of the non-spec, threshold-based rules to the validator, being able to view the INFO severity will be important.

@cka-y
Copy link
Contributor Author

cka-y commented Feb 4, 2025

@emmambd

  1. In the acceptance tests, I had +17 errors and +4 warnings. The 2KM threshold rule was actually set at the error level, while the 10KM threshold was at the warning level (the PR wasn’t quite ready for review yet which is why i deleted the acceptance test report! 😅). The results remain the same as we lack the info level severity notices in the acceptance tests.
  2. Add INFO level notices to acceptance tests #1962

@emmambd
Copy link
Contributor

emmambd commented Feb 4, 2025

  1. I think I'm still confused why there were 17 transit_distance_too_large notices before but now there's only 4.

Copy link
Contributor

github-actions bot commented Feb 4, 2025

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit f1ba463
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Errors (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

New Warnings (4 out of 1801 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset Notice Code
de-thuringen-verkehrsverbund-mittelthuringen-vmt-gtfs-1172 transfer_distance_too_large
fr-ile-de-france-regie-autonome-des-transports-parisiens-gtfs-1291 transfer_distance_too_large
ie-unknown-bus-eireann-gtfs-941 transfer_distance_too_large
us-new-york-sullivan-county-transit-gtfs-927 transfer_distance_too_large
Dropped Warnings (0 out of 1801 datasets, ~0%) ✅

No changes were detected due to the code change.

🛡️ Corruption Check

0 out of 1801 sources (~0 %) are corrupted.

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 3.66 3.75 ⬆️+0.09
Median -- 1.37 1.44 ⬆️+0.07
Standard Deviation -- 10.34 10.31 ⬇️-0.03
Minimum in References Reports us-massachusetts-massachusetts-area-express-max-gtfs-431 0.48 0.55 ⬆️+0.07
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 280.27 273.55 ⬇️-6.72
Minimum in Latest Reports us-idaho-pocatello-regional-transit-gtfs-171 0.69 0.54 ⬇️-0.15
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 280.27 273.55 ⬇️-6.72
📜 Memory Consumption
Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 462.48 MiB 465.13 MiB ⬆️+2.65 MiB
Median -- 335.90 MiB 331.92 MiB ⬇️-3.97 MiB
Standard Deviation -- 756.38 MiB 755.66 MiB ⬇️-729.13 KiB
Minimum in References Reports us-colorado-town-of-telluride-gtfs-2050 39.14 MiB 379.92 MiB ⬆️+340.79 MiB
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.57 GiB 10.88 GiB ⬆️+319.84 MiB
Minimum in Latest Reports ro-vrancea-consiliul-judetean-vrancea-gtfs-1984 70.45 MiB 40.08 MiB ⬇️-30.38 MiB
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.57 GiB 10.88 GiB ⬆️+319.84 MiB

Copy link
Contributor

github-actions bot commented Feb 6, 2025

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit 757f5c2
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (0 out of 1809 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Errors (0 out of 1809 datasets, ~0%) ✅

No changes were detected due to the code change.

New Warnings (4 out of 1809 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset Notice Code
de-thuringen-verkehrsverbund-mittelthuringen-vmt-gtfs-1172 transfer_distance_too_large
fr-ile-de-france-regie-autonome-des-transports-parisiens-gtfs-1291 transfer_distance_too_large
ie-unknown-bus-eireann-gtfs-941 transfer_distance_too_large
us-new-york-sullivan-county-transit-gtfs-927 transfer_distance_too_large
Dropped Warnings (0 out of 1809 datasets, ~0%) ✅

No changes were detected due to the code change.

🛡️ Corruption Check

0 out of 1809 sources (~0 %) are corrupted.

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 3.65 3.72 ⬆️+0.07
Median -- 1.33 1.40 ⬆️+0.07
Standard Deviation -- 10.54 10.53 ⬇️-0.01
Minimum in References Reports us-oregon-high-desert-point-gtfs-636 0.47 0.56 ⬆️+0.09
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 288.90 289.28 ⬆️+0.39
Minimum in Latest Reports ph-unknown-hm-transport-inc-and-robinsons-malls-gtfs-1105 0.49 0.48 ⬇️-0.01
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 288.90 289.28 ⬆️+0.39
📜 Memory Consumption
Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 470.32 MiB 462.38 MiB ⬇️-7.94 MiB
Median -- 332.27 MiB 331.92 MiB ⬇️-360.41 KiB
Standard Deviation -- 777.11 MiB 745.28 MiB ⬇️-31.83 MiB
Minimum in References Reports mexico-jalisco-direccion-general-de-transporte-publico-de-puerto-vallarta-gtfs-2034 36.42 MiB 407.92 MiB ⬆️+371.50 MiB
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.65 GiB 10.94 GiB ⬆️+303.53 MiB
Minimum in Latest Reports us-california-redding-area-bus-authority-raba-gtfs-114 40.64 MiB 38.92 MiB ⬇️-1.72 MiB
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 10.65 GiB 10.94 GiB ⬆️+303.53 MiB

Copy link

@skalexch skalexch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cka-y just one question before approving, what happened to mdb-784? It doesn't exist in the +4 new warnings despite the transfer distance being 5463 km

@cka-y
Copy link
Contributor Author

cka-y commented Feb 10, 2025

@skalexch mdb-784 (equivalent to de-unknown-rursee-schifffahrt-kg-gtfs-784) is actually omitted from the acceptance tests because of its size. Ref

Copy link

@skalexch skalexch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cka-y thanks for clarifying. Everything looks fine on the spec side.

@jcpitre
Copy link
Contributor

jcpitre commented Feb 16, 2025

@cka-y I know this isn't code review ready yet but I got excited...

  1. transfer_distance_too_large should be a warning and transfer_distance_above_threshold should be INFO
  2. Let's change the name of transfer_distance_above_threshold to something easier to understand: transfer_distance_above_2km?

@emmambd If we change the transfer_distance_above_threshold to transfer_distance_above_2km, shouldn't transfer_distance_too_large become transfer_distance_above_10km?

And, to go the other way, is it advisable to have the distance set in the notice name? Are we sure we will never decide that, for example, 1 km is enough of a distance to warrant an info instead of 2km?
Will we just add a new notice transfer_distance_above_1km in that case, while keeping transfer_distance_above_2km for backwards compatibility?

Copy link
Contributor

📝 Acceptance Test Report

📋 Summary

✅ The rule acceptance has passed for commit 99fe817
Download the full acceptance test report here (report will disappear after 90 days).

📊 Notices Comparison

New Errors (0 out of 1811 datasets, ~0%) ✅

No changes were detected due to the code change.

Dropped Errors (0 out of 1811 datasets, ~0%) ✅

No changes were detected due to the code change.

New Warnings (3 out of 1811 datasets, ~0%) ✅

Details of new errors due to code change, which is less than the provided threshold of 1%.

Dataset Notice Code
de-thuringen-verkehrsverbund-mittelthuringen-vmt-gtfs-1172 transfer_distance_too_large
fr-ile-de-france-regie-autonome-des-transports-parisiens-gtfs-1291 transfer_distance_too_large
ie-unknown-bus-eireann-gtfs-941 transfer_distance_too_large
Dropped Warnings (0 out of 1811 datasets, ~0%) ✅

No changes were detected due to the code change.

🛡️ Corruption Check

0 out of 1811 sources (~0 %) are corrupted.

⏱️ Performance Assessment

📈 Validation Time

Assess the performance in terms of seconds taken for the validation process.

Time Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 3.65 3.72 ⬆️+0.07
Median -- 1.33 1.39 ⬆️+0.06
Standard Deviation -- 10.44 10.49 ⬆️+0.05
Minimum in References Reports ph-unknown-hm-transport-inc-and-robinsons-malls-gtfs-1105 0.48 0.58 ⬆️+0.11
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 283.63 284.41 ⬆️+0.77
Minimum in Latest Reports us-california-city-of-wasco-gtfs-1788 0.51 0.49 ⬇️-0.02
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 283.63 284.41 ⬆️+0.77
📜 Memory Consumption
Metric Dataset ID Reference (s) Latest (s) Difference (s)
Average -- 471.61 MiB 463.79 MiB ⬇️-7.82 MiB
Median -- 335.92 MiB 335.92 MiB ⬇️0 bytes
Standard Deviation -- 781.51 MiB 755.47 MiB ⬇️-26.04 MiB
Minimum in References Reports mexico-jalisco-direccion-general-de-transporte-publico-de-puerto-vallarta-gtfs-2034 38.60 MiB 403.92 MiB ⬆️+365.32 MiB
Maximum in Reference Reports gb-unknown-uk-aggregate-feed-gtfs-2014 11.13 GiB 11.03 GiB ⬇️-100.80 MiB
Minimum in Latest Reports us-california-redding-area-bus-authority-raba-gtfs-114 411.92 MiB 38.89 MiB ⬇️-373.04 MiB
Maximum in Latest Reports gb-unknown-uk-aggregate-feed-gtfs-2014 11.13 GiB 11.03 GiB ⬇️-100.80 MiB

@emmambd
Copy link
Contributor

emmambd commented Feb 20, 2025

Hey @jcpitre! Sorry, for the wait, I asked @skalexch to take a look at how this worked in the old deprecated validator first. In the old one, they only have 1 notice and the severity varies (as well as the message which dynamically displays the invalid transfer distance) depending on the result.

I still think we should keep the behavior as is: transfer_distance_too_large for the warning level and transfer_distance_above_2km as the info level. If we do change the threshold for the info level, we would deprecate the notice and follow the path defined in #1964.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement transfer distance verification
4 participants