Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple OrganisationBoundaryReview and LGBCE Scraper #2235

Open
chris48s opened this issue Sep 12, 2024 · 0 comments
Open

Decouple OrganisationBoundaryReview and LGBCE Scraper #2235

chris48s opened this issue Sep 12, 2024 · 0 comments

Comments

@chris48s
Copy link
Member

Currently our OrganisationBoundaryReview model is doing 2 things.

  1. It is a source of nice clean data about boundary reviews. Some of that comes from LGBCE. Some of it we enter by hand (e.g: Community Gov reviews, Structural Change Orders, Wales/Scotland/NI stuff)
  2. It is a "mirror" of the LGBCE site which allows us to track changes over time. This allows us to (for example) fire a notification when a review moves from "in progress" to "complete"

These purposes are quite closely linked, but not exactly aligned.
One big issue is that if LGBCE's site links to the wrong thing. For example (all real examples):

  • Completed review still links to draft legislation
  • Legislation link points to consultation URL
  • Newcastle Under Lyme's legislation link points to Newcastle Upon Tyne's legislation

we can't fix that in our DB without breaking the scraper. We have to leave it wrong. Also in principle, unexpected edits to the LGBCE site can retrospectively break our data. We're not really in control of it.

With our other scraper that goes off spidering for PDFs that look like a Notice Of Election document, we flag things that might be a NoE but then a human reviews it before we create an Election object because sometimes we've scraped a Parish Council election, or a Neighbourhood Planning Referendum or something.

LGBCE's site is a bit more structured and we do have some validation in place. That said, I think it would be useful for us to separate the concept of "mirror the LGBCE site for scraping purposes" and "Nice clean Boundary Review data we can edit" with some kind of manual "Create OrganisationBoundaryReview from scraped record" process that

  • de-couples the two
  • gives us a one-click creation
  • allows us to edit our data independently of the LGBCE site
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant