Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guam: ParserError: Unknown string format: Guam 12, 11 #789

Closed
sentry-io bot opened this issue Nov 29, 2023 · 2 comments
Closed

Guam: ParserError: Unknown string format: Guam 12, 11 #789

sentry-io bot opened this issue Nov 29, 2023 · 2 comments
Assignees

Comments

@sentry-io
Copy link

sentry-io bot commented Nov 29, 2023

Sentry Issue: COURTLISTENER-5PX

ParserError: Unknown string format: Guam 12, 11
(3 additional frame(s) were not displayed)
...
  File "cl/scrapers/management/commands/cl_scrape_opinions.py", line 368, in handle
    self.parse_and_scrape_site(mod, options["full_crawl"])
  File "cl/scrapers/management/commands/cl_scrape_opinions.py", line 337, in parse_and_scrape_site
    site = mod.Site().parse()
@flooie
Copy link
Contributor

flooie commented Dec 13, 2023

This appears broken

@flooie flooie moved this to Todo in @grossir's backlog Dec 27, 2023
@flooie flooie moved this from Todo to State Supreme/Appellate/OA in @grossir's backlog Dec 28, 2023
@grossir
Copy link
Contributor

grossir commented Jan 26, 2024

The problem is the date regex date_match = re.search(r"[A-Za-z]+\.?\s+[0-9]+,\s+[0-9]+", text)
which fails for rows where the date has this format 12-28-2023. Sometimes it picks up a mix of the citation and the date, like Guam 12, 11. It seems that this date format started to appear on mid 2023.

We are missing all opinions that have that date format. In CL we only have 70 opinions for this court_id, 2 from a not-scraper source, the rest from the scraper from 2021 to the present.

I counted the opinions for the years present on CL, and there should be 75, so we are missing 7. I ran the backscraper on the PR with some problems.

From 2017 backwards, some records have no docket. For example, the only record with no docket from a year more recent than 2017:
image

From 2008 backwards, most records have no dates so they can't be collected without triggering an error on AbstractSite._check_sanity()

grossir added a commit to grossir/juriscraper that referenced this issue Jan 26, 2024
Solves freelawproject#789

- Validated and improved regexes for date, docket and citation
@grossir grossir moved this from State Supreme/Appellate to In Progress in @grossir's backlog Jan 26, 2024
@grossir grossir self-assigned this Jan 26, 2024
@flooie flooie closed this as completed Jan 31, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in @grossir's backlog Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants