Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

isaacus-dev / open-australian-legal-corpus-creator Public

Notifications You must be signed in to change notification settings
Fork 11
Star 79

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Releases: isaacus-dev/open-australian-legal-corpus-creator

Releases · isaacus-dev/open-australian-legal-corpus-creator

v3.0.4

08 Aug 11:34

umarbutler

Compare

Choose a tag to compare

Loading

v3.0.4 Latest

Latest

Fixed

Fixed the fact that, when the Creator was run, it would unnecessarily rewrite the entire Corpus in order to detect and remove duplicates, outdated documents and otherwise repair it (which caused excessive writes and overwore disks) by instead first reading the Corpus and then only overwriting it if found necessary as, although this can sometimes double read time, reading is much cheaper on SSDs (which most modern drives are) than writing (#2).

Assets 2

Loading

All reactions

v3.0.3

05 Aug 11:00

umarbutler

Compare

Choose a tag to compare

Loading

v3.0.3

Fixed

Fixed a bug preventing the scraping of documents from the NSW Legislation database that are stored as PDFs but are reported by the database's web server as being HTML files.

Assets 2

Loading

All reactions

v3.0.2

04 Aug 13:02

umarbutler

Compare

Choose a tag to compare

Loading

v3.0.2

Fixed

Fixed a bug that caused only the first volume of multivolume documents on the Federal Register of Legislation available in a HTML format to be scraped instead of all volumes.

Assets 2

Loading

All reactions

v3.0.1

26 Jul 11:03

umarbutler

Compare

Choose a tag to compare

Loading

v3.0.1

Fixed

Fixed a bug that caused the earliest versions of documents from the Federal Register of Legislation not available in a HTML format to be scraped instead of their latest versions.

Assets 2

Loading

All reactions

v3.0.0

01 Jun 11:36

umarbutler

Compare

Choose a tag to compare

Loading

v3.0.0

Added

Added the date field.
Added the mime field for storing the original MIME type of documents.
Began lightly cleaning texts.
Introduced the max_concurrent_ocr argument to Creator and -m/--max-concurrent-ocr argument to mkoalc to limit the maximum number of PDFs that may be OCR'd concurrently.

Changed

Suffixed the ids of documents in the Western Australian legislation database with their version ids, delimited by a slash, in order to make it easier to track changes to documents.
Started filtering out documents with texts that, after being cleaned and stripped of non-alphabetic characters, are less than 9 characters long.
Replaced PDF text extraction via pdfplumber with OCR via tesseract and tesserocr as most PDFs were poorly OCR'd.

Fixed

Improved removal of empty and restricted decisions from the NSW Caselaw database by making existing keyword searches for 'Decision number not in use' and 'Decision restricted' case- and whitespace-insensitive.
Fixed documents from the Western Australian legislation database never being updated due to the use of the last modified date of the status pages of documents as version ids when the last modified date remained constant for all pages by switching to use the XXH3 64-bit hexidecimal hash of the main element of the status pages as version ids.
Fixed bug preventing the scraping of documents from the Tasmanian Legislation database due to the improper skipping of documents that contain the substring 'Content Not Found.' and also set the substring to skip on to 'Content Not Found' (without a period, as it is not used by the database).
Ensured that warnings are raised when the only version of a document available from the Federal Register of Legislation is a DOC.
Fixed a bug preventing the scraping of PDFs from the Federal Register of Legislation database.
Fixed a bug causing roughly 5.3k documents to be missed from the Federal Register of Legislation database during indexing as a result of a likely bug in the database.

Removed

Removed unused dict2inst helper function that converted dictionaries to instances of classes.

Assets 2

Loading

All reactions

v2.0.0

18 May 09:41

umarbutler

Compare

Choose a tag to compare

Loading

v2.0.0

Added

Introduced the when_scraped field of documents.
Started retrying requests when parsing errors are encountered to cope with servers being overloaded but returning successful status codes.
Added support for Python 3.10 and 3.11.
Began checking for and removing corrupted documents from the Corpus.

Changed

Switched from attrs and orjson to msgspec in order to speed up and simplify the serialisation and deserialisation of Corpus data.
Reduced the semaphore limit for the NSW Caselaw and Federal Court of Australia database from 30 to 10 to avoid overloading it.
Made minor micro-optimisations by replacing lambda functions with named functions.

Fixed

Skipped scraping web pages from the NSW Legislation database that contain the substring 'No fragments found.' due to a newly identified bug in the database (see, eg, https://legislation.nsw.gov.au/view/whole/html/inforce/2021-03-25/act-1944-031).
Skipped scraping web pages from the Tasmanian Legislation database that contain the substring 'No fragments found.' due to a newly identified bug in the database (see, eg, https://www.legislation.tas.gov.au/view/whole/html/inforce/current/act-2022-033).
Fixed a bug wherein documents from the Federal Register of Legislation database stored as DOC files were parsed as DOCX files by searching for PDF versions instead or otherwise skipping them.

Assets 2

Loading

All reactions

v1.0.1

17 Feb 02:27

umarbutler

Compare

Choose a tag to compare

Loading

v1.0.1

Fixed

Refactored the scraper for the Federal Register of Legislation database in order to resolve breaking API changes brought about by the database's redesign, thereby fixing #1.

Assets 2

Loading

All reactions

v1.0.0

09 Nov 02:11

umarbutler

Compare

Choose a tag to compare

Loading

v1.0.0

Added

Created a scraper for the High Court of Australia database.
Added status code 429 as a default retryable status code.

Changed

Improved performance.
Expanded the maximum number of seconds to wait between retries.
Expanded the maximum number of seconds that can be waited between retries before raising an exception.

Assets 2

Loading

All reactions

v0.2.0

02 Nov 03:43

umarbutler

Compare

Choose a tag to compare

Loading

v0.2.0

Added

Created a scraper for the NSW Caselaw database.

Changed

Sped up the parsing of PDFs from the Queensland Legislation database.

Assets 2

Loading

All reactions

v0.1.2

30 Oct 04:10

umarbutler

Compare

Choose a tag to compare

Loading

v0.1.2

Fixed

Fixed a bug where everything after the first occurance of a document's abbreviated jurisdiction was stripped from its citation by switching to searching for abbreviated jurisdictions enclosed in parentheses.

Assets 2

Loading

All reactions

Previous 1 2 Next

Previous Next

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.