Skip to content

Releases: isaacus-dev/open-australian-legal-corpus-creator

v3.0.4

08 Aug 11:34
Compare
Choose a tag to compare

Fixed

  • Fixed the fact that, when the Creator was run, it would unnecessarily rewrite the entire Corpus in order to detect and remove duplicates, outdated documents and otherwise repair it (which caused excessive writes and overwore disks) by instead first reading the Corpus and then only overwriting it if found necessary as, although this can sometimes double read time, reading is much cheaper on SSDs (which most modern drives are) than writing (#2).

v3.0.3

05 Aug 11:00
Compare
Choose a tag to compare

Fixed

  • Fixed a bug preventing the scraping of documents from the NSW Legislation database that are stored as PDFs but are reported by the database's web server as being HTML files.

v3.0.2

04 Aug 13:02
Compare
Choose a tag to compare

Fixed

  • Fixed a bug that caused only the first volume of multivolume documents on the Federal Register of Legislation available in a HTML format to be scraped instead of all volumes.

v3.0.1

26 Jul 11:03
Compare
Choose a tag to compare

Fixed

  • Fixed a bug that caused the earliest versions of documents from the Federal Register of Legislation not available in a HTML format to be scraped instead of their latest versions.

v3.0.0

01 Jun 11:36
Compare
Choose a tag to compare

Added

  • Added the date field.
  • Added the mime field for storing the original MIME type of documents.
  • Began lightly cleaning texts.
  • Introduced the max_concurrent_ocr argument to Creator and -m/--max-concurrent-ocr argument to mkoalc to limit the maximum number of PDFs that may be OCR'd concurrently.

Changed

  • Suffixed the ids of documents in the Western Australian legislation database with their version ids, delimited by a slash, in order to make it easier to track changes to documents.
  • Started filtering out documents with texts that, after being cleaned and stripped of non-alphabetic characters, are less than 9 characters long.
  • Replaced PDF text extraction via pdfplumber with OCR via tesseract and tesserocr as most PDFs were poorly OCR'd.

Fixed

  • Improved removal of empty and restricted decisions from the NSW Caselaw database by making existing keyword searches for 'Decision number not in use' and 'Decision restricted' case- and whitespace-insensitive.
  • Fixed documents from the Western Australian legislation database never being updated due to the use of the last modified date of the status pages of documents as version ids when the last modified date remained constant for all pages by switching to use the XXH3 64-bit hexidecimal hash of the main element of the status pages as version ids.
  • Fixed bug preventing the scraping of documents from the Tasmanian Legislation database due to the improper skipping of documents that contain the substring 'Content Not Found.' and also set the substring to skip on to 'Content Not Found' (without a period, as it is not used by the database).
  • Ensured that warnings are raised when the only version of a document available from the Federal Register of Legislation is a DOC.
  • Fixed a bug preventing the scraping of PDFs from the Federal Register of Legislation database.
  • Fixed a bug causing roughly 5.3k documents to be missed from the Federal Register of Legislation database during indexing as a result of a likely bug in the database.

Removed

  • Removed unused dict2inst helper function that converted dictionaries to instances of classes.

v2.0.0

18 May 09:41
Compare
Choose a tag to compare

Added

  • Introduced the when_scraped field of documents.
  • Started retrying requests when parsing errors are encountered to cope with servers being overloaded but returning successful status codes.
  • Added support for Python 3.10 and 3.11.
  • Began checking for and removing corrupted documents from the Corpus.

Changed

  • Switched from attrs and orjson to msgspec in order to speed up and simplify the serialisation and deserialisation of Corpus data.
  • Reduced the semaphore limit for the NSW Caselaw and Federal Court of Australia database from 30 to 10 to avoid overloading it.
  • Made minor micro-optimisations by replacing lambda functions with named functions.

Fixed

v1.0.1

17 Feb 02:27
Compare
Choose a tag to compare

Fixed

  • Refactored the scraper for the Federal Register of Legislation database in order to resolve breaking API changes brought about by the database's redesign, thereby fixing #1.

v1.0.0

09 Nov 02:11
Compare
Choose a tag to compare

Added

  • Created a scraper for the High Court of Australia database.
  • Added status code 429 as a default retryable status code.

Changed

  • Improved performance.
  • Expanded the maximum number of seconds to wait between retries.
  • Expanded the maximum number of seconds that can be waited between retries before raising an exception.

v0.2.0

02 Nov 03:43
Compare
Choose a tag to compare

Added

  • Created a scraper for the NSW Caselaw database.

Changed

  • Sped up the parsing of PDFs from the Queensland Legislation database.

v0.1.2

30 Oct 04:10
Compare
Choose a tag to compare

Fixed

  • Fixed a bug where everything after the first occurance of a document's abbreviated jurisdiction was stripped from its citation by switching to searching for abbreviated jurisdictions enclosed in parentheses.