diff --git a/pombola/south_africa/data/members-interests/README.md b/pombola/south_africa/data/members-interests/README.md index 548533e6e..042d74596 100644 --- a/pombola/south_africa/data/members-interests/README.md +++ b/pombola/south_africa/data/members-interests/README.md @@ -1,15 +1,23 @@ # ZA Members' Interests Data -There are several files in this directory: +## Prepping and importing member's interests -## DOCX scraper +Importing member's interests is a multistep process. -The scraper currently scrapes `.docx` files. -To prepare the file: +### Obtain register PDF -1. Split the `PDF` into seperate files small enough to open in Google Drive. [PDF Arranger](https://github.com/pdfarranger/pdfarranger) works well -2. Open the files in Google Drive and download each in `.docx` format -3. Store the these files in `./pombola/south_africa/data/members-interests/scraper/docx_files/` +Get the latest register `PDF` from [parliament](https://www.parliament.gov.za/register-members-Interests). + +### Trim PDF + +Use [PDF Arranger](https://github.com/pdfarranger/pdfarranger) to remove cover and contents pages to have a final `PDF` that is just the register. + +### Convert PDF to DOCX + +Previously we used `Google Docs` to convert the `PDF` to `DOCX`. This was very cumbersome and meant working in batches of 80 pages at a time. +A workable and faster third-party solution is [ilovepdf.com](https://www.ilovepdf.com/pdf_to_word). + +### Convert DOCX to HTML Create an environment and install dependencies in the `./pombola/south_africa/data/members-interests/scraper` directory: ``` @@ -18,67 +26,43 @@ source venv/bin/activate pip install -r requirements.txt ``` -Run the script with the necessary arguments, e.g. -``` -python scrape_interests_docx.py --input ./docx_files/ --output ../2021.json --year 2021 --source https://static.pmg.org.za/Register_of_Members_Interests_2021.pdf -``` - -This will combine documents into a single HTML file `main_html_file.html` - -Run the Jupyter script `membersinterest.ipynb` making sure to update the input file name. The output should be `register.json` - -Copy `register.json` to the `members-interests` directory and rename it to the corresponding year - -## Conversion script - - convert_to_import_json.py +Use the first cell of `./scraper/docx_to_html_to_json.ipynb` to convert the `DOCX` to `HTML` with Mammoth. -This script takes the raw data and processes it to clean it up, to match the mp -to entries in the database and to put it in the format that the -`interests_register_import_from_json` management command expects. This script -is highly specific to the above JSON files and the MP entries in the database -at the time of writing (3 Dec 2013). +### Convert to JSON -You can run the script like this: +Run the other cells in the notebook to convert it to a workable `json` file. - cd pombola/south_africa/data/members-interests/ - ./convert_to_import_json.py 2022.json > 2022_for_import.json +### Copy to server -This will require a production or equivalent data for the persons table to filter against. -You can either run the script in prod or build a local database instance like so: +The next steps need to be run on the server as it uses the production database. -`dokku postgres:export pombola > dumpitydump.dump` +### Convert to importable json file -`pg_restore -U pombola -d pombola dumpitydumplol.dump` +``` +cd pombola/south_africa/data/members-interests/ +./convert_to_import_json.py 2024.json > 2024_for_import.json +``` When processing new data you may well need to add more entries to the `slug_corrections` attribute. Change the `finding_slug_corrections` to `True` to enable some code that'll help you do that. Change it back to False when done. -## Final importable data - - 2010_for_import.json - 2011_for_import.json - 2012_for_import.json - 2013_for_import.json - 2014_for_import.json - 2015_for_import.json - 2016_for_import.json - 2017_for_import.json - 2018_for_import.json - -This is the output of the above conversion script. It is committed for ease of -adding to the database, and as looking at the diffs is an easy way to see the -results of changes to the conversion script. +### Import final json file To load this data into the database, you can run the management command: (if there are already some entries you should delete them all using the admin) - ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2010_for_import.json - ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2011_for_import.json - ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2012_for_import.json - ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2013_for_import.json - ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2014_for_import.json - ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2015_for_import.json - ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2016_for_import.json - ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2017_for_import.json +``` +./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2024_for_import.json +``` + +## Some useful notes for possible issues + +- name, surname and title orders may change from year to year. Please see +`/pombola/south_africa/data/members-interests/convert_to_import_json.py` lines 1325 and 1328 to tweak this if need be: `name_ordered = re.sub(r'^(\w+\b\s+\w+\b)\s+(.*)$', r'\2 \1', name_only)` where `\2 \1` determines the order. + +- If the list of missing slugs is long, export existing slugs from Metbase and use ChatGPT to suggest matches. Confirm with PMG before importing. + +- Section names might change, this would need to be changed in the convert script to match the `json` file. + +- Regex patterns might also change and if there are broken entries or overlaps in the `json` file, make sure the patterns and sections are correct. \ No newline at end of file diff --git a/pombola/south_africa/data/members-interests/scraper/docx_to_html_to_json.ipynb b/pombola/south_africa/data/members-interests/scraper/docx_to_html_to_json.ipynb new file mode 100644 index 000000000..9229c6da3 --- /dev/null +++ b/pombola/south_africa/data/members-interests/scraper/docx_to_html_to_json.ipynb @@ -0,0 +1,272 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LllNYwnLqjn6" + }, + "outputs": [], + "source": [ + "import mammoth\n", + "\n", + "import re\n", + "from pprint import pprint\n", + "import json\n", + "from bs4 import BeautifulSoup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_ESbMiFoQJMo" + }, + "outputs": [], + "source": [ + "def docx_to_html(file_path):\n", + " with open(file_path, \"rb\") as docx_file:\n", + " result = mammoth.convert_to_html(docx_file)\n", + " html = result.value \n", + " messages = result.messages # Any warnings or errors during conversion\n", + " return html\n", + "\n", + "docx_file_path = \"register.docx\"\n", + "html_output = docx_to_html(docx_file_path)\n", + "\n", + "# find and delete this pattern in html_output
\n",
+ "cleaned_html = re.sub(r' ', ' ', cleaned_html)\n", + "\n", + "# Save the HTML to a file\n", + "with open(\"output.html\", \"w\") as html_file:\n", + " html_file.write(cleaned_html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tdsl7kk2mjYV" + }, + "outputs": [], + "source": [ + "\n", + "def split_document_by_pattern(html_content):\n", + "\n", + " # Split at table after TRUSTS\n", + "\n", + " pattern = r\"(
|
(.*?)
\",person)[0] if re.findall(r\"(.*?)
\",person) else None\n", + "\n", + " if person_name:\n", + " parts = person_name.split(\", \")\n", + " surname = parts[0].strip() # Always the first part\n", + " if len(parts) > 1:\n", + " person_title = parts[1].split()[0].strip() # Only the first word is the title\n", + " given_names = \" \".join(parts[1].split()[1:]).strip() # Remaining words are given names\n", + " person_name = f\"{given_names} {surname}\".strip()\n", + "\n", + " people.append({\n", + " \"mp\": person_name,\n", + " \"title\": person_title,\n", + " \"party\": person_party,\n", + " \"SHARES AND OTHER FINANCIAL INTERESTS\": process_person(person, \"RAW-SHARES AND OTHER FINANCIAL INTERESTS\"),\n", + " \"REMUNERATED EMPLOYMENT OR WORK OUTSIDE OF PARLIAMENT\": process_person(person, \"RAW-REMUNERATED EMPLOYMENT OR WORK OUTSIDE OF PARLIAMENT\"),\n", + " \"DIRECTORSHIPS AND PARTNERSHIPS\": process_person(person, \"RAW-DIRECTORSHIPS AND PARTNERSHIPS\"),\n", + " \"CONSULTANCIES AND RETAINERSHIPS\": process_person(person, \"RAW-CONSULTANCIES AND RETAINERSHIPS\"),\n", + " \"SPONSORSHIPS\": process_person(person, \"RAW-SPONSORSHIPS\"),\n", + " \"GIFTS AND HOSPITALITY\": process_person(person, \"RAW-GIFTS AND HOSPITALITY\"),\n", + " \"BENEFITS AND INTERESTS FREE LOANS\": process_person(person, \"RAW-BENEFITS AND INTERESTS FREE LOANS\"),\n", + " \"TRAVEL\": process_person(person, \"RAW-TRAVEL\"),\n", + " \"OWNERSHIP IN LAND AND PROPERTY\": process_person(person, \"RAW-OWNERSHIP IN LAND AND PROPERTY\"),\n", + " \"PENSIONS\": process_person(person, \"RAW-PENSIONS\"),\n", + " \"RENTED PROPERTY\": process_person(person, \"RAW-RENTED PROPERTY\"),\n", + " \"INCOME GENERATING ASSETS\": process_person(person, \"RAW-INCOME GENERATING ASSETS\"),\n", + " \"TRUSTS\": process_person(person, \"RAW-TRUSTS\")\n", + " })\n", + "\n", + " else:\n", + " print(f\"Skipping non-string person entry: {type(person)}\")\n", + "\n", + "\n", + "# clean people by dumping any entry where mp = None\n", + "people = [person for person in people if person['mp'] is not None]\n", + "\n", + "\n", + "with open(\"/content/drive/MyDrive/PROJECTS/PMG/MI/output.json\", \"w\") as outfile:\n", + " json.dump(people, outfile)\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "el1f_xdV1R4q" + }, + "outputs": [], + "source": [ + "# Just to debug final JSON\n", + "\n", + "import json\n", + "\n", + "def load_json_as_array(file_path):\n", + " try:\n", + " with open(file_path, 'r') as file:\n", + " data = json.load(file)\n", + " if isinstance(data, list):\n", + " return data\n", + " else:\n", + " print(f\"Warning: JSON file does not contain an array of objects. Returning the loaded data as is.\")\n", + " return data\n", + " except FileNotFoundError:\n", + " print(f\"Error: File not found at {file_path}\")\n", + " return None\n", + " except json.JSONDecodeError:\n", + " print(f\"Error: Invalid JSON format in {file_path}\")\n", + " return None\n", + "\n", + "# Example usage\n", + "file_path = \"output.json\" \n", + "data = load_json_as_array(file_path)\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}