diff --git a/pombola/south_africa/data/members-interests/README.md b/pombola/south_africa/data/members-interests/README.md index 548533e6e..042d74596 100644 --- a/pombola/south_africa/data/members-interests/README.md +++ b/pombola/south_africa/data/members-interests/README.md @@ -1,15 +1,23 @@ # ZA Members' Interests Data -There are several files in this directory: +## Prepping and importing member's interests -## DOCX scraper +Importing member's interests is a multistep process. -The scraper currently scrapes `.docx` files. -To prepare the file: +### Obtain register PDF -1. Split the `PDF` into seperate files small enough to open in Google Drive. [PDF Arranger](https://github.com/pdfarranger/pdfarranger) works well -2. Open the files in Google Drive and download each in `.docx` format -3. Store the these files in `./pombola/south_africa/data/members-interests/scraper/docx_files/` +Get the latest register `PDF` from [parliament](https://www.parliament.gov.za/register-members-Interests). + +### Trim PDF + +Use [PDF Arranger](https://github.com/pdfarranger/pdfarranger) to remove cover and contents pages to have a final `PDF` that is just the register. + +### Convert PDF to DOCX + +Previously we used `Google Docs` to convert the `PDF` to `DOCX`. This was very cumbersome and meant working in batches of 80 pages at a time. +A workable and faster third-party solution is [ilovepdf.com](https://www.ilovepdf.com/pdf_to_word). + +### Convert DOCX to HTML Create an environment and install dependencies in the `./pombola/south_africa/data/members-interests/scraper` directory: ``` @@ -18,67 +26,43 @@ source venv/bin/activate pip install -r requirements.txt ``` -Run the script with the necessary arguments, e.g. -``` -python scrape_interests_docx.py --input ./docx_files/ --output ../2021.json --year 2021 --source https://static.pmg.org.za/Register_of_Members_Interests_2021.pdf -``` - -This will combine documents into a single HTML file `main_html_file.html` - -Run the Jupyter script `membersinterest.ipynb` making sure to update the input file name. The output should be `register.json` - -Copy `register.json` to the `members-interests` directory and rename it to the corresponding year - -## Conversion script - - convert_to_import_json.py +Use the first cell of `./scraper/docx_to_html_to_json.ipynb` to convert the `DOCX` to `HTML` with Mammoth. -This script takes the raw data and processes it to clean it up, to match the mp -to entries in the database and to put it in the format that the -`interests_register_import_from_json` management command expects. This script -is highly specific to the above JSON files and the MP entries in the database -at the time of writing (3 Dec 2013). +### Convert to JSON -You can run the script like this: +Run the other cells in the notebook to convert it to a workable `json` file. - cd pombola/south_africa/data/members-interests/ - ./convert_to_import_json.py 2022.json > 2022_for_import.json +### Copy to server -This will require a production or equivalent data for the persons table to filter against. -You can either run the script in prod or build a local database instance like so: +The next steps need to be run on the server as it uses the production database. -`dokku postgres:export pombola > dumpitydump.dump` +### Convert to importable json file -`pg_restore -U pombola -d pombola dumpitydumplol.dump` +``` +cd pombola/south_africa/data/members-interests/ +./convert_to_import_json.py 2024.json > 2024_for_import.json +``` When processing new data you may well need to add more entries to the `slug_corrections` attribute. Change the `finding_slug_corrections` to `True` to enable some code that'll help you do that. Change it back to False when done. -## Final importable data - - 2010_for_import.json - 2011_for_import.json - 2012_for_import.json - 2013_for_import.json - 2014_for_import.json - 2015_for_import.json - 2016_for_import.json - 2017_for_import.json - 2018_for_import.json - -This is the output of the above conversion script. It is committed for ease of -adding to the database, and as looking at the diffs is an easy way to see the -results of changes to the conversion script. +### Import final json file To load this data into the database, you can run the management command: (if there are already some entries you should delete them all using the admin) - ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2010_for_import.json - ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2011_for_import.json - ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2012_for_import.json - ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2013_for_import.json - ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2014_for_import.json - ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2015_for_import.json - ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2016_for_import.json - ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2017_for_import.json +``` +./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2024_for_import.json +``` + +## Some useful notes for possible issues + +- name, surname and title orders may change from year to year. Please see +`/pombola/south_africa/data/members-interests/convert_to_import_json.py` lines 1325 and 1328 to tweak this if need be: `name_ordered = re.sub(r'^(\w+\b\s+\w+\b)\s+(.*)$', r'\2 \1', name_only)` where `\2 \1` determines the order. + +- If the list of missing slugs is long, export existing slugs from Metbase and use ChatGPT to suggest matches. Confirm with PMG before importing. + +- Section names might change, this would need to be changed in the convert script to match the `json` file. + +- Regex patterns might also change and if there are broken entries or overlaps in the `json` file, make sure the patterns and sections are correct. \ No newline at end of file diff --git a/pombola/south_africa/data/members-interests/scraper/docx_to_html_to_json.ipynb b/pombola/south_africa/data/members-interests/scraper/docx_to_html_to_json.ipynb new file mode 100644 index 000000000..9229c6da3 --- /dev/null +++ b/pombola/south_africa/data/members-interests/scraper/docx_to_html_to_json.ipynb @@ -0,0 +1,272 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LllNYwnLqjn6" + }, + "outputs": [], + "source": [ + "import mammoth\n", + "\n", + "import re\n", + "from pprint import pprint\n", + "import json\n", + "from bs4 import BeautifulSoup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_ESbMiFoQJMo" + }, + "outputs": [], + "source": [ + "def docx_to_html(file_path):\n", + " with open(file_path, \"rb\") as docx_file:\n", + " result = mammoth.convert_to_html(docx_file)\n", + " html = result.value \n", + " messages = result.messages # Any warnings or errors during conversion\n", + " return html\n", + "\n", + "docx_file_path = \"register.docx\"\n", + "html_output = docx_to_html(docx_file_path)\n", + "\n", + "# find and delete this pattern in html_output (.*?) - this is for tables that span pages\n", + "matches = re.findall(r'
(.*?)', html_output)\n", + "cleaned_html = re.sub(r'
(.*?)', '', html_output)\n", + "\n", + "# replace

with \"\" - This is for paragraphs in

\n", + "cleaned_html = re.sub(r'

', ' ', cleaned_html)\n", + "\n", + "# Save the HTML to a file\n", + "with open(\"output.html\", \"w\") as html_file:\n", + " html_file.write(cleaned_html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tdsl7kk2mjYV" + }, + "outputs": [], + "source": [ + "\n", + "def split_document_by_pattern(html_content):\n", + "\n", + " # Split at table after TRUSTS\n", + "\n", + " pattern = r\"(

    \\s*
  • \\s*
      \\s*
    • \\s*
        \\s*
      1. \\s*TRUSTS\\s*
      2. \\s*
      \\s*
    • \\s*
    \\s*
  • \\s*
.*?
)\"\n", + "\n", + " # Split the document using the pattern\n", + " sections = re.split(pattern, html_content, flags=re.DOTALL)\n", + "\n", + " # Combine the sections after splitting (capturing groups leave pattern matches in the split result)\n", + " combined_sections = []\n", + " for i in range(0, len(sections) - 1, 2):\n", + " combined_sections.append(sections[i] + sections[i + 1]) # Add the content before and including the match\n", + "\n", + " # Add the final leftover content if any\n", + " if len(sections) % 2 != 0:\n", + " combined_sections.append(sections[-1])\n", + "\n", + " return combined_sections\n", + "\n", + "with open(\"output.html\", \"r\", encoding=\"utf-8\") as file:\n", + " html_data = file.read()\n", + "\n", + "sections = split_document_by_pattern(html_data)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "lbZtE00n2P4g" + }, + "outputs": [], + "source": [ + "# Section Patterns\n", + "\n", + "sections_split = {\n", + " \"RAW-SHARES AND OTHER FINANCIAL INTERESTS\": r\"(.*?)\",\n", + " \"RAW-REMUNERATED EMPLOYMENT OR WORK OUTSIDE OF PARLIAMENT\": r\"(.*?)\",\n", + " \"RAW-DIRECTORSHIPS AND PARTNERSHIPS\": r\"(.*?)\",\n", + " \"RAW-CONSULTANCIES AND RETAINERSHIPS\": r\"(.*?)\",\n", + " \"RAW-SPONSORSHIPS\": r\"(.*?)\",\n", + " \"RAW-GIFTS AND HOSPITALITY\": r\"(.*?)\",\n", + " \"RAW-BENEFITS AND INTERESTS FREE LOANS\": r\"(.*?)\",\n", + " \"RAW-TRAVEL\": r\"(.*?)\",\n", + " \"RAW-OWNERSHIP IN LAND AND PROPERTY\": r\"(.*?)\",\n", + " \"RAW-PENSIONS\": r\"(.*?)\",\n", + " \"RAW-RENTED PROPERTY\": r\"(.*?)\",\n", + " \"RAW-INCOME GENERATING ASSETS\": r\"(.*?)\",\n", + " \"RAW-TRUSTS\": r\"(.*?)\"\n", + "}\n", + "\n", + "def parse_table_to_json(html_table, key_name):\n", + " if not isinstance(html_table, str):\n", + " return\n", + "\n", + " soup = BeautifulSoup(html_table, \"html.parser\")\n", + " rows = soup.find_all(\"tr\")\n", + "\n", + " # Extract headers from the first row\n", + " headers = [header.get_text(strip=True) for header in rows[0].find_all(\"p\")]\n", + "\n", + " # Extract data from the remaining rows\n", + " data = []\n", + " for row in rows[1:]:\n", + " values = [value.get_text(strip=True) for value in row.find_all(\"p\")]\n", + " entry = {headers[i]: values[i] if i < len(values) else \"\" for i in range(len(headers))}\n", + " data.append(entry)\n", + "\n", + " # Construct the final JSON object\n", + " result = data\n", + " return result\n", + "\n", + "def process_person(person_html, section_name):\n", + "\n", + "\n", + " content = {}\n", + "\n", + "\n", + "\n", + "\n", + " # Extract each section\n", + " for key, pattern in sections_split.items():\n", + " matches = re.findall(pattern, person_html)\n", + " content[key] = matches[0] if matches else None\n", + "\n", + "\n", + " for html in content:\n", + " table_pattern = r\"(.*?)\"\n", + "\n", + " if isinstance(content[html], str):\n", + " table_contents = re.findall(table_pattern, content[html])[0]\n", + " content[html] = \"\" + table_contents + \"
\"\n", + "\n", + "\n", + " key_name = section_name.replace(\"RAW-\", \"\")\n", + " result = parse_table_to_json(content['RAW-' + key_name], key_name)\n", + "\n", + " return(result)\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "people = []\n", + "\n", + "for person in sections:\n", + " if isinstance(person, str):\n", + "\n", + " person_name = \"\"\n", + " person_title = \"\"\n", + " person_party = \"\"\n", + "\n", + " if re.findall(r\"\", person):\n", + " person_name = re.findall(r\"\", person)[0]\n", + "\n", + " if person_name == \"SHARES AND OTHER FINANCIAL INTERESTS\":\n", + " person_name = re.findall(r\"
  1. (.*?)
    1. (.*?)
\",person)[0][1]\n", + "\n", + " person_party = re.findall(r\"

(.*?)

\",person)[0] if re.findall(r\"

(.*?)

\",person) else None\n", + "\n", + " if person_name:\n", + " parts = person_name.split(\", \")\n", + " surname = parts[0].strip() # Always the first part\n", + " if len(parts) > 1:\n", + " person_title = parts[1].split()[0].strip() # Only the first word is the title\n", + " given_names = \" \".join(parts[1].split()[1:]).strip() # Remaining words are given names\n", + " person_name = f\"{given_names} {surname}\".strip()\n", + "\n", + " people.append({\n", + " \"mp\": person_name,\n", + " \"title\": person_title,\n", + " \"party\": person_party,\n", + " \"SHARES AND OTHER FINANCIAL INTERESTS\": process_person(person, \"RAW-SHARES AND OTHER FINANCIAL INTERESTS\"),\n", + " \"REMUNERATED EMPLOYMENT OR WORK OUTSIDE OF PARLIAMENT\": process_person(person, \"RAW-REMUNERATED EMPLOYMENT OR WORK OUTSIDE OF PARLIAMENT\"),\n", + " \"DIRECTORSHIPS AND PARTNERSHIPS\": process_person(person, \"RAW-DIRECTORSHIPS AND PARTNERSHIPS\"),\n", + " \"CONSULTANCIES AND RETAINERSHIPS\": process_person(person, \"RAW-CONSULTANCIES AND RETAINERSHIPS\"),\n", + " \"SPONSORSHIPS\": process_person(person, \"RAW-SPONSORSHIPS\"),\n", + " \"GIFTS AND HOSPITALITY\": process_person(person, \"RAW-GIFTS AND HOSPITALITY\"),\n", + " \"BENEFITS AND INTERESTS FREE LOANS\": process_person(person, \"RAW-BENEFITS AND INTERESTS FREE LOANS\"),\n", + " \"TRAVEL\": process_person(person, \"RAW-TRAVEL\"),\n", + " \"OWNERSHIP IN LAND AND PROPERTY\": process_person(person, \"RAW-OWNERSHIP IN LAND AND PROPERTY\"),\n", + " \"PENSIONS\": process_person(person, \"RAW-PENSIONS\"),\n", + " \"RENTED PROPERTY\": process_person(person, \"RAW-RENTED PROPERTY\"),\n", + " \"INCOME GENERATING ASSETS\": process_person(person, \"RAW-INCOME GENERATING ASSETS\"),\n", + " \"TRUSTS\": process_person(person, \"RAW-TRUSTS\")\n", + " })\n", + "\n", + " else:\n", + " print(f\"Skipping non-string person entry: {type(person)}\")\n", + "\n", + "\n", + "# clean people by dumping any entry where mp = None\n", + "people = [person for person in people if person['mp'] is not None]\n", + "\n", + "\n", + "with open(\"/content/drive/MyDrive/PROJECTS/PMG/MI/output.json\", \"w\") as outfile:\n", + " json.dump(people, outfile)\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "el1f_xdV1R4q" + }, + "outputs": [], + "source": [ + "# Just to debug final JSON\n", + "\n", + "import json\n", + "\n", + "def load_json_as_array(file_path):\n", + " try:\n", + " with open(file_path, 'r') as file:\n", + " data = json.load(file)\n", + " if isinstance(data, list):\n", + " return data\n", + " else:\n", + " print(f\"Warning: JSON file does not contain an array of objects. Returning the loaded data as is.\")\n", + " return data\n", + " except FileNotFoundError:\n", + " print(f\"Error: File not found at {file_path}\")\n", + " return None\n", + " except json.JSONDecodeError:\n", + " print(f\"Error: Invalid JSON format in {file_path}\")\n", + " return None\n", + "\n", + "# Example usage\n", + "file_path = \"output.json\" \n", + "data = load_json_as_array(file_path)\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}