Merge pull request #258 from OpenUpSA/members-interest-2024

updated member's interests import procedure
OpenUpSA · Jan 13, 2025 · ee686ee · ee686ee
2 parents 3a54b20 + 67a8dc5
commit ee686ee
Show file tree

Hide file tree

Showing 2 changed files with 312 additions and 56 deletions.
diff --git a/pombola/south_africa/data/members-interests/README.md b/pombola/south_africa/data/members-interests/README.md
@@ -1,15 +1,23 @@
 # ZA Members' Interests Data
 
-There are several files in this directory:
+## Prepping and importing member's interests
 
-## DOCX scraper
+Importing member's interests is a multistep process.
 
-The scraper currently scrapes `.docx` files.
-To prepare the file:
+### Obtain register PDF
 
-1. Split the `PDF` into seperate files small enough to open in Google Drive. [PDF Arranger](https://github.com/pdfarranger/pdfarranger) works well 
-2. Open the files in Google Drive and download each in `.docx` format
-3. Store the these files in `./pombola/south_africa/data/members-interests/scraper/docx_files/`
+Get the latest register `PDF` from [parliament](https://www.parliament.gov.za/register-members-Interests).
+
+### Trim PDF
+
+Use [PDF Arranger](https://github.com/pdfarranger/pdfarranger) to remove cover and contents pages to have a final `PDF` that is just the register.
+
+### Convert PDF to DOCX
+
+Previously we used `Google Docs` to convert the `PDF` to `DOCX`. This was very cumbersome and meant working in batches of 80 pages at a time.
+A workable and faster third-party solution is [ilovepdf.com](https://www.ilovepdf.com/pdf_to_word).
+
+### Convert DOCX to HTML
 
 Create an environment and install dependencies in the `./pombola/south_africa/data/members-interests/scraper` directory:
 ```
@@ -18,67 +26,43 @@ source venv/bin/activate
 pip install -r requirements.txt
 ```
 
-Run the script with the necessary arguments, e.g.
-```
-python scrape_interests_docx.py --input ./docx_files/ --output ../2021.json --year 2021 --source https://static.pmg.org.za/Register_of_Members_Interests_2021.pdf
-```
-
-This will combine documents into a single HTML file `main_html_file.html`
-
-Run the Jupyter script `membersinterest.ipynb` making sure to update the input file name. The output should be `register.json`
-
-Copy `register.json` to the `members-interests` directory and rename it to the corresponding year
-
-## Conversion script
-
-    convert_to_import_json.py
+Use the first cell of `./scraper/docx_to_html_to_json.ipynb` to convert the `DOCX` to `HTML` with Mammoth.
 
-This script takes the raw data and processes it to clean it up, to match the mp
-to entries in the database and to put it in the format that the
-`interests_register_import_from_json` management command expects. This script
-is highly specific to the above JSON files and the MP entries in the database
-at the time of writing (3 Dec 2013).
+### Convert to JSON
 
-You can run the script like this:
+Run the other cells in the notebook to convert it to a workable `json` file.
 
-    cd pombola/south_africa/data/members-interests/
-    ./convert_to_import_json.py 2022.json > 2022_for_import.json
+### Copy to server
 
-This will require a production or equivalent data for the persons table to filter against.
-You can either run the script in prod or build a local database instance like so:
+The next steps need to be run on the server as it uses the production database.
 
-`dokku postgres:export pombola > dumpitydump.dump`
+### Convert to importable json file
 
-`pg_restore -U pombola -d pombola dumpitydumplol.dump`
+```
+cd pombola/south_africa/data/members-interests/
+./convert_to_import_json.py 2024.json > 2024_for_import.json
+```
 
 When processing new data you may well need to add more entries to the
 `slug_corrections` attribute. Change the `finding_slug_corrections` to `True`
 to enable some code that'll help you do that. Change it back to False when done.
 
-## Final importable data
-
-    2010_for_import.json
-    2011_for_import.json
-    2012_for_import.json
-    2013_for_import.json
-    2014_for_import.json
-    2015_for_import.json
-    2016_for_import.json
-    2017_for_import.json
-    2018_for_import.json
-
-This is the output of the above conversion script. It is committed for ease of
-adding to the database, and as looking at the diffs is an easy way to see the
-results of changes to the conversion script.
+### Import final json file
 
 To load this data into the database, you can run the management command:
 (if there are already some entries you should delete them all using the admin)
 
-    ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2010_for_import.json
-    ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2011_for_import.json
-    ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2012_for_import.json
-    ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2013_for_import.json
-    ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2014_for_import.json
-    ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2015_for_import.json
-    ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2016_for_import.json
-    ./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2017_for_import.json
+```
+./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2024_for_import.json
+```
+
+## Some useful notes for possible issues
+
+- name, surname and title orders may change from year to year. Please see 
+`/pombola/south_africa/data/members-interests/convert_to_import_json.py` lines 1325 and 1328 to tweak this if need be: `name_ordered = re.sub(r'^(\w+\b\s+\w+\b)\s+(.*)$', r'\2 \1', name_only)` where `\2 \1` determines the order.
+
+- If the list of missing slugs is long, export existing slugs from Metbase and use ChatGPT to suggest matches. Confirm with PMG before importing.
+
+- Section names might change, this would need to be changed in the convert script to match the `json` file.
+
+- Regex patterns might also change and if there are broken entries or overlaps in the `json` file, make sure the patterns and sections are correct.
diff --git a/pombola/south_africa/data/members-interests/scraper/docx_to_html_to_json.ipynb b/pombola/south_africa/data/members-interests/scraper/docx_to_html_to_json.ipynb
@@ -0,0 +1,272 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "LllNYwnLqjn6"
+   },
+   "outputs": [],
+   "source": [
+    "import mammoth\n",
+    "\n",
+    "import re\n",
+    "from pprint import pprint\n",
+    "import json\n",
+    "from bs4 import BeautifulSoup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "_ESbMiFoQJMo"
+   },
+   "outputs": [],
+   "source": [
+    "def docx_to_html(file_path):\n",
+    "    with open(file_path, \"rb\") as docx_file:\n",
+    "        result = mammoth.convert_to_html(docx_file)\n",
+    "        html = result.value  \n",
+    "        messages = result.messages  # Any warnings or errors during conversion\n",
+    "    return html\n",
+    "\n",
+    "docx_file_path = \"register.docx\"\n",
+    "html_output = docx_to_html(docx_file_path)\n",
+    "\n",
+    "# find and delete this pattern in html_output </table><table><tr>(.*?)</tr> - this is for tables that span pages\n",
+    "matches = re.findall(r'</table><table><tr>(.*?)</tr>', html_output)\n",
+    "cleaned_html = re.sub(r'</table><table><tr>(.*?)</tr>', '', html_output)\n",
+    "\n",
+    "# replace </p><p> with \"\" - This is for paragraphs in <td>\n",
+    "cleaned_html = re.sub(r'</p><p>', ' ', cleaned_html)\n",
+    "\n",
+    "# Save the HTML to a file\n",
+    "with open(\"output.html\", \"w\") as html_file:\n",
+    "    html_file.write(cleaned_html)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "tdsl7kk2mjYV"
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "def split_document_by_pattern(html_content):\n",
+    "\n",
+    "    # Split at table after TRUSTS\n",
+    "\n",
+    "    pattern = r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>TRUSTS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\"\n",
+    "\n",
+    "    # Split the document using the pattern\n",
+    "    sections = re.split(pattern, html_content, flags=re.DOTALL)\n",
+    "\n",
+    "    # Combine the sections after splitting (capturing groups leave pattern matches in the split result)\n",
+    "    combined_sections = []\n",
+    "    for i in range(0, len(sections) - 1, 2):\n",
+    "        combined_sections.append(sections[i] + sections[i + 1])  # Add the content before and including the match\n",
+    "\n",
+    "    # Add the final leftover content if any\n",
+    "    if len(sections) % 2 != 0:\n",
+    "        combined_sections.append(sections[-1])\n",
+    "\n",
+    "    return combined_sections\n",
+    "\n",
+    "with open(\"output.html\", \"r\", encoding=\"utf-8\") as file:\n",
+    "    html_data = file.read()\n",
+    "\n",
+    "sections = split_document_by_pattern(html_data)\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "lbZtE00n2P4g"
+   },
+   "outputs": [],
+   "source": [
+    "# Section Patterns\n",
+    "\n",
+    "sections_split = {\n",
+    "    \"RAW-SHARES AND OTHER FINANCIAL INTERESTS\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>SHARES AND OTHER FINANCIAL INTERESTS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
+    "    \"RAW-REMUNERATED EMPLOYMENT OR WORK OUTSIDE OF PARLIAMENT\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>REMUNERATED EMPLOYMENT OR WORK OUTSIDE OF PARLIAMENT</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
+    "    \"RAW-DIRECTORSHIPS AND PARTNERSHIPS\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>DIRECTORSHIPS AND PARTNERSHIPS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
+    "    \"RAW-CONSULTANCIES AND RETAINERSHIPS\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>CONSULTANCIES AND RETAINERSHIPS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
+    "    \"RAW-SPONSORSHIPS\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>SPONSORSHIPS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
+    "    \"RAW-GIFTS AND HOSPITALITY\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>GIFTS AND HOSPITALITY</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
+    "    \"RAW-BENEFITS AND INTERESTS FREE LOANS\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>BENEFITS AND INTERESTS FREE LOANS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
+    "    \"RAW-TRAVEL\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>TRAVEL</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
+    "    \"RAW-OWNERSHIP IN LAND AND PROPERTY\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>OWNERSHIP IN LAND AND PROPERTY</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
+    "    \"RAW-PENSIONS\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>PENSIONS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
+    "    \"RAW-RENTED PROPERTY\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>RENTED PROPERTY</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
+    "    \"RAW-INCOME GENERATING ASSETS\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>INCOME GENERATING ASSETS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
+    "    \"RAW-TRUSTS\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>TRUSTS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\"\n",
+    "}\n",
+    "\n",
+    "def parse_table_to_json(html_table, key_name):\n",
+    "  if not isinstance(html_table, str):\n",
+    "    return\n",
+    "\n",
+    "  soup = BeautifulSoup(html_table, \"html.parser\")\n",
+    "  rows = soup.find_all(\"tr\")\n",
+    "\n",
+    "  # Extract headers from the first row\n",
+    "  headers = [header.get_text(strip=True) for header in rows[0].find_all(\"p\")]\n",
+    "\n",
+    "  # Extract data from the remaining rows\n",
+    "  data = []\n",
+    "  for row in rows[1:]:\n",
+    "      values = [value.get_text(strip=True) for value in row.find_all(\"p\")]\n",
+    "      entry = {headers[i]: values[i] if i < len(values) else \"\" for i in range(len(headers))}\n",
+    "      data.append(entry)\n",
+    "\n",
+    "  # Construct the final JSON object\n",
+    "  result = data\n",
+    "  return result\n",
+    "\n",
+    "def process_person(person_html, section_name):\n",
+    "\n",
+    "\n",
+    "  content = {}\n",
+    "\n",
+    "\n",
+    "\n",
+    "\n",
+    "  # Extract each section\n",
+    "  for key, pattern in sections_split.items():\n",
+    "      matches = re.findall(pattern, person_html)\n",
+    "      content[key] = matches[0] if matches else None\n",
+    "\n",
+    "\n",
+    "  for html in content:\n",
+    "    table_pattern = r\"<table.*?>(.*?)</table>\"\n",
+    "\n",
+    "    if isinstance(content[html], str):\n",
+    "      table_contents = re.findall(table_pattern, content[html])[0]\n",
+    "      content[html] = \"<table>\" + table_contents + \"</table>\"\n",
+    "\n",
+    "\n",
+    "  key_name = section_name.replace(\"RAW-\", \"\")\n",
+    "  result = parse_table_to_json(content['RAW-' + key_name], key_name)\n",
+    "\n",
+    "  return(result)\n",
+    "\n",
+    "\n",
+    "\n",
+    "\n",
+    "\n",
+    "people = []\n",
+    "\n",
+    "for person in sections:\n",
+    "    if isinstance(person, str):\n",
+    "\n",
+    "      person_name = \"\"\n",
+    "      person_title = \"\"\n",
+    "      person_party = \"\"\n",
+    "\n",
+    "      if re.findall(r\"<ul><li><ol><li>(.*?)</li></ol></li></ul>\", person):\n",
+    "        person_name = re.findall(r\"<ul><li><ol><li>(.*?)</li></ol></li></ul>\", person)[0]\n",
+    "\n",
+    "      if person_name == \"<strong>SHARES AND OTHER FINANCIAL INTERESTS</strong>\":\n",
+    "        person_name = re.findall(r\"<ol><li>(.*?)<ol><li>(.*?)</li></ol></li></ol>\",person)[0][1]\n",
+    "\n",
+    "      person_party = re.findall(r\"<p>(.*?)</p>\",person)[0] if re.findall(r\"<p>(.*?)</p>\",person) else None\n",
+    "\n",
+    "      if person_name:\n",
+    "            parts = person_name.split(\", \")\n",
+    "            surname = parts[0].strip()  # Always the first part\n",
+    "            if len(parts) > 1:\n",
+    "                person_title = parts[1].split()[0].strip()  # Only the first word is the title\n",
+    "                given_names = \" \".join(parts[1].split()[1:]).strip()  # Remaining words are given names\n",
+    "                person_name = f\"{given_names} {surname}\".strip()\n",
+    "\n",
+    "      people.append({\n",
+    "        \"mp\": person_name,\n",
+    "        \"title\": person_title,\n",
+    "        \"party\": person_party,\n",
+    "        \"SHARES AND OTHER FINANCIAL INTERESTS\": process_person(person, \"RAW-SHARES AND OTHER FINANCIAL INTERESTS\"),\n",
+    "        \"REMUNERATED EMPLOYMENT OR WORK OUTSIDE OF PARLIAMENT\": process_person(person, \"RAW-REMUNERATED EMPLOYMENT OR WORK OUTSIDE OF PARLIAMENT\"),\n",
+    "        \"DIRECTORSHIPS AND PARTNERSHIPS\": process_person(person, \"RAW-DIRECTORSHIPS AND PARTNERSHIPS\"),\n",
+    "        \"CONSULTANCIES AND RETAINERSHIPS\": process_person(person, \"RAW-CONSULTANCIES AND RETAINERSHIPS\"),\n",
+    "        \"SPONSORSHIPS\": process_person(person, \"RAW-SPONSORSHIPS\"),\n",
+    "        \"GIFTS AND HOSPITALITY\": process_person(person, \"RAW-GIFTS AND HOSPITALITY\"),\n",
+    "        \"BENEFITS AND INTERESTS FREE LOANS\": process_person(person, \"RAW-BENEFITS AND INTERESTS FREE LOANS\"),\n",
+    "        \"TRAVEL\": process_person(person, \"RAW-TRAVEL\"),\n",
+    "        \"OWNERSHIP IN LAND AND PROPERTY\": process_person(person, \"RAW-OWNERSHIP IN LAND AND PROPERTY\"),\n",
+    "        \"PENSIONS\": process_person(person, \"RAW-PENSIONS\"),\n",
+    "        \"RENTED PROPERTY\": process_person(person, \"RAW-RENTED PROPERTY\"),\n",
+    "        \"INCOME GENERATING ASSETS\": process_person(person, \"RAW-INCOME GENERATING ASSETS\"),\n",
+    "        \"TRUSTS\": process_person(person, \"RAW-TRUSTS\")\n",
+    "      })\n",
+    "\n",
+    "    else:\n",
+    "        print(f\"Skipping non-string person entry: {type(person)}\")\n",
+    "\n",
+    "\n",
+    "# clean people by dumping any entry where mp = None\n",
+    "people = [person for person in people if person['mp'] is not None]\n",
+    "\n",
+    "\n",
+    "with open(\"/content/drive/MyDrive/PROJECTS/PMG/MI/output.json\", \"w\") as outfile:\n",
+    "    json.dump(people, outfile)\n",
+    "\n",
+    "\n",
+    "\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "el1f_xdV1R4q"
+   },
+   "outputs": [],
+   "source": [
+    "# Just to debug final JSON\n",
+    "\n",
+    "import json\n",
+    "\n",
+    "def load_json_as_array(file_path):\n",
+    "    try:\n",
+    "        with open(file_path, 'r') as file:\n",
+    "            data = json.load(file)\n",
+    "            if isinstance(data, list):\n",
+    "                return data\n",
+    "            else:\n",
+    "                print(f\"Warning: JSON file does not contain an array of objects. Returning the loaded data as is.\")\n",
+    "                return data\n",
+    "    except FileNotFoundError:\n",
+    "        print(f\"Error: File not found at {file_path}\")\n",
+    "        return None\n",
+    "    except json.JSONDecodeError:\n",
+    "        print(f\"Error: Invalid JSON format in {file_path}\")\n",
+    "        return None\n",
+    "\n",
+    "# Example usage\n",
+    "file_path = \"output.json\" \n",
+    "data = load_json_as_array(file_path)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}