Skip to content

Commit

Permalink
Merge pull request #258 from OpenUpSA/members-interest-2024
Browse files Browse the repository at this point in the history
updated member's interests import procedure
  • Loading branch information
desafinadude authored Jan 13, 2025
2 parents 3a54b20 + 67a8dc5 commit ee686ee
Show file tree
Hide file tree
Showing 2 changed files with 312 additions and 56 deletions.
96 changes: 40 additions & 56 deletions pombola/south_africa/data/members-interests/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,23 @@
# ZA Members' Interests Data

There are several files in this directory:
## Prepping and importing member's interests

## DOCX scraper
Importing member's interests is a multistep process.

The scraper currently scrapes `.docx` files.
To prepare the file:
### Obtain register PDF

1. Split the `PDF` into seperate files small enough to open in Google Drive. [PDF Arranger](https://github.com/pdfarranger/pdfarranger) works well
2. Open the files in Google Drive and download each in `.docx` format
3. Store the these files in `./pombola/south_africa/data/members-interests/scraper/docx_files/`
Get the latest register `PDF` from [parliament](https://www.parliament.gov.za/register-members-Interests).

### Trim PDF

Use [PDF Arranger](https://github.com/pdfarranger/pdfarranger) to remove cover and contents pages to have a final `PDF` that is just the register.

### Convert PDF to DOCX

Previously we used `Google Docs` to convert the `PDF` to `DOCX`. This was very cumbersome and meant working in batches of 80 pages at a time.
A workable and faster third-party solution is [ilovepdf.com](https://www.ilovepdf.com/pdf_to_word).

### Convert DOCX to HTML

Create an environment and install dependencies in the `./pombola/south_africa/data/members-interests/scraper` directory:
```
Expand All @@ -18,67 +26,43 @@ source venv/bin/activate
pip install -r requirements.txt
```

Run the script with the necessary arguments, e.g.
```
python scrape_interests_docx.py --input ./docx_files/ --output ../2021.json --year 2021 --source https://static.pmg.org.za/Register_of_Members_Interests_2021.pdf
```

This will combine documents into a single HTML file `main_html_file.html`

Run the Jupyter script `membersinterest.ipynb` making sure to update the input file name. The output should be `register.json`

Copy `register.json` to the `members-interests` directory and rename it to the corresponding year

## Conversion script

convert_to_import_json.py
Use the first cell of `./scraper/docx_to_html_to_json.ipynb` to convert the `DOCX` to `HTML` with Mammoth.

This script takes the raw data and processes it to clean it up, to match the mp
to entries in the database and to put it in the format that the
`interests_register_import_from_json` management command expects. This script
is highly specific to the above JSON files and the MP entries in the database
at the time of writing (3 Dec 2013).
### Convert to JSON

You can run the script like this:
Run the other cells in the notebook to convert it to a workable `json` file.

cd pombola/south_africa/data/members-interests/
./convert_to_import_json.py 2022.json > 2022_for_import.json
### Copy to server

This will require a production or equivalent data for the persons table to filter against.
You can either run the script in prod or build a local database instance like so:
The next steps need to be run on the server as it uses the production database.

`dokku postgres:export pombola > dumpitydump.dump`
### Convert to importable json file

`pg_restore -U pombola -d pombola dumpitydumplol.dump`
```
cd pombola/south_africa/data/members-interests/
./convert_to_import_json.py 2024.json > 2024_for_import.json
```

When processing new data you may well need to add more entries to the
`slug_corrections` attribute. Change the `finding_slug_corrections` to `True`
to enable some code that'll help you do that. Change it back to False when done.

## Final importable data

2010_for_import.json
2011_for_import.json
2012_for_import.json
2013_for_import.json
2014_for_import.json
2015_for_import.json
2016_for_import.json
2017_for_import.json
2018_for_import.json

This is the output of the above conversion script. It is committed for ease of
adding to the database, and as looking at the diffs is an easy way to see the
results of changes to the conversion script.
### Import final json file

To load this data into the database, you can run the management command:
(if there are already some entries you should delete them all using the admin)

./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2010_for_import.json
./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2011_for_import.json
./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2012_for_import.json
./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2013_for_import.json
./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2014_for_import.json
./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2015_for_import.json
./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2016_for_import.json
./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2017_for_import.json
```
./manage.py interests_register_import_from_json pombola/south_africa/data/members-interests/2024_for_import.json
```

## Some useful notes for possible issues

- name, surname and title orders may change from year to year. Please see
`/pombola/south_africa/data/members-interests/convert_to_import_json.py` lines 1325 and 1328 to tweak this if need be: `name_ordered = re.sub(r'^(\w+\b\s+\w+\b)\s+(.*)$', r'\2 \1', name_only)` where `\2 \1` determines the order.

- If the list of missing slugs is long, export existing slugs from Metbase and use ChatGPT to suggest matches. Confirm with PMG before importing.

- Section names might change, this would need to be changed in the convert script to match the `json` file.

- Regex patterns might also change and if there are broken entries or overlaps in the `json` file, make sure the patterns and sections are correct.
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "LllNYwnLqjn6"
},
"outputs": [],
"source": [
"import mammoth\n",
"\n",
"import re\n",
"from pprint import pprint\n",
"import json\n",
"from bs4 import BeautifulSoup"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_ESbMiFoQJMo"
},
"outputs": [],
"source": [
"def docx_to_html(file_path):\n",
" with open(file_path, \"rb\") as docx_file:\n",
" result = mammoth.convert_to_html(docx_file)\n",
" html = result.value \n",
" messages = result.messages # Any warnings or errors during conversion\n",
" return html\n",
"\n",
"docx_file_path = \"register.docx\"\n",
"html_output = docx_to_html(docx_file_path)\n",
"\n",
"# find and delete this pattern in html_output </table><table><tr>(.*?)</tr> - this is for tables that span pages\n",
"matches = re.findall(r'</table><table><tr>(.*?)</tr>', html_output)\n",
"cleaned_html = re.sub(r'</table><table><tr>(.*?)</tr>', '', html_output)\n",
"\n",
"# replace </p><p> with \"\" - This is for paragraphs in <td>\n",
"cleaned_html = re.sub(r'</p><p>', ' ', cleaned_html)\n",
"\n",
"# Save the HTML to a file\n",
"with open(\"output.html\", \"w\") as html_file:\n",
" html_file.write(cleaned_html)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tdsl7kk2mjYV"
},
"outputs": [],
"source": [
"\n",
"def split_document_by_pattern(html_content):\n",
"\n",
" # Split at table after TRUSTS\n",
"\n",
" pattern = r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>TRUSTS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\"\n",
"\n",
" # Split the document using the pattern\n",
" sections = re.split(pattern, html_content, flags=re.DOTALL)\n",
"\n",
" # Combine the sections after splitting (capturing groups leave pattern matches in the split result)\n",
" combined_sections = []\n",
" for i in range(0, len(sections) - 1, 2):\n",
" combined_sections.append(sections[i] + sections[i + 1]) # Add the content before and including the match\n",
"\n",
" # Add the final leftover content if any\n",
" if len(sections) % 2 != 0:\n",
" combined_sections.append(sections[-1])\n",
"\n",
" return combined_sections\n",
"\n",
"with open(\"output.html\", \"r\", encoding=\"utf-8\") as file:\n",
" html_data = file.read()\n",
"\n",
"sections = split_document_by_pattern(html_data)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "lbZtE00n2P4g"
},
"outputs": [],
"source": [
"# Section Patterns\n",
"\n",
"sections_split = {\n",
" \"RAW-SHARES AND OTHER FINANCIAL INTERESTS\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>SHARES AND OTHER FINANCIAL INTERESTS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
" \"RAW-REMUNERATED EMPLOYMENT OR WORK OUTSIDE OF PARLIAMENT\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>REMUNERATED EMPLOYMENT OR WORK OUTSIDE OF PARLIAMENT</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
" \"RAW-DIRECTORSHIPS AND PARTNERSHIPS\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>DIRECTORSHIPS AND PARTNERSHIPS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
" \"RAW-CONSULTANCIES AND RETAINERSHIPS\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>CONSULTANCIES AND RETAINERSHIPS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
" \"RAW-SPONSORSHIPS\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>SPONSORSHIPS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
" \"RAW-GIFTS AND HOSPITALITY\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>GIFTS AND HOSPITALITY</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
" \"RAW-BENEFITS AND INTERESTS FREE LOANS\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>BENEFITS AND INTERESTS FREE LOANS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
" \"RAW-TRAVEL\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>TRAVEL</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
" \"RAW-OWNERSHIP IN LAND AND PROPERTY\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>OWNERSHIP IN LAND AND PROPERTY</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
" \"RAW-PENSIONS\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>PENSIONS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
" \"RAW-RENTED PROPERTY\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>RENTED PROPERTY</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
" \"RAW-INCOME GENERATING ASSETS\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>INCOME GENERATING ASSETS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\",\n",
" \"RAW-TRUSTS\": r\"(<ul>\\s*<li>\\s*<ul>\\s*<li>\\s*<ol>\\s*<li>\\s*<strong>TRUSTS</strong>\\s*</li>\\s*</ol>\\s*</li>\\s*</ul>\\s*</li>\\s*</ul>.*?</table>)\"\n",
"}\n",
"\n",
"def parse_table_to_json(html_table, key_name):\n",
" if not isinstance(html_table, str):\n",
" return\n",
"\n",
" soup = BeautifulSoup(html_table, \"html.parser\")\n",
" rows = soup.find_all(\"tr\")\n",
"\n",
" # Extract headers from the first row\n",
" headers = [header.get_text(strip=True) for header in rows[0].find_all(\"p\")]\n",
"\n",
" # Extract data from the remaining rows\n",
" data = []\n",
" for row in rows[1:]:\n",
" values = [value.get_text(strip=True) for value in row.find_all(\"p\")]\n",
" entry = {headers[i]: values[i] if i < len(values) else \"\" for i in range(len(headers))}\n",
" data.append(entry)\n",
"\n",
" # Construct the final JSON object\n",
" result = data\n",
" return result\n",
"\n",
"def process_person(person_html, section_name):\n",
"\n",
"\n",
" content = {}\n",
"\n",
"\n",
"\n",
"\n",
" # Extract each section\n",
" for key, pattern in sections_split.items():\n",
" matches = re.findall(pattern, person_html)\n",
" content[key] = matches[0] if matches else None\n",
"\n",
"\n",
" for html in content:\n",
" table_pattern = r\"<table.*?>(.*?)</table>\"\n",
"\n",
" if isinstance(content[html], str):\n",
" table_contents = re.findall(table_pattern, content[html])[0]\n",
" content[html] = \"<table>\" + table_contents + \"</table>\"\n",
"\n",
"\n",
" key_name = section_name.replace(\"RAW-\", \"\")\n",
" result = parse_table_to_json(content['RAW-' + key_name], key_name)\n",
"\n",
" return(result)\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"people = []\n",
"\n",
"for person in sections:\n",
" if isinstance(person, str):\n",
"\n",
" person_name = \"\"\n",
" person_title = \"\"\n",
" person_party = \"\"\n",
"\n",
" if re.findall(r\"<ul><li><ol><li>(.*?)</li></ol></li></ul>\", person):\n",
" person_name = re.findall(r\"<ul><li><ol><li>(.*?)</li></ol></li></ul>\", person)[0]\n",
"\n",
" if person_name == \"<strong>SHARES AND OTHER FINANCIAL INTERESTS</strong>\":\n",
" person_name = re.findall(r\"<ol><li>(.*?)<ol><li>(.*?)</li></ol></li></ol>\",person)[0][1]\n",
"\n",
" person_party = re.findall(r\"<p>(.*?)</p>\",person)[0] if re.findall(r\"<p>(.*?)</p>\",person) else None\n",
"\n",
" if person_name:\n",
" parts = person_name.split(\", \")\n",
" surname = parts[0].strip() # Always the first part\n",
" if len(parts) > 1:\n",
" person_title = parts[1].split()[0].strip() # Only the first word is the title\n",
" given_names = \" \".join(parts[1].split()[1:]).strip() # Remaining words are given names\n",
" person_name = f\"{given_names} {surname}\".strip()\n",
"\n",
" people.append({\n",
" \"mp\": person_name,\n",
" \"title\": person_title,\n",
" \"party\": person_party,\n",
" \"SHARES AND OTHER FINANCIAL INTERESTS\": process_person(person, \"RAW-SHARES AND OTHER FINANCIAL INTERESTS\"),\n",
" \"REMUNERATED EMPLOYMENT OR WORK OUTSIDE OF PARLIAMENT\": process_person(person, \"RAW-REMUNERATED EMPLOYMENT OR WORK OUTSIDE OF PARLIAMENT\"),\n",
" \"DIRECTORSHIPS AND PARTNERSHIPS\": process_person(person, \"RAW-DIRECTORSHIPS AND PARTNERSHIPS\"),\n",
" \"CONSULTANCIES AND RETAINERSHIPS\": process_person(person, \"RAW-CONSULTANCIES AND RETAINERSHIPS\"),\n",
" \"SPONSORSHIPS\": process_person(person, \"RAW-SPONSORSHIPS\"),\n",
" \"GIFTS AND HOSPITALITY\": process_person(person, \"RAW-GIFTS AND HOSPITALITY\"),\n",
" \"BENEFITS AND INTERESTS FREE LOANS\": process_person(person, \"RAW-BENEFITS AND INTERESTS FREE LOANS\"),\n",
" \"TRAVEL\": process_person(person, \"RAW-TRAVEL\"),\n",
" \"OWNERSHIP IN LAND AND PROPERTY\": process_person(person, \"RAW-OWNERSHIP IN LAND AND PROPERTY\"),\n",
" \"PENSIONS\": process_person(person, \"RAW-PENSIONS\"),\n",
" \"RENTED PROPERTY\": process_person(person, \"RAW-RENTED PROPERTY\"),\n",
" \"INCOME GENERATING ASSETS\": process_person(person, \"RAW-INCOME GENERATING ASSETS\"),\n",
" \"TRUSTS\": process_person(person, \"RAW-TRUSTS\")\n",
" })\n",
"\n",
" else:\n",
" print(f\"Skipping non-string person entry: {type(person)}\")\n",
"\n",
"\n",
"# clean people by dumping any entry where mp = None\n",
"people = [person for person in people if person['mp'] is not None]\n",
"\n",
"\n",
"with open(\"/content/drive/MyDrive/PROJECTS/PMG/MI/output.json\", \"w\") as outfile:\n",
" json.dump(people, outfile)\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "el1f_xdV1R4q"
},
"outputs": [],
"source": [
"# Just to debug final JSON\n",
"\n",
"import json\n",
"\n",
"def load_json_as_array(file_path):\n",
" try:\n",
" with open(file_path, 'r') as file:\n",
" data = json.load(file)\n",
" if isinstance(data, list):\n",
" return data\n",
" else:\n",
" print(f\"Warning: JSON file does not contain an array of objects. Returning the loaded data as is.\")\n",
" return data\n",
" except FileNotFoundError:\n",
" print(f\"Error: File not found at {file_path}\")\n",
" return None\n",
" except json.JSONDecodeError:\n",
" print(f\"Error: Invalid JSON format in {file_path}\")\n",
" return None\n",
"\n",
"# Example usage\n",
"file_path = \"output.json\" \n",
"data = load_json_as_array(file_path)\n"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

0 comments on commit ee686ee

Please sign in to comment.