Skip to content

Commit

Permalink
Location Normalization (#11)
Browse files Browse the repository at this point in the history
* Initial commit

* Datasets

* The 1st version of 300events information extracted by GPT4 and summarizing process by GPT3.5

* Add small data table for testing purposes

* Create enwiki-title-matched-cold-spells.jsonl

* Add Wikipedia files

* Add windstorm keyphrases

* Add keyphrases for additional categories

* no message

* no message

* no message

* no message

* 🙈 Ignore results and pycache

* Create comparison-test.py

Add script for testing comparison module.

* Add normalisation and comparison modules

* Update comparison module

* Update comparison test

* Update normalisation module

* Add initial comparison analysis

* Extend conversion from text to integers

* Add precision, recall and null penalty

* Add comparison of event sets

* Add event set comparison experiments

* 🙈 Ignore results and pycache

* 📌 Pin dependencies

* 💡 Add TODO comments

* 🗃️ Add preliminary schema

* 🗃️ Fix schema sqlite3 compat + add validation checks

* 🚚 Fix python script location

* ➕ Add pandas as dep for json parsing

* 🗃️ Fix database check validation for day/month

* 🎨 Format sql file

* 🗃️ Add Date field alongside d/m/y split

* 🗃️ Parse 'Events' table

* ➕ Add deps for data parsing

* ♻️ Refactor uid (7 alphanumeric)

* 🗃️ Insert Events without annotation

* ♻️ Refactor Events insert

* 💬 Fix readme

* ♻️ Refactor json parsing (safe fail)

* ✨ Parse date into (day, month, year)

* ✨ Handle missing dates + split dates to d/m/y cols

* 🎨 Add pre-commit for format/lint

* 📌 Add pre-commit deps

* 🚨 Fix lint + formatting

* 🚨 Fix lint / extra line

* 💡 Remove comment

* ♻️ Refactor y/m/d format strings

* 🧐 Fix postprocessing after change in json schema

* 🚚 Fix output path

* 🗃️ Json-normalize specific impact data

* 🗃️ Split dates into d/m/y

* ⚰️ Remove unused functions

* 📝 Add docs + helpful comments

* 🧐 Parse subevents

* 🗃️ Update schema with start+end dates for subevents

* 🚚 Rename Location_* columns

* 🐛 Fix bad date tuple return

* 🚚 Rename insertion file + add json with 8 events

* 🗃️ Add subevents to sqlite3 db

* 🙈 Ignore .DS_Store

* Fix typo

* ♻️ Small refactors + formatting

* 🔥 Remove dead file

* 🗃️ Add database + fix schema and drop annotations from subevents

* 🗃️ Add country column (raw file + in json)

* 🗃️ Add country col

* 🗃️ Update database

* 🔨 Add raw sample files

* 💬 Fix col name order

* Normalize digit/word numerals to floats in a (min, max) range (#9)

* ✨ Add number normalization to extract col data

* ➕ Add number normalization related deps

* 🧑‍💻 Install spacy model if missing

* 💡 Add comments to explain the flow (to myself)

* ♻️ Load spacy model in func

* ♻️ Fix formatting + small refactors

* 🐛 Fix inequality function (check for approx)

* ♻️ Refactor code and logic

* ♻️ Refactor label checking

* 💡 Add comments

* 🐛 Fix approx check order

* 🐛 Handle millions/billions/etc

* ✨ Add extracted/normalized min,max,approx to parquet

* 🗃️ Add min,max,approx cols to db

* ⬇️ Downgrade to python3.9

* ⬇️ Downgrade deps for python3.8

* 🏷️ Fix List type

* 🏷️ Fix more list types -> List

* 🏷️ Fix types to run code on python3.8

* 🧑‍💻 Add args to pass to python scripts

* 📝 Add docs

* Fix NormalizeNum instantiation

* Fix string '0' being identified as approximation case

* Fix Indian Rupee normalization

* Fix 'hundreds of' cases

* Fix '>43 total' case

* 🩹 Fix "none" for Total_* cols meaning "zero"

* 🚨 Fix lint + format

* 🎨 Clean out comments

* ⚰️ Remove stats

* ♻️ Add catch-all for approx

* 📦️  Refresh parquet data files and db

---------

Co-authored-by: Shorouq <shorouq.zahra@ri.se>
Co-authored-by: chanjuan meng <brionymeng@gmail.com>

* 🙈 Ignore geopy cache

* ✨ Normalize locations (functions + example)

* 🙈 Ignore .env + format with comments

* 📝 Add docs on getting Bing api key

* ➕ Add deps for extracting/normalizing locations

* ♻️ Refactor normalizing locations

* 🗃️ Normalize locations in db

* ♻️ Refactor util functions into own file

* 🎨 Fix formating/lint

* 🗃️ Add GADM csv with location types/levels

* ♻️ Refactor location splitting function

* 🚚 Move data to dir

* ♻️ Refactor function to handle more cases

* ♻️ General refactor of parsing code

* 🚨 Fix lint

* 🔥 Remove moved file

* ♻️ Handle cases with missing columns

* ➕ Add dev dep + description

* 🚚 Refactor module name

* 🏷️ Convert list to str for sql db

* 🔨 Use OpenStreetMap (Nominatim); ditch BING

* 🔨 Add GADM normalization layer

* ⚰️ Clean print statements

* 🐛 Fix selecting only parquet

* 🗃️ Add GADM + UNSD datasets

* ♻️ Fix var name

* ♻️ Refactor GADM id getter for robustness

* ✨ Fuzzy-match world regions

* ♻️ Split function into several

* ♻️ Refactor GADM gid function

* 🚨 Format and lint

* ♻️ Refactor function

* 🚨 Fix lint + format

* ♻️ Refactor GADM data for the USA

* 🙈 Ignore excel files

* 🗃️ Parse events with Nominatim

* 📝 Update docs to remove BING access key instructions

* 📝 Add more instructions

* 🙈 Ignore pycache no matter where it is

* 🚚 Move and rename files for clarity

* ♻️ Prefer locations with multigon/polygon

* 🚨 Format + lint

* 🗃️ Fix database schema

* 🗃️ Add location normalization (normalized name, gid, type, geojson/geometry)

* ➖ Remove unused dep

* 🚚 Fix normalization class names

* 🔊 Add logger

* 🐛 Fix cache_uninstall bug

* 🗃️ Convert GID and location type to str

* ✏️ Fix typo

* ♻️ Refactor to improve quality

* ♻️ Refactor to handle wider cases + quickfix for cardinals

* 🐛 Fix cardinal normalization

* 💬 Expand unwanted location types

* ⚡️ Add caching

* 🐛 Fix not finding lowercase unsd regions

* ⚡️ Generalize to american state if country not found

* ⚡️ Improve country matching in gadm

* ⚗️ Get locations by segment if normal querying fails

* 💡 Fix comment

* 🔊 Fix log

* ♻️ Refactor api return

* ⚡️ Match any segment of location if all fails

* ♻️ Return original area if not normalized

* 🎨 Fix formatting

* ⚡️ Drop events/subevents with no location or year

* ♻️ Expand list of location segments to remove (like city/prefecture/district/etc)

* 💬 Expand list of unwanted location types

* ✏️ Fix typo

* ⚡️ Return "name" if "int_name" not available

* 🥅 Catch pycountry exception

* 🚨 Fix lint/whitespace

* ⚡️ Expand allowed names to be returned for countries (in order of preference)

* ✏️ Fix typo

* 🐛 Fix subevents where the country and location are identical (generalize to country col)

* 🔊 Add logs to table creation script + update db

* 🐛Convert GID list to str for sqlite3

* 🐛Fix inserting partial columns into impactDB

* 🗃️ Add alternative name to Mexico

* Upgrade pre-commit

* 🚨 Format

* 📦️ Set up lfs for large files

* 🙈 Ignore dev files

* 📦️ Add parquet output before db insert

* 📝 Add doc on git lfs

* 💡 Remove debug script

---------

Co-authored-by: Ni Li <100538534+liniiiiii@users.noreply.github.com>
Co-authored-by: CUMULUS\nili <deanlnee@163.com>
Co-authored-by: olofg <olof.gornerup@ri.se>
Co-authored-by: LiNi12 <nili7518@gmail.com>
Co-authored-by: Shorouq Zahra <shorouqza@MAPG99XN4K97N.local>
Co-authored-by: Shorouq <shorouq.zahra@ri.se>
Co-authored-by: chanjuan meng <brionymeng@gmail.com>
  • Loading branch information
8 people authored May 9, 2024
1 parent 0846fd0 commit 501df99
Show file tree
Hide file tree
Showing 26 changed files with 17,775 additions and 556 deletions.
9 changes: 9 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
.csv filter=lfs diff=lfs merge=lfs -text
.db filter=lfs diff=lfs merge=lfs -text
.sqlite filter=lfs diff=lfs merge=lfs -text
.parquet filter=lfs diff=lfs merge=lfs -text
.json filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.db filter=lfs diff=lfs merge=lfs -text
*.csv filter=lfs diff=lfs merge=lfs -text
*.sqlite filter=lfs diff=lfs merge=lfs -text
15 changes: 14 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,16 @@
# ignore dev files
results
*/__pycache__
.env
Database/raw/*.xlsx
Database/output/*.csv
Database/output/Ni/*.csv
Database/output/dev/*

# ignore pycache
**/__pycache__

# ignore mac-related files
.DS_Store

# ignore geopy cache (used for normalizing locations faster)
geopy_cache.sqlite
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
rev: v4.5.0
hooks:
- id: end-of-file-fixer
- id: trailing-whitespace
Expand Down
3 changes: 3 additions & 0 deletions Database/data/UNSD — Methodology.csv
Git LFS file not shown
3 changes: 3 additions & 0 deletions Database/data/gadm_world.csv
Git LFS file not shown
Git LFS file not shown
Git LFS file not shown
Git LFS file not shown
Git LFS file not shown
Git LFS file not shown
Loading

0 comments on commit 501df99

Please sign in to comment.