Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Initial commit * Datasets * The 1st version of 300events information extracted by GPT4 and summarizing process by GPT3.5 * Add small data table for testing purposes * Create enwiki-title-matched-cold-spells.jsonl * Add Wikipedia files * Add windstorm keyphrases * Add keyphrases for additional categories * no message * no message * no message * no message * 🙈 Ignore results and pycache * Create comparison-test.py Add script for testing comparison module. * Add normalisation and comparison modules * Update comparison module * Update comparison test * Update normalisation module * Add initial comparison analysis * Extend conversion from text to integers * Add precision, recall and null penalty * Add comparison of event sets * Add event set comparison experiments * 🙈 Ignore results and pycache * 📌 Pin dependencies * 💡 Add TODO comments * 🗃️ Add preliminary schema * 🗃️ Fix schema sqlite3 compat + add validation checks * 🚚 Fix python script location * ➕ Add pandas as dep for json parsing * 🗃️ Fix database check validation for day/month * 🎨 Format sql file * 🗃️ Add Date field alongside d/m/y split * 🗃️ Parse 'Events' table * ➕ Add deps for data parsing * ♻️ Refactor uid (7 alphanumeric) * 🗃️ Insert Events without annotation * ♻️ Refactor Events insert * 💬 Fix readme * ♻️ Refactor json parsing (safe fail) * ✨ Parse date into (day, month, year) * ✨ Handle missing dates + split dates to d/m/y cols * 🎨 Add pre-commit for format/lint * 📌 Add pre-commit deps * 🚨 Fix lint + formatting * 🚨 Fix lint / extra line * 💡 Remove comment * ♻️ Refactor y/m/d format strings * 🧐 Fix postprocessing after change in json schema * 🚚 Fix output path * 🗃️ Json-normalize specific impact data * 🗃️ Split dates into d/m/y * ⚰️ Remove unused functions * 📝 Add docs + helpful comments * 🧐 Parse subevents * 🗃️ Update schema with start+end dates for subevents * 🚚 Rename Location_* columns * 🐛 Fix bad date tuple return * 🚚 Rename insertion file + add json with 8 events * 🗃️ Add subevents to sqlite3 db * 🙈 Ignore .DS_Store * Fix typo * ♻️ Small refactors + formatting * 🔥 Remove dead file * 🗃️ Add database + fix schema and drop annotations from subevents * 🗃️ Add country column (raw file + in json) * 🗃️ Add country col * 🗃️ Update database * 🔨 Add raw sample files * 💬 Fix col name order * Normalize digit/word numerals to floats in a (min, max) range (#9) * ✨ Add number normalization to extract col data * ➕ Add number normalization related deps * 🧑💻 Install spacy model if missing * 💡 Add comments to explain the flow (to myself) * ♻️ Load spacy model in func * ♻️ Fix formatting + small refactors * 🐛 Fix inequality function (check for approx) * ♻️ Refactor code and logic * ♻️ Refactor label checking * 💡 Add comments * 🐛 Fix approx check order * 🐛 Handle millions/billions/etc * ✨ Add extracted/normalized min,max,approx to parquet * 🗃️ Add min,max,approx cols to db * ⬇️ Downgrade to python3.9 * ⬇️ Downgrade deps for python3.8 * 🏷️ Fix List type * 🏷️ Fix more list types -> List * 🏷️ Fix types to run code on python3.8 * 🧑💻 Add args to pass to python scripts * 📝 Add docs * Fix NormalizeNum instantiation * Fix string '0' being identified as approximation case * Fix Indian Rupee normalization * Fix 'hundreds of' cases * Fix '>43 total' case * 🩹 Fix "none" for Total_* cols meaning "zero" * 🚨 Fix lint + format * 🎨 Clean out comments * ⚰️ Remove stats * ♻️ Add catch-all for approx * 📦️ Refresh parquet data files and db --------- Co-authored-by: Shorouq <shorouq.zahra@ri.se> Co-authored-by: chanjuan meng <brionymeng@gmail.com> * 🙈 Ignore geopy cache * ✨ Normalize locations (functions + example) * 🙈 Ignore .env + format with comments * 📝 Add docs on getting Bing api key * ➕ Add deps for extracting/normalizing locations * ♻️ Refactor normalizing locations * 🗃️ Normalize locations in db * ♻️ Refactor util functions into own file * 🎨 Fix formating/lint * 🗃️ Add GADM csv with location types/levels * ♻️ Refactor location splitting function * 🚚 Move data to dir * ♻️ Refactor function to handle more cases * ♻️ General refactor of parsing code * 🚨 Fix lint * 🔥 Remove moved file * ♻️ Handle cases with missing columns * ➕ Add dev dep + description * 🚚 Refactor module name * 🏷️ Convert list to str for sql db * 🔨 Use OpenStreetMap (Nominatim); ditch BING * 🔨 Add GADM normalization layer * ⚰️ Clean print statements * 🐛 Fix selecting only parquet * 🗃️ Add GADM + UNSD datasets * ♻️ Fix var name * ♻️ Refactor GADM id getter for robustness * ✨ Fuzzy-match world regions * ♻️ Split function into several * ♻️ Refactor GADM gid function * 🚨 Format and lint * ♻️ Refactor function * 🚨 Fix lint + format * ♻️ Refactor GADM data for the USA * 🙈 Ignore excel files * 🗃️ Parse events with Nominatim * 📝 Update docs to remove BING access key instructions * 📝 Add more instructions * 🙈 Ignore pycache no matter where it is * 🚚 Move and rename files for clarity * ♻️ Prefer locations with multigon/polygon * 🚨 Format + lint * 🗃️ Fix database schema * 🗃️ Add location normalization (normalized name, gid, type, geojson/geometry) * ➖ Remove unused dep * 🚚 Fix normalization class names * 🔊 Add logger * 🐛 Fix cache_uninstall bug * 🗃️ Convert GID and location type to str * ✏️ Fix typo * ♻️ Refactor to improve quality * ♻️ Refactor to handle wider cases + quickfix for cardinals * 🐛 Fix cardinal normalization * 💬 Expand unwanted location types * ⚡️ Add caching * 🐛 Fix not finding lowercase unsd regions * ⚡️ Generalize to american state if country not found * ⚡️ Improve country matching in gadm * ⚗️ Get locations by segment if normal querying fails * 💡 Fix comment * 🔊 Fix log * ♻️ Refactor api return * ⚡️ Match any segment of location if all fails * ♻️ Return original area if not normalized * 🎨 Fix formatting * ⚡️ Drop events/subevents with no location or year * ♻️ Expand list of location segments to remove (like city/prefecture/district/etc) * 💬 Expand list of unwanted location types * ✏️ Fix typo * ⚡️ Return "name" if "int_name" not available * 🥅 Catch pycountry exception * 🚨 Fix lint/whitespace * ⚡️ Expand allowed names to be returned for countries (in order of preference) * ✏️ Fix typo * 🐛 Fix subevents where the country and location are identical (generalize to country col) * 🔊 Add logs to table creation script + update db * 🐛Convert GID list to str for sqlite3 * 🐛Fix inserting partial columns into impactDB * 🗃️ Add alternative name to Mexico * Upgrade pre-commit * 🚨 Format * 📦️ Set up lfs for large files * 🙈 Ignore dev files * 📦️ Add parquet output before db insert * 📝 Add doc on git lfs * 💡 Remove debug script --------- Co-authored-by: Ni Li <100538534+liniiiiii@users.noreply.github.com> Co-authored-by: CUMULUS\nili <deanlnee@163.com> Co-authored-by: olofg <olof.gornerup@ri.se> Co-authored-by: LiNi12 <nili7518@gmail.com> Co-authored-by: Shorouq Zahra <shorouqza@MAPG99XN4K97N.local> Co-authored-by: Shorouq <shorouq.zahra@ri.se> Co-authored-by: chanjuan meng <brionymeng@gmail.com>
- Loading branch information