Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider partitioning countries at the file level rather than marking countries in a TSV row #18

Open
marklit opened this issue Apr 10, 2023 · 0 comments

Comments

@marklit
Copy link

marklit commented Apr 10, 2023

Oceania-Full.zip is 282 MB at the moment. If its GeoJSON file was partitioned by country and sorted the ZIP file would be 244 MB instead. This would allow people to download the ZIP file faster. They would also use less space picking out the countries they're interested in. The GeoJSON would open right away in QGIS and other GIS software without first needing to ETL the TSV.

$ vi a.sh
sort AUS.geojson > AUS.sorted.geojson
sort NZL.geojson > NZL.sorted.geojson
sort PNG.geojson > PNG.sorted.geojson
sort VUT.geojson > VUT.sorted.geojson
sort FJI.geojson > FJI.sorted.geojson
sort SLB.geojson > SLB.sorted.geojson
sort TON.geojson > TON.sorted.geojson
sort WSM.geojson > WSM.sorted.geojson
sort FSM.geojson > FSM.sorted.geojson
sort KIR.geojson > KIR.sorted.geojson
sort PLW.geojson > PLW.sorted.geojson
sort MHL.geojson > MHL.sorted.geojson
sort TUV.geojson > TUV.sorted.geojson
sort NRU.geojson > NRU.sorted.geojson
$ cat a.sh | xargs -n1 -P4 -I% bash -xc '%'
$ zip -9 Oceania.sorted.zip \
    AUS.sorted.geojson \
    NZL.sorted.geojson \
    PNG.sorted.geojson \
    VUT.sorted.geojson \
    FJI.sorted.geojson \
    SLB.sorted.geojson \
    TON.sorted.geojson \
    WSM.sorted.geojson \
    FSM.sorted.geojson \
    KIR.sorted.geojson \
    PLW.sorted.geojson \
    MHL.sorted.geojson \
    TUV.sorted.geojson \
    NRU.sorted.geojson

$ unzip -l Oceania.sorted.zip
Archive:  Oceania.sorted.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
1071521607  2023-04-10 18:58   AUS.sorted.geojson
185466598  2023-04-10 18:57   NZL.sorted.geojson
 28007237  2023-04-10 18:57   PNG.sorted.geojson
  6470562  2023-04-10 18:57   VUT.sorted.geojson
  5832797  2023-04-10 18:57   FJI.sorted.geojson
  4423195  2023-04-10 18:57   SLB.sorted.geojson
  1047604  2023-04-10 18:57   TON.sorted.geojson
  1066450  2023-04-10 18:57   WSM.sorted.geojson
   307308  2023-04-10 18:57   FSM.sorted.geojson
   190892  2023-04-10 18:57   KIR.sorted.geojson
   242639  2023-04-10 18:57   PLW.sorted.geojson
   119872  2023-04-10 18:57   MHL.sorted.geojson
    44300  2023-04-10 18:57   TUV.sorted.geojson
    38006  2023-04-10 18:57   NRU.sorted.geojson
---------                     -------
1304779067                     14 files
$ unzip Oceania.sorted.zip NZL.sorted.geojson

For some of the largest datasets, like Canada and Japan, the 3-letter country identifier is redundant since every record in those ZIPs are for their respective countries.

@marklit marklit changed the title Consider partitioning countries at the file-level rather than marking records in a TSV file Consider partitioning countries at the file-level rather than marking countries in a TSV row Apr 10, 2023
@marklit marklit changed the title Consider partitioning countries at the file-level rather than marking countries in a TSV row Consider partitioning countries at the file level rather than marking countries in a TSV row Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant