Update extracted table definitions #13

dfsnow · 2024-03-20T20:36:02Z

This PR makes a few adjustments to the Hive table definitions used by sqoop and the subsequent Parquet files that are saved to S3. Primarily, it:

Adjusts the scale and precision of coordinate fields to deal with Investigate legdat.xcoord & legdat.ycoord truncation #12.
Removes file bucketing for most tables. Currently, Parquet files are split many times within a single partition (e.g. taxyr=2022/ would have 20 parquet files within it), resulting in small files. This is bad as Athena actually prefers larger files ~100MB.
Adjusts bucketing to use cur instead of seq. Athena bucketing sorts/splits files according to the value in a column. Ideally, we want to bucket on column which are commonly used for filtering. Considering seq is never used in most queries, I figure cur would be a better choice.

Most of the heinously large diff here is whitespace changes and dropping a copy of the table that was necessary for bucketing. It should be safe to ignore most of the actual table level definition changes.

Closes #12.

dfsnow · 2024-03-20T20:42:25Z

tables/AASYSJUR.sql

@@ -1,44 +1,44 @@
 CREATE TABLE `iasworld.aasysjur`(
-  `jur` varchar(6), 


In cases where the table was never bucketed, there will be line-level changes like this. In the bucketed case, an additional version of the table is removed.

dfsnow · 2024-03-21T17:01:25Z

tables/ADDRINDX.sql

-  `xcoord` decimal(10,0), 
-  `ycoord` decimal(10,0), 
-  `strcd` varchar(10), 
-  `strreloc` varchar(150), 
+  `jur` varchar(6),
+  `parid` varchar(30),
+  `card` decimal(4,0),
+  `lline` decimal(4,0),
+  `tble` varchar(30),
+  `tabseq` decimal(3,0),
+  `seq` decimal(3,0),
+  `cur` varchar(1),
+  `who` varchar(50),
+  `wen` string,
+  `adrpre` varchar(10),
+  `adrno` decimal(10,0),
+  `adrgrid` varchar(12),
+  `adradd` varchar(10),
+  `adrdir` varchar(2),
+  `adrstr` varchar(50),
+  `adrsuf` varchar(8),
+  `adrsuf2` varchar(8),
+  `cityname` varchar(50),
+  `statecode` varchar(2),
+  `unitdesc` varchar(20),
+  `unitno` varchar(20),
+  `zip1` varchar(5),
+  `zip2` varchar(4),
+  `loc2` varchar(40),
+  `defaddr` varchar(1),
+  `deactivat` string,
+  `iasw_id` decimal(10,0),
+  `trans_id` decimal(10,0),
+  `upd_status` varchar(1),
+  `gislink` varchar(20),
+  `bldgno` varchar(10),
+  `childparid` varchar(30),
+  `user1` varchar(40),
+  `user2` varchar(40),
+  `user3` varchar(40),
+  `user4` varchar(40),
+  `user5` varchar(40),
+  `user6` varchar(40),
+  `user7` varchar(40),
+  `user8` varchar(40),
+  `user9` varchar(40),
+  `user10` varchar(40),
+  `user11` varchar(40),
+  `user12` varchar(40),
+  `user13` varchar(40),
+  `user14` varchar(40),
+  `user15` varchar(40),
+  `adrid` decimal(10,0),
+  `adrparchild` varchar(1),
+  `adrstatus` varchar(2),
+  `adrpremod` varchar(20),
+  `adrpretype` varchar(20),
+  `adrpostmod` varchar(20),
+  `floorno` varchar(20),
+  `coopid` varchar(30),
+  `country` varchar(30),
+  `postalcode` varchar(10),
+  `addrsrc` varchar(10),
+  `addrvalid` varchar(1),
+  `xcoord` decimal(15,8),
+  `ycoord` decimal(15,8),


coord fields are just NUMBER in Oracle, which will apparently just store the number as originally entered. When pulling via sqoop, NUMBER becomes the default Hive decimal(10,0), so we adjust it manually here to account for the precision of these coords.

We don't need to do this for most fields since they already have a precision and scale specified.

dfsnow · 2024-03-21T17:08:48Z

tables/update-tables.sh

+    # Manually update some columns with corrected data types
+    for coord in xcoord ycoord zcoord; do
+        sed -i "s/\`$coord\` decimal(10,0)/\`$coord\` decimal(15,8)/" "$TABLE".sql.tmp1
+    done


Super hot bash right here

dfsnow · 2024-03-21T17:09:47Z

tables/update-tables.sh

@@ -40,7 +45,7 @@ for TABLE in ${TABLES}; do
        cp "$TABLE".sql.tmp1 "$TABLE".sql.tmp2
        sed -i "/^CREATE TABLE/s/${TABLE_LC}/${TABLE_LC}\_bucketed/" "$TABLE".sql.tmp2
        echo "PARTITIONED BY (\`taxyr\` string)
-CLUSTERED BY (\`parid\`) SORTED BY (\`seq\`) INTO ${NUM_BUCKETS} BUCKETS
+CLUSTERED BY (\`parid\`) SORTED BY (\`cur\`) INTO ${NUM_BUCKETS} BUCKETS


I changed the sorting field for bucketed/clustered files to cur, since we're much more likely to filter on that than the seq field.

[Question, non-blocking] Generally I think this is a better choice, but given that the only tables that seem to be bucketed anymore are asmt_all and asmt_hist, is it an issue that cur is actually rarely used in the specific case of asmt_all?

Nah, this will just sort the cur values for each PIN, but I don't actually think the sorting matters that much tbh.

dfsnow · 2024-03-21T17:10:40Z

tables/tables-list.csv

Here are the actual changes to the number of buckets for each table. See ADDN in Athena/S3 for a table updated using the new definition and bucketing.

dfsnow · 2024-03-21T17:11:24Z

tables/ADDN.sql

sqoop can't pull into a bucketed table directly. So my fix was to create a table for the initial pull and then a secondary bucketed table to copy into. Now that we're dropping bucketing for most tables, we can also drop the secondary table definition.

Two upsides here: faster queries and faster sqoop pull (since it doesn't need to transfer the data to a second table).

jeancochrane

Looking great, I like this cleanup! Only tangentially related, but I'm curious if you have resources explaining why Athena prefers larger files.

jeancochrane · 2024-03-21T22:43:55Z

tables/tables-list.csv

-ASMT_ALL,TRUE,30
-ASMT_HIST,TRUE,30
+APRVAL,TRUE,1
+ASMT_ALL,TRUE,20


[Question, non-blocking] Given that cur is supposed to be an enumerated type with at most 3 values, does it still make sense to set NUM_BUCKETS=20?

The bucketing here is actually by parid, cur is just a field to internally sort the Parquet once it's bucketed.

jeancochrane · 2024-03-21T22:45:25Z

tables/update-tables.sh

@@ -40,7 +45,7 @@ for TABLE in ${TABLES}; do
        cp "$TABLE".sql.tmp1 "$TABLE".sql.tmp2
        sed -i "/^CREATE TABLE/s/${TABLE_LC}/${TABLE_LC}\_bucketed/" "$TABLE".sql.tmp2
        echo "PARTITIONED BY (\`taxyr\` string)
-CLUSTERED BY (\`parid\`) SORTED BY (\`seq\`) INTO ${NUM_BUCKETS} BUCKETS
+CLUSTERED BY (\`parid\`) SORTED BY (\`cur\`) INTO ${NUM_BUCKETS} BUCKETS


[Question, non-blocking] Generally I think this is a better choice, but given that the only tables that seem to be bucketed anymore are asmt_all and asmt_hist, is it an issue that cur is actually rarely used in the specific case of asmt_all?

dfsnow · 2024-03-23T16:37:36Z

Looking great, I like this cleanup! Only tangentially related, but I'm curious if you have resources explaining why Athena prefers larger files.

@jeancochrane Item 4 in this old AWS blog post says to aim for files ~128MB.

dfsnow added 3 commits March 20, 2024 20:19

Remove bucketing on small tables

8f045ba

Update table definitions, compression, and buckets

5f89eea

Update table definitions

8717ea2

dfsnow added the enhancement New feature or request label Mar 20, 2024

dfsnow requested a review from jeancochrane March 20, 2024 20:36

dfsnow self-assigned this Mar 20, 2024

jeancochrane marked this pull request as draft March 20, 2024 20:39

dfsnow commented Mar 20, 2024

View reviewed changes

dfsnow added 3 commits March 21, 2024 15:53

Switch back to snappy compression

b2999b7

Revert data type changes to tables

7a35da3

Change only coord columns

47b7f03

dfsnow commented Mar 21, 2024

View reviewed changes

dfsnow added 2 commits March 21, 2024 17:02

Drop backup table definitions

0209e01

Revert comment

917f411

dfsnow commented Mar 21, 2024

View reviewed changes

dfsnow marked this pull request as ready for review March 21, 2024 17:11

dfsnow requested a review from wrridgeway March 21, 2024 17:11

jeancochrane approved these changes Mar 21, 2024

View reviewed changes

dfsnow merged commit d006a19 into master Mar 23, 2024
1 check passed

dfsnow deleted the dfsnow/update-table-definitions branch March 23, 2024 16:38

dfsnow mentioned this pull request Apr 8, 2024

Fix missing taxyr columns #15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update extracted table definitions #13

Update extracted table definitions #13

dfsnow commented Mar 20, 2024 •

edited

Loading

dfsnow Mar 20, 2024

dfsnow Mar 21, 2024

dfsnow Mar 21, 2024

dfsnow Mar 21, 2024

jeancochrane Mar 21, 2024

dfsnow Mar 23, 2024

dfsnow Mar 21, 2024

dfsnow Mar 21, 2024

jeancochrane left a comment

jeancochrane Mar 21, 2024

dfsnow Mar 23, 2024

jeancochrane Mar 21, 2024

dfsnow commented Mar 23, 2024 •

edited

Loading

		@@ -1,44 +1,44 @@
		CREATE TABLE `iasworld.aasysjur`(
		`jur` varchar(6),

Update extracted table definitions #13

Update extracted table definitions #13

Conversation

dfsnow commented Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfsnow commented Mar 23, 2024 • edited Loading

dfsnow commented Mar 20, 2024 •

edited

Loading

dfsnow commented Mar 23, 2024 •

edited

Loading