Skip to content
This repository has been archived by the owner on Aug 23, 2024. It is now read-only.

Update extracted table definitions #13

Merged
merged 8 commits into from
Mar 23, 2024
Merged

Conversation

dfsnow
Copy link
Member

@dfsnow dfsnow commented Mar 20, 2024

This PR makes a few adjustments to the Hive table definitions used by sqoop and the subsequent Parquet files that are saved to S3. Primarily, it:

  • Adjusts the scale and precision of coordinate fields to deal with Investigate legdat.xcoord & legdat.ycoord truncation #12.
  • Removes file bucketing for most tables. Currently, Parquet files are split many times within a single partition (e.g. taxyr=2022/ would have 20 parquet files within it), resulting in small files. This is bad as Athena actually prefers larger files ~100MB.
  • Adjusts bucketing to use cur instead of seq. Athena bucketing sorts/splits files according to the value in a column. Ideally, we want to bucket on column which are commonly used for filtering. Considering seq is never used in most queries, I figure cur would be a better choice.

Most of the heinously large diff here is whitespace changes and dropping a copy of the table that was necessary for bucketing. It should be safe to ignore most of the actual table level definition changes.

Closes #12.

@dfsnow dfsnow added the enhancement New feature or request label Mar 20, 2024
@dfsnow dfsnow requested a review from jeancochrane March 20, 2024 20:36
@dfsnow dfsnow self-assigned this Mar 20, 2024
@jeancochrane jeancochrane marked this pull request as draft March 20, 2024 20:39
@@ -1,44 +1,44 @@
CREATE TABLE `iasworld.aasysjur`(
`jur` varchar(6),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cases where the table was never bucketed, there will be line-level changes like this. In the bucketed case, an additional version of the table is removed.

Comment on lines -62 to +63
`xcoord` decimal(10,0),
`ycoord` decimal(10,0),
`strcd` varchar(10),
`strreloc` varchar(150),
`jur` varchar(6),
`parid` varchar(30),
`card` decimal(4,0),
`lline` decimal(4,0),
`tble` varchar(30),
`tabseq` decimal(3,0),
`seq` decimal(3,0),
`cur` varchar(1),
`who` varchar(50),
`wen` string,
`adrpre` varchar(10),
`adrno` decimal(10,0),
`adrgrid` varchar(12),
`adradd` varchar(10),
`adrdir` varchar(2),
`adrstr` varchar(50),
`adrsuf` varchar(8),
`adrsuf2` varchar(8),
`cityname` varchar(50),
`statecode` varchar(2),
`unitdesc` varchar(20),
`unitno` varchar(20),
`zip1` varchar(5),
`zip2` varchar(4),
`loc2` varchar(40),
`defaddr` varchar(1),
`deactivat` string,
`iasw_id` decimal(10,0),
`trans_id` decimal(10,0),
`upd_status` varchar(1),
`gislink` varchar(20),
`bldgno` varchar(10),
`childparid` varchar(30),
`user1` varchar(40),
`user2` varchar(40),
`user3` varchar(40),
`user4` varchar(40),
`user5` varchar(40),
`user6` varchar(40),
`user7` varchar(40),
`user8` varchar(40),
`user9` varchar(40),
`user10` varchar(40),
`user11` varchar(40),
`user12` varchar(40),
`user13` varchar(40),
`user14` varchar(40),
`user15` varchar(40),
`adrid` decimal(10,0),
`adrparchild` varchar(1),
`adrstatus` varchar(2),
`adrpremod` varchar(20),
`adrpretype` varchar(20),
`adrpostmod` varchar(20),
`floorno` varchar(20),
`coopid` varchar(30),
`country` varchar(30),
`postalcode` varchar(10),
`addrsrc` varchar(10),
`addrvalid` varchar(1),
`xcoord` decimal(15,8),
`ycoord` decimal(15,8),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coord fields are just NUMBER in Oracle, which will apparently just store the number as originally entered. When pulling via sqoop, NUMBER becomes the default Hive decimal(10,0), so we adjust it manually here to account for the precision of these coords.

We don't need to do this for most fields since they already have a precision and scale specified.

Comment on lines +34 to +37
# Manually update some columns with corrected data types
for coord in xcoord ycoord zcoord; do
sed -i "s/\`$coord\` decimal(10,0)/\`$coord\` decimal(15,8)/" "$TABLE".sql.tmp1
done
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super hot bash right here

@@ -40,7 +45,7 @@ for TABLE in ${TABLES}; do
cp "$TABLE".sql.tmp1 "$TABLE".sql.tmp2
sed -i "/^CREATE TABLE/s/${TABLE_LC}/${TABLE_LC}\_bucketed/" "$TABLE".sql.tmp2
echo "PARTITIONED BY (\`taxyr\` string)
CLUSTERED BY (\`parid\`) SORTED BY (\`seq\`) INTO ${NUM_BUCKETS} BUCKETS
CLUSTERED BY (\`parid\`) SORTED BY (\`cur\`) INTO ${NUM_BUCKETS} BUCKETS
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the sorting field for bucketed/clustered files to cur, since we're much more likely to filter on that than the seq field.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question, non-blocking] Generally I think this is a better choice, but given that the only tables that seem to be bucketed anymore are asmt_all and asmt_hist, is it an issue that cur is actually rarely used in the specific case of asmt_all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, this will just sort the cur values for each PIN, but I don't actually think the sorting matters that much tbh.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the actual changes to the number of buckets for each table. See ADDN in Athena/S3 for a table updated using the new definition and bucketing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqoop can't pull into a bucketed table directly. So my fix was to create a table for the initial pull and then a secondary bucketed table to copy into. Now that we're dropping bucketing for most tables, we can also drop the secondary table definition.

Two upsides here: faster queries and faster sqoop pull (since it doesn't need to transfer the data to a second table).

@dfsnow dfsnow marked this pull request as ready for review March 21, 2024 17:11
@dfsnow dfsnow requested a review from wrridgeway March 21, 2024 17:11
Copy link

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great, I like this cleanup! Only tangentially related, but I'm curious if you have resources explaining why Athena prefers larger files.

ASMT_ALL,TRUE,30
ASMT_HIST,TRUE,30
APRVAL,TRUE,1
ASMT_ALL,TRUE,20

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question, non-blocking] Given that cur is supposed to be an enumerated type with at most 3 values, does it still make sense to set NUM_BUCKETS=20?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bucketing here is actually by parid, cur is just a field to internally sort the Parquet once it's bucketed.

@@ -40,7 +45,7 @@ for TABLE in ${TABLES}; do
cp "$TABLE".sql.tmp1 "$TABLE".sql.tmp2
sed -i "/^CREATE TABLE/s/${TABLE_LC}/${TABLE_LC}\_bucketed/" "$TABLE".sql.tmp2
echo "PARTITIONED BY (\`taxyr\` string)
CLUSTERED BY (\`parid\`) SORTED BY (\`seq\`) INTO ${NUM_BUCKETS} BUCKETS
CLUSTERED BY (\`parid\`) SORTED BY (\`cur\`) INTO ${NUM_BUCKETS} BUCKETS

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question, non-blocking] Generally I think this is a better choice, but given that the only tables that seem to be bucketed anymore are asmt_all and asmt_hist, is it an issue that cur is actually rarely used in the specific case of asmt_all?

@dfsnow
Copy link
Member Author

dfsnow commented Mar 23, 2024

Looking great, I like this cleanup! Only tangentially related, but I'm curious if you have resources explaining why Athena prefers larger files.

@jeancochrane Item 4 in this old AWS blog post says to aim for files ~128MB.

@dfsnow dfsnow merged commit d006a19 into master Mar 23, 2024
1 check passed
@dfsnow dfsnow deleted the dfsnow/update-table-definitions branch March 23, 2024 16:38
@dfsnow dfsnow mentioned this pull request Apr 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate legdat.xcoord & legdat.ycoord truncation
2 participants