-
Notifications
You must be signed in to change notification settings - Fork 0
Conversation
@@ -1,44 +1,44 @@ | |||
CREATE TABLE `iasworld.aasysjur`( | |||
`jur` varchar(6), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cases where the table was never bucketed, there will be line-level changes like this. In the bucketed case, an additional version of the table is removed.
`xcoord` decimal(10,0), | ||
`ycoord` decimal(10,0), | ||
`strcd` varchar(10), | ||
`strreloc` varchar(150), | ||
`jur` varchar(6), | ||
`parid` varchar(30), | ||
`card` decimal(4,0), | ||
`lline` decimal(4,0), | ||
`tble` varchar(30), | ||
`tabseq` decimal(3,0), | ||
`seq` decimal(3,0), | ||
`cur` varchar(1), | ||
`who` varchar(50), | ||
`wen` string, | ||
`adrpre` varchar(10), | ||
`adrno` decimal(10,0), | ||
`adrgrid` varchar(12), | ||
`adradd` varchar(10), | ||
`adrdir` varchar(2), | ||
`adrstr` varchar(50), | ||
`adrsuf` varchar(8), | ||
`adrsuf2` varchar(8), | ||
`cityname` varchar(50), | ||
`statecode` varchar(2), | ||
`unitdesc` varchar(20), | ||
`unitno` varchar(20), | ||
`zip1` varchar(5), | ||
`zip2` varchar(4), | ||
`loc2` varchar(40), | ||
`defaddr` varchar(1), | ||
`deactivat` string, | ||
`iasw_id` decimal(10,0), | ||
`trans_id` decimal(10,0), | ||
`upd_status` varchar(1), | ||
`gislink` varchar(20), | ||
`bldgno` varchar(10), | ||
`childparid` varchar(30), | ||
`user1` varchar(40), | ||
`user2` varchar(40), | ||
`user3` varchar(40), | ||
`user4` varchar(40), | ||
`user5` varchar(40), | ||
`user6` varchar(40), | ||
`user7` varchar(40), | ||
`user8` varchar(40), | ||
`user9` varchar(40), | ||
`user10` varchar(40), | ||
`user11` varchar(40), | ||
`user12` varchar(40), | ||
`user13` varchar(40), | ||
`user14` varchar(40), | ||
`user15` varchar(40), | ||
`adrid` decimal(10,0), | ||
`adrparchild` varchar(1), | ||
`adrstatus` varchar(2), | ||
`adrpremod` varchar(20), | ||
`adrpretype` varchar(20), | ||
`adrpostmod` varchar(20), | ||
`floorno` varchar(20), | ||
`coopid` varchar(30), | ||
`country` varchar(30), | ||
`postalcode` varchar(10), | ||
`addrsrc` varchar(10), | ||
`addrvalid` varchar(1), | ||
`xcoord` decimal(15,8), | ||
`ycoord` decimal(15,8), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coord
fields are just NUMBER
in Oracle, which will apparently just store the number as originally entered. When pulling via sqoop, NUMBER
becomes the default Hive decimal(10,0)
, so we adjust it manually here to account for the precision of these coords.
We don't need to do this for most fields since they already have a precision and scale specified.
# Manually update some columns with corrected data types | ||
for coord in xcoord ycoord zcoord; do | ||
sed -i "s/\`$coord\` decimal(10,0)/\`$coord\` decimal(15,8)/" "$TABLE".sql.tmp1 | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super hot bash right here
@@ -40,7 +45,7 @@ for TABLE in ${TABLES}; do | |||
cp "$TABLE".sql.tmp1 "$TABLE".sql.tmp2 | |||
sed -i "/^CREATE TABLE/s/${TABLE_LC}/${TABLE_LC}\_bucketed/" "$TABLE".sql.tmp2 | |||
echo "PARTITIONED BY (\`taxyr\` string) | |||
CLUSTERED BY (\`parid\`) SORTED BY (\`seq\`) INTO ${NUM_BUCKETS} BUCKETS | |||
CLUSTERED BY (\`parid\`) SORTED BY (\`cur\`) INTO ${NUM_BUCKETS} BUCKETS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the sorting field for bucketed/clustered files to cur
, since we're much more likely to filter on that than the seq
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question, non-blocking] Generally I think this is a better choice, but given that the only tables that seem to be bucketed anymore are asmt_all
and asmt_hist
, is it an issue that cur
is actually rarely used in the specific case of asmt_all
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, this will just sort the cur
values for each PIN, but I don't actually think the sorting matters that much tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the actual changes to the number of buckets for each table. See ADDN in Athena/S3 for a table updated using the new definition and bucketing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqoop can't pull into a bucketed table directly. So my fix was to create a table for the initial pull and then a secondary bucketed table to copy into. Now that we're dropping bucketing for most tables, we can also drop the secondary table definition.
Two upsides here: faster queries and faster sqoop pull (since it doesn't need to transfer the data to a second table).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great, I like this cleanup! Only tangentially related, but I'm curious if you have resources explaining why Athena prefers larger files.
ASMT_ALL,TRUE,30 | ||
ASMT_HIST,TRUE,30 | ||
APRVAL,TRUE,1 | ||
ASMT_ALL,TRUE,20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question, non-blocking] Given that cur
is supposed to be an enumerated type with at most 3 values, does it still make sense to set NUM_BUCKETS=20
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bucketing here is actually by parid
, cur
is just a field to internally sort the Parquet once it's bucketed.
@@ -40,7 +45,7 @@ for TABLE in ${TABLES}; do | |||
cp "$TABLE".sql.tmp1 "$TABLE".sql.tmp2 | |||
sed -i "/^CREATE TABLE/s/${TABLE_LC}/${TABLE_LC}\_bucketed/" "$TABLE".sql.tmp2 | |||
echo "PARTITIONED BY (\`taxyr\` string) | |||
CLUSTERED BY (\`parid\`) SORTED BY (\`seq\`) INTO ${NUM_BUCKETS} BUCKETS | |||
CLUSTERED BY (\`parid\`) SORTED BY (\`cur\`) INTO ${NUM_BUCKETS} BUCKETS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question, non-blocking] Generally I think this is a better choice, but given that the only tables that seem to be bucketed anymore are asmt_all
and asmt_hist
, is it an issue that cur
is actually rarely used in the specific case of asmt_all
?
@jeancochrane Item 4 in this old AWS blog post says to aim for files ~128MB. |
This PR makes a few adjustments to the Hive table definitions used by sqoop and the subsequent Parquet files that are saved to S3. Primarily, it:
legdat.xcoord
&legdat.ycoord
truncation #12.taxyr=2022/
would have 20 parquet files within it), resulting in small files. This is bad as Athena actually prefers larger files ~100MB.cur
instead ofseq
. Athena bucketing sorts/splits files according to the value in a column. Ideally, we want to bucket on column which are commonly used for filtering. Consideringseq
is never used in most queries, I figurecur
would be a better choice.Most of the heinously large diff here is whitespace changes and dropping a copy of the table that was necessary for bucketing. It should be safe to ignore most of the actual table level definition changes.
Closes #12.