You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Honestly I have not fully thought through how it would work, hoping to spark some discussion in thread.
Perhaps just another config variable for this? E.g.
{{config(
materialized='incremental', -- (or sp_insert_by_period; not relevant for table/view)
table_type='iceberg',
use_glue_automatic_compaction=true,
)}}
When use_glue_automatic_compaction is specified, then we would use the Glue {Create,Update}TableOptimizer API operations to create the optimiser for compaction.
Describe alternatives you've considered
You can just OPTIMIZE {{ this.schema }}.{{ this.identifier }} ... in your post_hook yes, but on full-refresh of a very large table (e.g. requiring insert_by_period) this may fail due to timeout or the iceberg "not finished, please run compaction again" message. Regardless I think it would be good to let AWS just handle it.
Caveat; I haven't actually tried to use the automatic compaction feature so I have no idea how it performs in practise. Maybe it just scan your entire table once a day and you get charged for 100 DPUs 😂.
Who will this benefit?
Anybody with large datasets in Iceberg. I would think quite a lot of overlap with users of insert_by_period.
Are you interested in contributing this feature?
maybe, depends how much work it would be
Anything else?
#514 somewhat related, in the realm of "table optimisation management"
The text was updated successfully, but these errors were encountered:
You can just OPTIMIZE {{ this.schema }}.{{ this.identifier }} ... in your post_hook yes, but on full-refresh of a very large table (e.g. requiring insert_by_period) this may fail due to timeout or the iceberg "not finished, please run compaction again" message. Regardless I think it would be good to let AWS just handle it.
If you partition your table I can share a "post-hook", that is ugly, but does the job, and pretty much optimize your table by partition values, using a batch size < 100, to avoid partition limitation issue in athena.
Said so, what you describe can be relevant, and should be relatively easy to implement:
we need a new method in the impl.py that does a boto3call add automatic compaction
we expose the method via a macro that then can be called in different table materializations.
If you partition your table I can share a "post-hook", that is ugly, but does the job, and pretty much optimize your table by partition values, using a batch size < 100, to avoid partition limitation issue in athena.
😮 I would love to see it! I think it might also be helpful for others that stumble upon this ticket
Is this your first time submitting a feature request?
Describe the feature
https://aws.amazon.com/blogs/aws/aws-glue-data-catalog-now-supports-automatic-compaction-of-apache-iceberg-tables/
Honestly I have not fully thought through how it would work, hoping to spark some discussion in thread.
Perhaps just another config variable for this? E.g.
When
use_glue_automatic_compaction
is specified, then we would use the Glue{Create,Update}TableOptimizer
API operations to create the optimiser for compaction.Describe alternatives you've considered
You can just
OPTIMIZE {{ this.schema }}.{{ this.identifier }} ...
in yourpost_hook
yes, but on full-refresh of a very large table (e.g. requiringinsert_by_period
) this may fail due to timeout or the iceberg "not finished, please run compaction again" message. Regardless I think it would be good to let AWS just handle it.Caveat; I haven't actually tried to use the automatic compaction feature so I have no idea how it performs in practise. Maybe it just scan your entire table once a day and you get charged for 100 DPUs 😂.
Who will this benefit?
Anybody with large datasets in Iceberg. I would think quite a lot of overlap with users of
insert_by_period
.Are you interested in contributing this feature?
maybe, depends how much work it would be
Anything else?
#514 somewhat related, in the realm of "table optimisation management"
The text was updated successfully, but these errors were encountered: