Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Full Refresh Overwrite Deduped #43413

Merged
merged 4 commits into from
Aug 8, 2024
Merged

Conversation

gosusnp
Copy link
Contributor

@gosusnp gosusnp commented Aug 8, 2024

What

Document Full Refresh Overwrite Deduped

Review guide

User Impact

Can this PR be safely reverted and rolled back?

  • YES πŸ’š
  • NO ❌

@gosusnp gosusnp requested a review from davinchia August 8, 2024 21:32
Copy link

vercel bot commented Aug 8, 2024

The latest updates on your projects. Learn more about Vercel for Git β†—οΈŽ

Name Status Preview Comments Updated (UTC)
airbyte-docs βœ… Ready (Inspect) Visit Preview πŸ’¬ Add feedback Aug 8, 2024 11:43pm


## Destination-specific mechanism for full refresh

The mechanism by which a destination connector acomplishes the full refresh will vary wildly from destination to destinaton. For our certified database and data warehouse destinations, we will be recreating the final table each sync. This allows us leave the previous sync's data viewable by writing to a "final-table-tmp" location as the sync is running, and at the end dropping the olf "final" table, and renaming the new one into place. That said, this may not possible for all destinations, and we may need to erase the existing data at the start of each full-refresh sync.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The mechanism by which a destination connector acomplishes the full refresh will vary wildly from destination to destinaton. For our certified database and data warehouse destinations, we will be recreating the final table each sync. This allows us leave the previous sync's data viewable by writing to a "final-table-tmp" location as the sync is running, and at the end dropping the olf "final" table, and renaming the new one into place. That said, this may not possible for all destinations, and we may need to erase the existing data at the start of each full-refresh sync.
The mechanism by which a destination connector acomplishes the full refresh will vary wildly from destination to destinaton. For our certified database and data warehouse destinations, we will be recreating the final table each sync. This allows us leave the previous sync's data viewable by writing to a "final-table-tmp" location as the sync is running, and at the end dropping the old "final" table, and renaming the new one into place. That said, this may not possible for all destinations, and we may need to erase the existing data at the start of each full-refresh sync.

The **Full Refresh** modes are the simplest methods that Airbyte uses to sync data, as they always retrieve all available information requested from the source, regardless of whether it has been synced before. This contrasts with [**Incremental sync**](./incremental-append.md), which does not sync data that has already been synced before.

In the **Overwrite + Deduped** variant, new syncs will replace all data in the existing destination table and then pull the new data in. Therefore, data that has been removed from the source after an old sync will be deleted in the destination table. The **deduped** variant also means that the data in the final table will be unique per primary key \(unlike [Full Refresh Overwrite](./full-refresh-overwrite.md)\). This is determined by sorting the data using the cursor field and keeping only the latest de-duplicated data row.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a quick sentence here about "As Full Refresh syncs are Resumable as much as possible, we recommend Full Refresh Overwrite Dedup if avoiding duplicates is a strong requirement. Bear in mind this incurs additional warehouse costs.', and linking the Resumable doc page in there.

Something like this!

Copy link
Contributor

@davinchia davinchia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestions. Feel free to merge after!

@gosusnp gosusnp merged commit 599b569 into master Aug 8, 2024
29 checks passed
@gosusnp gosusnp deleted the gosusnp/update-frod-doc branch August 8, 2024 23:51
LouisAuneau pushed a commit to LouisAuneau/airbyte that referenced this pull request Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants