Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNOW-147] Filter out deleted ACLs from ACL_LATEST #140

Merged
merged 9 commits into from
Feb 18, 2025

Conversation

jaymedina
Copy link
Contributor

problem

aclsnapshots does not capture deleted ACLs, and there was an issue in the way the latest ACLs were retrieved. Both are addressed in the solution below.

solution

  • Remove R script creating the ACL_LATEST dynamic table
  • Add V script re-introducing the ACL_LATEST dynamic table, refactored
  • Add a snapshot_date window to filter out for only the latest ACLs, which should minimize the risk of including deleted ACLs

Original Process

  1. Unpack
  2. Dedup

Original Data Structure

In this example, principal ID 200 gets READ privileges removed in 2025.

owner-id record-access timestamp
2 {100: [read, download], 200: [read]} 12-1-2024
2 {100: [read, download]} 1-1-2025

Data Structure (After Unpacking & Deduplicating)

Notice how unpacking before deduplicating, and defining a duplicate as a repeated entry of owner_id access_type and prinicpal_id, lets the ACL with pid 200 sneak into the final latest table, even though their permissions were removed in 1-1-2025.

owner-id access-type pid timestamp
2 read 100 1-1-2025
2 download 100 1-1-2025
2 read 200 12-1-2024

New Process

  1. Dedup
  2. Unpack

Data Structure (After Deduplicating & Unpacking)

When we first deduplicate only based on the owner_id, and then unpack the access_type and principal_id, only the ACL from 1-1-2025 is kept in the latest table.

owner-id access-type pid timestamp
2 read 100 1-1-2025
2 download 100 1-1-2025

testing

For testing, I created a temporary table for `ACL_LATEST` that I ran all the queries against.
CREATE OR REPLACE TEMPORARY TABLE temp_acl_latest AS
WITH dedup_acl AS (
    SELECT
        *,
        parse_json(resource_access) AS acl
    FROM synapse_data_warehouse.SYNAPSE_RAW.ACLSNAPSHOTS
    WHERE
        SNAPSHOT_DATE >= CURRENT_TIMESTAMP - INTERVAL '14 days'
    QUALIFY
        ROW_NUMBER() OVER (
            PARTITION BY OWNER_ID
            ORDER BY CHANGE_TIMESTAMP DESC, SNAPSHOT_TIMESTAMP DESC
        ) = 1
),
dedup_acl_expanded AS (
    SELECT
        CREATED_ON,
        CHANGE_TIMESTAMP,
        CHANGE_TYPE,
        OWNER_ID,
        OWNER_TYPE,
        SNAPSHOT_DATE,
        SNAPSHOT_TIMESTAMP,
        COALESCE(
            array_sort(value:"accesstype"::variant),
            array_sort(value:"accessType"::variant),
            array_sort(value:"accesstype#1"::variant),
            array_sort(value:"accesstype#2"::variant),
            array_sort(value:"accesstype#3"::variant)
        ) AS access_type,
        COALESCE(
            value:"principalId"::number,
            value:"principalid"::number,
            value:"principalid#1"::number
        ) AS principal_id
    FROM 
        dedup_acl,
        LATERAL FLATTEN(acl, outer => TRUE)
)
SELECT
    *
FROM
    dedup_acl_expanded;
Next, I queried for the discrepancies between the temporary table and the current `acl_latest`
WITH temp_counts AS (
    SELECT owner_id, COUNT(accesstype) AS temp_count
    FROM temp_acl_latest
    GROUP BY owner_id
),
acl_counts AS (
    SELECT owner_id, COUNT(access_type) AS acl_count
    FROM synapse_data_warehouse.synapse.acl_latest
    GROUP BY owner_id
)
SELECT 
    COALESCE(t.owner_id, a.owner_id) AS owner_id,
    t.temp_count,
    a.acl_count,
    ABS(COALESCE(t.temp_count, 0) - COALESCE(a.acl_count, 0)) AS difference
FROM temp_counts t
FULL OUTER JOIN acl_counts a
    ON t.owner_id = a.owner_id
WHERE COALESCE(t.temp_count, 0) != COALESCE(a.acl_count, 0);
Lastly, I would arbitrarily select entities (`owner_id`) from the previous query and compare the ACL results between the new solution table and the current ACL_LATEST table, against the entity itself on Synapse.
select owner_id, change_timestamp, access_type, principal_id
from temp_acl_latest 
where owner_id = 24246541
order by change_timestamp desc, principal_id desc;

select *
from synapse_data_warehouse.synapse.acl_latest
where owner_id = 24246541
order by change_timestamp desc, principal_id desc;
Here are some example results that confirm the results from `temp_acl_latest` are more accurate
  1. syn3482905 no longer exists, and ACL does not show up in new solution, but does in acl_latest
image image
  1. same with syn27899252
image image
  1. syn63997960 has 6 ACLs on synapse, matching the new solution
image

new sol:
image

original sol has 3 ACLs that were removed:
image

Leveraging NODE_LATEST/TEAM_LATEST for further verification

I looked to see what nodes might be missing from the new solution, and found that all the ones listed are nodes that don't exist anymore (a separate problem for node_latest).

SELECT *
FROM synapse.node_latest
WHERE benefactor_id = id
  AND id NOT IN (
    SELECT DISTINCT owner_id
    FROM temp_acl_latest
  );
image

Likeswise, I did the same for teams and saw that MOST of the missing IDs are because they don't exist in the original aclsnapshots table OR because of the 14 day cutoff in our new solution (see below for the ones that ARE in aclsnapshots, but don't make it through the 14 day cutoff):

with ids_in_team_latest_not_in_acl_latest as (
select id
from synapse.team_latest
where id not in (
    SELECT DISTINCT owner_id
        FROM temp_acl_latest
      )
order by snapshot_timestamp desc, change_timestamp desc
)
select *
from synapse_raw.aclsnapshots
where owner_id in (select id from ids_in_team_latest_not_in_acl_latest);
image

@jaymedina jaymedina marked this pull request as ready for review February 13, 2025 20:24
@jaymedina jaymedina requested a review from a team as a code owner February 13, 2025 20:24
@philerooski
Copy link
Collaborator

No differences between that interactive rebase I did and this new branch, so you're good on that front 👍

…est_refactored.sql

Co-authored-by: BryanFauble <17128019+BryanFauble@users.noreply.github.com>
Copy link
Contributor

@BryanFauble BryanFauble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jaymedina
Copy link
Contributor Author

Thanks! I'll be merging this once #142 merges into main.

@jaymedina
Copy link
Contributor Author

jaymedina commented Feb 18, 2025

I just bumped up to the latest version, would y'all mind giving this one last review before I merge? Thanks!

Copy link

@jaymedina jaymedina merged commit 68e43e3 into dev Feb 18, 2025
3 checks passed
@jaymedina jaymedina deleted the snow-147-acl-latest-remove-deleted-2 branch February 18, 2025 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants