Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new content ID function #1766

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Add new content ID function #1766

wants to merge 4 commits into from

Conversation

TG1999
Copy link
Contributor

@TG1999 TG1999 commented Jan 29, 2025

Reference: #1583

Copy link
Member

@keshav-space keshav-space left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @TG1999, see some suggestions below.

# Normalize fields
normalized_data = {
"summary": normalize_text(advisory_data.summary),
"affected_packages": normalize_list(advisory_data.affected_packages),
Copy link
Member

@keshav-space keshav-space Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no certainty that this will work since we don't have a proper implementation for handling comparisons between AffectedPackage.

For example, this will not be able to normalize the list of affected_packages below:

affected_packages = [
    AffectedPackage(
        package=PackageURL(
            type="alpine",
            namespace=None,
            name="linux-lts",
            version=None,
            qualifiers={
                "arch": "aarch64",
                "distroversion": "v3.20",
                "reponame": "main",
            },
            subpath=None,
        ),
        affected_version_range=None,
        fixed_version="6.6.13-r1",
    ),
    AffectedPackage(
        package=PackageURL(
            type="alpine",
            namespace=None,
            name="linux-lts",
            version=None,
            qualifiers={"arch": "armhf", "distroversion": "v3.21", "reponame": "main"},
            subpath=None,
        ),
        affected_version_range=None,
        fixed_version="6.6.13-r1",
    ),
]

"summary": normalize_text(advisory_data.summary),
"affected_packages": normalize_list(advisory_data.affected_packages),
"references": normalize_list(advisory_data.references),
"weaknesses": normalize_list(advisory_data.weaknesses),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above affected_packages list, this also won't work.

}

if include_metadata:
normalized_data["created_by"] = advisory_data.created_by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created_by is a model field, not an attribute of AdvisoryData.

Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
@TG1999 TG1999 force-pushed the add_new_content_id_function branch from 8793912 to ebf1a32 Compare February 6, 2025 16:01
Copy link
Member

@keshav-space keshav-space left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @TG1999, some nits for your consideration.

@@ -1369,6 +1370,11 @@ def save(self, *args, **kwargs):
self.unique_content_id = checksum.hexdigest()
super().save(*args, **kwargs)

def save(self, *args, **kwargs):
advisory_data = self.to_advisory_data()
self.unique_content_id = compute_content_id(advisory_data, include_metadata=False)[:31]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why trim the hash to 32 characters when we're increasing the max_length for unique_content_id to 64?

Suggested change
self.unique_content_id = compute_content_id(advisory_data, include_metadata=False)[:31]
self.unique_content_id = compute_content_id(advisory_data, include_metadata=False)

Comment on lines +36 to +41
duplicate_content_ids = (
Advisory.objects.values("unique_content_id")
.annotate(count=Count("id"))
.filter(count__gt=1)
.values_list("unique_content_id", flat=True)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are potentially millions of duplicate advisories, we should paginate them instead of loading them all at once into memory.

f"Found {len(duplicate_content_ids)} content IDs with duplicates", level=logging.INFO
)

for content_id in duplicate_content_ids:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use LoopProgress.

Update content IDs for all advisories that don't have one.
"""
advisories = Advisory.objects.filter(
Q(unique_content_id="") | Q(unique_content_id__isnull=True)
Copy link
Member

@keshav-space keshav-space Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have changed how we compute content id and also we're now using sha256 insted md5, we need to recompute content id for all advisories irrespective of whether or not the field is empty.


for advisory in advisories:
advisory.unique_content_id = compute_content_id(advisory)
advisory.save()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling save individually on millions of advisory is not right. We should insted do bulk update.

Comment on lines +32 to +79
"""
Find advisories with the same content and keep only the latest one.
"""
# Get all advisories that have duplicates based on content ID
duplicate_content_ids = (
Advisory.objects.values("unique_content_id")
.annotate(count=Count("id"))
.filter(count__gt=1)
.values_list("unique_content_id", flat=True)
)

self.log(
f"Found {len(duplicate_content_ids)} content IDs with duplicates", level=logging.INFO
)

for content_id in duplicate_content_ids:
# Get all advisories with this content ID
advisories = Advisory.objects.filter(unique_content_id=content_id)

# Find the latest advisory
latest = advisories.latest("date_imported")

# Delete all except the latest
advisories.exclude(id=latest.id).delete()

if self.log:
self.log(
f"Kept advisory {latest.id} and removed "
f"{advisories.count() - 1} duplicates for content ID {content_id}",
level=logging.INFO,
)

def update_content_ids(self):
"""
Update content IDs for all advisories that don't have one.
"""
advisories = Advisory.objects.filter(
Q(unique_content_id="") | Q(unique_content_id__isnull=True)
)

self.log(f"Found {advisories.count()} advisories without content ID", level=logging.INFO)

for advisory in advisories:
advisory.unique_content_id = compute_content_id(advisory)
advisory.save()

if self.log:
self.log(f"Updated content ID for advisory {advisory.id}", level=logging.DEBUG)
Copy link
Member

@keshav-space keshav-space Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can do something simpler like this.

Suggested change
"""
Find advisories with the same content and keep only the latest one.
"""
# Get all advisories that have duplicates based on content ID
duplicate_content_ids = (
Advisory.objects.values("unique_content_id")
.annotate(count=Count("id"))
.filter(count__gt=1)
.values_list("unique_content_id", flat=True)
)
self.log(
f"Found {len(duplicate_content_ids)} content IDs with duplicates", level=logging.INFO
)
for content_id in duplicate_content_ids:
# Get all advisories with this content ID
advisories = Advisory.objects.filter(unique_content_id=content_id)
# Find the latest advisory
latest = advisories.latest("date_imported")
# Delete all except the latest
advisories.exclude(id=latest.id).delete()
if self.log:
self.log(
f"Kept advisory {latest.id} and removed "
f"{advisories.count() - 1} duplicates for content ID {content_id}",
level=logging.INFO,
)
def update_content_ids(self):
"""
Update content IDs for all advisories that don't have one.
"""
advisories = Advisory.objects.filter(
Q(unique_content_id="") | Q(unique_content_id__isnull=True)
)
self.log(f"Found {advisories.count()} advisories without content ID", level=logging.INFO)
for advisory in advisories:
advisory.unique_content_id = compute_content_id(advisory)
advisory.save()
if self.log:
self.log(f"Updated content ID for advisory {advisory.id}", level=logging.DEBUG)
"""
Recompute content id and remove advisories with the same content and keep only the latest one.
"""
advisories = Advisory.objects.all().order_by("-id").paginated()
advisories_count = Advisory.objects.all().count()
self.log(f"Computing new content id for {advisories_count} and removing duplicates.")
batch_size = 10000
deleted_advisory_count = 0
updated_advisory_count = 0
duplicate_advisory_id = []
updated_advisory = []
content_ids = set()
progress = LoopProgress(
total_iterations=advisories_count,
logger=self.log,
progress_step=1,
)
for advisory in progress.iter(advisories):
content_id = compute_content_id(advisory)
if content_id in content_ids:
duplicate_advisory_id.append(advisory.id)
else:
advisory.unique_content_id = content_id
updated_advisory.append(advisory)
content_ids.add(content_id)
if len(duplicate_advisory_id) > batch_size:
deleted_advisory_count += delete_advisories(
advisory_ids=duplicate_advisory_id,
logger=self.log,
)
if len(updated_advisory) > batch_size:
updated_advisory_count += bulk_update_advisory(
items=updated_advisory,
fields=["unique_content_id"],
logger=self.log,
)
deleted_advisory_count += delete_advisories(
advisory_ids=duplicate_advisory_id,
logger=self.log,
)
updated_advisory_count += bulk_update_advisory(
items=updated_advisory,
fields=["unique_content_id"],
logger=self.log,
)
self.log(f"Removed {deleted_advisory_count} duplicates advisories.")
self.log(f"Updated content id for {deleted_advisory_count} advisories.")
def bulk_update_advisory(items, fields, logger):
item_count = 0
if items:
try:
Advisory.objects.bulk_update(objs=items, fields=fields)
item_count += len(items)
except Exception as e:
logger(f"Error updating Advisory: {e}")
items.clear()
return item_count
def delete_advisories(advisory_ids, logger):
item_count = 0
if advisory_ids:
try:
Advisory.objects.filter(id__in=advisory_ids).delete()
item_count += len(advisory_ids)
except Exception as e:
logger(f"Error deleting Advisory: {e}")
advisory_ids.clear()
return item_count

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is some more feedback.


# TODO: Add cache
def _cmp_key(self):
return (str(self.system), self.value, self.scoring_elements, self.published_at)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is less costly than getting a string on the whole object.

Suggested change
return (str(self.system), self.value, self.scoring_elements, self.published_at)
return (self.system.identifier, self.value, self.scoring_elements, self.published_at)


# TODO: Add cache
def _cmp_key(self):
return (str(self.package), str(self.affected_version_range), str(self.fixed_version))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no fixed_version or no affected_version_range, their value would be None and the string value "None" which is not what we want IMHO, but rather ""

Suggested change
return (str(self.package), str(self.affected_version_range), str(self.fixed_version))
return (str(self.package), str(self.affected_version_range or ""), str(self.fixed_version or ""))

@@ -1315,7 +1316,7 @@ class Advisory(models.Model):
"""

unique_content_id = models.CharField(
max_length=32,
max_length=64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a help text and including an explanation of why we use 64 chars, e.h. sha256 as hex.

Advisory.objects.values("unique_content_id")
.annotate(count=Count("id"))
.filter(count__gt=1)
.values_list("unique_content_id", flat=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want the ids to delete, not the unique_content_id IMHO

Suggested change
.values_list("unique_content_id", flat=True)
.values_list("unique_content_id", flat=True)

advisories = Advisory.objects.filter(unique_content_id=content_id)

# Find the latest advisory
latest = advisories.latest("date_imported")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the earliest IMHO not the latest.

# Delete all except the latest
advisories.exclude(id=latest.id).delete()

if self.log:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants