Add new content ID function #1766

TG1999 · 2025-01-29T13:38:56Z

Reference: #1583

keshav-space

Thanks @TG1999, see some suggestions below.

keshav-space · 2025-01-31T16:20:22Z

vulnerabilities/utils.py

+    # Normalize fields
+    normalized_data = {
+        "summary": normalize_text(advisory_data.summary),
+        "affected_packages": normalize_list(advisory_data.affected_packages),


There is no certainty that this will work since we don't have a proper implementation for handling comparisons between AffectedPackage.

For example, this will not be able to normalize the list of affected_packages below:

affected_packages = [ AffectedPackage( package=PackageURL( type="alpine", namespace=None, name="linux-lts", version=None, qualifiers={ "arch": "aarch64", "distroversion": "v3.20", "reponame": "main", }, subpath=None, ), affected_version_range=None, fixed_version="6.6.13-r1", ), AffectedPackage( package=PackageURL( type="alpine", namespace=None, name="linux-lts", version=None, qualifiers={"arch": "armhf", "distroversion": "v3.21", "reponame": "main"}, subpath=None, ), affected_version_range=None, fixed_version="6.6.13-r1", ), ]

keshav-space · 2025-01-31T16:20:26Z

vulnerabilities/utils.py

+        "summary": normalize_text(advisory_data.summary),
+        "affected_packages": normalize_list(advisory_data.affected_packages),
+        "references": normalize_list(advisory_data.references),
+        "weaknesses": normalize_list(advisory_data.weaknesses),


Similar to the above affected_packages list, this also won't work.

keshav-space · 2025-01-31T16:20:32Z

vulnerabilities/utils.py

+    }
+
+    if include_metadata:
+        normalized_data["created_by"] = advisory_data.created_by


created_by is a model field, not an attribute of AdvisoryData.

Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

keshav-space

Thanks @TG1999, some nits for your consideration.

keshav-space · 2025-02-07T16:05:51Z

vulnerabilities/models.py

@@ -1369,6 +1370,11 @@ def save(self, *args, **kwargs):
        self.unique_content_id = checksum.hexdigest()
        super().save(*args, **kwargs)

+    def save(self, *args, **kwargs):
+        advisory_data = self.to_advisory_data()
+        self.unique_content_id = compute_content_id(advisory_data, include_metadata=False)[:31]


Why trim the hash to 32 characters when we're increasing the max_length for unique_content_id to 64?

Suggested change

self.unique_content_id = compute_content_id(advisory_data, include_metadata=False)[:31]

self.unique_content_id = compute_content_id(advisory_data, include_metadata=False)

keshav-space · 2025-02-07T16:05:57Z

vulnerabilities/pipelines/remove_duplicate_advisories.py

+        duplicate_content_ids = (
+            Advisory.objects.values("unique_content_id")
+            .annotate(count=Count("id"))
+            .filter(count__gt=1)
+            .values_list("unique_content_id", flat=True)
+        )


There are potentially millions of duplicate advisories, we should paginate them instead of loading them all at once into memory.

keshav-space · 2025-02-07T16:05:59Z

vulnerabilities/pipelines/remove_duplicate_advisories.py

+            f"Found {len(duplicate_content_ids)} content IDs with duplicates", level=logging.INFO
+        )
+
+        for content_id in duplicate_content_ids:


We should use LoopProgress.

keshav-space · 2025-02-07T16:06:17Z

vulnerabilities/pipelines/remove_duplicate_advisories.py

+        Update content IDs for all advisories that don't have one.
+        """
+        advisories = Advisory.objects.filter(
+            Q(unique_content_id="") | Q(unique_content_id__isnull=True)


Since we have changed how we compute content id and also we're now using sha256 insted md5, we need to recompute content id for all advisories irrespective of whether or not the field is empty.

keshav-space · 2025-02-07T16:06:32Z

vulnerabilities/pipelines/remove_duplicate_advisories.py

+
+        for advisory in advisories:
+            advisory.unique_content_id = compute_content_id(advisory)
+            advisory.save()


Calling save individually on millions of advisory is not right. We should insted do bulk update.

keshav-space · 2025-02-07T16:06:48Z

vulnerabilities/pipelines/remove_duplicate_advisories.py

+        """
+        Find advisories with the same content and keep only the latest one.
+        """
+        # Get all advisories that have duplicates based on content ID
+        duplicate_content_ids = (
+            Advisory.objects.values("unique_content_id")
+            .annotate(count=Count("id"))
+            .filter(count__gt=1)
+            .values_list("unique_content_id", flat=True)
+        )
+
+        self.log(
+            f"Found {len(duplicate_content_ids)} content IDs with duplicates", level=logging.INFO
+        )
+
+        for content_id in duplicate_content_ids:
+            # Get all advisories with this content ID
+            advisories = Advisory.objects.filter(unique_content_id=content_id)
+
+            # Find the latest advisory
+            latest = advisories.latest("date_imported")
+
+            # Delete all except the latest
+            advisories.exclude(id=latest.id).delete()
+
+            if self.log:
+                self.log(
+                    f"Kept advisory {latest.id} and removed "
+                    f"{advisories.count() - 1} duplicates for content ID {content_id}",
+                    level=logging.INFO,
+                )
+
+    def update_content_ids(self):
+        """
+        Update content IDs for all advisories that don't have one.
+        """
+        advisories = Advisory.objects.filter(
+            Q(unique_content_id="") | Q(unique_content_id__isnull=True)
+        )
+
+        self.log(f"Found {advisories.count()} advisories without content ID", level=logging.INFO)
+
+        for advisory in advisories:
+            advisory.unique_content_id = compute_content_id(advisory)
+            advisory.save()
+
+            if self.log:
+                self.log(f"Updated content ID for advisory {advisory.id}", level=logging.DEBUG)


Maybe we can do something simpler like this.

Suggested change

"""

Find advisories with the same content and keep only the latest one.

"""

# Get all advisories that have duplicates based on content ID

duplicate_content_ids = (

Advisory.objects.values("unique_content_id")

.annotate(count=Count("id"))

.filter(count__gt=1)

.values_list("unique_content_id", flat=True)

)

self.log(

f"Found {len(duplicate_content_ids)} content IDs with duplicates", level=logging.INFO

)

for content_id in duplicate_content_ids:

# Get all advisories with this content ID

advisories = Advisory.objects.filter(unique_content_id=content_id)

# Find the latest advisory

latest = advisories.latest("date_imported")

# Delete all except the latest

advisories.exclude(id=latest.id).delete()

if self.log:

self.log(

f"Kept advisory {latest.id} and removed "

f"{advisories.count() - 1} duplicates for content ID {content_id}",

level=logging.INFO,

)

def update_content_ids(self):

"""

Update content IDs for all advisories that don't have one.

"""

advisories = Advisory.objects.filter(

Q(unique_content_id="") | Q(unique_content_id__isnull=True)

)

self.log(f"Found {advisories.count()} advisories without content ID", level=logging.INFO)

for advisory in advisories:

advisory.unique_content_id = compute_content_id(advisory)

advisory.save()

if self.log:

self.log(f"Updated content ID for advisory {advisory.id}", level=logging.DEBUG)

"""

Recompute content id and remove advisories with the same content and keep only the latest one.

"""

advisories = Advisory.objects.all().order_by("-id").paginated()

advisories_count = Advisory.objects.all().count()

self.log(f"Computing new content id for {advisories_count} and removing duplicates.")

batch_size = 10000

deleted_advisory_count = 0

updated_advisory_count = 0

duplicate_advisory_id = []

updated_advisory = []

content_ids = set()

progress = LoopProgress(

total_iterations=advisories_count,

logger=self.log,

progress_step=1,

)

for advisory in progress.iter(advisories):

content_id = compute_content_id(advisory)

if content_id in content_ids:

duplicate_advisory_id.append(advisory.id)

else:

advisory.unique_content_id = content_id

updated_advisory.append(advisory)

content_ids.add(content_id)

if len(duplicate_advisory_id) > batch_size:

deleted_advisory_count += delete_advisories(

advisory_ids=duplicate_advisory_id,

logger=self.log,

)

if len(updated_advisory) > batch_size:

updated_advisory_count += bulk_update_advisory(

items=updated_advisory,

fields=["unique_content_id"],

logger=self.log,

)

deleted_advisory_count += delete_advisories(

advisory_ids=duplicate_advisory_id,

logger=self.log,

)

updated_advisory_count += bulk_update_advisory(

items=updated_advisory,

fields=["unique_content_id"],

logger=self.log,

)

self.log(f"Removed {deleted_advisory_count} duplicates advisories.")

self.log(f"Updated content id for {deleted_advisory_count} advisories.")

def bulk_update_advisory(items, fields, logger):

item_count = 0

if items:

try:

Advisory.objects.bulk_update(objs=items, fields=fields)

item_count += len(items)

except Exception as e:

logger(f"Error updating Advisory: {e}")

items.clear()

return item_count

def delete_advisories(advisory_ids, logger):

item_count = 0

if advisory_ids:

try:

Advisory.objects.filter(id__in=advisory_ids).delete()

item_count += len(advisory_ids)

except Exception as e:

logger(f"Error deleting Advisory: {e}")

advisory_ids.clear()

return item_count

pombredanne

Here is some more feedback.

pombredanne · 2025-02-11T14:26:27Z

vulnerabilities/importer.py

+
+    # TODO: Add cache
+    def _cmp_key(self):
+        return (str(self.system), self.value, self.scoring_elements, self.published_at)


This is less costly than getting a string on the whole object.

Suggested change

return (str(self.system), self.value, self.scoring_elements, self.published_at)

return (self.system.identifier, self.value, self.scoring_elements, self.published_at)

pombredanne · 2025-02-11T14:33:43Z

vulnerabilities/importer.py

+
+    # TODO: Add cache
+    def _cmp_key(self):
+        return (str(self.package), str(self.affected_version_range), str(self.fixed_version))


If there is no fixed_version or no affected_version_range, their value would be None and the string value "None" which is not what we want IMHO, but rather ""

Suggested change

return (str(self.package), str(self.affected_version_range), str(self.fixed_version))

return (str(self.package), str(self.affected_version_range or ""), str(self.fixed_version or ""))

pombredanne · 2025-02-11T14:34:53Z

vulnerabilities/models.py

@@ -1315,7 +1316,7 @@ class Advisory(models.Model):
    """

    unique_content_id = models.CharField(
-        max_length=32,
+        max_length=64,


We need a help text and including an explanation of why we use 64 chars, e.h. sha256 as hex.

pombredanne · 2025-02-11T14:40:42Z

vulnerabilities/pipelines/remove_duplicate_advisories.py

+            Advisory.objects.values("unique_content_id")
+            .annotate(count=Count("id"))
+            .filter(count__gt=1)
+            .values_list("unique_content_id", flat=True)


You want the ids to delete, not the unique_content_id IMHO

Suggested change

.values_list("unique_content_id", flat=True)

.values_list("unique_content_id", flat=True)

pombredanne · 2025-02-11T14:42:25Z

vulnerabilities/pipelines/remove_duplicate_advisories.py

+            advisories = Advisory.objects.filter(unique_content_id=content_id)
+
+            # Find the latest advisory
+            latest = advisories.latest("date_imported")


Keep the earliest IMHO not the latest.

pombredanne · 2025-02-11T14:42:57Z

vulnerabilities/pipelines/remove_duplicate_advisories.py

+            # Delete all except the latest
+            advisories.exclude(id=latest.id).delete()
+
+            if self.log:


Why this test?

keshav-space requested changes Jan 31, 2025

View reviewed changes

TG1999 added 4 commits February 6, 2025 21:28

Add new content ID function

0f6ab4d

Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

Add tests and address review comments

546d2d9

Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

New content ID pipeline

296fbcf

Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

New content ID pipeline

ebf1a32

Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>

TG1999 force-pushed the add_new_content_id_function branch from 8793912 to ebf1a32 Compare February 6, 2025 16:01

keshav-space requested changes Feb 7, 2025

View reviewed changes

pombredanne requested changes Feb 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new content ID function #1766

Add new content ID function #1766

TG1999 commented Jan 29, 2025 •

edited

Loading

keshav-space left a comment

keshav-space Jan 31, 2025 •

edited

Loading

keshav-space Jan 31, 2025

keshav-space Jan 31, 2025

keshav-space left a comment

keshav-space Feb 7, 2025

keshav-space Feb 7, 2025

keshav-space Feb 7, 2025

keshav-space Feb 7, 2025 •

edited

Loading

keshav-space Feb 7, 2025

keshav-space Feb 7, 2025 •

edited

Loading

pombredanne left a comment

pombredanne Feb 11, 2025

pombredanne Feb 11, 2025

pombredanne Feb 11, 2025

pombredanne Feb 11, 2025

pombredanne Feb 11, 2025

pombredanne Feb 11, 2025

	self.unique_content_id = compute_content_id(advisory_data, include_metadata=False)[:31]
	self.unique_content_id = compute_content_id(advisory_data, include_metadata=False)

	return (str(self.system), self.value, self.scoring_elements, self.published_at)
	return (self.system.identifier, self.value, self.scoring_elements, self.published_at)

	return (str(self.package), str(self.affected_version_range), str(self.fixed_version))
	return (str(self.package), str(self.affected_version_range or ""), str(self.fixed_version or ""))

	.values_list("unique_content_id", flat=True)
	.values_list("unique_content_id", flat=True)

Add new content ID function #1766

Are you sure you want to change the base?

Add new content ID function #1766

Conversation

TG1999 commented Jan 29, 2025 • edited Loading

keshav-space left a comment

Choose a reason for hiding this comment

keshav-space Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keshav-space left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keshav-space Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keshav-space Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

pombredanne left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TG1999 commented Jan 29, 2025 •

edited

Loading

keshav-space Jan 31, 2025 •

edited

Loading

keshav-space Feb 7, 2025 •

edited

Loading

keshav-space Feb 7, 2025 •

edited

Loading