-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new content ID function #1766
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @TG1999, see some suggestions below.
vulnerabilities/utils.py
Outdated
# Normalize fields | ||
normalized_data = { | ||
"summary": normalize_text(advisory_data.summary), | ||
"affected_packages": normalize_list(advisory_data.affected_packages), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no certainty that this will work since we don't have a proper implementation for handling comparisons between AffectedPackage
.
For example, this will not be able to normalize the list of affected_packages below:
affected_packages = [
AffectedPackage(
package=PackageURL(
type="alpine",
namespace=None,
name="linux-lts",
version=None,
qualifiers={
"arch": "aarch64",
"distroversion": "v3.20",
"reponame": "main",
},
subpath=None,
),
affected_version_range=None,
fixed_version="6.6.13-r1",
),
AffectedPackage(
package=PackageURL(
type="alpine",
namespace=None,
name="linux-lts",
version=None,
qualifiers={"arch": "armhf", "distroversion": "v3.21", "reponame": "main"},
subpath=None,
),
affected_version_range=None,
fixed_version="6.6.13-r1",
),
]
"summary": normalize_text(advisory_data.summary), | ||
"affected_packages": normalize_list(advisory_data.affected_packages), | ||
"references": normalize_list(advisory_data.references), | ||
"weaknesses": normalize_list(advisory_data.weaknesses), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the above affected_packages
list, this also won't work.
vulnerabilities/utils.py
Outdated
} | ||
|
||
if include_metadata: | ||
normalized_data["created_by"] = advisory_data.created_by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created_by
is a model field, not an attribute of AdvisoryData.
Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
Signed-off-by: Tushar Goel <tushar.goel.dav@gmail.com>
8793912
to
ebf1a32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @TG1999, some nits for your consideration.
@@ -1369,6 +1370,11 @@ def save(self, *args, **kwargs): | |||
self.unique_content_id = checksum.hexdigest() | |||
super().save(*args, **kwargs) | |||
|
|||
def save(self, *args, **kwargs): | |||
advisory_data = self.to_advisory_data() | |||
self.unique_content_id = compute_content_id(advisory_data, include_metadata=False)[:31] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why trim the hash to 32 characters when we're increasing the max_length for unique_content_id
to 64?
self.unique_content_id = compute_content_id(advisory_data, include_metadata=False)[:31] | |
self.unique_content_id = compute_content_id(advisory_data, include_metadata=False) |
duplicate_content_ids = ( | ||
Advisory.objects.values("unique_content_id") | ||
.annotate(count=Count("id")) | ||
.filter(count__gt=1) | ||
.values_list("unique_content_id", flat=True) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are potentially millions of duplicate advisories, we should paginate them instead of loading them all at once into memory.
f"Found {len(duplicate_content_ids)} content IDs with duplicates", level=logging.INFO | ||
) | ||
|
||
for content_id in duplicate_content_ids: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use LoopProgress.
Update content IDs for all advisories that don't have one. | ||
""" | ||
advisories = Advisory.objects.filter( | ||
Q(unique_content_id="") | Q(unique_content_id__isnull=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have changed how we compute content id and also we're now using sha256 insted md5, we need to recompute content id for all advisories irrespective of whether or not the field is empty.
|
||
for advisory in advisories: | ||
advisory.unique_content_id = compute_content_id(advisory) | ||
advisory.save() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling save individually on millions of advisory is not right. We should insted do bulk update.
""" | ||
Find advisories with the same content and keep only the latest one. | ||
""" | ||
# Get all advisories that have duplicates based on content ID | ||
duplicate_content_ids = ( | ||
Advisory.objects.values("unique_content_id") | ||
.annotate(count=Count("id")) | ||
.filter(count__gt=1) | ||
.values_list("unique_content_id", flat=True) | ||
) | ||
|
||
self.log( | ||
f"Found {len(duplicate_content_ids)} content IDs with duplicates", level=logging.INFO | ||
) | ||
|
||
for content_id in duplicate_content_ids: | ||
# Get all advisories with this content ID | ||
advisories = Advisory.objects.filter(unique_content_id=content_id) | ||
|
||
# Find the latest advisory | ||
latest = advisories.latest("date_imported") | ||
|
||
# Delete all except the latest | ||
advisories.exclude(id=latest.id).delete() | ||
|
||
if self.log: | ||
self.log( | ||
f"Kept advisory {latest.id} and removed " | ||
f"{advisories.count() - 1} duplicates for content ID {content_id}", | ||
level=logging.INFO, | ||
) | ||
|
||
def update_content_ids(self): | ||
""" | ||
Update content IDs for all advisories that don't have one. | ||
""" | ||
advisories = Advisory.objects.filter( | ||
Q(unique_content_id="") | Q(unique_content_id__isnull=True) | ||
) | ||
|
||
self.log(f"Found {advisories.count()} advisories without content ID", level=logging.INFO) | ||
|
||
for advisory in advisories: | ||
advisory.unique_content_id = compute_content_id(advisory) | ||
advisory.save() | ||
|
||
if self.log: | ||
self.log(f"Updated content ID for advisory {advisory.id}", level=logging.DEBUG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can do something simpler like this.
""" | |
Find advisories with the same content and keep only the latest one. | |
""" | |
# Get all advisories that have duplicates based on content ID | |
duplicate_content_ids = ( | |
Advisory.objects.values("unique_content_id") | |
.annotate(count=Count("id")) | |
.filter(count__gt=1) | |
.values_list("unique_content_id", flat=True) | |
) | |
self.log( | |
f"Found {len(duplicate_content_ids)} content IDs with duplicates", level=logging.INFO | |
) | |
for content_id in duplicate_content_ids: | |
# Get all advisories with this content ID | |
advisories = Advisory.objects.filter(unique_content_id=content_id) | |
# Find the latest advisory | |
latest = advisories.latest("date_imported") | |
# Delete all except the latest | |
advisories.exclude(id=latest.id).delete() | |
if self.log: | |
self.log( | |
f"Kept advisory {latest.id} and removed " | |
f"{advisories.count() - 1} duplicates for content ID {content_id}", | |
level=logging.INFO, | |
) | |
def update_content_ids(self): | |
""" | |
Update content IDs for all advisories that don't have one. | |
""" | |
advisories = Advisory.objects.filter( | |
Q(unique_content_id="") | Q(unique_content_id__isnull=True) | |
) | |
self.log(f"Found {advisories.count()} advisories without content ID", level=logging.INFO) | |
for advisory in advisories: | |
advisory.unique_content_id = compute_content_id(advisory) | |
advisory.save() | |
if self.log: | |
self.log(f"Updated content ID for advisory {advisory.id}", level=logging.DEBUG) | |
""" | |
Recompute content id and remove advisories with the same content and keep only the latest one. | |
""" | |
advisories = Advisory.objects.all().order_by("-id").paginated() | |
advisories_count = Advisory.objects.all().count() | |
self.log(f"Computing new content id for {advisories_count} and removing duplicates.") | |
batch_size = 10000 | |
deleted_advisory_count = 0 | |
updated_advisory_count = 0 | |
duplicate_advisory_id = [] | |
updated_advisory = [] | |
content_ids = set() | |
progress = LoopProgress( | |
total_iterations=advisories_count, | |
logger=self.log, | |
progress_step=1, | |
) | |
for advisory in progress.iter(advisories): | |
content_id = compute_content_id(advisory) | |
if content_id in content_ids: | |
duplicate_advisory_id.append(advisory.id) | |
else: | |
advisory.unique_content_id = content_id | |
updated_advisory.append(advisory) | |
content_ids.add(content_id) | |
if len(duplicate_advisory_id) > batch_size: | |
deleted_advisory_count += delete_advisories( | |
advisory_ids=duplicate_advisory_id, | |
logger=self.log, | |
) | |
if len(updated_advisory) > batch_size: | |
updated_advisory_count += bulk_update_advisory( | |
items=updated_advisory, | |
fields=["unique_content_id"], | |
logger=self.log, | |
) | |
deleted_advisory_count += delete_advisories( | |
advisory_ids=duplicate_advisory_id, | |
logger=self.log, | |
) | |
updated_advisory_count += bulk_update_advisory( | |
items=updated_advisory, | |
fields=["unique_content_id"], | |
logger=self.log, | |
) | |
self.log(f"Removed {deleted_advisory_count} duplicates advisories.") | |
self.log(f"Updated content id for {deleted_advisory_count} advisories.") | |
def bulk_update_advisory(items, fields, logger): | |
item_count = 0 | |
if items: | |
try: | |
Advisory.objects.bulk_update(objs=items, fields=fields) | |
item_count += len(items) | |
except Exception as e: | |
logger(f"Error updating Advisory: {e}") | |
items.clear() | |
return item_count | |
def delete_advisories(advisory_ids, logger): | |
item_count = 0 | |
if advisory_ids: | |
try: | |
Advisory.objects.filter(id__in=advisory_ids).delete() | |
item_count += len(advisory_ids) | |
except Exception as e: | |
logger(f"Error deleting Advisory: {e}") | |
advisory_ids.clear() | |
return item_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is some more feedback.
|
||
# TODO: Add cache | ||
def _cmp_key(self): | ||
return (str(self.system), self.value, self.scoring_elements, self.published_at) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is less costly than getting a string on the whole object.
return (str(self.system), self.value, self.scoring_elements, self.published_at) | |
return (self.system.identifier, self.value, self.scoring_elements, self.published_at) |
|
||
# TODO: Add cache | ||
def _cmp_key(self): | ||
return (str(self.package), str(self.affected_version_range), str(self.fixed_version)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no fixed_version or no affected_version_range, their value would be None
and the string value "None"
which is not what we want IMHO, but rather ""
return (str(self.package), str(self.affected_version_range), str(self.fixed_version)) | |
return (str(self.package), str(self.affected_version_range or ""), str(self.fixed_version or "")) |
@@ -1315,7 +1316,7 @@ class Advisory(models.Model): | |||
""" | |||
|
|||
unique_content_id = models.CharField( | |||
max_length=32, | |||
max_length=64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a help text and including an explanation of why we use 64 chars, e.h. sha256 as hex.
Advisory.objects.values("unique_content_id") | ||
.annotate(count=Count("id")) | ||
.filter(count__gt=1) | ||
.values_list("unique_content_id", flat=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want the ids to delete, not the unique_content_id IMHO
.values_list("unique_content_id", flat=True) | |
.values_list("unique_content_id", flat=True) |
advisories = Advisory.objects.filter(unique_content_id=content_id) | ||
|
||
# Find the latest advisory | ||
latest = advisories.latest("date_imported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep the earliest IMHO not the latest.
# Delete all except the latest | ||
advisories.exclude(id=latest.id).delete() | ||
|
||
if self.log: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this test?
Reference: #1583