VCIO-next: Unstable advisory `unique_content_id` #1583

keshav-space · 2024-09-04T12:59:24Z

Current implementation of unique_content_id is unstable because the order of keys is not preserved in JsonField.
On PostgreSQL by default JsonField uses jsonb, which does not preserve the order or whitespace.
For more information, see https://docs.djangoproject.com/en/5.1/ref/models/fields/#django.db.models.JSONField and https://www.postgresql.org/docs/16/datatype-json.html.

This can lead to widespread duplication of advisory data, resulting in increased storage usage.

Below is a snippet to reproduce the bug, where the same advisory data leads to different unique_content_id values:

In [1]: from vulnerabilities import importer

In [2]: from packageurl import PackageURL

In [3]: from univers.version_range import VersionRange

In [4]: from django.utils import timezone

In [5]: from vulnerabilities.pipes.advisory import insert_advisory

In [6]: from vulnerabilities.importer import AdvisoryData

In [7]: advisory_data = importer.AdvisoryData(
   ...:     aliases=["CVE-2020-13371337"],
   ...:     summary="vulnerability description here",
   ...:     affected_packages=[
   ...:         importer.AffectedPackage(
   ...:             package=PackageURL(type="pypi", name="dummy"),
   ...:             affected_version_range=VersionRange.from_string("vers:pypi/>=1.0.0|<=2.0.0"),
   ...:         )
   ...:     ],
   ...:     references=[importer.Reference(url="https://example.com/with/more/info/CVE-2020-13371337")],
   ...:     date_published=timezone.now(),
   ...:     url="https://test.com",
   ...: )

In [8]: r = insert_advisory(advisory_data, "test")

In [9]: r.unique_content_id
Out[9]: '2ececc550f7f6b5537e5f1a767ef0f25'

In [10]: k = Advisory.objects.get(unique_content_id=r.unique_content_id)

In [11]: k.unique_content_id
Out[11]: '2ececc550f7f6b5537e5f1a767ef0f25'

In [12]: k.date_imported = None # Change any field not used for computing content id

In [12]: k.save()

In [13]: k.unique_content_id
Out[13]: 'bf83e58fc8f7eb54d04a59c27f0680f8'

In [14]: assert k.unique_content_id == r.unique_content_id
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[14], line 1
----> 1 assert k.unique_content_id == r.unique_content_id

The text was updated successfully, but these errors were encountered:

keshav-space · 2025-01-24T15:56:49Z

Remediation

We should have a function that computes the content_id using the AdvisoryData object. It should normalize the content for each field before computing the digest.

For summary -> remove spaces -> lowercase
All lists should be ordered
Objects should have ordered members

Once each item in the object is normalized, create a single data structure from the normalized data. Then perform JSON dumping and encoding, followed by a SHA-256 digest to get a unique content_id.

TBD: Should the content_id also include created_by and the URL of the source?

Next Steps:

Once we have a way to generate a stable content_id, we need to create a one-time pipeline to dedupe advisories. No migration, please!

keshav-space added bug Priority: high design-needed labels Sep 4, 2024

aboutcode-org deleted a comment Sep 4, 2024

pombredanne added the 3-next label Oct 15, 2024

pombredanne added this to the v36.0.0 - 3-next milestone Oct 15, 2024

TG1999 assigned keshav-space Nov 26, 2024

pombredanne changed the title ~~Unstable advisory unique_content_id~~ VCIO-next: Unstable advisory unique_content_id Dec 23, 2024

keshav-space assigned TG1999 and unassigned keshav-space Jan 24, 2025

keshav-space mentioned this issue Jan 27, 2025

Add pipeline to compute Advisory ToDos #1764

Open

TG1999 mentioned this issue Feb 11, 2025

Add new content ID function #1766

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VCIO-next: Unstable advisory `unique_content_id` #1583

VCIO-next: Unstable advisory `unique_content_id` #1583

keshav-space commented Sep 4, 2024 •

edited

Loading

keshav-space commented Jan 24, 2025 •

edited

Loading

VCIO-next: Unstable advisory unique_content_id #1583

VCIO-next: Unstable advisory unique_content_id #1583

Comments

keshav-space commented Sep 4, 2024 • edited Loading

keshav-space commented Jan 24, 2025 • edited Loading

Remediation

Next Steps:

VCIO-next: Unstable advisory `unique_content_id` #1583

VCIO-next: Unstable advisory `unique_content_id` #1583

keshav-space commented Sep 4, 2024 •

edited

Loading

keshav-space commented Jan 24, 2025 •

edited

Loading