Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCIO-next: Unstable advisory unique_content_id #1583

Open
keshav-space opened this issue Sep 4, 2024 · 1 comment
Open

VCIO-next: Unstable advisory unique_content_id #1583

keshav-space opened this issue Sep 4, 2024 · 1 comment

Comments

@keshav-space
Copy link
Member

keshav-space commented Sep 4, 2024

Current implementation of unique_content_id is unstable because the order of keys is not preserved in JsonField.
On PostgreSQL by default JsonField uses jsonb, which does not preserve the order or whitespace.
For more information, see https://docs.djangoproject.com/en/5.1/ref/models/fields/#django.db.models.JSONField and https://www.postgresql.org/docs/16/datatype-json.html.

This can lead to widespread duplication of advisory data, resulting in increased storage usage.

Below is a snippet to reproduce the bug, where the same advisory data leads to different unique_content_id values:

In [1]: from vulnerabilities import importer

In [2]: from packageurl import PackageURL

In [3]: from univers.version_range import VersionRange

In [4]: from django.utils import timezone

In [5]: from vulnerabilities.pipes.advisory import insert_advisory

In [6]: from vulnerabilities.importer import AdvisoryData

In [7]: advisory_data = importer.AdvisoryData(
   ...:     aliases=["CVE-2020-13371337"],
   ...:     summary="vulnerability description here",
   ...:     affected_packages=[
   ...:         importer.AffectedPackage(
   ...:             package=PackageURL(type="pypi", name="dummy"),
   ...:             affected_version_range=VersionRange.from_string("vers:pypi/>=1.0.0|<=2.0.0"),
   ...:         )
   ...:     ],
   ...:     references=[importer.Reference(url="https://example.com/with/more/info/CVE-2020-13371337")],
   ...:     date_published=timezone.now(),
   ...:     url="https://test.com",
   ...: )

In [8]: r = insert_advisory(advisory_data, "test")

In [9]: r.unique_content_id
Out[9]: '2ececc550f7f6b5537e5f1a767ef0f25'

In [10]: k = Advisory.objects.get(unique_content_id=r.unique_content_id)

In [11]: k.unique_content_id
Out[11]: '2ececc550f7f6b5537e5f1a767ef0f25'

In [12]: k.date_imported = None # Change any field not used for computing content id

In [12]: k.save()

In [13]: k.unique_content_id
Out[13]: 'bf83e58fc8f7eb54d04a59c27f0680f8'

In [14]: assert k.unique_content_id == r.unique_content_id
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[14], line 1
----> 1 assert k.unique_content_id == r.unique_content_id
@aboutcode-org aboutcode-org deleted a comment Sep 4, 2024
@pombredanne pombredanne added this to the v36.0.0 - 3-next milestone Oct 15, 2024
@pombredanne pombredanne changed the title Unstable advisory unique_content_id VCIO-next: Unstable advisory unique_content_id Dec 23, 2024
@keshav-space
Copy link
Member Author

keshav-space commented Jan 24, 2025

Remediation

We should have a function that computes the content_id using the AdvisoryData object. It should normalize the content for each field before computing the digest.

  • For summary -> remove spaces -> lowercase
  • All lists should be ordered
  • Objects should have ordered members

Once each item in the object is normalized, create a single data structure from the normalized data. Then perform JSON dumping and encoding, followed by a SHA-256 digest to get a unique content_id.

TBD: Should the content_id also include created_by and the URL of the source?

Next Steps:

Once we have a way to generate a stable content_id, we need to create a one-time pipeline to dedupe advisories. No migration, please!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants