You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We should have a function that computes the content_id using the AdvisoryData object. It should normalize the content for each field before computing the digest.
For summary -> remove spaces -> lowercase
All lists should be ordered
Objects should have ordered members
Once each item in the object is normalized, create a single data structure from the normalized data. Then perform JSON dumping and encoding, followed by a SHA-256 digest to get a unique content_id.
TBD: Should the content_id also include created_by and the URL of the source?
Next Steps:
Once we have a way to generate a stable content_id, we need to create a one-time pipeline to dedupe advisories. No migration, please!
Current implementation of
unique_content_id
is unstable because the order of keys is not preserved inJsonField
.On PostgreSQL by default
JsonField
usesjsonb
, which does not preserve the order or whitespace.For more information, see https://docs.djangoproject.com/en/5.1/ref/models/fields/#django.db.models.JSONField and https://www.postgresql.org/docs/16/datatype-json.html.
This can lead to widespread duplication of advisory data, resulting in increased storage usage.
Below is a snippet to reproduce the bug, where the same advisory data leads to different
unique_content_id
values:The text was updated successfully, but these errors were encountered: