-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SPDX format support for SBOM #608
base: main
Are you sure you want to change the base?
Conversation
I thought I will eventually integrate the changes into new subcommand implemented here: #593 but it wasn't merged yet when I started working on this one |
5e78923
to
e64cec4
Compare
e64cec4
to
4a10a9d
Compare
typo in commit message and PR: s/SDPX/SPDX/ |
e08cfd6
to
0f308c0
Compare
Thanks for all the reviews and comments. I also wanted to add spdx support into merge-sboms subcommand but I noticed that in the tests you don't really test merged output. Or am I missing something? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM, a couple of minor changes is needed. Edit: you would also need to make sure UTs pass.
|
Pushing the change to collect early round of feedback on rework. There is still some work to do, mainly to add more tests and to restructure the history to make it more manageable. |
Tried this out on https://github.com/cachito-testing/gomod-pandemonium cachi2 fetch-deps --sbom-output-type spdx '[
{"type": "gomod"},
{"type": "gomod", "path": "terminaltor"},
{"type": "gomod", "path": "weird"}
]' Noticed a problem: Multiple purls for one packageThere are 15 packages with 2 purls (and curiously, none with more than 2) jq < cachi2-output/bom.json '.packages | map(select(.externalRefs | length >= 2))' Some interesting ones: crypto/rand vs. math/rand -
|
Then I also tried out SBOM merging on https://github.com/cachito-testing/pip-e2e-test cachi2 fetch-deps --sbom-output-type spdx pip
cd ..
# the two repos are next to each other
cachi2 merge-sboms --sbom-output-type spdx gomod-pandemonium/cachi2-output/bom.json pip-e2e-test/cachi2-output/bom.json > merged.bom.json First minor observation - the output SBOM type does not default to the input SBOM type, which feels weird Validity problem - the The individual SBOMs do have it, and the merged SBOM still references it in relationships: {
"spdxElementId": "SPDXRef-DOCUMENT",
"comment": "",
"relatedSpdxElement": "SPDXRef-DocumentRoot-File-",
"relationshipType": "DESCRIBES"
},
{
"spdxElementId": "SPDXRef-DocumentRoot-File-",
"comment": "",
"relatedSpdxElement": "SPDXRef-Package-vendor/golang.org/x/sys/cpu-None-fff45ff8a23684e4ba82a21914e5418abe0d79156f3d0d8f33925961edde6d19",
"relationshipType": "CONTAINS"
} But there's no such SPDXID in the |
For reference, how I found out about the issues: I'm currently working on the script that merges cachi2 SBOMs with syft SBOMs, and the cachi2 SBOM failed this check chmeliik/build-tasks-dockerfiles@d4ee394 |
Another annoying thing: # use any non-empty SBOM
cp cachi2-output/bom.json bom1.json
cp cachi2-output/bom.json bom2.json
cachi2 merge-sboms --sbom-output-type spdx bom1.json bom2.json > merged1.json
cachi2 merge-sboms --sbom-output-type spdx bom1.json bom2.json > merged2.json
diff merged1.json merged2.json
# created date is different, which is fine, and relationships order doesn't match, which is annoying |
Turns out the original implementation ignored
The current default is still CycloneDX, so it is assumed that if you don't say otherwise you expect to get CyDX as output. This could be changed to SPDX, but I am not sure trying to guess correct output type basing on input type is a good idea: what should cachi2 do when it is asked to merge one CyDX and one SPDX? |
Ah, I didn't know cachi2 supports that. The behaviour could be:
But defaulting to CycloneDX is fine too |
16c9f7b
to
da4e1cf
Compare
Good catch, thank you. Fixed.
Yes, it definitely could be, but I am more concerned about user perspective -- having a known default and a way to override it feels more straightforward than having some logic which sometimes works by itself and sometimes requires assistance. I think I'll keep the current behavior for now and will change it if there is more popular demand for that. |
63fa229
to
c229a34
Compare
Added SDPX format support for SBOM Support for SPDX format was added to fetch-depds command and also to merge_syft_sboms. No changes were made in particular package manager generating components which are then converted to cyclonedx format. SPDX sbom can be obtained by calling Sbom.to_spdx(). New switch sbom-type was added to merge_syfy_sboms, so user can choose which output format should be generated - default is cyclonedx. Once all tooling is ready to consume spdx sboms, cutoff changes in this repository can be started. SPDXRef-DocumentRoot-File- includes all spdx packages and is set to be described by SPDXRef-DOCUMENT. This way of spdx generation is closer to way syft generates spdx Included fixes from sbom validation tool - Added required documentNamespace attribute - Added creationInfo.created attribute - changed SPDXID-Package-* to SPDXRef-Package - changed annotationDate to not include ms - changed annotator to include prefix Tool: Co-authered-by: Alexey Ovchinnikov <aovchinn@redhat.com> Signed-off-by: Jindrich Luza <jluza@redhat.com>
Having merge_outputs in global utils can result in import loops in rather unexpected places. Moving it to a different scope to prevent those from appearing. Signed-off-by: Alexey Ovchinnikov <aovchinn@redhat.com>
This commit builds on top of proposed SPDX support and addresses the following: * simplifies merge logic to align it with design in konflux-ci/architecture#213; * adjusts the model to make it work with Syft generated SPDX Sboms; * adds tests for some of existing and some new SPDX-CyDX handling and merging cases. This is not the final form and more work still has to be done: * several important test cases are pending; * CyDX SBOM has to be amended to better fit into the new structure; * Decisions need to be additionally documented within the code; * This commit and its parent must be split into several parts to simplify repository history. Co-authored-by: Jindrich Luza <midnightercz@gmail.com> Signed-off-by: Alexey Ovchinnikov <aovchinn@redhat.com>
Adding test conditions and simplifying overall code structure. Signed-off-by: Alexey Ovchinnikov <aovchinn@redhat.com>
c229a34
to
e2daf82
Compare
""" | ||
|
||
def create_document_root() -> SPDXPackage: | ||
return SPDXPackage(name="", versionInfo="", SPDXID="SPDXRef-DocumentRoot-File-") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this element required by the spec? I tried to find references to it, but couldn't. I was also wondering why does it end with a hyphen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found the reference for it in the Konflux SPDX design (which triggered this change here).
So the intention was to inform the folder that was processed to generate the SBOM. If so, I think we want to keep it as "SPDXRef-DocumentRoot-Directory-{dirname}", and maybe use the same dirname
for the "name" package attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make more sense to have one "main element" for each "main package". So cachi2 fetch-deps '[{"type": "pip"}, {"type": "npm"}]'
would result in something like the example here https://github.com/konflux-ci/build-tasks-dockerfiles/tree/main/sbom-utility-scripts/scripts/add-image-reference-script#list-of-updates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that would need more changes I think, so a made-up main element is fine in the meantime.
But if you do use dirname
for the name attribute, please use ./{dirname}
to clearly make it look like a file rather than an actual package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make more sense to have one "main element" for each "main package
I like this idea, it could be a future improvement.
# mypy is upset by patrition_by being broadly typed. | ||
packages = [create_document_root()] + libs_to_packages(libraries) # type: ignore | ||
files = files_to_packages(cydxfiles) # type: ignore | ||
relationships = [create_root_relationship()] + link_to_root(packages) + link_to_root(files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that the spec does not require these relationships to be informed (I also checked the official examples to try to figure this out).
Now, IIUC, we're saying that every dependency is contained by this "virtual" root package SPDXRef-DocumentRoot-File-
. I'm wondering if this is adding any value in the end, as it does add a lot of extra lines to the SBOM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am following a precedent set by Syft here. Since these relationships seem to be added by default I think it is best to keep them (there will be some interaction with Syft, at least for some time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, Syft does at least two types of relationships:
{
"spdxElementId": "SPDXRef-DocumentRoot-Directory--home-bpimente-Projects-test-repos-porta",
"relatedSpdxElement": "SPDXRef-Package-npm-babel-plugin-polyfill-corejs3-a77653646f05b06b",
"relationshipType": "CONTAINS"
}
And one for the file that described the dependency (e.g. package.json
):
{
"spdxElementId": "SPDXRef-Package-npm-string-decoder-14144cdbecdae249",
"relatedSpdxElement": "SPDXRef-File-yarn.lock-dcee26e98fae7b36",
"relationshipType": "OTHER",
"comment": "evident-by: indicates the package's existence is evident by the given file"
}
I'm not sure if we'd also want to add the latter at some point, but it does feel more useful than the first one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, feel free to close this thread, it's a non-blocker.
A stricter rule is applied to packages equivalency determination. Now only those packages whos purls sets are identical will be considered identical, they will be reported separately otherwise. This change also resulted in a redefinition of one of test cases Signed-off-by: Alexey Ovchinnikov <aovchinn@redhat.com>
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't dare to comment on the overall design of this at this point in time without diving into the depths of the SPDX spec, even more so since 3 people have done that already. I can only comment on the overall structure of the PR, more specifically, what caught my eye as I was going through it and what I think would have helped me understand the topic without actually diving into the spec. To be fair, @a-ovchinnikov you've done a great job of describing majority of the functions/methods so that any non-intuitive logic is covered by an extensive commentary. That said, I think if you try to squash changes into the very first commit and introduce other changes more gradually as I tried to hint, second pass over this PR is going to be a much better experience for me such that I'll understand at least a fraction of what's actually going on.
) -> None: | ||
"""Process a single SPDX relationship. | ||
|
||
Add relatationship to merged relationships list while replacing spdxElementId and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/relatationship/relationship.
@@ -196,22 +195,3 @@ def get_cache_dir() -> Path: | |||
except KeyError: | |||
cache_dir = Path.home().joinpath(".cache") | |||
return cache_dir.joinpath("cachi2") | |||
|
|||
|
|||
def merge_outputs(outputs: Iterable[RequestOutput]) -> RequestOutput: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a cleanup of a pre-existing behaviour not strictly related to the PR, so it could be the first commit, couldn't it?
@@ -101,124 +101,3 @@ def merge(self, other: "Self") -> "Self": | |||
pip_package_binary=self.pip_package_binary or other.pip_package_binary, | |||
bundler_package_binary=self.bundler_package_binary or other.bundler_package_binary, | |||
) | |||
|
|||
|
|||
def merge_relationships( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels really weird for a reviewer to have to read "initial implementation", trying to get a grasp of it only to realize a few commits later in the same patch series that much of it is simply dropped.
My guess is that this commit being proposed standalone is supposed to be a way of giving credit to the original author of "initial implementation" commit? Be it as it may and while a noble gesture, strictly from review POV these line dropping changes should be squashed to the original commit to reduce the confusion and declutter the overall diff.
The common practice is to just use a bunch of Signed-off-by
s with every contributor's name/email. It's not like you need to take the authorship away from commit 1. I believe readability of the changes in such a complex topic as adding SPDX support should take precedence over any noble gestures in the name of collaboration :).
def __hash__(self) -> int: | ||
return hash(self.algorithm + self.checksumValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this to make instances hashable and therefore usable in maps/sets? I suppose so, but this is insufficient because the instances are still mutable, a single update to any of the values invalidates all places that such an instance would have been used as a hashable object. I know it may be unlikely, but I feel like there are enough models here where it might be quite hard to track down a potential violator. Can we make use of the frozen
config option for these models?
https://docs.pydantic.dev/latest/api/config/#pydantic.config.ConfigDict.frozen
cachi2/core/models/sbom.py
Outdated
@staticmethod | ||
def deduplicate_spdx_packages(items: Iterable[SPDXPackage]) -> list[SPDXPackage]: | ||
"""Deduplicate SPDX packages and merge external references. | ||
|
||
If package with same name and version is found multiple times in the list, | ||
merge external references of all the packages into one package. | ||
""" | ||
# NOTE: keeping this implementation mostly intact for historical reasons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate, this implementation was done within this very PR so indirectly referring to it as "inferior" and only keeping it intact for historical reasons doesn't seem very convincing in general.
cachi2/core/models/sbom.py
Outdated
@@ -135,7 +139,7 @@ def to_spdx(self, doc_namespace: str) -> "SPDXSbom": | |||
for prop in component.properties: | |||
annotations.append( | |||
SPDXPackageAnnotation( | |||
annotator="Tool:cachi2:jsonencoded", | |||
annotator="Tool: cachi2:jsonencoded", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, is any of Tool
and jsonencoded
somewhat common/used in the SPDX ecosystem?
@@ -197,6 +198,12 @@ def get_cache_dir() -> Path: | |||
return cache_dir.joinpath("cachi2") | |||
|
|||
|
|||
def first(predicate: Callable, iterable: Iterable, fallback: Any) -> Any: | |||
def first_for(predicate: Callable, iterable: Iterable, fallback: Any) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like this change should happen in the commit which is introducing first
as utility function rather than renaming it later in the series.
def partition_by(predicate: Callable, iterable: Iterable) -> tuple[Iterable, Iterable]: | ||
"""Partition iterable in two by predicate.""" | ||
i1, i2 = tee(iterable) | ||
return filterfalse(predicate, i1), filter(predicate, i2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similarly to first
, feels like partition_by
could be a standalone patch OR at least paired with first
in a standalone patch without other logic around.
cachi2/interface/cli.py
Outdated
cyclonedx_sboms_to_merge = [] | ||
for _sbom in sboms_to_merge: | ||
if not isinstance(_sbom, Sbom): | ||
cyclonedx_sboms_to_merge.append(_sbom.to_cyclonedx()) | ||
else: | ||
cyclonedx_sboms_to_merge.append(_sbom) | ||
# TODO: merging SBOMs should be done uniformly via "+" | ||
sbom: Union[Sbom, SPDXSbom] = Sbom( | ||
components=merge_component_properties( | ||
chain.from_iterable(s.components for s in cyclonedx_sboms_to_merge) | ||
) | ||
) | ||
start_sbom = sboms_to_merge[0].to_cyclonedx() | ||
else: | ||
spdx_sboms_to_merge = [] | ||
for _sbom in sboms_to_merge: | ||
if not isinstance(_sbom, SPDXSbom): | ||
spdx_sboms_to_merge.append(_sbom.to_spdx(doc_namespace="NOASSERTION")) | ||
else: | ||
spdx_sboms_to_merge.append(_sbom) | ||
|
||
sbom = sum(spdx_sboms_to_merge, start=spdx_sboms_to_merge[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial implementation squash material?
PropertyName = Literal[ | ||
"cachi2:bundler:package:binary", | ||
"cachi2:found_by", | ||
"cachi2:missing_hash:in_file", | ||
"cachi2:pip:package:binary", | ||
"cdx:npm:package:bundled", | ||
"cdx:npm:package:development", | ||
] | ||
|
||
def merge_component_properties(components: Iterable[Component]) -> list[Component]: | ||
"""Sort and de-duplicate components while merging their `properties`.""" | ||
components = sorted(components, key=Component.key) | ||
grouped_components = groupby(components, key=Component.key) | ||
|
||
def merge_component_group(component_group: Iterable[Component]) -> Component: | ||
component_group = list(component_group) | ||
prop_sets = (PropertySet.from_properties(c.properties) for c in component_group) | ||
merged_prop_set = functools.reduce(PropertySet.merge, prop_sets) | ||
component = component_group[0] | ||
return component.model_copy(update={"properties": merged_prop_set.to_properties()}) | ||
class Property(pydantic.BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hunk isn't strictly SPDX related, is it? IOW Could we introduce it at the beginning of the series so that reviewers can see it and forget about it as they go through the PR?
Is it known how a property like this will be replicated for SPDX? "components": [
{
"externalReferences": [
{
"type": "distribution",
"url": "https://github.com/cachito-testing/cachi2-generic/archive/refs/tags/v2.0.0.zip"
}
],
"name": "archive.zip",
"properties": [
{
"name": "cachi2:found_by",
"value": "cachi2"
}
],
"purl": "pkg:generic/archive.zip?checksum=sha256:386428a82f37345fa24b74068e0e79f4c1f2ff38d4f5c106ea14de4a2926e584&download_url=https://github.com/cachito-testing/cachi2-generic/archive/refs/tags/v2.0.0.zip",
"type": "file"
},
] |
Support for SPDX format was added to fetch-depds command and also to merge_syft_sboms.
No changes were made in particular package manager generating components which are then converted to cyclonedx format. SPDX sbom can be obtained by calling Sbom.to_spdx().
New switch sbom-type was added to merge_syfy_sboms, so user can choose which output format should be generated - default is cyclonedx. Once all tooling is ready to consume spdx sboms, cutoff changes in this repository can be started.
Maintainers will complete the following section
Note: if the contribution is external (not from an organization member), the CI
pipeline will not run automatically. After verifying that the CI is safe to run:
/ok-to-test
(as is the standard for Pipelines as Code)