Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON output that groups aliases #29

Closed
another-rex opened this issue Dec 8, 2022 · 10 comments
Closed

JSON output that groups aliases #29

another-rex opened this issue Dec 8, 2022 · 10 comments
Labels
enhancement New feature or request

Comments

@another-rex
Copy link
Collaborator

Improve JSON output so that it encodes alias information so that the number of vulnerabilities are not artificially increased by the number of different advisories publishing the same vulnerability.

@another-rex another-rex added the enhancement New feature or request label Dec 8, 2022
@oliverchang
Copy link
Collaborator

Maybe we can output multiple IDs per item in the "vulnerabilities" output? These IDs would all be linked together by their aliases (i.e. grouped).

i.e.

{
  "results": [
    {
      "filePath": "sbom:file/path/test.spdx.json",
      "packages": [
        {
          "name": "mercurial",
          "version": "4.8.2",
          "ecosystem": "pypi",
          "vulnerabilities": [
            {
              "ids": ["PYSEC-2019-188", "GHSA-foo"],
              "aliases": [ "CVE-blah" ]
            },
            {
              "ids": ["PYSEC-2019-189", "GHSA-bar"],
              "aliases": [ "CVE-blah2" ]
            }
          ]
        }]
  }
  ]
} 

The distinction between the "ids" and "aliases" in the output would be that anything in "ids" would exist in the OSV.dev DB, while anything in "aliases" doesn't exist. But maybe we also don't need this distinction.

CC @G-Rath for thoughts too.

@G-Rath
Copy link
Collaborator

G-Rath commented Dec 8, 2022

@oliverchang my thinking is that vulnerabilities should have the whole OSV (which I assume it does right now, but maybe it doesn't?) in which case what you're proposing would go against the spec if I understand you correctly.

But it sounds like you're just wanting to deduplicate, which something we're already doing in the detector by checking against the id and aliases 😅

I guess what I'm saying is my thoughts are the point of aliases is that the two entities should be considered equivalent and so really have the same information in the long-run, meaning there shouldn't be a reason to keep both OSVs in the output - so you should be fine to just deduplicate.

@G-Rath
Copy link
Collaborator

G-Rath commented Dec 8, 2022

one optimization that does come to mind though would be if the API included the aliases in its bulk endpoint response, so that tools could avoid hydrating OSVs that it is just going to de-duplicate away because they're aliases.

@oliverchang
Copy link
Collaborator

We are also already grouping by aliases as well in the human readable output -- just not in the JSON.

Re deduplication -- I'm not sure if we want to make the decision on which source to present, so we'd like to give the option to the user and present every option.

The vulnerabilities part currently does not have the whole OSV. There's a good point about including the full OSV result in the JSON though -- perhaps we could do something like:

{
  "osvs": {
    "PYSEC-2019-188": { full OSV },
    "PYSEC-2019-189": { full OSV },
    "GHSA-foo": { full OSV },
    "GHSA-bar": { full OSV },
  },
  "matches": [
    {
      "filePath": "sbom:file/path/test.spdx.json",
      "packages": [
        {
          "name": "mercurial",
          "version": "4.8.2",
          "ecosystem": "pypi",
          "vulnerabilities": [
            {
              "ids": ["PYSEC-2019-188", "GHSA-foo"],
              "aliases": [ "CVE-blah" ]
            },
            {
              "ids": ["PYSEC-2019-189", "GHSA-bar"],
              "aliases": [ "CVE-blah2" ]
            }
          ]
        }]
  }
  ]
} 

one optimization that does come to mind though would be if the API included the aliases in its bulk endpoint response, so that tools could avoid hydrating OSVs that it is just going to de-duplicate away because they're aliases.

That sounds like something that's quite easily doable.

@another-rex
Copy link
Collaborator Author

my thinking is that vulnerabilities should have the whole OSV (which I assume it does right now, but maybe it doesn't?)

Right now vulnerabilities only contains

id string
aliases []string

but as Oliver says, it's definitely a good idea to output the full OSV, since we are already querying for it anyway.

The distinction between the "ids" and "aliases" in the output would be that anything in "ids" would exist in the OSV.dev DB, while anything in "aliases" doesn't exist. But maybe we also don't need this distinction.

We might run into the problem presented here: google/osv.dev#888 where two different vulnerabilities are grouped together and presented as just one because they both alias a common vulnerability.

one optimization that does come to mind though would be if the API included the aliases in its bulk endpoint response, so that tools could avoid hydrating OSVs that it is just going to de-duplicate away because they're aliases.

That sounds like something that's quite easily doable.

If we are presenting the full OSV in the output anyway, this won't be needed anymore since it doesn't save any additional queries.

@oliverchang
Copy link
Collaborator

We might run into the problem presented here: google/osv.dev#888 where two different vulnerabilities are grouped together and presented as just one because they both alias a common vulnerability.

Would this be a real problem in practice? The vulns considered for grouping are the ones that match a particular package/version for the same lockfile right?

@oliverchang
Copy link
Collaborator

oliverchang commented Dec 8, 2022

Also another alternative approach to the output:

{
  "results": [
    {
      "filePath": "sbom:file/path/test.spdx.json",
      "packages": [
        {
          "name": "mercurial",
          "version": "4.8.2",
          "ecosystem": "pypi",
          "vulnerabilities": [
            {
             full OSV 
            },
            {
              full OSV
            },
           ...
          ],
          "groups": [
            {"ids": ["PYSEC-2019-188", "GHSA-foo"]},
            {"ids": ["PYSEC-2019-189", "GHSA-bar"]},
          ]
    }
  ]
} 

This might be a little better because it doesn't overload "aliases", and "vulnerabilities" contain the exact OSV entries which seems more logical.

Users wanting an accurate count of vulnerabilities can just do len(groups).

@G-Rath
Copy link
Collaborator

G-Rath commented Dec 8, 2022

I'm not sure if we want to make the decision on which source to present, so we'd like to give the option to the user and present every option.

I think I disagree because that feels like you're forcing more work onto the user when the acknowledged goal is for unity/consistency (aka the same information should be on all OSVs that are for the same vulnerability, in the long-run) - and likewise with some of the other suggestions being presented, it feels like it's putting more work on the user that could easily be done by the tool.

This is probably more about what is the point of the tool: is it just to be a local client to interface with the API? (in which case yes including all OSVs and having the user sort out which they use is the right thing), or is it to be a tool for using the API? (in which case imo it should be more opinionated, to reduce the amount of work users need to do).

I think also keeping in mind the API exists right? so users should already have a very easy way to get the OSV data for aliases if they wish.

@oliverchang
Copy link
Collaborator

I'm not sure if we want to make the decision on which source to present, so we'd like to give the option to the user and present every option.

I think I disagree because that feels like you're forcing more work onto the user when the acknowledged goal is for unity/consistency (aka the same information should be on all OSVs that are for the same vulnerability, in the long-run) - and likewise with some of the other suggestions being presented, it feels like it's putting more work on the user that could easily be done by the tool.

This is probably more about what is the point of the tool: is it just to be a local client to interface with the API? (in which case yes including all OSVs and having the user sort out which they use is the right thing), or is it to be a tool for using the API? (in which case imo it should be more opinionated, to reduce the amount of work users need to do).

I think also keeping in mind the API exists right? so users should already have a very easy way to get the OSV data for aliases if they wish.

Indeed we want to reduce the work required by the user of this too. However, I don't think OSV should be the one making a judgement on which source is better, particularly when there may be small discrepancies.

There's a few steps we can take towards making this workflow easier:

  1. Extract the common action the end user needs to take from an alias group based. e.g. on "Upgrade to version X". The full OSV variants are there for users to get context/info on the vulnerability itself.
  2. Let the user provide a priority list of which sources to prefer (possibly per ecosystem).
  3. Automatically generate patches based on satisfying every OSV match. This is the ultimate end goal we have, which will eliminate all of the manual work required.

@G-Rath
Copy link
Collaborator

G-Rath commented Dec 9, 2022

Let the user provide a priority list of which sources to prefer (possibly per ecosystem).

ooh I really like this because it can be used in other places i.e. preferring to link to GHSA advisories, because all of the dev's repos are on GH and that way they can easily see what other projects are affected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants