JSON output that groups aliases #29

another-rex · 2022-12-08T00:55:04Z

Improve JSON output so that it encodes alias information so that the number of vulnerabilities are not artificially increased by the number of different advisories publishing the same vulnerability.

oliverchang · 2022-12-08T01:11:13Z

Maybe we can output multiple IDs per item in the "vulnerabilities" output? These IDs would all be linked together by their aliases (i.e. grouped).

i.e.

{
  "results": [
    {
      "filePath": "sbom:file/path/test.spdx.json",
      "packages": [
        {
          "name": "mercurial",
          "version": "4.8.2",
          "ecosystem": "pypi",
          "vulnerabilities": [
            {
              "ids": ["PYSEC-2019-188", "GHSA-foo"],
              "aliases": [ "CVE-blah" ]
            },
            {
              "ids": ["PYSEC-2019-189", "GHSA-bar"],
              "aliases": [ "CVE-blah2" ]
            }
          ]
        }]
  }
  ]
}

The distinction between the "ids" and "aliases" in the output would be that anything in "ids" would exist in the OSV.dev DB, while anything in "aliases" doesn't exist. But maybe we also don't need this distinction.

CC @G-Rath for thoughts too.

G-Rath · 2022-12-08T01:25:45Z

@oliverchang my thinking is that vulnerabilities should have the whole OSV (which I assume it does right now, but maybe it doesn't?) in which case what you're proposing would go against the spec if I understand you correctly.

But it sounds like you're just wanting to deduplicate, which something we're already doing in the detector by checking against the id and aliases 😅

I guess what I'm saying is my thoughts are the point of aliases is that the two entities should be considered equivalent and so really have the same information in the long-run, meaning there shouldn't be a reason to keep both OSVs in the output - so you should be fine to just deduplicate.

G-Rath · 2022-12-08T01:27:45Z

one optimization that does come to mind though would be if the API included the aliases in its bulk endpoint response, so that tools could avoid hydrating OSVs that it is just going to de-duplicate away because they're aliases.

oliverchang · 2022-12-08T02:08:33Z

We are also already grouping by aliases as well in the human readable output -- just not in the JSON.

Re deduplication -- I'm not sure if we want to make the decision on which source to present, so we'd like to give the option to the user and present every option.

The vulnerabilities part currently does not have the whole OSV. There's a good point about including the full OSV result in the JSON though -- perhaps we could do something like:

{
  "osvs": {
    "PYSEC-2019-188": { full OSV },
    "PYSEC-2019-189": { full OSV },
    "GHSA-foo": { full OSV },
    "GHSA-bar": { full OSV },
  },
  "matches": [
    {
      "filePath": "sbom:file/path/test.spdx.json",
      "packages": [
        {
          "name": "mercurial",
          "version": "4.8.2",
          "ecosystem": "pypi",
          "vulnerabilities": [
            {
              "ids": ["PYSEC-2019-188", "GHSA-foo"],
              "aliases": [ "CVE-blah" ]
            },
            {
              "ids": ["PYSEC-2019-189", "GHSA-bar"],
              "aliases": [ "CVE-blah2" ]
            }
          ]
        }]
  }
  ]
}

one optimization that does come to mind though would be if the API included the aliases in its bulk endpoint response, so that tools could avoid hydrating OSVs that it is just going to de-duplicate away because they're aliases.

That sounds like something that's quite easily doable.

another-rex · 2022-12-08T02:17:11Z

my thinking is that vulnerabilities should have the whole OSV (which I assume it does right now, but maybe it doesn't?)

Right now vulnerabilities only contains

id string
aliases []string

but as Oliver says, it's definitely a good idea to output the full OSV, since we are already querying for it anyway.

The distinction between the "ids" and "aliases" in the output would be that anything in "ids" would exist in the OSV.dev DB, while anything in "aliases" doesn't exist. But maybe we also don't need this distinction.

We might run into the problem presented here: google/osv.dev#888 where two different vulnerabilities are grouped together and presented as just one because they both alias a common vulnerability.

one optimization that does come to mind though would be if the API included the aliases in its bulk endpoint response, so that tools could avoid hydrating OSVs that it is just going to de-duplicate away because they're aliases.

That sounds like something that's quite easily doable.

If we are presenting the full OSV in the output anyway, this won't be needed anymore since it doesn't save any additional queries.

oliverchang · 2022-12-08T02:21:43Z

We might run into the problem presented here: google/osv.dev#888 where two different vulnerabilities are grouped together and presented as just one because they both alias a common vulnerability.

Would this be a real problem in practice? The vulns considered for grouping are the ones that match a particular package/version for the same lockfile right?

oliverchang · 2022-12-08T02:24:44Z

Also another alternative approach to the output:

{
  "results": [
    {
      "filePath": "sbom:file/path/test.spdx.json",
      "packages": [
        {
          "name": "mercurial",
          "version": "4.8.2",
          "ecosystem": "pypi",
          "vulnerabilities": [
            {
             full OSV 
            },
            {
              full OSV
            },
           ...
          ],
          "groups": [
            {"ids": ["PYSEC-2019-188", "GHSA-foo"]},
            {"ids": ["PYSEC-2019-189", "GHSA-bar"]},
          ]
    }
  ]
}

This might be a little better because it doesn't overload "aliases", and "vulnerabilities" contain the exact OSV entries which seems more logical.

Users wanting an accurate count of vulnerabilities can just do len(groups).

G-Rath · 2022-12-08T02:34:05Z

I'm not sure if we want to make the decision on which source to present, so we'd like to give the option to the user and present every option.

I think I disagree because that feels like you're forcing more work onto the user when the acknowledged goal is for unity/consistency (aka the same information should be on all OSVs that are for the same vulnerability, in the long-run) - and likewise with some of the other suggestions being presented, it feels like it's putting more work on the user that could easily be done by the tool.

This is probably more about what is the point of the tool: is it just to be a local client to interface with the API? (in which case yes including all OSVs and having the user sort out which they use is the right thing), or is it to be a tool for using the API? (in which case imo it should be more opinionated, to reduce the amount of work users need to do).

I think also keeping in mind the API exists right? so users should already have a very easy way to get the OSV data for aliases if they wish.

oliverchang · 2022-12-08T03:24:55Z

I'm not sure if we want to make the decision on which source to present, so we'd like to give the option to the user and present every option.

I think I disagree because that feels like you're forcing more work onto the user when the acknowledged goal is for unity/consistency (aka the same information should be on all OSVs that are for the same vulnerability, in the long-run) - and likewise with some of the other suggestions being presented, it feels like it's putting more work on the user that could easily be done by the tool.

This is probably more about what is the point of the tool: is it just to be a local client to interface with the API? (in which case yes including all OSVs and having the user sort out which they use is the right thing), or is it to be a tool for using the API? (in which case imo it should be more opinionated, to reduce the amount of work users need to do).

I think also keeping in mind the API exists right? so users should already have a very easy way to get the OSV data for aliases if they wish.

Indeed we want to reduce the work required by the user of this too. However, I don't think OSV should be the one making a judgement on which source is better, particularly when there may be small discrepancies.

There's a few steps we can take towards making this workflow easier:

Extract the common action the end user needs to take from an alias group based. e.g. on "Upgrade to version X". The full OSV variants are there for users to get context/info on the vulnerability itself.
Let the user provide a priority list of which sources to prefer (possibly per ecosystem).
Automatically generate patches based on satisfying every OSV match. This is the ultimate end goal we have, which will eliminate all of the manual work required.

G-Rath · 2022-12-09T03:13:01Z

Let the user provide a priority list of which sources to prefer (possibly per ecosystem).

ooh I really like this because it can be used in other places i.e. preferring to link to GHSA advisories, because all of the dev's repos are on GH and that way they can easily see what other projects are affected.

another-rex added the enhancement New feature or request label Dec 8, 2022

another-rex mentioned this issue Dec 11, 2022

Group json output #33

Merged

another-rex closed this as completed Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON output that groups aliases #29

JSON output that groups aliases #29

another-rex commented Dec 8, 2022

oliverchang commented Dec 8, 2022

G-Rath commented Dec 8, 2022

G-Rath commented Dec 8, 2022

oliverchang commented Dec 8, 2022

another-rex commented Dec 8, 2022

oliverchang commented Dec 8, 2022

oliverchang commented Dec 8, 2022 •

edited

Loading

G-Rath commented Dec 8, 2022 •

edited

Loading

oliverchang commented Dec 8, 2022

G-Rath commented Dec 9, 2022

JSON output that groups aliases #29

JSON output that groups aliases #29

Comments

another-rex commented Dec 8, 2022

oliverchang commented Dec 8, 2022

G-Rath commented Dec 8, 2022

G-Rath commented Dec 8, 2022

oliverchang commented Dec 8, 2022

another-rex commented Dec 8, 2022

oliverchang commented Dec 8, 2022

oliverchang commented Dec 8, 2022 • edited Loading

G-Rath commented Dec 8, 2022 • edited Loading

oliverchang commented Dec 8, 2022

G-Rath commented Dec 9, 2022

oliverchang commented Dec 8, 2022 •

edited

Loading

G-Rath commented Dec 8, 2022 •

edited

Loading