Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Optionally support regular expressions when applying ignore rules for package names/package upstream names #2445

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

adammcclenaghan
Copy link

Description

Hello folks 👋

In #1809 both ifPackageNameApplies and ifUpstreamPackageNameApplies were updated to use regular expressions for all rule names/upstream names. This was recently updated in #2320 too.

So far as I can see from the codebase only the linux kernel headers ignore rules require regular expression matching.

In my testing I've found that FindMatches is significantly slowed down by the amount of regular expression compilation being performed.

For one of my test scans with about ~60k packages, using this branch results in a 2x speedup of FindMatches with ~8GB less cumulative allocations

Before

Screenshot 2025-02-13 at 11 16 29

After

Screenshot 2025-02-13 at 11 21 11

For now, I've modified ifPackageNameApplies and ifUpstreamPackageNameApplies to handle the linux-headers as special cases.

In future if we wish to support regular expressions more broadly perhaps there should be a nicer interface passed into the likes of ifPackageNameApplies and ifUpstreamPackageNameApplies rather than a string, and this interface could define whether or not to treat the match type as a regex etc...

I've had to make some changes to the unit tests.

I've removed two tests which seem to be testing behaviour that Grype doesn't have configuration for:

  • there are no existing ignore rules which define kernel-headers.* so I've removed the test ignore on name regex
  • there are no existing ignore rules which define regular expresions that start and end with anchor tags, so I've removed the test ignore on name regex, line termination test match

I've also modified some of the tests to match the actual patterns set out in Grype, namely:
-- linux-.*-headers-.* is now declared linux(-.*)?-headers-.*
-- linux-.* is now defined as linux.* .. as a result of fixing this I've also updated the upstream name from linux to notalinux for the test case. The ignore rules defined today would actually ignore linux, the only reason the test was passing before is because it defined the regex as linux-.*

In lieue of a larger refactor to have a nicer interface for ifPackageNameApplies/ifUpstreamPackageNameApplies which I've previously mentioned, it would at least perhaps be nice to declare the linux(-.*)?-headers-.* and linux.* as a const in some shared place.

If the maintainers can suggest somewhere that would be a good fit, I'd be happy to create this.

It looks like only the linux kernel headers ignore rules
make use of pattern matching.

Create and use regex patterns for these cases only to reduce
memory allocations from pattern.compile and improve FindMatches performance.

In future if we wish to support regular expressions more broadly perhaps
then perhaps ignore rules should have some way to define whether they
should be treated as a regular expression or not.

Signed-off-by: Adam McClenaghan <adam@mcclenaghan.co.uk>
Signed-off-by: Adam McClenaghan <adam@mcclenaghan.co.uk>
Signed-off-by: Adam McClenaghan <adam@mcclenaghan.co.uk>
pattern, err := packageNameRegex(name)
if err != nil {
return func(Match) bool { return false }
if name == "linux(-.*)?-headers-.*" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was my hope that we would move these ignore rules to be data driven at some point, but by hardcoding them in the functions here, this seems to move in the opposite direction. I'm having a hard time understanding where the time/allocation is: should we just check to see if strings.Contains(name, ".*") here to avoid making patterns and return a different function which just does a string == comparison?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was my hope that we would move these ignore rules to be data driven at some point, but by hardcoding them in the functions here, this seems to move in the opposite direction

Yes this seems like a good way to do things, however the fact will still remain that some of the ignore rules (as far as I can tell just one currently) need to be treated as regex while the vast majority just need to be string equals comparisons.

In this data driven approach we could move towards a more 'robust' solution similar to the one I allude to in the PR description where the data itself lets callers know whether to treat the rule as a regular expression or not.

I'm having a hard time understanding where the time/allocation is

I could have explained that better and just realised my PR description screenshots maybe don't make it totally clear.

This pprof diagram might be a bit more helpful

Screenshot 2025-02-13 at 15 55 47

You can see that ApplyExplicitIgnoreRules perform a lot of allocations.

If we look at top10 we can see this as well

(pprof) top10
Showing nodes accounting for 10164.59MB, 62.40% of 16288.23MB total
Dropped 376 nodes (cum <= 81.44MB)
Showing top 10 nodes out of 140
      flat  flat%   sum%        cum   cum%
 3203.41MB 19.67% 19.67%  3203.41MB 19.67%  regexp/syntax.(*compiler).inst (inline)
 1913.64MB 11.75% 31.42%  1913.64MB 11.75%  regexp.onePassCopy
 1023.75MB  6.29% 37.70%  1023.75MB  6.29%  encoding/json.(*Decoder).refill
  882.94MB  5.42% 43.12%  9653.96MB 59.27%  github.com/anchore/grype/grype/match.ApplyExplicitIgnoreRules

And we can jump to the problem with:

(pprof) list github.com/anchore/grype/grype/match.ApplyExplicitIgnoreRules
Total: 15.91GB
ROUTINE ======================== github.com/anchore/grype/grype/match.ApplyExplicitIgnoreRules in /Users/adammcclenaghan/go/pkg/mod/github.com/anchore/grype@v0.87.0/grype/match/explicit_ignores.go
  882.94MB     9.43GB (flat, cum) 59.27% of Total
       2MB        2MB     72:func ApplyExplicitIgnoreRules(provider ExclusionProvider, matches Matches) (Matches, []IgnoredMatch) {
         .          .     73:	var ignoreRules []IgnoreRule
  880.94MB   880.94MB     74:	ignoreRules = append(ignoreRules, explicitIgnoreRules...)
         .          .     75:
         .    30.60MB     76:	for _, m := range matches.Sorted() {
         .   101.06MB     77:		r, err := provider.GetRules(m.Vulnerability.ID)
         .          .     78:
         .          .     79:		if err != nil {
         .          .     80:			log.Warnf("unable to get ignore rules for vuln id=%s", m.Vulnerability.ID)
         .          .     81:			continue
         .          .     82:		}
         .          .     83:
         .          .     84:		ignoreRules = append(ignoreRules, r...)
         .          .     85:	}
         .          .     86:
         .     8.44GB     87:	return ApplyIgnoreRules(matches, ignoreRules)
         .          .     88:}

The majority of allocations are happening here:

         .     8.44GB     87:	return ApplyIgnoreRules(matches, ignoreRules)

To step through the code that gets us here...

  1. For each package here and matcher we apply the defined ignore rules . When there's a lot of packages, we'll call this a lot.
  2. All of the defined ignore rules then get applied here. Keep in mind no patterns have been compiled yet.
  3. Calls here
  4. If you follow through, we get to here. (Recap on context, this line gets called (numPackage * numMatches * rules) times and importantly it calls filter.IgnoreMatch(match)... that many times...
  5. If we follow into filter.IgnoreMatch it calls here getIgnoreConditionsForRule(r)
  6. And getIgnoreConditionsForRule is where the allocations are happening. This is because every time it gets called it's compiling a pattern for the rule. So we compile regex patterns (Num Packages * Matches * Rules) times.

We can confirm through the heap profile:

(pprof) list getIgnoreConditionsForRule
Total: 15.91GB
ROUTINE ======================== github.com/anchore/grype/grype/match.getIgnoreConditionsForRule in /Users/adammcclenaghan/go/pkg/mod/github.com/anchore/grype@v0.87.0/grype/match/ignore.go
     106MB     8.35GB (flat, cum) 52.48% of Total
         .          .    111:func getIgnoreConditionsForRule(rule IgnoreRule) []ignoreCondition {
         .          .    112:	var ignoreConditions []ignoreCondition
         .          .    113:
         .          .    114:	if v := rule.Vulnerability; v != "" {
      14MB    54.50MB    115:		ignoreConditions = append(ignoreConditions, ifVulnerabilityApplies(v))
         .          .    116:	}
         .          .    117:
         .          .    118:	if ns := rule.Namespace; ns != "" {
         .          .    119:		ignoreConditions = append(ignoreConditions, ifNamespaceApplies(ns))
         .          .    120:	}
         .          .    121:
         .          .    122:	if n := rule.Package.Name; n != "" {
      32MB     8.19GB    123:		ignoreConditions = append(ignoreConditions, ifPackageNameApplies(n))
         .          .    124:	}
         .          .    125:
         .          .    126:	if v := rule.Package.Version; v != "" {
         .          .    127:		ignoreConditions = append(ignoreConditions, ifPackageVersionApplies(v))
         .          .    128:	}
         .          .    129:
         .          .    130:	if l := rule.Package.Language; l != "" {
         .          .    131:		ignoreConditions = append(ignoreConditions, ifPackageLanguageApplies(l))
         .          .    132:	}
         .          .    133:
         .          .    134:	if t := rule.Package.Type; t != "" {
      60MB   102.50MB    135:		ignoreConditions = append(ignoreConditions, ifPackageTypeApplies(t))
         .          .    136:	}
         .          .    137:
         .          .    138:	if l := rule.Package.Location; l != "" {
         .          .    139:		ignoreConditions = append(ignoreConditions, ifPackageLocationApplies(l))
         .          .    140:	}

Here you see the largest offender is the package name rules being re-compiled:

         .          .    122:	if n := rule.Package.Name; n != "" {
      32MB     8.19GB    123:		ignoreConditions = append(ignoreConditions, ifPackageNameApplies(n))
         .          .    124:	}

And just to see it all out through to the end:

(pprof) list ifPackageNameApplies
Total: 15.91GB
ROUTINE ======================== github.com/anchore/grype/grype/match.ifPackageNameApplies in /Users/adammcclenaghan/go/pkg/mod/github.com/anchore/grype@v0.87.0/grype/match/ignore.go
   22.50MB     8.16GB (flat, cum) 51.32% of Total
         .          .    182:func ifPackageNameApplies(name string) ignoreCondition {
         .     8.14GB    183:	pattern, err := packageNameRegex(name)
         .          .    184:	if err != nil {
         .          .    185:		return func(Match) bool { return false }
         .          .    186:	}
         .          .    187:
   22.50MB    22.50MB    188:	return func(match Match) bool {
         .          .    189:		return pattern.MatchString(match.Package.Name)
         .          .    190:	}
         .          .    191:}
         .          .    192:
         .          .    193:func ifPackageVersionApplies(version string) ignoreCondition {

And the compile calls from packageNameRegex:

(pprof) list packageNameRegex
Total: 15.91GB
ROUTINE ======================== github.com/anchore/grype/grype/match.packageNameRegex in /Users/adammcclenaghan/go/pkg/mod/github.com/anchore/grype@v0.87.0/grype/match/ignore.go
   33.50MB     8.14GB (flat, cum) 51.18% of Total
         .          .    174:func packageNameRegex(packageName string) (*regexp.Regexp, error) {
         .          .    175:	pattern := packageName
         .          .    176:	if packageName[0] != '$' || packageName[len(packageName)-1] != '^' {
   33.50MB    33.50MB    177:		pattern = "^" + packageName + "$"
         .          .    178:	}
         .     8.11GB    179:	return regexp.Compile(pattern)
         .          .    180:}
         .          .    181:
         .          .    182:func ifPackageNameApplies(name string) ignoreCondition {
         .          .    183:	pattern, err := packageNameRegex(name)
         .          .    184:	if err != nil {

Wrt time the problem is that the amount of short lived allocations causes a lot of work for the GC as expected:
Screenshot 2025-02-13 at 16 05 31


should we just check to see if strings.Contains(name, ".*") here to avoid making patterns and return a different function which just does a string == comparison

I may be misunderstanding, are you suggesting that we use a strings.Contains(name, ".*") check to determine if we should perform pattern matching, otherwise perform the == comparison?

The problem is if we do this to try to support regular expressions more broadly, it does not go far enough. There are a whole host of other symbols we'd have to check for to determine if the rule is a valid regular expression, in fact the best way to check would be to compile the regex in the first place (which we are trying to avoid)

I think that while it looks a little ugly to have the hard coded definitions here, it may make the intent clearer to readers for now.. Its also why I suggested that we perhaps use a shared const to make it even clearer that the linux kernel headers are the exception to the rule (for now)

As mentioned, a more proper solution is to have rules surface how they should be treated and how they should be matched against. However I felt doing that is beyond the scope of this fix for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants