-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Optionally support regular expressions when applying ignore rules for package names/package upstream names #2445
base: main
Are you sure you want to change the base?
Conversation
It looks like only the linux kernel headers ignore rules make use of pattern matching. Create and use regex patterns for these cases only to reduce memory allocations from pattern.compile and improve FindMatches performance. In future if we wish to support regular expressions more broadly perhaps then perhaps ignore rules should have some way to define whether they should be treated as a regular expression or not. Signed-off-by: Adam McClenaghan <adam@mcclenaghan.co.uk>
Signed-off-by: Adam McClenaghan <adam@mcclenaghan.co.uk>
Signed-off-by: Adam McClenaghan <adam@mcclenaghan.co.uk>
pattern, err := packageNameRegex(name) | ||
if err != nil { | ||
return func(Match) bool { return false } | ||
if name == "linux(-.*)?-headers-.*" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was my hope that we would move these ignore rules to be data driven at some point, but by hardcoding them in the functions here, this seems to move in the opposite direction. I'm having a hard time understanding where the time/allocation is: should we just check to see if strings.Contains(name, ".*")
here to avoid making patterns and return a different function which just does a string ==
comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was my hope that we would move these ignore rules to be data driven at some point, but by hardcoding them in the functions here, this seems to move in the opposite direction
Yes this seems like a good way to do things, however the fact will still remain that some of the ignore rules (as far as I can tell just one currently) need to be treated as regex while the vast majority just need to be string equals comparisons.
In this data driven approach we could move towards a more 'robust' solution similar to the one I allude to in the PR description where the data itself lets callers know whether to treat the rule as a regular expression or not.
I'm having a hard time understanding where the time/allocation is
I could have explained that better and just realised my PR description screenshots maybe don't make it totally clear.
This pprof diagram might be a bit more helpful

You can see that ApplyExplicitIgnoreRules
perform a lot of allocations.
If we look at top10
we can see this as well
(pprof) top10
Showing nodes accounting for 10164.59MB, 62.40% of 16288.23MB total
Dropped 376 nodes (cum <= 81.44MB)
Showing top 10 nodes out of 140
flat flat% sum% cum cum%
3203.41MB 19.67% 19.67% 3203.41MB 19.67% regexp/syntax.(*compiler).inst (inline)
1913.64MB 11.75% 31.42% 1913.64MB 11.75% regexp.onePassCopy
1023.75MB 6.29% 37.70% 1023.75MB 6.29% encoding/json.(*Decoder).refill
882.94MB 5.42% 43.12% 9653.96MB 59.27% github.com/anchore/grype/grype/match.ApplyExplicitIgnoreRules
And we can jump to the problem with:
(pprof) list github.com/anchore/grype/grype/match.ApplyExplicitIgnoreRules
Total: 15.91GB
ROUTINE ======================== github.com/anchore/grype/grype/match.ApplyExplicitIgnoreRules in /Users/adammcclenaghan/go/pkg/mod/github.com/anchore/grype@v0.87.0/grype/match/explicit_ignores.go
882.94MB 9.43GB (flat, cum) 59.27% of Total
2MB 2MB 72:func ApplyExplicitIgnoreRules(provider ExclusionProvider, matches Matches) (Matches, []IgnoredMatch) {
. . 73: var ignoreRules []IgnoreRule
880.94MB 880.94MB 74: ignoreRules = append(ignoreRules, explicitIgnoreRules...)
. . 75:
. 30.60MB 76: for _, m := range matches.Sorted() {
. 101.06MB 77: r, err := provider.GetRules(m.Vulnerability.ID)
. . 78:
. . 79: if err != nil {
. . 80: log.Warnf("unable to get ignore rules for vuln id=%s", m.Vulnerability.ID)
. . 81: continue
. . 82: }
. . 83:
. . 84: ignoreRules = append(ignoreRules, r...)
. . 85: }
. . 86:
. 8.44GB 87: return ApplyIgnoreRules(matches, ignoreRules)
. . 88:}
The majority of allocations are happening here:
. 8.44GB 87: return ApplyIgnoreRules(matches, ignoreRules)
To step through the code that gets us here...
- For each package here and matcher we apply the defined ignore rules . When there's a lot of packages, we'll call this a lot.
- All of the defined ignore rules then get applied here. Keep in mind no patterns have been compiled yet.
- Calls here
- If you follow through, we get to here. (Recap on context, this line gets called (numPackage * numMatches * rules) times and importantly it calls
filter.IgnoreMatch(match)...
that many times... - If we follow into
filter.IgnoreMatch
it calls heregetIgnoreConditionsForRule(r)
- And
getIgnoreConditionsForRule
is where the allocations are happening. This is because every time it gets called it's compiling a pattern for the rule. So we compile regex patterns (Num Packages * Matches * Rules) times.
We can confirm through the heap profile:
(pprof) list getIgnoreConditionsForRule
Total: 15.91GB
ROUTINE ======================== github.com/anchore/grype/grype/match.getIgnoreConditionsForRule in /Users/adammcclenaghan/go/pkg/mod/github.com/anchore/grype@v0.87.0/grype/match/ignore.go
106MB 8.35GB (flat, cum) 52.48% of Total
. . 111:func getIgnoreConditionsForRule(rule IgnoreRule) []ignoreCondition {
. . 112: var ignoreConditions []ignoreCondition
. . 113:
. . 114: if v := rule.Vulnerability; v != "" {
14MB 54.50MB 115: ignoreConditions = append(ignoreConditions, ifVulnerabilityApplies(v))
. . 116: }
. . 117:
. . 118: if ns := rule.Namespace; ns != "" {
. . 119: ignoreConditions = append(ignoreConditions, ifNamespaceApplies(ns))
. . 120: }
. . 121:
. . 122: if n := rule.Package.Name; n != "" {
32MB 8.19GB 123: ignoreConditions = append(ignoreConditions, ifPackageNameApplies(n))
. . 124: }
. . 125:
. . 126: if v := rule.Package.Version; v != "" {
. . 127: ignoreConditions = append(ignoreConditions, ifPackageVersionApplies(v))
. . 128: }
. . 129:
. . 130: if l := rule.Package.Language; l != "" {
. . 131: ignoreConditions = append(ignoreConditions, ifPackageLanguageApplies(l))
. . 132: }
. . 133:
. . 134: if t := rule.Package.Type; t != "" {
60MB 102.50MB 135: ignoreConditions = append(ignoreConditions, ifPackageTypeApplies(t))
. . 136: }
. . 137:
. . 138: if l := rule.Package.Location; l != "" {
. . 139: ignoreConditions = append(ignoreConditions, ifPackageLocationApplies(l))
. . 140: }
Here you see the largest offender is the package name rules being re-compiled:
. . 122: if n := rule.Package.Name; n != "" {
32MB 8.19GB 123: ignoreConditions = append(ignoreConditions, ifPackageNameApplies(n))
. . 124: }
And just to see it all out through to the end:
(pprof) list ifPackageNameApplies
Total: 15.91GB
ROUTINE ======================== github.com/anchore/grype/grype/match.ifPackageNameApplies in /Users/adammcclenaghan/go/pkg/mod/github.com/anchore/grype@v0.87.0/grype/match/ignore.go
22.50MB 8.16GB (flat, cum) 51.32% of Total
. . 182:func ifPackageNameApplies(name string) ignoreCondition {
. 8.14GB 183: pattern, err := packageNameRegex(name)
. . 184: if err != nil {
. . 185: return func(Match) bool { return false }
. . 186: }
. . 187:
22.50MB 22.50MB 188: return func(match Match) bool {
. . 189: return pattern.MatchString(match.Package.Name)
. . 190: }
. . 191:}
. . 192:
. . 193:func ifPackageVersionApplies(version string) ignoreCondition {
And the compile calls from packageNameRegex
:
(pprof) list packageNameRegex
Total: 15.91GB
ROUTINE ======================== github.com/anchore/grype/grype/match.packageNameRegex in /Users/adammcclenaghan/go/pkg/mod/github.com/anchore/grype@v0.87.0/grype/match/ignore.go
33.50MB 8.14GB (flat, cum) 51.18% of Total
. . 174:func packageNameRegex(packageName string) (*regexp.Regexp, error) {
. . 175: pattern := packageName
. . 176: if packageName[0] != '$' || packageName[len(packageName)-1] != '^' {
33.50MB 33.50MB 177: pattern = "^" + packageName + "$"
. . 178: }
. 8.11GB 179: return regexp.Compile(pattern)
. . 180:}
. . 181:
. . 182:func ifPackageNameApplies(name string) ignoreCondition {
. . 183: pattern, err := packageNameRegex(name)
. . 184: if err != nil {
Wrt time the problem is that the amount of short lived allocations causes a lot of work for the GC as expected:
should we just check to see if strings.Contains(name, ".*") here to avoid making patterns and return a different function which just does a string == comparison
I may be misunderstanding, are you suggesting that we use a strings.Contains(name, ".*")
check to determine if we should perform pattern matching, otherwise perform the ==
comparison?
The problem is if we do this to try to support regular expressions more broadly, it does not go far enough. There are a whole host of other symbols we'd have to check for to determine if the rule is a valid regular expression, in fact the best way to check would be to compile the regex in the first place (which we are trying to avoid)
I think that while it looks a little ugly to have the hard coded definitions here, it may make the intent clearer to readers for now.. Its also why I suggested that we perhaps use a shared const to make it even clearer that the linux kernel headers are the exception to the rule (for now)
As mentioned, a more proper solution is to have rules surface how they should be treated and how they should be matched against. However I felt doing that is beyond the scope of this fix for now.
Description
Hello folks 👋
In #1809 both
ifPackageNameApplies
andifUpstreamPackageNameApplies
were updated to use regular expressions for all rule names/upstream names. This was recently updated in #2320 too.So far as I can see from the codebase only the linux kernel headers ignore rules require regular expression matching.
In my testing I've found that
FindMatches
is significantly slowed down by the amount of regular expression compilation being performed.For one of my test scans with about ~60k packages, using this branch results in a 2x speedup of
FindMatches
with ~8GB less cumulative allocationsBefore
After
For now, I've modified
ifPackageNameApplies
andifUpstreamPackageNameApplies
to handle the linux-headers as special cases.In future if we wish to support regular expressions more broadly perhaps there should be a nicer interface passed into the likes of
ifPackageNameApplies
andifUpstreamPackageNameApplies
rather than a string, and this interface could define whether or not to treat the match type as a regex etc...I've had to make some changes to the unit tests.
I've removed two tests which seem to be testing behaviour that Grype doesn't have configuration for:
kernel-headers.*
so I've removed the testignore on name regex
ignore on name regex, line termination test match
I've also modified some of the tests to match the actual patterns set out in Grype, namely:
--
linux-.*-headers-.*
is now declaredlinux(-.*)?-headers-.*
--
linux-.*
is now defined aslinux.*
.. as a result of fixing this I've also updated the upstream name fromlinux
tonotalinux
for the test case. The ignore rules defined today would actually ignorelinux
, the only reason the test was passing before is because it defined the regex aslinux-.*
In lieue of a larger refactor to have a nicer interface for
ifPackageNameApplies
/ifUpstreamPackageNameApplies
which I've previously mentioned, it would at least perhaps be nice to declare thelinux(-.*)?-headers-.*
andlinux.*
as a const in some shared place.If the maintainers can suggest somewhere that would be a good fit, I'd be happy to create this.