-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker layer caching #2696
base: main
Are you sure you want to change the base?
Docker layer caching #2696
Changes from all commits
bff60ef
1dade6c
e5b691b
1e14c68
9b09684
e3cb083
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,6 +28,7 @@ import ( | |
"github.com/trufflesecurity/trufflehog/v3/pkg/pb/source_metadatapb" | ||
"github.com/trufflesecurity/trufflehog/v3/pkg/pb/sourcespb" | ||
"github.com/trufflesecurity/trufflehog/v3/pkg/sources" | ||
"github.com/trufflesecurity/trufflehog/v3/pkg/sources/docker" | ||
) | ||
|
||
var overlapError = errors.New("More than one detector has found this result. For your safety, verification has been disabled. You can override this behavior by using the --allow-verification-overlap flag.") | ||
|
@@ -105,6 +106,10 @@ type Engine struct { | |
// verify determines whether the scanner will attempt to verify candidate secrets | ||
verify bool | ||
|
||
// dockerCache and dockerCacheDb is used to cache the results of scanning docker layers. | ||
dockerCache bool | ||
dockerCacheDb string | ||
|
||
// Note: bad hack only used for testing | ||
verificationOverlapTracker *verificationOverlapTracker | ||
} | ||
|
@@ -239,6 +244,14 @@ func WithVerificationOverlap(verificationOverlap bool) Option { | |
} | ||
} | ||
|
||
// WithDockerCache enables caching of the results of scanning docker layers. | ||
func WithDockerCache(dockerCache bool, dockerCacheDb string) Option { | ||
return func(e *Engine) { | ||
e.dockerCache = dockerCache | ||
e.dockerCacheDb = dockerCacheDb | ||
} | ||
} | ||
|
||
func filterDetectors(filterFunc func(detectors.Detector) bool, input []detectors.Detector) []detectors.Detector { | ||
var out []detectors.Detector | ||
for _, detector := range input { | ||
|
@@ -864,6 +877,32 @@ func (e *Engine) processResult(ctx context.Context, data detectableChunk, res de | |
|
||
func (e *Engine) notifyResults(ctx context.Context) { | ||
for r := range e.ResultsChan() { | ||
|
||
// Handle docker layer caching if applicable | ||
if e.dockerCache && r.SourceType == sourcespb.SourceType_SOURCE_TYPE_DOCKER { | ||
layer := r.SourceMetadata.GetDocker().Layer | ||
db, err := docker.ConnectToLayersDB(e.dockerCacheDb) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. suggestion: I noticed that the code creates and closes a new database connection for each result processed in the notifyResults function. This approach of establishing and tearing down connections for every result could potentially introduce a significant performance overhead, especially if the number of results is large. To mitigate this performance impact, I suggest exploring the possibility of establishing database connections earlier in the engine's lifecycle and leveraging connection pooling. Connection pooling should allow multiple goroutines to efficiently share and reuse a pool of pre-established database connections, reducing the overhead of creating and closing connections for each operation. By implementing connection pooling, we could create a pool of connections when initializing the engine. Then, instead of creating a new connection for each result, we could acquire a connection from the pool, perform the necessary database operations, and release the connection back to the pool when finished. This approach would minimize the connection overhead and improve the overall performance of the notifyResults function. What are your thoughts? Do you think it's feasible to establish connections earlier and utilize connection pooling in this scenario? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I authored the code, I thought it was creating/closing db connections a lot, but I honestly wasn't super sure the most efficient way to implement it. I think connection pooling would be a good addition, but I could use a hand with it since I haven't implemented anything like that in the past. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure thing. Maybe we can do something like: // Add the pool to the engine.
// Connect during init
// Use the pool for operations.
|
||
if err != nil { | ||
ctx.Logger().Error(err, "error connecting to docker cache") | ||
err = docker.UpdateCompleted(db, layer, false) | ||
if err != nil { | ||
ctx.Logger().Error(err, "error updating docker cache") | ||
} | ||
} | ||
if r.Verified { | ||
err = docker.UpdateVerified(db, layer, true) | ||
} else if r.VerificationError() != nil { | ||
err = docker.UpdateUnverified(db, layer, true) | ||
} | ||
if err != nil { | ||
ctx.Logger().Error(err, "error adding to docker cache") | ||
err = docker.UpdateCompleted(db, layer, false) | ||
if err != nil { | ||
ctx.Logger().Error(err, "error updating docker cache") | ||
} | ||
} | ||
} | ||
|
||
// Filter unwanted results, based on `--results`. | ||
if !r.Verified { | ||
if r.VerificationError() != nil { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're persisting data, it may be prudent to make the name more generic in case there's other stuff worth caching in the future.
e.g., caching binary files rather than re-scanning them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking about data caching outside of the Docker source or within it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about data caching in general. I've previously experimented with using sqlite to do things like skip duplicate docker layers, binary files, and commits; cache GitHub API E-Tags; track progress so that scans can be resumed; etc.
Then again, based on @rosecodym's comment about performance it may not be desirable to re-use the same database for multiple purposes.