Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker layer caching #2696

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ require (
github.com/lrstanley/bubblezone v0.0.0-20240125042004-b7bafc493195
github.com/marusama/semaphore/v2 v2.5.0
github.com/mattn/go-isatty v0.0.20
github.com/mattn/go-sqlite3 v1.14.22
github.com/mholt/archiver/v4 v4.0.0-alpha.8
github.com/microsoft/go-mssqldb v1.7.0
github.com/mitchellh/go-ps v1.0.0
Expand Down
2 changes: 2 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -544,6 +544,8 @@ github.com/mattn/go-runewidth v0.0.9/go.mod h1:H031xJmbD/WCDINGzjvQ9THkh0rPKHF+m
github.com/mattn/go-runewidth v0.0.12/go.mod h1:RAqKPSqVFrSLVXbA8x7dzmKdmGzieGRCM46jaSJTDAk=
github.com/mattn/go-runewidth v0.0.15 h1:UNAjwbU9l54TA3KzvqLGxwWjHmMgBUVhBiTjelZgg3U=
github.com/mattn/go-runewidth v0.0.15/go.mod h1:Jdepj2loyihRzMpdS35Xk/zdY8IAYHsh153qUoGf23w=
github.com/mattn/go-sqlite3 v1.14.22 h1:2gZY6PC6kBnID23Tichd1K+Z0oS6nE/XwU+Vz/5o4kU=
github.com/mattn/go-sqlite3 v1.14.22/go.mod h1:Uh1q+B4BYcTPb+yiD3kU8Ct7aC0hY9fxUwlHK0RXw+Y=
github.com/mholt/archiver/v4 v4.0.0-alpha.8 h1:tRGQuDVPh66WCOelqe6LIGh0gwmfwxUrSSDunscGsRM=
github.com/mholt/archiver/v4 v4.0.0-alpha.8/go.mod h1:5f7FUYGXdJWUjESffJaYR4R60VhnHxb2X3T1teMyv5A=
github.com/microcosm-cc/bluemonday v1.0.25 h1:4NEwSfiJ+Wva0VxN5B8OwMicaJvD8r9tlJWm9rtloEg=
Expand Down
5 changes: 5 additions & 0 deletions main.go
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,8 @@ var (

dockerScan = cli.Command("docker", "Scan Docker Image")
dockerScanImages = dockerScan.Flag("image", "Docker image to scan. Use the file:// prefix to point to a local tarball, otherwise a image registry is assumed.").Required().Strings()
dockerCache = dockerScan.Flag("cache", "Use layer caching. Don't re-scan a layer that has already been scanned and is in the layer caching db.").Bool()
dockerCacheDB = dockerScan.Flag("cache-db", "Path to the layer caching database. Default is trufflehog_layers.sqlite3").Default("trufflehog_layers.sqlite3").String()
Copy link
Contributor

@rgmz rgmz Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're persisting data, it may be prudent to make the name more generic in case there's other stuff worth caching in the future.

e.g., caching binary files rather than re-scanning them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking about data caching outside of the Docker source or within it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about data caching in general. I've previously experimented with using sqlite to do things like skip duplicate docker layers, binary files, and commits; cache GitHub API E-Tags; track progress so that scans can be resumed; etc.

Then again, based on @rosecodym's comment about performance it may not be desirable to re-use the same database for multiple purposes.


travisCiScan = cli.Command("travisci", "Scan TravisCI")
travisCiScanToken = travisCiScan.Flag("token", "TravisCI token. Can also be provided with environment variable").Envar("TRAVISCI_TOKEN").Required().String()
Expand Down Expand Up @@ -448,6 +450,7 @@ func run(state overseer.State) {
engine.WithFilterEntropy(*filterEntropy),
engine.WithVerificationOverlap(*allowVerificationOverlap),
engine.WithJobReportWriter(jobReportWriter),
engine.WithDockerCache(*dockerCache, *dockerCacheDB),
)
if err != nil {
logFatal(err, "error initializing engine")
Expand Down Expand Up @@ -583,6 +586,8 @@ func run(state overseer.State) {
Credential: &sourcespb.Docker_DockerKeychain{
DockerKeychain: true,
},
Cache: *dockerCache,
CacheDb: *dockerCacheDB,
}
anyConn, err := anypb.New(&dockerConn)
if err != nil {
Expand Down
39 changes: 39 additions & 0 deletions pkg/engine/engine.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ import (
"github.com/trufflesecurity/trufflehog/v3/pkg/pb/source_metadatapb"
"github.com/trufflesecurity/trufflehog/v3/pkg/pb/sourcespb"
"github.com/trufflesecurity/trufflehog/v3/pkg/sources"
"github.com/trufflesecurity/trufflehog/v3/pkg/sources/docker"
)

var overlapError = errors.New("More than one detector has found this result. For your safety, verification has been disabled. You can override this behavior by using the --allow-verification-overlap flag.")
Expand Down Expand Up @@ -105,6 +106,10 @@ type Engine struct {
// verify determines whether the scanner will attempt to verify candidate secrets
verify bool

// dockerCache and dockerCacheDb is used to cache the results of scanning docker layers.
dockerCache bool
dockerCacheDb string

// Note: bad hack only used for testing
verificationOverlapTracker *verificationOverlapTracker
}
Expand Down Expand Up @@ -239,6 +244,14 @@ func WithVerificationOverlap(verificationOverlap bool) Option {
}
}

// WithDockerCache enables caching of the results of scanning docker layers.
func WithDockerCache(dockerCache bool, dockerCacheDb string) Option {
return func(e *Engine) {
e.dockerCache = dockerCache
e.dockerCacheDb = dockerCacheDb
}
}

func filterDetectors(filterFunc func(detectors.Detector) bool, input []detectors.Detector) []detectors.Detector {
var out []detectors.Detector
for _, detector := range input {
Expand Down Expand Up @@ -864,6 +877,32 @@ func (e *Engine) processResult(ctx context.Context, data detectableChunk, res de

func (e *Engine) notifyResults(ctx context.Context) {
for r := range e.ResultsChan() {

// Handle docker layer caching if applicable
if e.dockerCache && r.SourceType == sourcespb.SourceType_SOURCE_TYPE_DOCKER {
layer := r.SourceMetadata.GetDocker().Layer
db, err := docker.ConnectToLayersDB(e.dockerCacheDb)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I noticed that the code creates and closes a new database connection for each result processed in the notifyResults function. This approach of establishing and tearing down connections for every result could potentially introduce a significant performance overhead, especially if the number of results is large.

To mitigate this performance impact, I suggest exploring the possibility of establishing database connections earlier in the engine's lifecycle and leveraging connection pooling. Connection pooling should allow multiple goroutines to efficiently share and reuse a pool of pre-established database connections, reducing the overhead of creating and closing connections for each operation.

By implementing connection pooling, we could create a pool of connections when initializing the engine. Then, instead of creating a new connection for each result, we could acquire a connection from the pool, perform the necessary database operations, and release the connection back to the pool when finished. This approach would minimize the connection overhead and improve the overall performance of the notifyResults function.

What are your thoughts? Do you think it's feasible to establish connections earlier and utilize connection pooling in this scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I authored the code, I thought it was creating/closing db connections a lot, but I honestly wasn't super sure the most efficient way to implement it. I think connection pooling would be a good addition, but I could use a hand with it since I haven't implemented anything like that in the past.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. Maybe we can do something like:

// Add the pool to the engine.

type Engine {
  ...
  dbPool *sql.DB
}

// Connect during init

func WithDockerCache(dockerCache bool, dockerCacheDb string) Option {
    return func(e *Engine) {
        if dockerCache {
            err := e.InitializeDockerCache(dockerCacheDb)
            ...
        }
    }
}

func (e *Engine) InitializeDockerCache(dockerCacheDb string) error {
    var err error
    e.dbPool, err = sql.Open("sqlite3", dockerCacheDb)
    ...

    // Not sure if we want to make these configurable or use defaults. Default could be the number of workers
   // used for notifying?
    e.dbPool.SetMaxOpenConns(10) 
    e.dbPool.SetMaxIdleConns(5)
    return nil
}

// Use the pool for operations.

func (e *Engine) updateDockerCache(layer string) error {
    conn := e.dbPool.Conn(context.Background())
    defer conn.Close()

    // Perform database operations using the acquired connection
    // ...

    return nil
}

if err != nil {
ctx.Logger().Error(err, "error connecting to docker cache")
err = docker.UpdateCompleted(db, layer, false)
if err != nil {
ctx.Logger().Error(err, "error updating docker cache")
}
}
if r.Verified {
err = docker.UpdateVerified(db, layer, true)
} else if r.VerificationError() != nil {
err = docker.UpdateUnverified(db, layer, true)
}
if err != nil {
ctx.Logger().Error(err, "error adding to docker cache")
err = docker.UpdateCompleted(db, layer, false)
if err != nil {
ctx.Logger().Error(err, "error updating docker cache")
}
}
}

// Filter unwanted results, based on `--results`.
if !r.Verified {
if r.VerificationError() != nil {
Expand Down
Loading
Loading