-
-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement DB collections for mirrors, repos and snapshots #766
Conversation
Codecov Report
@@ Coverage Diff @@
## master #766 +/- ##
==========================================
+ Coverage 63.71% 64.05% +0.34%
==========================================
Files 50 50
Lines 6308 6326 +18
==========================================
+ Hits 4019 4052 +33
+ Misses 1797 1778 -19
- Partials 492 496 +4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general just one comment above concerning tests. Maybe it would also be good to do some performance and memory measurements with a large repo.
Just to see what is the memory usage gain and performance hit. The code does get more complicated with this change so I guess it would be good to prove that it is actually decreases memory usage with a acceptable performance hit.
return nil, err | ||
} | ||
|
||
r := &LocalRepo{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't seem to be covered by a test (https://codecov.io/gh/aptly-dev/aptly/pull/766/diff#D2-210)
I think as this is a success path it would be good to have a unit test for it. There are some similar code bits uncovered in the other collections too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, thanks, I will check it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've improved test coverage and added some benchmarks for ByUUID()
and ByName()
. They exercise "worst case": collection is created and one lookup is performed. On master
branch this always leads to loading all the objects.
For ByUUID()
in the new approach loading is fast, as object is looked up directly (loading only single element). ByName()
still requires scanning whole collection, in the worst case it is as bad as the old approach, on average it's 50% better.
ForEach()
doesn't cache objects in memory, so this should help #761 as objects would be GCed as soon as they're scanned. ByUUID()
is used a lot to lookup source of published repositories.
On master
:
BenchmarkSnapshotCollectionByUUID-8 500 2953932 ns/op 1352168 B/op 30743 allocs/op
BenchmarkSnapshotCollectionByName-8 500 2922043 ns/op 1352504 B/op 30747 allocs/op
This branch:
BenchmarkSnapshotCollectionByUUID-8 300000 4492 ns/op 1792 B/op 39 allocs/op
BenchmarkSnapshotCollectionByName-8 1000 1433058 ns/op 533994 B/op 14870 allocs/op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I guess we can merge this and try to get some feedback of the users with a specific problem in #761 whether this fix helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also had an idea to implement very simple "index" to lookup things by name (which is really frequent lookup): just a key in DB which is like name
-> UUID
. ByName()
could use this index optimistically - if it's missing or doesn't point to right entry, ByName()
falls back to full scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a way. I think we should see first how these changes are in terms of performance in a actual use case before we make the code even more complex.
See #765, #761 Collections were relying on keeping in-memory list of all the objects for any kind of operation which doesn't scale well the number of objects in the database. With this rewrite, objects are loaded only on demand which might be pessimization in some edge cases but should improve performance and memory footprint signifcantly.
See #765, #761
Collections were relying on keeping in-memory list of all the objects
for any kind of operation which doesn't scale well the number of
objects in the database.
With this rewrite, objects are loaded only on demand which might
be pessimization in some edge cases but should improve performance
and memory footprint significantly.
This doesn't touch
PublishedRepoCollection
as it relies on list ofall the objects in many places to implement unique checks, proper
cleanup.
Checklist
AUTHORS