-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Milestone-3] Serokell: New OrderedSet.mo & OrderedMap fixup #662
Conversation
One quick observation. subset and equality seem very expensive: I wonder if that's because the isSubsetHelper still uses an iterator, that allocates. Maybe just use recursion again? Perhaps there's even faster way but I don't know enough about red-black trees to suggest one. If the representation is canonical, maybe you can exploit that. But I'm not sure it is or depends on the order of insertions/deletions. Do you know? |
It's not canonical, so we cannot have just a linear one-to-one comparison. I've experimented with a couple of alternatives I think it works pretty well now. I've managed to use the order of nodes in the tree to optimize it a bit. I have updated the benchmark results in the description, for the comparison of the version before and after the last commit see serokell#37 . |
That's much better indeed! Thank you! |
Also, I've optimized |
src/OrderedMap.mo
Outdated
}; | ||
|
||
/// Test helpers | ||
public module MapDebug { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this module into the test dir, to avoid polluting the library?
Store the set size in the root node and make set operations to use the set size instead of calculating black height.
* rename elements() -> vals() * add valsRev() * Hide helper types for iterations into the Internal module
* rename `rbSet` to `set` or `s` * add import `Debug` into examples * fix typos
+ Minor call optimization
Also: * fix code format * fix typo * refactor internal functions a bit
Co-authored-by: Claudio Russo <claudio@dfinity.org>
Co-authored-by: Claudio Russo <claudio@dfinity.org>
Head branch was pushed to by a user without write access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the adjustments!
This is an MR for the 3rd Milestone of the Serokell's grant about improving Motoko's base library. The main goal of the PR is to introduce a new functional implementation of the set data structure to the' base' library. Also, it brings a few changes to the new functional map that was added in #664 , #654 . # General changes: * rename `PersistentOrderedMap` to `OrderedMap` (same for the `OrderedSet`) * improve docs # Functional Map changes: ## New functionality: + add `any`/`all` functions + add `contains` function + add `minEntry`/`maxEntry` ## Optimizations: + Store `size` in the Map, [benchmark results](serokell#35) ## Fixup: + add `entriesRev()`, remove `iter()` # NEW functional Set: The new data structure implements an ordered set interface using Red-Black trees as well as the new functional map from the 1-2 Milestones. ## API implemented: * Basic operations (based on the map): `put`, `delete`, `contains`, `fromIter`, etc * Maps and folds: `map`, `mapFilter`, `foldLeft`, `foldRight` * Set operations: `union` , `intersect`, `diff`, `isSubset`, `equal` * Additional operations (as for the `OrderedMap`): `min`/`max`, `all`/`some` ## Maintainance support: * Unit, property tests * Documentation ## Applied optimizations: * Same optimizations that were useful for the functional map: * inline node color * float-out exceeded matching in iteration * `map`/`filterMap` through `foldLeft` * direct recursion in `foldLeft` * [Benchmark results for all four optimizations together](serokell#27) * store size in the root of the tree, [benchmark results](serokell#36 (comment)) * Pattern matching order optimization, [benchmark results](serokell#36 (comment)) * Other optimizations: * Inline code of `OrderedMap` instead of sharing it, [benchmark results](serokell#25) * `intersect` optimization: use order of output values to build the resulting tree faster, see serokell#39 * `isSubset`, `equal` optimization: use early exit and use order of subtrees to reduce intermediate tree height, see serokell#37 ## Rejected optimizations: * Nipkow's implementation of set operation [Tobias Nipkow's "Functional Data Structures and Algorithms", 117]. Initially, we were planning to use an implementation of set operations (`intersect`, `union`, `diff`) from Nipkow's book. However, the experiment shows that naive implementation with a simple size heuristic performs better. [The benchmark results](serokell#33) are comparing 3 versions: * persistentset_baseline -- original implementation that uses Nipkow's algorithms. However, the black height is calculated before each set operation (the book assumes it's stored). * persistentset_bh -- the same as the baseline but the black height is stored in each node. * persistentset -- naive implementation that looks up in a smaller set and modifies a bigger one (it gives us `O(min(n,m)log((max(n,m))` which is very close to Nipkow's version). Sizes of sets are also stored but only in the root. The last one outperforms others and keeps a tree slim in terms of byte size. Thus, we have picked it. ## Final benchmark results: ### Collection benchmarks | |binary_size|generate|max mem|batch_get 50|batch_put 50|batch_remove 50|upgrade| |--:|--:|--:|--:|--:|--:|--:|--:| |orderedset+100|218_168|186_441|37_916|53_044|121_237|127_460|346_108| |trieset+100|211_245|574_022|47_652|131_218|288_429|268_499|729_696| |orderedset+1000|218_168|2_561_296|520_364|69_883|158_349|170_418|3_186_579| |trieset+1000|211_245|7_374_045|633_440|162_806|383_594|375_264|9_178_466| |orderedset+10000|218_168|40_015_301|320_532|84_660|192_931|215_592|31_522_120| |trieset+10000|211_245|105_695_670|682_792|192_931|457_923|462_594|129_453_045| |orderedset+100000|218_168|476_278_087|3_200_532|98_553|230_123|258_372|409_032_232| |trieset+100000|211_245|1_234_038_235|6_826_516|222_247|560_440|549_813|1_525_692_388| |orderedset+1000000|218_168|5_514_198_432|32_000_532|115_836|268_236|306_896|4_090_302_778| |trieset+1000000|211_245|13_990_048_548|68_228_312|252_211|650_405|642_099|17_455_845_492| ### set API | |size|intersect|union|diff|equals|isSubset| |--:|--:|--:|--:|--:|--:|--:| |orderedset+100|100|146_264|157_544|215_871|28_117|27_726| |trieset+100|100|352_496|411_306|350_935|201_896|201_456| |orderedset+1000|1000|162_428|194_198|286_747|242_329|241_938| |trieset+1000|1000|731_650|1_079_906|912_629|2_589_090|4_023_673| |orderedset+10000|10000|177_080|231_070|345_529|2_383_587|2_383_591| |trieset+10000|10000|3_986_854|21_412_306|5_984_106|46_174_710|31_885_381| |orderedset+100000|100000|190_727|267_008|402_081|91_300_348|91_300_393| |trieset+100000|100000|178_863_894|209_889_623|199_028_396|521_399_350|521_399_346| |orderedset+1000000|1000000|205_022|304_937|464_859|912_901_595|912_901_558| |trieset+1000000|1000000|1_782_977_198|2_092_850_787|1_984_818_266|5_813_335_155|5_813_335_151| ### new set API | |size|foldLeft|foldRight|mapfilter|map| |--:|--:|--:|--:|--:|--:| |orderedset|100|16_487|16_463|88_028|224_597| |orderedset|1000|133_685|131_953|1_526_510|4_035_782| |orderedset|10000|1_305_120|1_287_495|28_455_361|51_527_733| |orderedset|100000|13_041_665|12_849_418|344_132_505|630_692_463| |orderedset|1000000|130_428_573|803_454_777|4_019_592_041|7_453_944_902| --------- Co-authored-by: Andrei Borzenkov <andreyborzenkov2002@gmail.com> Co-authored-by: Andrei Borzenkov <root@sandwitch.dev> Co-authored-by: Sergey Gulin <sergeygulin96@gmail.com> Co-authored-by: Claudio Russo <claudio@dfinity.org>
Squashed and fast-forward merged with commit 1961fab (see https://dfinity.slack.com/archives/D07MAEML9RD/p1731527989433379) |
This is an MR for the 3rd Milestone of the Serokell's grant about improving Motoko's base library.
The main goal of the PR is to introduce a new functional implementation of the set data structure to the' base' library. Also, it brings a few changes to the new functional map that was added in #664 , #654 .
General changes:
PersistentOrderedMap
toOrderedMap
(same for theOrderedSet
)Functional Map changes:
New functionality:
any
/all
functionscontains
functionminEntry
/maxEntry
Optimizations:
size
in the Map, benchmark resultsFixup:
entriesRev()
, removeiter()
NEW functional Set:
The new data structure implements an ordered set interface using Red-Black trees as well as the new functional map from the 1-2 Milestones.
API implemented:
put
,delete
,contains
,fromIter
, etcmap
,mapFilter
,foldLeft
,foldRight
union
,intersect
,diff
,isSubset
,equal
OrderedMap
):min
/max
,all
/some
Maintainance support:
Applied optimizations:
map
/filterMap
throughfoldLeft
foldLeft
OrderedMap
instead of sharing it, benchmark resultsintersect
optimization: use order of output values to build the resulting tree faster, see Optimize set operations: intersect serokell/motoko-base#39isSubset
,equal
optimization: use early exit and use order of subtrees to reduce intermediate tree height, see OrderedSet: optimze isSubset,equal serokell/motoko-base#37Rejected optimizations:
Initially, we were planning to use an implementation of set operations (
intersect
,union
,diff
) from Nipkow's book. However, the experiment shows that naive implementation with a simple size heuristic performs better. The benchmark results are comparing 3 versions:O(min(n,m)log((max(n,m))
which is very close to Nipkow's version). Sizes of sets are also stored but only in the root.The last one outperforms others and keeps a tree slim in terms of byte size. Thus, we have picked it.
Final benchmark results:
Collection benchmarks
set API
new set API