-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make constructors for SpaceSaver subclasses public so that deserializers... #289
Conversation
…ers that are not in this package have access to them
@@ -101,7 +101,7 @@ object SSMany { | |||
private def apply[T](capacity: Int, counters: Map[T, (Long, Long)], buckets: SortedMap[Long, Set[T]]): SSMany[T] = | |||
SSMany(capacity, counters, Some(buckets)) | |||
|
|||
private def apply[T](capacity: Int, counters: Map[T, (Long, Long)]): SSMany[T] = | |||
def apply[T](capacity: Int, counters: Map[T, (Long, Long)]): SSMany[T] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you made public the constructor for the SSOne, but an apply method for SSMany. Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the SSMany constructor has a third argument bucketsOption that i would prefer not to be visible to the world.
it is possible to build a SSMany with invalid internal state by passing in bucketsOption that is inconsistent with counters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if this is for the purpose of serialization are we not potentially incurring a re-calculation overhead upon every ser/deser? Things won't necessarily RT through ser now? the field is a public parameter of the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes bucketsOption is public and immutable. and yes we are incurring
re-calculation. however this is reduce-side and i have found it to be
immaterial for performance, so i didnt bother serializing bucketsOption. at
the time i wrote that there was also some issue with serializing sorted
maps, but that has been resolved i think.
if you care about RT through serialization than i can add it to serialized
class. or to address that i can make bucketsOption private and change
equals etc.
On Mon, Mar 10, 2014 at 1:02 PM, ianoc notifications@github.com wrote:
In algebird-core/src/main/scala/com/twitter/algebird/SpaceSaver.scala:
@@ -101,7 +101,7 @@ object SSMany {
private def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29], buckets: SortedMap[Long, Set[T]]): SSMany[T] =
SSMany(capacity, counters, Some(buckets))
- private def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29]): SSMany[T] =
- def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29]): SSMany[T] =
But if this is for the purpose of serialization are we not potentially
incurring a re-calculation overhead upon every ser/deser? Things won't
necessarily RT through ser now? the field is a public parameter of the
class?Reply to this email directly or view it on GitHubhttps://github.com//pull/289/files#r10438445
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So i think being able to safely/sanely RT any structure is critical to ensuring our serialization paths are robust.
How you mean though that it only comes up during reduce? If we pre-aggregate then we will have data in these fields we will need to re-calculate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map-side lots of SSOnes are added to SSMany
reduce side its simply a matter of merging the SSManys, which depending on
your cluster side isnt that many. this is where the wasteful re-calculation
of bucketsOption is done. my experience is that this reduce side step is
very fast. the work sits all map-side. so i didnt put any effort into
optimizing this.
sure we can make it RT serialization. choices are
- making bucketsOption an implementation detail by making it private and
changing equals and print and all that good stuff, or - serializing bucketsOption and opening up the constructor that can lead
to invalid state if someone misuses it
On Mon, Mar 10, 2014 at 1:41 PM, ianoc notifications@github.com wrote:
In algebird-core/src/main/scala/com/twitter/algebird/SpaceSaver.scala:
@@ -101,7 +101,7 @@ object SSMany {
private def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29], buckets: SortedMap[Long, Set[T]]): SSMany[T] =
SSMany(capacity, counters, Some(buckets))
- private def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29]): SSMany[T] =
- def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29]): SSMany[T] =
So i think being able to safely/sanely RT any structure is critical to
ensuring our serialization paths are robust.How you mean though that it only comes up during reduce? If we
pre-aggregate then we will have data in these fields we will need to
re-calculate?Reply to this email directly or view it on GitHubhttps://github.com//pull/289/files#r10440404
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation, very useful. In a streaming compute context there can be many merges occurring reduce side. Different cache settings, would be coming into play often. Serialization to a store would be interesting here, given you would have a longer lived SSMany your merging with a newer one to update it... if the rebuilding here would be expensive to do the merge I'm certainly in favour of including the field in the serialization. -- Just trying to think of this in a context other than hadoop.
End users shouldn't be using either SSMany or SSOne directly right? and we have apply methods without this arg, so if they create something in an invalid state i'm not sure there is too much to be done?
Unless you have a big objection could you use the full set of parms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK i will change the algebird pullreq to open up the constructor again then
too
On Mon, Mar 10, 2014 at 2:03 PM, ianoc notifications@github.com wrote:
In algebird-core/src/main/scala/com/twitter/algebird/SpaceSaver.scala:
@@ -101,7 +101,7 @@ object SSMany {
private def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29], buckets: SortedMap[Long, Set[T]]): SSMany[T] =
SSMany(capacity, counters, Some(buckets))
- private def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29]): SSMany[T] =
- def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29]): SSMany[T] =
Thanks for the explanation, very useful. In a streaming compute context
there can be many merges occurring reduce side. Different cache settings,
would be coming into play often. Serialization to a store would be
interesting here, given you would have a longer lived SSMany your merging
with a newer one to update it... if the rebuilding here would be expensive
to do the merge I'm certainly in favour of including the field in the
serialization. -- Just trying to think of this in a context other than
hadoop.End users shouldn't be using either SSMany or SSOne directly right? and we
have apply methods without this arg, so if they create something in an
invalid state i'm not sure there is too much to be done?Unless you have a big objection could you use the full set of parms?
Reply to this email directly or view it on GitHubhttps://github.com//pull/289/files#r10441611
.
…or for SSMany" After discussion with ianoc we decided to also serialize bucketsOption which requires the main constructor of SSMany to be public. This reverts commit bda7837.
make constructors for SpaceSaver subclasses public so that deserializers...
Merged, thanks! |
... that are not in this package have access to them
i was just working on adding serializers to scalding's AlgebirdSerializers when i realized i couldnt deserialize without access to constructors...