Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make constructors for SpaceSaver subclasses public so that deserializers... #289

Merged
merged 3 commits into from
Mar 11, 2014
Merged

make constructors for SpaceSaver subclasses public so that deserializers... #289

merged 3 commits into from
Mar 11, 2014

Conversation

koertkuipers
Copy link
Contributor

... that are not in this package have access to them

i was just working on adding serializers to scalding's AlgebirdSerializers when i realized i couldnt deserialize without access to constructors...

@@ -101,7 +101,7 @@ object SSMany {
private def apply[T](capacity: Int, counters: Map[T, (Long, Long)], buckets: SortedMap[Long, Set[T]]): SSMany[T] =
SSMany(capacity, counters, Some(buckets))

private def apply[T](capacity: Int, counters: Map[T, (Long, Long)]): SSMany[T] =
def apply[T](capacity: Int, counters: Map[T, (Long, Long)]): SSMany[T] =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you made public the constructor for the SSOne, but an apply method for SSMany. Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the SSMany constructor has a third argument bucketsOption that i would prefer not to be visible to the world.

it is possible to build a SSMany with invalid internal state by passing in bucketsOption that is inconsistent with counters

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if this is for the purpose of serialization are we not potentially incurring a re-calculation overhead upon every ser/deser? Things won't necessarily RT through ser now? the field is a public parameter of the class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes bucketsOption is public and immutable. and yes we are incurring
re-calculation. however this is reduce-side and i have found it to be
immaterial for performance, so i didnt bother serializing bucketsOption. at
the time i wrote that there was also some issue with serializing sorted
maps, but that has been resolved i think.

if you care about RT through serialization than i can add it to serialized
class. or to address that i can make bucketsOption private and change
equals etc.

On Mon, Mar 10, 2014 at 1:02 PM, ianoc notifications@github.com wrote:

In algebird-core/src/main/scala/com/twitter/algebird/SpaceSaver.scala:

@@ -101,7 +101,7 @@ object SSMany {
private def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29], buckets: SortedMap[Long, Set[T]]): SSMany[T] =
SSMany(capacity, counters, Some(buckets))

  • private def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29]): SSMany[T] =
  • def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29]): SSMany[T] =

But if this is for the purpose of serialization are we not potentially
incurring a re-calculation overhead upon every ser/deser? Things won't
necessarily RT through ser now? the field is a public parameter of the
class?

Reply to this email directly or view it on GitHubhttps://github.com//pull/289/files#r10438445
.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So i think being able to safely/sanely RT any structure is critical to ensuring our serialization paths are robust.

How you mean though that it only comes up during reduce? If we pre-aggregate then we will have data in these fields we will need to re-calculate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map-side lots of SSOnes are added to SSMany

reduce side its simply a matter of merging the SSManys, which depending on
your cluster side isnt that many. this is where the wasteful re-calculation
of bucketsOption is done. my experience is that this reduce side step is
very fast. the work sits all map-side. so i didnt put any effort into
optimizing this.

sure we can make it RT serialization. choices are

  1. making bucketsOption an implementation detail by making it private and
    changing equals and print and all that good stuff, or
  2. serializing bucketsOption and opening up the constructor that can lead
    to invalid state if someone misuses it

On Mon, Mar 10, 2014 at 1:41 PM, ianoc notifications@github.com wrote:

In algebird-core/src/main/scala/com/twitter/algebird/SpaceSaver.scala:

@@ -101,7 +101,7 @@ object SSMany {
private def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29], buckets: SortedMap[Long, Set[T]]): SSMany[T] =
SSMany(capacity, counters, Some(buckets))

  • private def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29]): SSMany[T] =
  • def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29]): SSMany[T] =

So i think being able to safely/sanely RT any structure is critical to
ensuring our serialization paths are robust.

How you mean though that it only comes up during reduce? If we
pre-aggregate then we will have data in these fields we will need to
re-calculate?

Reply to this email directly or view it on GitHubhttps://github.com//pull/289/files#r10440404
.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, very useful. In a streaming compute context there can be many merges occurring reduce side. Different cache settings, would be coming into play often. Serialization to a store would be interesting here, given you would have a longer lived SSMany your merging with a newer one to update it... if the rebuilding here would be expensive to do the merge I'm certainly in favour of including the field in the serialization. -- Just trying to think of this in a context other than hadoop.

End users shouldn't be using either SSMany or SSOne directly right? and we have apply methods without this arg, so if they create something in an invalid state i'm not sure there is too much to be done?

Unless you have a big objection could you use the full set of parms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK i will change the algebird pullreq to open up the constructor again then
too

On Mon, Mar 10, 2014 at 2:03 PM, ianoc notifications@github.com wrote:

In algebird-core/src/main/scala/com/twitter/algebird/SpaceSaver.scala:

@@ -101,7 +101,7 @@ object SSMany {
private def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29], buckets: SortedMap[Long, Set[T]]): SSMany[T] =
SSMany(capacity, counters, Some(buckets))

  • private def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29]): SSMany[T] =
  • def apply[T](capacity: Int, counters: Map[T, %28Long, Long%29]): SSMany[T] =

Thanks for the explanation, very useful. In a streaming compute context
there can be many merges occurring reduce side. Different cache settings,
would be coming into play often. Serialization to a store would be
interesting here, given you would have a longer lived SSMany your merging
with a newer one to update it... if the rebuilding here would be expensive
to do the merge I'm certainly in favour of including the field in the
serialization. -- Just trying to think of this in a context other than
hadoop.

End users shouldn't be using either SSMany or SSOne directly right? and we
have apply methods without this arg, so if they create something in an
invalid state i'm not sure there is too much to be done?

Unless you have a big objection could you use the full set of parms?

Reply to this email directly or view it on GitHubhttps://github.com//pull/289/files#r10441611
.

…or for SSMany"

After discussion with ianoc we decided to also serialize bucketsOption which requires the main constructor of SSMany to be public.
This reverts commit bda7837.
ianoc added a commit that referenced this pull request Mar 11, 2014
make constructors for SpaceSaver subclasses public so that deserializers...
@ianoc ianoc merged commit 038aad7 into twitter:develop Mar 11, 2014
@ianoc
Copy link
Collaborator

ianoc commented Mar 11, 2014

Merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants