Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21394][SPARK-21432][PYTHON] Reviving callable object/partial function support in UDF in PySpark #18615

Closed
wants to merge 4 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jul 12, 2017

What changes were proposed in this pull request?

This PR proposes to avoid __name__ in the tuple naming the attributes assigned directly from the wrapped function to the wrapper function, and use self._name (func.__name__ or obj.__class__.name__).

After SPARK-19161, we happened to break callable objects as UDFs in Python as below:

from pyspark.sql import functions


class F(object):
    def __call__(self, x):
        return x

foo = F()
udf = functions.udf(foo)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/functions.py", line 2142, in udf
    return _udf(f=f, returnType=returnType)
  File ".../spark/python/pyspark/sql/functions.py", line 2133, in _udf
    return udf_obj._wrapped()
  File ".../spark/python/pyspark/sql/functions.py", line 2090, in _wrapped
    @functools.wraps(self.func)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/functools.py", line 33, in update_wrapper
    setattr(wrapper, attr, getattr(wrapped, attr))
AttributeError: F instance has no attribute '__name__'

This worked in Spark 2.1:

from pyspark.sql import functions


class F(object):
    def __call__(self, x):
        return x

foo = F()
udf = functions.udf(foo)
spark.range(1).select(udf("id")).show()
+-----+
|F(id)|
+-----+
|    0|
+-----+

After

from pyspark.sql import functions


class F(object):
    def __call__(self, x):
        return x

foo = F()
udf = functions.udf(foo)
spark.range(1).select(udf("id")).show()
+-----+
|F(id)|
+-----+
|    0|
+-----+

In addition, we also happened to break partial functions as below:

from pyspark.sql import functions
from functools import partial


partial_func = partial(lambda x: x, x=1)
udf = functions.udf(partial_func)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/functions.py", line 2154, in udf
    return _udf(f=f, returnType=returnType)
  File ".../spark/python/pyspark/sql/functions.py", line 2145, in _udf
    return udf_obj._wrapped()
  File ".../spark/python/pyspark/sql/functions.py", line 2099, in _wrapped
    @functools.wraps(self.func, assigned=assignments)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/functools.py", line 33, in update_wrapper
    setattr(wrapper, attr, getattr(wrapped, attr))
AttributeError: 'functools.partial' object has no attribute '__module__'

This worked in Spark 2.1:

from pyspark.sql import functions
from functools import partial


partial_func = partial(lambda x: x, x=1)
udf = functions.udf(partial_func)
spark.range(1).select(udf()).show()
+---------+
|partial()|
+---------+
|        1|
+---------+

After

from pyspark.sql import functions
from functools import partial


partial_func = partial(lambda x: x, x=1)
udf = functions.udf(partial_func)
spark.range(1).select(udf()).show()
+---------+
|partial()|
+---------+
|        1|
+---------+

How was this patch tested?

Unit tests in python/pyspark/sql/tests.py and manual tests.

@HyukjinKwon
Copy link
Member Author

cc @holdenk, could you take a look when you have some time?

@SparkQA
Copy link

SparkQA commented Jul 12, 2017

Test build #79572 has finished for PR 18615 at commit 6d3ef48.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this and the quick fix. One suggestion that we should leave a comment about the intention behind what we are doing but otherwise looks reasonable.

@@ -2087,10 +2087,13 @@ def _wrapped(self):
"""
Wrap this udf with a function and attach docstring from func
"""
@functools.wraps(self.func)
assignments = tuple(a for a in functools.WRAPPER_ASSIGNMENTS if a != "__name__")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put a comment here saying what this is for just so that when we see this next year we don't forget why we stripped out name from the things we asked functools to assign.

@HyukjinKwon
Copy link
Member Author

Sure, thanks @holdenk. I just address your comments.

@SparkQA
Copy link

SparkQA commented Jul 13, 2017

Test build #79582 has finished for PR 18615 at commit 2317226.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

(gentle ping @holdenk)

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM will merge tomorrow unless anyone else has anything to say.

@HyukjinKwon HyukjinKwon changed the title [SPARK-21394][PYTHON] Reviving callable object support in UDF in PySpark [SPARK-21394][SPARK-21432][PYTHON] Reviving callable object/partial function support in UDF in PySpark Jul 17, 2017
@functools.wraps(self.func)

# It is possible for a callable instance without __name__ attribute or/and
# __module__ attribute to be wrapped here For example, functools.partial. In this case,
Copy link
Member

@viirya viirya Jul 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here For example -> here. For example
Or:
here For example -> here, for example

@HyukjinKwon
Copy link
Member Author

Thanks @viirya.

@holdenk, I happened to find another issue and tried to fix both here together.

@viirya
Copy link
Member

viirya commented Jul 17, 2017

LGTM

@SparkQA
Copy link

SparkQA commented Jul 17, 2017

Test build #79653 has finished for PR 18615 at commit 4950b8d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 17, 2017

Test build #79654 has finished for PR 18615 at commit 8167ea1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Jul 17, 2017

The update looks good to me. I'll merge this to master.

@asfgit asfgit closed this in 4ce735e Jul 17, 2017
@HyukjinKwon
Copy link
Member Author

Thank you @holdenk and @viirya.

@holdenk
Copy link
Contributor

holdenk commented Jul 17, 2017

Merged into master.

ptkool added a commit to Shopify/spark that referenced this pull request Oct 3, 2017
ptkool added a commit to Shopify/spark that referenced this pull request Oct 4, 2017
Resolve issues with udf functions and nullability.

Backport apache#18615

Eliminate whitespace diffs.
@HyukjinKwon HyukjinKwon deleted the callable-object branch January 2, 2018 03:41
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Aug 20, 2018
Resolve issues with udf functions and nullability.

Backport apache#18615

Eliminate whitespace diffs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants