[SPARK-19165][PYTHON][SQL] UserDefinedFunction.call should validate input types #16537

zero323 · 2017-01-10T22:09:22Z

What changes were proposed in this pull request?

Adds basic input validation for UserDefinedFunction.__call__ to avoid failing with cryptic Py4J errors.

How was this patch tested?

Unit tests.

SparkQA · 2017-01-10T22:50:49Z

Test build #71164 has finished for PR 16537 at commit dd79069.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2017-01-10T23:50:07Z

python/pyspark/sql/tests.py

+    def test_udf_should_validate_input_args(self):
+        from pyspark.sql.functions import udf
+
+        self.assertRaises(TypeError, udf(lambda x: x), None)


I think this should have positive tests for a column and a string as well as a negative test.

It is pretty well covered by existing udf tests. The more the merrier but I am not sure what can be added with duplicating other test cases.

Do you think we should try to some type validation of the number of arguments?

Pros:

It is easy to implement with inspect or func.__code__ for plain Python objects.

It is nice to fail without starting a complex job.

Cons:

It most likely won't work well for C extensions and such.

If this is covered by existing tests, then that's fine. Good point.

To validate number of args, I think it is a good idea, as long as we know that it won't fail C extensions (but may be inconclusive).

Yeah. I am afraid it can actually cause more troubles than its worth:

If we throw an exception there is a chance we hit some border cases.

Issuing a warning doesn't prevent task failure so it doesn't provide the same advantages as failing early.

Maybe it is better to leave it as is. Right now users get a clear feedback, if there is an incorrect type, and for additional safety one can always use annotations and type checker.

Sounds good. It's probably worth exploring eventually, but there's no need to hold up this PR.

MTE I removed [WIP] and hopefully it will get merged :)

rdblue · 2017-01-10T23:52:12Z

python/pyspark/sql/functions.py

+        for c in cols:
+            if not isinstance(c, (Column, str)):
+                raise TypeError(
+                    "All arguments should be Columns or strings representing column names. "


"All arguments" is a little vague, since this is going to be called later in code than the UDF definition. What about an error message like "Invalid UDF argument, not a String or Column: {0} of type {1}"

Sounds good. I think it should also provide a suggestion that one can use literals (lit, array, struct and create_map).

A literal ends up being a Column, aren't the others as well?

Yes, it is. My point is it is not always obvious for new users. Giving a hint that there is such thing as literals could be a good idea.

SparkQA · 2017-01-11T14:16:15Z

Test build #71207 has finished for PR 16537 at commit a94f755.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-19T23:43:52Z

Test build #71676 has finished for PR 16537 at commit 9e0ab69.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-20T01:24:16Z

Test build #71687 has finished for PR 16537 at commit 7aa1607.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-01T12:22:54Z

Test build #72244 has finished for PR 16537 at commit 5a25ee2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2017-02-01T19:12:32Z

+1

nchammas

This LGTM. @holdenk?

holdenk · 2017-02-13T19:19:16Z

So I'm curious what the motivation is for adding these checks - looking in the mailing list archives this doesn't seem like a common error (and the only stack overflow post about this I saw was using UDFs inside of a parallelize so I don't think this would have helped). If this is an error people have run into and found confusing though, the change looks pretty reasonable I'm just a bit hesitant to add start doing a lot of type checking PRs for things people aren't running into/having difficulties with.

SparkQA · 2017-02-13T20:27:07Z

Test build #72826 has finished for PR 16537 at commit 60bde4e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2017-02-13T20:29:51Z

For me it is all about the bigger picture. I've been working with Python for quite a while right now (probably to long for my own good) and I am used to two things:

Language is relatively forgiving when it comes to types. I am more used to thinking about abstract base classes than concrete types.
Language which communicates failures in a clear way. If there is problem with incompatible types or interfaces I expect to receive clear feedback. And of course interactive debugger on top of that if something goes particularly wrong.

PySpark is not there yet. We get Py4j exceptions (albeit it improved a lot in 2.x), we get runtime exceptions with huge JVM tracebacks, when it is possible to fail fast (on the driver), and finally we get silent errors (like returning int from and UDF with declared type float).

It is not always possible or practical to avoid these failures but I believe that in cases where:

We have very strict requirements regarding types.
Cost of checking is low (O(1) not for example O(N)).
We fail early and prevent an expensive failure in the middle of a pipeline.

it is a good idea to be proactive.

holdenk · 2017-02-13T20:46:24Z

I definitely think moving errors earlier is important (nothing is worse than a 9 hour job that fails in the middle because of the wrong type). That being said in this case the error isn't caught any earlier though. I think doing this selectively in the cases where it makes sense would be good, I just don't want to spend a lot of time adding type checks in places where they don't give us a lot.

rdblue · 2017-02-13T21:02:06Z

I don't think it is a good idea to think that this has little use because it is a dumb mistake to pass something that isn't callable. In this case, it's easy to accidentally reuse a name for a function and a variable (e.g., format), especially as scripts change over time and pass from one maintainer to another.

Spark should have reasonable behavior for any error, as opposed to being harder to work with because we thought the user wasn't likely to hit a particular problem. This is very few lines of code that will make a user's experience much better because it can catch exactly what the problem is, without running a job.

zero323 · 2017-02-13T21:11:05Z

And there is of course a matter of user experience. Even if failure is cheap, something like this:

In [4]: from pyspark.sql.functions import udf

In [5]: udf(lambda x: x)(1)
---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
<ipython-input-5-729166e23ad0> in <module>()
----> 1 udf(lambda x: x)(1)

...

Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.lang.Integer]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
	at py4j.Gateway.invoke(Gateway.java:274)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)

is not an useful exception for anyone who is not familiar with Spark internals.

zero323 · 2017-02-13T21:15:26Z

I explore an alternative approach, with adding type hints (https://github.com/zero323/pyspark-stubs), but I doubt it'll become particularly popular, and I won't even try to push it to the main repository :)

rdblue · 2017-02-13T21:17:27Z

Sorry, my example was for validating the object passed to udf was callable, not for the use of the UDF. I still think it's a good idea not to make assumptions about how a user makes a mistake. Error checking should be proportional to how difficult it is to find what happened, not how likely it is to happen.

holdenk · 2017-02-13T21:28:38Z

@rdblue i think we're maybe understanding different type checks. My understanding is in this case the error is already thrown right away. It's also not that the user needs to pass a callable here, were checking that the UDF is called with column or strings as arguments.

I agree the current error message is a bit obtuse to new users, but I also don't want us to go adding type checking to every individual function wrapper to Py4J in another PR - that just isn't scalable given the level of committer bandwidth available in Python. I'm think we should prioritize issues that users have run into or figure out a more scalable way to solve this (either bigger batches of cleanups, improved explanations of Py4J errors possible causes, or type hints - although as zero323 points out that's a larger discussion). (And for this type check I didn't see any posts related to it).

rdblue · 2017-02-13T22:11:32Z

Yeah, I thought this was the other PR that validates the function is callable. Still, I don't agree that it's okay for python to be less friendly as long as we don't think people will hit the problem too much or because they solve the problem before asking a list. There are reasonable ways to hit this and Spark should give a good explanation about what went wrong.

I'm not saying we have to go fix all of the cases like this, but where there's a PR ready to go I think it's better to include it than not.

holdenk · 2017-02-13T22:33:07Z

I think the overhead of doing this piecemeal removes review time available for more important changes (like places where users are actively encountering confusing error messages, incorrect behaviour, or missing functionality for Scala parity).

As illustrated by your confusion about the purpose of the PR, there are other outstanding PRs adding similar checks in other places and @zero323 is already familiar with the code base hence my suggestion that they look at a more scalable solution (from the point of view of review time).

Of course this is just a personal request that we look at solving this in a less piecemeal way to reduce review overhead (and a biased one at that), but I'll probably triage these issues as less important unless there is a clear link to a user issue - (except from new contributors getting familiar with the code base which is valuable in other ways too).

That being said this discussion has gotten pretty off topic from the point of view of this individual PR and we should maybe move it to the JIRA or lists if we want to continue it (but personally I think we are at an agree to disagree about priorities and no one is obliged to listen to my priorities :)).

rdblue · 2017-02-13T22:37:44Z

Maybe we're at an agree to disagree situation, but I think we may be talking about different things. If you're saying that we should try to keep these together to make reviews easier, I'd agree. I was under the impression that this change may be rejected because it isn't important enough of a problem, which I think isn't a good way of looking at it.

holdenk · 2017-02-13T23:00:45Z

Ah perhaps then we are simply agreeing with each-other. I'm fine with adding these types of fixes - but doing it one function at a time is just going to be too time consuming and distracting from other more useful changes.

zero323 · 2017-02-13T23:05:35Z

Putting this particular PR and the scalability of the improvement process aside, Spark is heavily underdocumented. This is something that hits Python and R users way more than everyone else. In the worst case scenario when working with Scala you can just follow the types. It wouldn't be a problem if used consistent conventions, idiomatic Python and didn't make hidden assumptions once in a while :) Take things like DataFrame.replace or some parts of the DStream API (can you point out places when expect function not a Callable) for example. I am not really a stakeholder here but i really believe that small things like this are crucial to create decent user experience.

In hindsight, I overdid with the number of tasks but on my defense I am pretty sure that at least some of these won't get merged. Moreover problem is not imaginary. For many users it is not obvious how to use udfs (http://stackoverflow.com/q/35546576, http://stackoverflow.com/q/35375255, http://stackoverflow.com/q/39254503) and docstring of udf is actually the only documented example I found.

SparkQA · 2017-02-14T18:33:22Z

Test build #72884 has finished for PR 16537 at commit c2add7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-15T21:45:02Z

Test build #72953 has finished for PR 16537 at commit 85e4a7c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-05T17:13:39Z

Test build #73936 has finished for PR 16537 at commit 0222c44.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2017-03-16T00:30:01Z

@holdenk If you don't see this merged could your resolve the JIRA ticket and I'll just close the PR? No reason to keep this open ad infinitum :) TIA

ueshin · 2017-06-20T18:33:09Z

@zero323 Hi, are you still working on this?

holdenk · 2017-06-20T18:39:20Z

So it seems there isn't a solid reason not to merge this provided we aren't going to go down the rabbit whole we've been talking about. Lets make sure everything is still ok with Jenkins still (Jenkins Retest This Please).

ueshin · 2017-06-20T19:03:27Z

Jenkins, retest this please.

SparkQA · 2017-06-20T19:18:53Z

Test build #78307 has finished for PR 16537 at commit 0222c44.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-06-20T19:21:44Z

If you have the time to update/fix this @zero323 I'm happy to merge it pending jenkins, otherwise I'll just close the issue at the end of the month.

zero323 · 2017-06-20T19:37:13Z

@holdenk I'll try to reproduce this problem but it looks a bit awkward:

AttributeError: 'function' object has no attribute 'closure'

Doesn't look like something related to this PR at all 😕

SparkQA · 2017-06-20T20:19:14Z

Test build #78313 has finished for PR 16537 at commit 0086723.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2017-06-20T20:47:44Z

I cannot reproduce this locally, but do we really use pypy-2.0.2?

SparkQA · 2017-06-20T21:01:07Z

Test build #78315 has finished for PR 16537 at commit d476faf.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-06-20T22:39:07Z

Jenkins, retest this please.

SparkQA · 2017-06-20T23:06:49Z

Test build #78330 has finished for PR 16537 at commit d476faf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-06-26T13:10:10Z

python/pyspark/sql/functions.py

@@ -1949,6 +1949,14 @@ def _create_judf(self):
        return judf

    def __call__(self, *cols):
+        for c in cols:
+            if not isinstance(c, (Column, str)):


Doesn't this break unicode support in Python 2?

from pyspark.sql.functions import udf udf(lambda x: x)(u"a")

before

Column<<lambda>(a)>

after

Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/functions.py", line 1970, in wrapper return self(*args) File ".../spark/python/pyspark/sql/functions.py", line 1958, in __call__ "lit, array, struct or create_map.".format(c, type(c))) TypeError: Invalid UDF argument, not a str or Column: a of type <type 'unicode'>. For Column literals use sql.functions lit, array, struct or create_map.

@HyukjinKwon Sorry for a delayed response, I am seldom online these days. You're right, it looks like an issue. I'll take a look at this, when I have more time

felixcheung · 2017-08-22T16:38:52Z

I see people run into this kind of things quite a bit.
sounds like this is important to have. how about reviving some forms of this?

HyukjinKwon · 2017-08-22T21:48:40Z

Let me take over this one and credit to @zero323.

holdenk · 2017-08-22T22:01:53Z

Thanks @HyukjinKwon :D

zero323 · 2017-08-29T10:12:41Z

Thanks @HyukjinKwon

rdblue requested changes Jan 10, 2017

View reviewed changes

zero323 force-pushed the SPARK-19165 branch from dd79069 to a94f755 Compare January 11, 2017 13:44

zero323 force-pushed the SPARK-19165 branch from 9e0ab69 to 7aa1607 Compare January 20, 2017 00:50

zero323 force-pushed the SPARK-19165 branch from 7aa1607 to 5a25ee2 Compare February 1, 2017 11:50

zero323 changed the title ~~[SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__call__ should validate input types~~ [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ should validate input types Feb 1, 2017

nchammas approved these changes Feb 10, 2017

View reviewed changes

zero323 force-pushed the SPARK-19165 branch from 5a25ee2 to 60bde4e Compare February 13, 2017 19:48

zero323 force-pushed the SPARK-19165 branch from 60bde4e to c2add7e Compare February 14, 2017 17:55

zero323 force-pushed the SPARK-19165 branch from c2add7e to 85e4a7c Compare February 15, 2017 21:09

zero323 force-pushed the SPARK-19165 branch from 85e4a7c to 0222c44 Compare March 5, 2017 16:50

zero323 force-pushed the SPARK-19165 branch from 0222c44 to 0086723 Compare June 20, 2017 20:06

zero323 closed this Jun 20, 2017

zero323 force-pushed the SPARK-19165 branch from 0086723 to b6b1088 Compare June 20, 2017 20:37

Validate types in UserDefinedFunction.__call__

d476faf

zero323 reopened this Jun 20, 2017

HyukjinKwon reviewed Jun 26, 2017

View reviewed changes

zero323 closed this Jul 13, 2017

HyukjinKwon mentioned this pull request Aug 23, 2017

[SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should validate input types for column #19027

Closed

zero323 deleted the SPARK-19165 branch February 2, 2020 17:43

nchammas mentioned this pull request Feb 3, 2020

[SPARK-30681][PYSPARK][SQL] Add higher order functions API to PySpark #27406

Closed

[SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ should validate input types #16537

[SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ should validate input types #16537

Conversation

zero323 commented Jan 10, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 11, 2017

SparkQA commented Jan 19, 2017

SparkQA commented Jan 20, 2017

SparkQA commented Feb 1, 2017

rdblue commented Feb 1, 2017

nchammas left a comment

Choose a reason for hiding this comment

holdenk commented Feb 13, 2017

SparkQA commented Feb 13, 2017

zero323 commented Feb 13, 2017

holdenk commented Feb 13, 2017

rdblue commented Feb 13, 2017

zero323 commented Feb 13, 2017

zero323 commented Feb 13, 2017

rdblue commented Feb 13, 2017

holdenk commented Feb 13, 2017

rdblue commented Feb 13, 2017

holdenk commented Feb 13, 2017

rdblue commented Feb 13, 2017

holdenk commented Feb 13, 2017 • edited Loading

zero323 commented Feb 13, 2017

SparkQA commented Feb 14, 2017

SparkQA commented Feb 15, 2017

SparkQA commented Mar 5, 2017

zero323 commented Mar 16, 2017

ueshin commented Jun 20, 2017

holdenk commented Jun 20, 2017

ueshin commented Jun 20, 2017

SparkQA commented Jun 20, 2017

holdenk commented Jun 20, 2017

zero323 commented Jun 20, 2017

SparkQA commented Jun 20, 2017

zero323 commented Jun 20, 2017

SparkQA commented Jun 20, 2017

ueshin commented Jun 20, 2017

SparkQA commented Jun 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung commented Aug 22, 2017

HyukjinKwon commented Aug 22, 2017

holdenk commented Aug 22, 2017

zero323 commented Aug 29, 2017

[SPARK-19165][PYTHON][SQL] UserDefinedFunction.call should validate input types #16537

[SPARK-19165][PYTHON][SQL] UserDefinedFunction.call should validate input types #16537

holdenk commented Feb 13, 2017 •

edited

Loading