Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ should validate input types #16537

Closed
wants to merge 1 commit into from

Conversation

zero323
Copy link
Member

@zero323 zero323 commented Jan 10, 2017

What changes were proposed in this pull request?

Adds basic input validation for UserDefinedFunction.__call__ to avoid failing with cryptic Py4J errors.

How was this patch tested?

Unit tests.

@SparkQA
Copy link

SparkQA commented Jan 10, 2017

Test build #71164 has finished for PR 16537 at commit dd79069.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def test_udf_should_validate_input_args(self):
from pyspark.sql.functions import udf

self.assertRaises(TypeError, udf(lambda x: x), None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should have positive tests for a column and a string as well as a negative test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is pretty well covered by existing udf tests. The more the merrier but I am not sure what can be added with duplicating other test cases.

Do you think we should try to some type validation of the number of arguments?

Pros:

  • It is easy to implement with inspect or func.__code__ for plain Python objects.
  • It is nice to fail without starting a complex job.

Cons:

  • It most likely won't work well for C extensions and such.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is covered by existing tests, then that's fine. Good point.

To validate number of args, I think it is a good idea, as long as we know that it won't fail C extensions (but may be inconclusive).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I am afraid it can actually cause more troubles than its worth:

  • If we throw an exception there is a chance we hit some border cases.
  • Issuing a warning doesn't prevent task failure so it doesn't provide the same advantages as failing early.

Maybe it is better to leave it as is. Right now users get a clear feedback, if there is an incorrect type, and for additional safety one can always use annotations and type checker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. It's probably worth exploring eventually, but there's no need to hold up this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MTE I removed [WIP] and hopefully it will get merged :)

for c in cols:
if not isinstance(c, (Column, str)):
raise TypeError(
"All arguments should be Columns or strings representing column names. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"All arguments" is a little vague, since this is going to be called later in code than the UDF definition. What about an error message like "Invalid UDF argument, not a String or Column: {0} of type {1}"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I think it should also provide a suggestion that one can use literals (lit, array, struct and create_map).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A literal ends up being a Column, aren't the others as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is. My point is it is not always obvious for new users. Giving a hint that there is such thing as literals could be a good idea.

@SparkQA
Copy link

SparkQA commented Jan 11, 2017

Test build #71207 has finished for PR 16537 at commit a94f755.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 19, 2017

Test build #71676 has finished for PR 16537 at commit 9e0ab69.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71687 has finished for PR 16537 at commit 7aa1607.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 1, 2017

Test build #72244 has finished for PR 16537 at commit 5a25ee2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Feb 1, 2017

+1

@zero323 zero323 changed the title [SPARK-19165][PYTHON][SQL][WIP] UserDefinedFunction.__call__ should validate input types [SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ should validate input types Feb 1, 2017
Copy link
Contributor

@nchammas nchammas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. @holdenk?

@holdenk
Copy link
Contributor

holdenk commented Feb 13, 2017

So I'm curious what the motivation is for adding these checks - looking in the mailing list archives this doesn't seem like a common error (and the only stack overflow post about this I saw was using UDFs inside of a parallelize so I don't think this would have helped). If this is an error people have run into and found confusing though, the change looks pretty reasonable I'm just a bit hesitant to add start doing a lot of type checking PRs for things people aren't running into/having difficulties with.

@SparkQA
Copy link

SparkQA commented Feb 13, 2017

Test build #72826 has finished for PR 16537 at commit 60bde4e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member Author

zero323 commented Feb 13, 2017

For me it is all about the bigger picture. I've been working with Python for quite a while right now (probably to long for my own good) and I am used to two things:

  • Language is relatively forgiving when it comes to types. I am more used to thinking about abstract base classes than concrete types.
  • Language which communicates failures in a clear way. If there is problem with incompatible types or interfaces I expect to receive clear feedback. And of course interactive debugger on top of that if something goes particularly wrong.

PySpark is not there yet. We get Py4j exceptions (albeit it improved a lot in 2.x), we get runtime exceptions with huge JVM tracebacks, when it is possible to fail fast (on the driver), and finally we get silent errors (like returning int from and UDF with declared type float).

It is not always possible or practical to avoid these failures but I believe that in cases where:

  • We have very strict requirements regarding types.
  • Cost of checking is low (O(1) not for example O(N)).
  • We fail early and prevent an expensive failure in the middle of a pipeline.

it is a good idea to be proactive.

@holdenk
Copy link
Contributor

holdenk commented Feb 13, 2017

I definitely think moving errors earlier is important (nothing is worse than a 9 hour job that fails in the middle because of the wrong type). That being said in this case the error isn't caught any earlier though. I think doing this selectively in the cases where it makes sense would be good, I just don't want to spend a lot of time adding type checks in places where they don't give us a lot.

@rdblue
Copy link
Contributor

rdblue commented Feb 13, 2017

I don't think it is a good idea to think that this has little use because it is a dumb mistake to pass something that isn't callable. In this case, it's easy to accidentally reuse a name for a function and a variable (e.g., format), especially as scripts change over time and pass from one maintainer to another.

Spark should have reasonable behavior for any error, as opposed to being harder to work with because we thought the user wasn't likely to hit a particular problem. This is very few lines of code that will make a user's experience much better because it can catch exactly what the problem is, without running a job.

@zero323
Copy link
Member Author

zero323 commented Feb 13, 2017

And there is of course a matter of user experience. Even if failure is cheap, something like this:

In [4]: from pyspark.sql.functions import udf

In [5]: udf(lambda x: x)(1)
---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
<ipython-input-5-729166e23ad0> in <module>()
----> 1 udf(lambda x: x)(1)

...

Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.lang.Integer]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
	at py4j.Gateway.invoke(Gateway.java:274)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)

is not an useful exception for anyone who is not familiar with Spark internals.

@zero323
Copy link
Member Author

zero323 commented Feb 13, 2017

I explore an alternative approach, with adding type hints (https://github.com/zero323/pyspark-stubs), but I doubt it'll become particularly popular, and I won't even try to push it to the main repository :)

@rdblue
Copy link
Contributor

rdblue commented Feb 13, 2017

Sorry, my example was for validating the object passed to udf was callable, not for the use of the UDF. I still think it's a good idea not to make assumptions about how a user makes a mistake. Error checking should be proportional to how difficult it is to find what happened, not how likely it is to happen.

@holdenk
Copy link
Contributor

holdenk commented Feb 13, 2017

@rdblue i think we're maybe understanding different type checks. My understanding is in this case the error is already thrown right away. It's also not that the user needs to pass a callable here, were checking that the UDF is called with column or strings as arguments.

I agree the current error message is a bit obtuse to new users, but I also don't want us to go adding type checking to every individual function wrapper to Py4J in another PR - that just isn't scalable given the level of committer bandwidth available in Python. I'm think we should prioritize issues that users have run into or figure out a more scalable way to solve this (either bigger batches of cleanups, improved explanations of Py4J errors possible causes, or type hints - although as zero323 points out that's a larger discussion). (And for this type check I didn't see any posts related to it).

@rdblue
Copy link
Contributor

rdblue commented Feb 13, 2017

Yeah, I thought this was the other PR that validates the function is callable. Still, I don't agree that it's okay for python to be less friendly as long as we don't think people will hit the problem too much or because they solve the problem before asking a list. There are reasonable ways to hit this and Spark should give a good explanation about what went wrong.

I'm not saying we have to go fix all of the cases like this, but where there's a PR ready to go I think it's better to include it than not.

@holdenk
Copy link
Contributor

holdenk commented Feb 13, 2017

I think the overhead of doing this piecemeal removes review time available for more important changes (like places where users are actively encountering confusing error messages, incorrect behaviour, or missing functionality for Scala parity).

As illustrated by your confusion about the purpose of the PR, there are other outstanding PRs adding similar checks in other places and @zero323 is already familiar with the code base hence my suggestion that they look at a more scalable solution (from the point of view of review time).

Of course this is just a personal request that we look at solving this in a less piecemeal way to reduce review overhead (and a biased one at that), but I'll probably triage these issues as less important unless there is a clear link to a user issue - (except from new contributors getting familiar with the code base which is valuable in other ways too).

That being said this discussion has gotten pretty off topic from the point of view of this individual PR and we should maybe move it to the JIRA or lists if we want to continue it (but personally I think we are at an agree to disagree about priorities and no one is obliged to listen to my priorities :)).

@rdblue
Copy link
Contributor

rdblue commented Feb 13, 2017

Maybe we're at an agree to disagree situation, but I think we may be talking about different things. If you're saying that we should try to keep these together to make reviews easier, I'd agree. I was under the impression that this change may be rejected because it isn't important enough of a problem, which I think isn't a good way of looking at it.

@holdenk
Copy link
Contributor

holdenk commented Feb 13, 2017

Ah perhaps then we are simply agreeing with each-other. I'm fine with adding these types of fixes - but doing it one function at a time is just going to be too time consuming and distracting from other more useful changes.

@zero323
Copy link
Member Author

zero323 commented Feb 13, 2017

Putting this particular PR and the scalability of the improvement process aside, Spark is heavily underdocumented. This is something that hits Python and R users way more than everyone else. In the worst case scenario when working with Scala you can just follow the types. It wouldn't be a problem if used consistent conventions, idiomatic Python and didn't make hidden assumptions once in a while :) Take things like DataFrame.replace or some parts of the DStream API (can you point out places when expect function not a Callable) for example. I am not really a stakeholder here but i really believe that small things like this are crucial to create decent user experience.

In hindsight, I overdid with the number of tasks but on my defense I am pretty sure that at least some of these won't get merged. Moreover problem is not imaginary. For many users it is not obvious how to use udfs (http://stackoverflow.com/q/35546576, http://stackoverflow.com/q/35375255, http://stackoverflow.com/q/39254503) and docstring of udf is actually the only documented example I found.

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72884 has finished for PR 16537 at commit c2add7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 15, 2017

Test build #72953 has finished for PR 16537 at commit 85e4a7c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 5, 2017

Test build #73936 has finished for PR 16537 at commit 0222c44.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member Author

zero323 commented Mar 16, 2017

@holdenk If you don't see this merged could your resolve the JIRA ticket and I'll just close the PR? No reason to keep this open ad infinitum :) TIA

@ueshin
Copy link
Member

ueshin commented Jun 20, 2017

@zero323 Hi, are you still working on this?

@holdenk
Copy link
Contributor

holdenk commented Jun 20, 2017

So it seems there isn't a solid reason not to merge this provided we aren't going to go down the rabbit whole we've been talking about. Lets make sure everything is still ok with Jenkins still (Jenkins Retest This Please).

@ueshin
Copy link
Member

ueshin commented Jun 20, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 20, 2017

Test build #78307 has finished for PR 16537 at commit 0222c44.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Jun 20, 2017

If you have the time to update/fix this @zero323 I'm happy to merge it pending jenkins, otherwise I'll just close the issue at the end of the month.

@zero323
Copy link
Member Author

zero323 commented Jun 20, 2017

@holdenk I'll try to reproduce this problem but it looks a bit awkward:

AttributeError: 'function' object has no attribute 'closure'

Doesn't look like something related to this PR at all 😕

@SparkQA
Copy link

SparkQA commented Jun 20, 2017

Test build #78313 has finished for PR 16537 at commit 0086723.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323 zero323 reopened this Jun 20, 2017
@zero323
Copy link
Member Author

zero323 commented Jun 20, 2017

I cannot reproduce this locally, but do we really use pypy-2.0.2?

@SparkQA
Copy link

SparkQA commented Jun 20, 2017

Test build #78315 has finished for PR 16537 at commit d476faf.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Jun 20, 2017

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jun 20, 2017

Test build #78330 has finished for PR 16537 at commit d476faf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1949,6 +1949,14 @@ def _create_judf(self):
return judf

def __call__(self, *cols):
for c in cols:
if not isinstance(c, (Column, str)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this break unicode support in Python 2?

from pyspark.sql.functions import udf
udf(lambda x: x)(u"a")

before

Column<<lambda>(a)>

after

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/sql/functions.py", line 1970, in wrapper
    return self(*args)
  File ".../spark/python/pyspark/sql/functions.py", line 1958, in __call__
    "lit, array, struct or create_map.".format(c, type(c)))
TypeError: Invalid UDF argument, not a str or Column: a of type <type 'unicode'>. For Column literals use sql.functions lit, array, struct or create_map.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon Sorry for a delayed response, I am seldom online these days. You're right, it looks like an issue. I'll take a look at this, when I have more time

@zero323 zero323 closed this Jul 13, 2017
@felixcheung
Copy link
Member

I see people run into this kind of things quite a bit.
sounds like this is important to have. how about reviving some forms of this?

@HyukjinKwon
Copy link
Member

Let me take over this one and credit to @zero323.

@holdenk
Copy link
Contributor

holdenk commented Aug 22, 2017

Thanks @HyukjinKwon :D

@zero323
Copy link
Member Author

zero323 commented Aug 29, 2017

Thanks @HyukjinKwon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants