-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19165][PYTHON][SQL] UserDefinedFunction.__call__ should validate input types #16537
Conversation
Test build #71164 has finished for PR 16537 at commit
|
def test_udf_should_validate_input_args(self): | ||
from pyspark.sql.functions import udf | ||
|
||
self.assertRaises(TypeError, udf(lambda x: x), None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should have positive tests for a column and a string as well as a negative test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is pretty well covered by existing udf
tests. The more the merrier but I am not sure what can be added with duplicating other test cases.
Do you think we should try to some type validation of the number of arguments?
Pros:
- It is easy to implement with
inspect
orfunc.__code__
for plain Python objects. - It is nice to fail without starting a complex job.
Cons:
- It most likely won't work well for C extensions and such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is covered by existing tests, then that's fine. Good point.
To validate number of args, I think it is a good idea, as long as we know that it won't fail C extensions (but may be inconclusive).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I am afraid it can actually cause more troubles than its worth:
- If we throw an exception there is a chance we hit some border cases.
- Issuing a warning doesn't prevent task failure so it doesn't provide the same advantages as failing early.
Maybe it is better to leave it as is. Right now users get a clear feedback, if there is an incorrect type, and for additional safety one can always use annotations and type checker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. It's probably worth exploring eventually, but there's no need to hold up this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MTE I removed [WIP] and hopefully it will get merged :)
python/pyspark/sql/functions.py
Outdated
for c in cols: | ||
if not isinstance(c, (Column, str)): | ||
raise TypeError( | ||
"All arguments should be Columns or strings representing column names. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"All arguments" is a little vague, since this is going to be called later in code than the UDF definition. What about an error message like "Invalid UDF argument, not a String or Column: {0} of type {1}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I think it should also provide a suggestion that one can use literals (lit
, array
, struct
and create_map
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A literal ends up being a Column
, aren't the others as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is. My point is it is not always obvious for new users. Giving a hint that there is such thing as literals could be a good idea.
Test build #71207 has finished for PR 16537 at commit
|
Test build #71676 has finished for PR 16537 at commit
|
Test build #71687 has finished for PR 16537 at commit
|
Test build #72244 has finished for PR 16537 at commit
|
+1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. @holdenk?
So I'm curious what the motivation is for adding these checks - looking in the mailing list archives this doesn't seem like a common error (and the only stack overflow post about this I saw was using UDFs inside of a parallelize so I don't think this would have helped). If this is an error people have run into and found confusing though, the change looks pretty reasonable I'm just a bit hesitant to add start doing a lot of type checking PRs for things people aren't running into/having difficulties with. |
Test build #72826 has finished for PR 16537 at commit
|
For me it is all about the bigger picture. I've been working with Python for quite a while right now (probably to long for my own good) and I am used to two things:
PySpark is not there yet. We get Py4j exceptions (albeit it improved a lot in 2.x), we get runtime exceptions with huge JVM tracebacks, when it is possible to fail fast (on the driver), and finally we get silent errors (like returning It is not always possible or practical to avoid these failures but I believe that in cases where:
it is a good idea to be proactive. |
I definitely think moving errors earlier is important (nothing is worse than a 9 hour job that fails in the middle because of the wrong type). That being said in this case the error isn't caught any earlier though. I think doing this selectively in the cases where it makes sense would be good, I just don't want to spend a lot of time adding type checks in places where they don't give us a lot. |
I don't think it is a good idea to think that this has little use because it is a dumb mistake to pass something that isn't callable. In this case, it's easy to accidentally reuse a name for a function and a variable (e.g., Spark should have reasonable behavior for any error, as opposed to being harder to work with because we thought the user wasn't likely to hit a particular problem. This is very few lines of code that will make a user's experience much better because it can catch exactly what the problem is, without running a job. |
And there is of course a matter of user experience. Even if failure is cheap, something like this: In [4]: from pyspark.sql.functions import udf
In [5]: udf(lambda x: x)(1)
---------------------------------------------------------------------------
Py4JError Traceback (most recent call last)
<ipython-input-5-729166e23ad0> in <module>()
----> 1 udf(lambda x: x)(1)
...
Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.lang.Integer]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745) is not an useful exception for anyone who is not familiar with Spark internals. |
I explore an alternative approach, with adding type hints (https://github.com/zero323/pyspark-stubs), but I doubt it'll become particularly popular, and I won't even try to push it to the main repository :) |
Sorry, my example was for validating the object passed to |
@rdblue i think we're maybe understanding different type checks. My understanding is in this case the error is already thrown right away. It's also not that the user needs to pass a callable here, were checking that the UDF is called with column or strings as arguments. I agree the current error message is a bit obtuse to new users, but I also don't want us to go adding type checking to every individual function wrapper to Py4J in another PR - that just isn't scalable given the level of committer bandwidth available in Python. I'm think we should prioritize issues that users have run into or figure out a more scalable way to solve this (either bigger batches of cleanups, improved explanations of Py4J errors possible causes, or type hints - although as zero323 points out that's a larger discussion). (And for this type check I didn't see any posts related to it). |
Yeah, I thought this was the other PR that validates the function is callable. Still, I don't agree that it's okay for python to be less friendly as long as we don't think people will hit the problem too much or because they solve the problem before asking a list. There are reasonable ways to hit this and Spark should give a good explanation about what went wrong. I'm not saying we have to go fix all of the cases like this, but where there's a PR ready to go I think it's better to include it than not. |
I think the overhead of doing this piecemeal removes review time available for more important changes (like places where users are actively encountering confusing error messages, incorrect behaviour, or missing functionality for Scala parity). As illustrated by your confusion about the purpose of the PR, there are other outstanding PRs adding similar checks in other places and @zero323 is already familiar with the code base hence my suggestion that they look at a more scalable solution (from the point of view of review time). Of course this is just a personal request that we look at solving this in a less piecemeal way to reduce review overhead (and a biased one at that), but I'll probably triage these issues as less important unless there is a clear link to a user issue - (except from new contributors getting familiar with the code base which is valuable in other ways too). That being said this discussion has gotten pretty off topic from the point of view of this individual PR and we should maybe move it to the JIRA or lists if we want to continue it (but personally I think we are at an agree to disagree about priorities and no one is obliged to listen to my priorities :)). |
Maybe we're at an agree to disagree situation, but I think we may be talking about different things. If you're saying that we should try to keep these together to make reviews easier, I'd agree. I was under the impression that this change may be rejected because it isn't important enough of a problem, which I think isn't a good way of looking at it. |
Ah perhaps then we are simply agreeing with each-other. I'm fine with adding these types of fixes - but doing it one function at a time is just going to be too time consuming and distracting from other more useful changes. |
Putting this particular PR and the scalability of the improvement process aside, Spark is heavily underdocumented. This is something that hits Python and R users way more than everyone else. In the worst case scenario when working with Scala you can just follow the types. It wouldn't be a problem if used consistent conventions, idiomatic Python and didn't make hidden assumptions once in a while :) Take things like In hindsight, I overdid with the number of tasks but on my defense I am pretty sure that at least some of these won't get merged. Moreover problem is not imaginary. For many users it is not obvious how to use udfs (http://stackoverflow.com/q/35546576, http://stackoverflow.com/q/35375255, http://stackoverflow.com/q/39254503) and docstring of |
Test build #72884 has finished for PR 16537 at commit
|
Test build #72953 has finished for PR 16537 at commit
|
Test build #73936 has finished for PR 16537 at commit
|
@holdenk If you don't see this merged could your resolve the JIRA ticket and I'll just close the PR? No reason to keep this open ad infinitum :) TIA |
@zero323 Hi, are you still working on this? |
So it seems there isn't a solid reason not to merge this provided we aren't going to go down the rabbit whole we've been talking about. Lets make sure everything is still ok with Jenkins still (Jenkins Retest This Please). |
Jenkins, retest this please. |
Test build #78307 has finished for PR 16537 at commit
|
If you have the time to update/fix this @zero323 I'm happy to merge it pending jenkins, otherwise I'll just close the issue at the end of the month. |
@holdenk I'll try to reproduce this problem but it looks a bit awkward:
Doesn't look like something related to this PR at all 😕 |
Test build #78313 has finished for PR 16537 at commit
|
I cannot reproduce this locally, but do we really use |
Test build #78315 has finished for PR 16537 at commit
|
Jenkins, retest this please. |
Test build #78330 has finished for PR 16537 at commit
|
@@ -1949,6 +1949,14 @@ def _create_judf(self): | |||
return judf | |||
|
|||
def __call__(self, *cols): | |||
for c in cols: | |||
if not isinstance(c, (Column, str)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this break unicode
support in Python 2?
from pyspark.sql.functions import udf
udf(lambda x: x)(u"a")
before
Column<<lambda>(a)>
after
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../spark/python/pyspark/sql/functions.py", line 1970, in wrapper
return self(*args)
File ".../spark/python/pyspark/sql/functions.py", line 1958, in __call__
"lit, array, struct or create_map.".format(c, type(c)))
TypeError: Invalid UDF argument, not a str or Column: a of type <type 'unicode'>. For Column literals use sql.functions lit, array, struct or create_map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Sorry for a delayed response, I am seldom online these days. You're right, it looks like an issue. I'll take a look at this, when I have more time
I see people run into this kind of things quite a bit. |
Let me take over this one and credit to @zero323. |
Thanks @HyukjinKwon :D |
Thanks @HyukjinKwon |
What changes were proposed in this pull request?
Adds basic input validation for
UserDefinedFunction.__call__
to avoid failing with crypticPy4J
errors.How was this patch tested?
Unit tests.