-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating DataFrame using python #18444
Conversation
Python's array has more type than python it self, for example python only has float while array support 'f' (float) and 'd' (double) Switching to array.typecode helps spark make a better inference For example, for the code: from pyspark.sql.types import _infer_type from array import array a = array('f',[1,2,3,4,5,6]) _infer_type(a) We will get ArrayType(DoubleType,true) before change, but ArrayType(FloatType,true) after change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this!
I left some comments for now.
python/pyspark/sql/tests.py
Outdated
for t in int_types: | ||
# test positive numbers | ||
a = array.array(t, [1]) | ||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this loop for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me figure that out. This patch was more than a year ago and I forget what was happening.
python/pyspark/sql/types.py
Outdated
for v in obj: | ||
if v is not None: | ||
return ArrayType(_infer_type(obj[0]), True) | ||
else: | ||
return ArrayType(NullType(), True) | ||
elif isinstance(obj, array): | ||
if obj.typecode in _array_type_mappings: | ||
return ArrayType(_array_type_mappings[obj.typecode](), True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we can change the element type. Doesn't this break backward-compatibility? cc @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for cc'ing me. If we say ignoring specified type in array
, e.g., float
being double
is a bug, I think this is the same instance of the Pandas type related PR we helped review before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon Could you send me the link to "the Pandas type related PR we helped review before"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant 67c7502. I think it is okay.
ok to test |
Test build #78775 has finished for PR 18444 at commit
|
While I am here, I just ran some tests as below: from array import array
from pyspark.sql import Row
spark.createDataFrame([Row(floatarray=array('f',[1, 2, 3]))]).show()
spark.createDataFrame([Row(unicodearray=array('u',[u"a", u"b"]))]).show() Before >>> spark.createDataFrame([Row(floatarray=array('f',[1,2,3]))]).show()
>>> spark.createDataFrame([Row(unicodearray=array('u',[u"a", u"b"]))]).show()
After >>> spark.createDataFrame([Row(floatarray=array('f',[1, 2, 3]))]).show()
>>> spark.createDataFrame([Row(unicodearray=array('u',[u"a", u"b"]))]).show()
I think we should not drop this support. I guess the same thing would happen to |
# array types depend on C implementation on the machine. Therefore there | ||
# is no machine independent correspondence between python's array types | ||
# and Scala types. | ||
# See: https://docs.python.org/2/library/array.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin The answer to your question is explained here, a couple lines of comments I just added.
@HyukjinKwon Thanks for figuring that out. I will fix those issues. |
Test build #78802 has finished for PR 18444 at commit
|
@HyukjinKwon It is strange that the float array does not work and pyspark unit tests fails. This patch was created more than a year ago and and it worked well at that time. Maybe some changes in spark's internal things break this patch. I didn't follow spark's internal changes so I have no idea on how to fix this right now. Please allow me some time to figure things out and if you have an idea on what is the problem, please let me know. |
Btw, seems like
|
In my local environment, this patch for floatarray looks working well: df = spark.createDataFrame([Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))])
df.printSchema()
df.show()
@HyukjinKwon Could you check this again please? |
Ahhh.. there was a Scala change here. I checked this again and you are right @ueshin. I will read and make a clean build carefully next time ... I also updated #18444 (comment). |
@zasdfgbnm, I am sorry for a false alarm. |
Test build #79006 has finished for PR 18444 at commit
|
Test build #79010 has finished for PR 18444 at commit
|
Test build #79705 has finished for PR 18444 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. For me LGTM except for the comments.
if obj.typecode in _array_type_mappings: | ||
return ArrayType(_array_type_mappings[obj.typecode](), False) | ||
else: | ||
raise TypeError("not supported type: array(%s)" % obj.typecode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about failling back to type inference in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason to do so? I don't think array with unsupported typecode will be correctly serialized or deserialized(if it will, why not add it to supported list?). In this case, it would be better to raise an TypeError and let the user to pick another type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. To double check, it covers all the cases we supported before in Spark? If it can be for sure, I am fine with as is.
I was trying to leave a sign-off for this reason - fixing a case we never be reachable before ('c' type in a specific Python version) and a bug (assigning the correct type for array.array
), and all other cases work as was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we fall back to _infer_type
here, then array('q')
will be inferred as something like array of Long, and then the user will see an error from net.razorvine.pickle
saying that bad array typecode q
, due to SPARK-21420
, which seems to me more confusing than explicitly fails here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this worth a double check. Let me do some test on old spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about doing the following?
- Open a new JIRA, explaining the issue about 'L' support.
- Make changes to tests as you suggest, make sure we clearly document why for python2 'L' is an exception as comments, ref the new JIRA issue. (and say that this exception should be removed and a better support for large integers should be implemented)
- Finish this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Let me do this. If it is possible, I would like to finish this PR before I travel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, too. Let's discuss it in the next pr.
'L' -> 11, 'l' -> 13, 'f' -> 15, 'd' -> 17, 'u' -> 21 | ||
) | ||
} else { | ||
Map('c' -> 1, 'B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8, | ||
Map('B' -> 0, 'b' -> 1, 'H' -> 2, 'h' -> 4, 'I' -> 6, 'i' -> 8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last question to double check. So, 'c' did not ever work in a specific Python version but this PR fixes it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c
was supported by spark 2.2.0:
>>> row = Row(myarray=array('c', ["a"]))
>>> df = spark.createDataFrame([row])
>>> df.first()["myarray"][0]
u'a'
This support I think was because array was infered the same way as list. But after we make changes on the type infer of array, we have to change this accordingly to bring it back to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yea. I meant to say 'c' -> 1
was not ever reachable for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
By the way, I'm traveling tomorrow and will be back next Tuesday. During traveling, I may not be able to answer any comments, questions, etc. |
Sure, take your time @zasdfgbnm. Thanks for your quick response. |
LGTM if we have a proper conclusion in #18444 (comment). |
@HyukjinKwon Take a look at my newest commit. I think I find a better way to solve the problem that keeps all the hacking code for |
Okay. LGTM. |
LGTM, too. pending Jenkins. |
Test build #79736 has finished for PR 18444 at commit
|
Looks to be an unrelated error. Sent from my OnePlus ONEPLUS A3000 using FastHub |
Sounds actually related:
|
Yea, it's related to typecode |
Btw, now I'm wondering Python 3 can handle array like |
I just tested this with Python 3.6 to help: from pyspark.sql import Row
import array
row = Row(myarray=array.array('L', [9223372036854775807]))
df = spark.createDataFrame([row])
df.show() Master:
After this PR:
He tested other cases as well - https://github.com/apache/spark/pull/18444/files#r128131853. |
python/pyspark/sql/types.py
Outdated
|
||
# compute array typecode mappings for unsigned integer types | ||
for _typecode in _array_unsigned_int_typecode_ctype_mappings.keys(): | ||
# JVM does not have unsigned types, so use signed types that is at list 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo at list
.
@HyukjinKwon Oh, I was missing his test for |
Hmm, how was 'L' supported in master python 2? Why there was no such an error there? Sent from my OnePlus ONEPLUS A3000 using FastHub |
We just change the dirty hacking to something like:
Sent from my OnePlus ONEPLUS A3000 using FastHub |
python/pyspark/sql/types.py
Outdated
|
||
def _int_size_to_type(size): | ||
""" | ||
Return the Scala type from the size of integers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Scala type -> Catalyst datatype.
Test build #79758 has finished for PR 18444 at commit
|
fixed |
Test build #79774 has finished for PR 18444 at commit
|
Thanks! merging to master. |
What changes were proposed in this pull request?
This is the reopen of #14198, with merge conflicts resolved.
@ueshin Could you please take a look at my code?
Fix bugs about types that result an array of null when creating DataFrame using python.
Python's array.array have richer type than python itself, e.g. we can have
array('f',[1,2,3])
andarray('d',[1,2,3])
. Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you havearray('f')
in your rows.A simple code to reproduce this bug is:
which have output
How was this patch tested?
New test case added