-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating dataframe using python #14198
Conversation
Python's array has more type than python it self, for example python only has float while array support 'f' (float) and 'd' (double) Switching to array.typecode helps spark make a better inference For example, for the code: from pyspark.sql.types import _infer_type from array import array a = array('f',[1,2,3,4,5,6]) _infer_type(a) We will get ArrayType(DoubleType,true) before change, but ArrayType(FloatType,true) after change
Oh interesting - thanks for working on this @zasdfgbnm and sorry its sort of fallen through the cracks. Is this something you are still working on? For PRs to get in you generally need some form of automated tests, let me know if you would like some help adding tests for this issue. |
I'd love to help @holdenk |
Something to mention is, there is still one problem that I'm not sure whether I solve it correctly: in python's array, unsigned types are supported, but unsigned types are not supported in JVM. The solution in this PR is to convert unsigned types to a larger type, e.g. unsigned int -> long. I'm not sure whether it would be better to reject the unsigned types in python and throw an exception. |
Hi @holdenk , I think I'm done. I create a test for this issue and I do find from the test that spark has the same issue not only for float but also for byte and short. After several commits, To be clear, I need to say that only array with typecode Would you, or any other developer, review my code and get it merged? |
cc @ueshin |
|
@zasdfgbnm Are you still working on this? If so, could you rebase or merge master to fix conflicts please? |
We are closing it due to inactivity. please do reopen if you want to push it forward. Thanks! |
@ueshin @gatorsmile I'm happy to resolve the conflicts IF AND ONLY IF there will be a developer work on the code review for this. This PR was opened more than a year ago and I keep waiting for the review for one year. If it is guaranteed that there will be a reviewer assigned for this recently, I will resolve the conflicts. Otherwise, I don't want to maintain a PR forever just to wait for review. |
@zasdfgbnm I think you can ping @ueshin to review. |
@zasdfgbnm Please reopen the PR and @ueshin can help review your PR. Thanks! |
reopened at #18444 |
retest this please |
… of null when creating DataFrame using python ## What changes were proposed in this pull request? This is the reopen of #14198, with merge conflicts resolved. ueshin Could you please take a look at my code? Fix bugs about types that result an array of null when creating DataFrame using python. Python's array.array have richer type than python itself, e.g. we can have `array('f',[1,2,3])` and `array('d',[1,2,3])`. Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you have `array('f')` in your rows. A simple code to reproduce this bug is: ``` from pyspark import SparkContext from pyspark.sql import SQLContext,Row,DataFrame from array import array sc = SparkContext() sqlContext = SQLContext(sc) row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3])) rows = sc.parallelize([ row1 ]) df = sqlContext.createDataFrame(rows) df.show() ``` which have output ``` +---------------+------------------+ | doublearray| floatarray| +---------------+------------------+ |[1.0, 2.0, 3.0]|[null, null, null]| +---------------+------------------+ ``` ## How was this patch tested? New test case added Author: Xiang Gao <qasdfgtyuiop@gmail.com> Author: Gao, Xiang <qasdfgtyuiop@gmail.com> Author: Takuya UESHIN <ueshin@databricks.com> Closes #18444 from zasdfgbnm/fix_array_infer.
What changes were proposed in this pull request?
Fix bugs about types that result an array of null when creating DataFrame using python.
Python's array.array have richer type than python itself, e.g. we can have
array('f',[1,2,3])
andarray('d',[1,2,3])
. Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you havearray('f')
in your rows.A simple code to reproduce this bug is:
which have output
How was this patch tested?
tested manually