-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21375][PYSPARK][SQL] Add Date and Timestamp support to ArrowConverters for toPandas() Conversion #18664
Closed
BryanCutler
wants to merge
39
commits into
apache:master
from
BryanCutler:arrow-date-timestamp-SPARK-21375
Closed
Changes from all commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
5aa8b9e
added date type and started test, still some issue with time difference
BryanCutler 20313f9
DateTimeUtils forces defaultTimeZone
BryanCutler 69e1e21
fix style checks
BryanCutler dbfbef3
date type java tests passing
BryanCutler 436afff
timestamp type java tests passing
BryanCutler 78119ca
adding date and timestamp data to python tests, not passing
BryanCutler b709d78
TimestampType is correctly inferred as datetime64[ns]
BryanCutler 399e527
Merge remote-tracking branch 'upstream/master' into arrow-date-timest…
BryanCutler e6d8590
Adding DateType and TimestampType to ArrowUtils conversions
BryanCutler 719e77c
using default timezone, fixed tests
BryanCutler 3585520
fixed scala tests for timestamp
BryanCutler f977d0b
Adding sync between Python and Java default timezones
BryanCutler b826445
Merge remote-tracking branch 'upstream/master' into arrow-date-timest…
BryanCutler 3b83d7a
added date timestamp writers, fixed tests
BryanCutler a6009a5
Modify ArrowUtils to have timeZoneId when convert schema to Arrow sch…
ueshin 2ec98cc
fixed python test tearDownClass
BryanCutler c29018c
using Date.valueOf for tests instead
BryanCutler 7dbdb1f
Made timezone id required for TimestampType
BryanCutler c3f4e4d
added test for TimestampType without specifying timezone id
BryanCutler ddbea24
added date and timestamp to ArrowWriter and tests
BryanCutler c6b597d
removed unused import
BryanCutler 874f104
Merge remote-tracking branch 'upstream/master' into arrow-date-timest…
BryanCutler d8bae0b
added Python timezone converions for working with Pandas
BryanCutler 36f58b1
Merge remote-tracking branch 'upstream/master' into arrow-date-timest…
BryanCutler c4fd5ae
fix compilation
BryanCutler d1617fd
fixed test comp
BryanCutler d7d9b47
add conversion to Python system local timezone before localize
BryanCutler efe3e27
timestamps with Arrow almost working for pandas_udfs
BryanCutler 9894519
added workaround for Series to_pandas with timestamps, store os.envir…
BryanCutler a3ba4ac
change use of xrange for py3
BryanCutler 7266304
remove check for valid timezone in vector for ArrowWriter
BryanCutler e428cbe
added note for 'us' conversion
BryanCutler cade921
changed python api for is_datetime64
BryanCutler f512deb
remove Option for timezoneId
BryanCutler 171d9e1
Merge remote-tracking branch 'upstream/master' into arrow-date-timest…
BryanCutler 79bb93f
added pandas_udf test for date
BryanCutler c555207
added workaround for date casting, put back check for timestamp conve…
BryanCutler 4d40893
added fillna for null timestamp values
BryanCutler addd35f
added check for pandas_udf return is a timestamp with tz, added comme…
BryanCutler File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After running some tests, this change does not significantly degrade performance, but there seems to be a small difference. cc @ueshin
I ran various columns of random data through a
pandas_udf
repeatedly with and without this change. Test was in local mode with default Spark conf, looking at min wall clock time of 10 loopsbefore change: 2.595558
after change: 2.681813
Do you think the difference here is acceptable for now until arrow is upgraded and we can look into again?
pandas_udf_perf.py.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran your script in my local, too.
I think it's okay to use this workaround.