Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix concat series loss of timezone #24027

Merged
merged 25 commits into from
Dec 5, 2018
Merged

Conversation

evangelinehl
Copy link
Contributor

@evangelinehl evangelinehl commented Nov 30, 2018

Closes #23816

@pep8speaks
Copy link

pep8speaks commented Nov 30, 2018

Hello @evangelineliu! Thanks for updating the PR.

Comment last updated on December 05, 2018 at 22:33 Hours UTC

@codecov
Copy link

codecov bot commented Dec 1, 2018

Codecov Report

Merging #24027 into master will increase coverage by <.01%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24027      +/-   ##
==========================================
+ Coverage   42.46%   42.46%   +<.01%     
==========================================
  Files         161      161              
  Lines       51557    51556       -1     
==========================================
  Hits        21892    21892              
+ Misses      29665    29664       -1
Flag Coverage Δ
#single 42.46% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/dtypes/concat.py 56.25% <0%> (ø) ⬆️
pandas/core/reshape/tile.py 11.69% <0%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b6b346...1804d7d. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 1, 2018

Codecov Report

Merging #24027 into master will increase coverage by 49.68%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24027       +/-   ##
===========================================
+ Coverage   42.52%    92.2%   +49.68%     
===========================================
  Files         161      162        +1     
  Lines       51697    51727       +30     
===========================================
+ Hits        21982    47695    +25713     
+ Misses      29715     4032    -25683
Flag Coverage Δ
#multiple 90.6% <100%> (?)
#single 43.02% <0%> (+0.5%) ⬆️
Impacted Files Coverage Δ
pandas/core/dtypes/concat.py 96.6% <100%> (+40.35%) ⬆️
pandas/core/internals/construction.py 96.64% <0%> (ø)
pandas/core/computation/pytables.py 92.37% <0%> (+0.3%) ⬆️
pandas/io/pytables.py 92.3% <0%> (+0.92%) ⬆️
pandas/util/_test_decorators.py 93.24% <0%> (+4.05%) ⬆️
pandas/compat/__init__.py 58.3% <0%> (+8.1%) ⬆️
pandas/core/config_init.py 99.24% <0%> (+9.84%) ⬆️
pandas/core/reshape/util.py 100% <0%> (+11.53%) ⬆️
pandas/compat/numpy/__init__.py 92.85% <0%> (+14.28%) ⬆️
pandas/core/computation/common.py 85.71% <0%> (+14.28%) ⬆️
... and 120 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d3ed91...1867b3a. Read the comment docs.

@TomAugspurger
Copy link
Contributor

Can you add “closes
https://github.com/pandas-dev/pandas/issues/23816” to the original post?

We also need a test (the one from the issue is fine) and a release note.

@@ -193,7 +193,8 @@ def _concat_categorical(to_concat, axis=0):

def _concat_asobject(to_concat):
to_concat = [x.get_values() if is_categorical_dtype(x.dtype)
else np.asarray(x).ravel() for x in to_concat]
else np.asarray(x).ravel() if not is_datetime64tz_dtype(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the way
_concat_conpat already handles this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t realize we do call _concat_compat from here... I think we’ve already lost the tz by the time we call it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do these if checks not even need to be there? Not sure what this means for the change we had here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback after closer examination, I'm not sure if _concat_compat actually already handles this. If the type is also a category, it simply calls _concat_categorical which then summons _concat_asobject. I think that therefore the change to the code should happen inside _concat_categorical or _concat_asobject.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback calling _concat_compat before the list is modified like this causes _concat_compat to just call _concat_categorical again and back and forth until they reach the maximum recursion depth. We have to modify the array somehow before that happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm not sure why we aren't doing

to_concat = [np.asarray(x.astype(object)) for x in to_concat]

The function name is _concat_asobject, so let's convert to object and concat them. Can you see if that works?

Please write the test first though, and ensure that the test fails with master. Then implement the fix and make sure it works.

Copy link
Contributor Author

@evangelinehl evangelinehl Dec 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I tried this previously but the build fails (I think that was the dimensions bug we had before) @TomAugspurger

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we have some inconsistencies in concat that's breaking this approach. IIRC there's an outstanding PR fixing some of these up.

I was going by the loose rule of "different dtypes means the result type is object". But things like

In [1]: import pandas as pd; import numpy as np; import pickle

In [2]: a = pd.Series([1, 2], dtype='category')

In [3]: b = pd.Series([1, None], dtype='category')

In [4]: pd.concat([a, b], ignore_index=True)
Out[4]:
0    1.0
1    2.0
2    1.0
3    NaN
dtype: float64

mess that up. Not sure what's best here...

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 1, 2018 via email

@evangelinehl
Copy link
Contributor Author

evangelinehl commented Dec 1, 2018

Hi, so should we be calling from concat_compat? Why is calling from concat_compat better? Does the current way of fixing the issue conflict with something else in the codebase? @jreback

@jakezimmer
Copy link

Just wanted to test something here, I'm decently sure that these tests will fail.

@gfyoung gfyoung added Bug Datetime Datetime data dtype Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 2, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs a test first thing

@evangelinehl
Copy link
Contributor Author

Sorry, still trying to figure out why the dimensions would be different for the arrays passed to np.concatenate. In your experience why does this usually happen?

@jreback
Copy link
Contributor

jreback commented Dec 2, 2018

Sorry, still trying to figure out why the dimensions would be different for the arrays passed to np.concatenate. In your experience why does this usually happen?

datetimes with timezones are backed by a DatetimeIndex itself which is only 1-D object. datetimes that are naive are 2-D blocks internaly.

@evangelinehl
Copy link
Contributor Author

Taking a look at concat_compat, I agree with @jakezimmer that I'm not super sure if it actually takes care of the if checks that we tried to delete.

@evangelinehl
Copy link
Contributor Author

Is this a test for the new output? Because I tried what you said earlier already but it didn't build

@TomAugspurger
Copy link
Contributor

That's the expected output from whatever ends up fixing the original issue.

@evangelinehl
Copy link
Contributor Author

Should we add in the test case? I'm a little bit confused about what's happening exactly with the tests

@TomAugspurger
Copy link
Contributor

Yes. That test is asserting the correct behavior when concatenating a Categorical and timezone-aware series. The expected output is an object-dtype series.

@jreback
Copy link
Contributor

jreback commented Dec 4, 2018

@evangelineliu pushed a small cleanup. pls add a whatsnew note (reshaping bug fixes). ping on green.

@jreback jreback added this to the 0.24.0 milestone Dec 4, 2018
@jakezimmer
Copy link

jakezimmer commented Dec 4, 2018

Hi @jreback, I noticed that the fix up commit broke some linting as you can see here. Do you know what could be causing this?

Also I added the whatsnew change. This is my first time doing this so would you be able to take a look at the proposed change and see if it is acceptable.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 4, 2018 via email

attempting to rerun the tests
@jakezimmer
Copy link

@jreback all tests passed!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping on green.

@@ -1545,6 +1545,7 @@ Reshaping
- Bug in :meth:`DataFrame.append` with a :class:`Series` with a dateutil timezone would raise a ``TypeError`` (:issue:`23682`)
- Bug in ``Series`` construction when passing no data and ``dtype=str`` (:issue:`22477`)
- Bug in :func:`cut` with ``bins`` as an overlapping ``IntervalIndex`` where multiple bins were returned per item instead of raising a ``ValueError`` (:issue:`23980`)
- Bug in :func:`pandas.concat` when joining series datetimetz with series category would lose timezone (:issue:`23816`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug in :func:`pandas.concat` when joining a datetime w/tz aware Series with a categorical dtyped Series, lose timezone (:issue:`23816`)

use double backticks on Series

@jreback
Copy link
Contributor

jreback commented Dec 5, 2018

also pls merge master

@jreback
Copy link
Contributor

jreback commented Dec 5, 2018

thanks ping on green.

@jakezimmer
Copy link

@jreback do you know what would cause pandas azure pipelines job to fail like this on the "before install" task. It looks like it couldn't download anaconda. The only change was removing that whitespace and it has never happened in previous commits. Is there a way to try that test?

You can see the error here under Linux py36_locale_slow:
https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=4780

@jreback jreback merged commit b841374 into pandas-dev:master Dec 5, 2018
@jreback
Copy link
Contributor

jreback commented Dec 5, 2018

thanks @jakezimmer

that pipeline failed on some network timeout.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Dec 6, 2018
commit 28c61d770f6dfca6857fd0fa6979d4119a31129e
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 12:18:19 2018 -0600

    uncomment

commit bae2e322523efc73a1344464f51611e2dc555ccb
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 12:17:09 2018 -0600

    maybe fixes

commit 6cb4db05c9d6ceba3794096f0172cae5ed5f6019
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 09:57:37 2018 -0600

    we back

commit d97ab57fb32cb23371169d9ed659ccfac34cfe45
Merge: a117de4 b78aa8d
Author: Tom Augspurger <tom.w.augspurger@gmail.com>
Date:   Thu Dec 6 09:51:51 2018 -0600

    Merge remote-tracking branch 'upstream/master' into disown-tz-only-rebased2

commit b78aa8d
Author: gfyoung <gfyoung17+GitHub@gmail.com>
Date:   Thu Dec 6 07:18:44 2018 -0500

    REF/TST: Add pytest idiom to reshape/test_tile (pandas-dev#24107)

commit 2993b8e
Author: gfyoung <gfyoung17+GitHub@gmail.com>
Date:   Thu Dec 6 07:17:55 2018 -0500

    REF/TST: Add more pytest idiom to scalar/test_nat (pandas-dev#24120)

commit b841374
Author: evangelineliu <hsiyinliu@gmail.com>
Date:   Wed Dec 5 18:21:46 2018 -0500

    BUG: Fix concat series loss of timezone (pandas-dev#24027)

commit 4ae63aa
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Wed Dec 5 14:44:50 2018 -0800

    Implement DatetimeArray._from_sequence (pandas-dev#24074)

commit 2643721
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Wed Dec 5 14:43:45 2018 -0800

    CLN: Follow-up to pandas-dev#24100 (pandas-dev#24116)

commit 8ea7744
Author: chris-b1 <cbartak@gmail.com>
Date:   Wed Dec 5 14:21:23 2018 -0600

    PERF: ascii c string functions (pandas-dev#23981)

commit cb862e4
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Wed Dec 5 12:19:46 2018 -0800

    BUG: fix mutation of DTI backing Series/DataFrame (pandas-dev#24096)

commit aead29b
Author: topper-123 <contribute@tensortable.com>
Date:   Wed Dec 5 19:06:00 2018 +0000

    API: rename MultiIndex.labels to MultiIndex.codes (pandas-dev#23752)
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants