-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH/BUG: Fix names, levels and labels handling in MultiIndex #4039
Conversation
values = list(values) | ||
if len(values) != self.nlevels: | ||
raise ValueError(('Length of names (%d) must be same as level ' | ||
'(%d)') % (len(values),self.nlevels)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
complete bikeshedding, but no need for double parens here, as long as there's one set of parens python knows what 2 do :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh, I just moved it straight from __new__
, certainly worth it to change. :)
as an aside, this leads to weird things like idx.names = list("abcdef")
idx.droplevel("d") # Gives an index out of range-esque error |
yep i used to get that and i hacked around by recreating frames/series and some other trickery that will hopefully never see the light of day :) |
@cpcloud okay, I think I'm understanding more of the problem here: when you slice an index, the levels remain the same...e.g.: chunklet = idx[-3:]
assert chunklet.levels[0] is idx.levels[0] # True So, when you assign names, it mutates the underlying levels of both. This seems to follow the convention in Moreover, if you pass levels to the new_idx = MultiIndex(idx.levels, idx.labels)
assert new_idx.levels[0] is idx.levels[0] # True so what ought to be happening here? Should names be assigned to underlying levels or just left alone? |
index.names = ["a", "b"] | ||
ind_names = list(index.names) | ||
level_names = [level.name for level in index.levels] | ||
self.assertListEqual(ind_names, level_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi you can't use this because it was introduced in py27
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the heads up ... didn't realize (and definitely not worth
porting). Second change to make.
On Wed, Jun 26, 2013 at 10:01 PM, Phillip Cloud notifications@github.comwrote:
In pandas/tests/test_index.py:
# initializing with bad names (should always be equivalent)
major_axis, minor_axis = self.index.levels
major_labels, minor_labels = self.index.labels
assertRaisesRegexp(ValueError, "^Length of names", MultiIndex, levels=[major_axis, minor_axis],
labels=[major_labels, minor_labels],
names=['first'])
assertRaisesRegexp(ValueError, "^Length of names", MultiIndex, levels=[major_axis, minor_axis],
labels=[major_labels, minor_labels],
names=['first', 'second', 'third'])
# names are assigned
index.names = ["a", "b"]
ind_names = list(index.names)
level_names = [level.name for level in index.levels]
self.assertListEqual(ind_names, level_names)
fyi you can't use this because it was introduced in py27http://docs.python.org/2/library/unittest.html#unittest.TestCase.assertListEqual
—
Reply to this email directly or view it on GitHubhttps://github.com//pull/4039/files#r4907657
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In python >= 2.7 assertEqual
will dispatch to e.g., assertListEqual
if two lists are passed, so that's nice.
@cpcloud made those two changes and rebased. |
@wesm any reason why |
nope. i just never added any validation and left names as a simple attribute. |
+1 for a validator here...@jtratner what breaks? |
this is basically same issue as #3742 |
indeed...close this one, that one? |
def should be some kind of validator on setting of |
can close that one (maybe move example to here though) as another test case |
Example from #3742 cc @thriveth I have raised the issue in this question on Stack Overflow, but I'm not sure it ever made it to the Pandas issue tracker. I have a MultiIndex'ed DataFrame which I want to expand by using lev1 = ['hans', 'hans', 'hans', 'grethe', 'grethe', 'grethe']
lev2 = ['1', '2', '3'] * 2
idx = pd.MultiIndex.from_arrays(
[lev1, lev2],
names=['Name', 'Number'])
df = pd.DataFrame(
np.random.randn(6, 4),
columns=['one', 'two', 'three', 'four'],
index=idx)
df = df.sortlevel()
df This shows a neat and nice object, just as I expected, with proper naming of the index columns. If I now run: df.set_value(('grethe', '3'), 'one', 99.34) the result is also as expected. But if I run: df.set_value(('grethe', '4'), 'one', 99.34) The column names of the index are gone, and the |
also #3714 same issue too, except assigning to levels, needs validation as well |
This is what I was trying to get across earlier :) If you pass levels through the MultiIndex constructor, they have their names set to the if names is None:
# !!!This is why names get reset to None
subarr.names = [None] * subarr.nlevels
else:
if len(names) != subarr.nlevels:
raise AssertionError(('Length of names (%d) must be same as level '
'(%d)') % (len(names),subarr.nlevels))
subarr.names = list(names)
# THIS IS WHERE NAMES GET OVERWRITTEN WITHOUT BEING COPIED
# set the name
for i, name in enumerate(subarr.names):
subarr.levels[i].name = name An easy solution would be for the MultiIndex to copy the levels it receives first and then rename them. Then, any time you set the names attribute, it would set the name on every level and the levels = [_ensure_index(lev) for lev in levels] Because later on it just assigns it to the object: subarr.levels = levels I think levels should be a cached_readonly property, so that you don't end up creating indices twice (once in the If this all makes sense to you, I can write it up into a PR soon. |
well they are immutable on the values |
@cpcloud I learned something today: Python cheats and compares lists first by object equality, then checks individual items...I'm wondering if this error is lurking other places in the code: >>> arr1, arr2 = np.array(range(10)), np.array(range(10))
>>> assert [arr1, arr2] == [arr1, arr2] # succeeds
>>> assert [arr1, arr2] == [arr1, arr2.copy()]
Traceback (most recent call last):
File "<console>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
>>> assert_almost_equal([arr1, arr2], [arr1.copy(), arr2])
True I found this in def test_copy(self):
i_copy = self.index.copy()
# Equal...but not the same object
self.assert_(i_copy.levels == self.index.levels)
self.assert_(i_copy.levels is not self.index.levels) |
Actually that whole test is not great, because it's actually (was) testing that two different lists were created. |
@jtratner that is good to know. |
i always assumed that sequence equality was done recursively. docs seem to imply that that is the case..strange |
@cpcloud yep, that's exactly right. (I'm sure that's implementation dependent, but it is important to know if using numpy arrays. |
maybe some sort of lightweight |
Maybe you could use |
@cpcloud maybe. Right now I just changed levels, names and labels to return tuples and used On that note - is it okay to make that change? Much easier to prevent erroneous assignment (like |
|
@cpcloud well, I'm using shallow copies/views, which I think means that only metadata is copied. This is necessary anyways, because you want to be able to set names on the underlying levels without worrying about messing up other indexes. |
(I just pushed what I have so far - it's failing because a ton of tests assume that index names, levels and labels will be lists...) |
* `FrozenNDArray` - thin wrapper around ndarray that disallows setting methods (will be used for levels on `MultiIndex`) * `FrozenList` - thin wrapper around list that disallows setting methods (needed because of type checks elsewhere) Index inherits from FrozenNDArray now and also actually copies for deepcopy. Assumption is that underlying array is still immutable-ish
@jreback tell me if you want docs on set_names, set_levels, or the new copy constructor. |
I think an example for v0.13 is good |
okay, there's an example in v0.13.0.txt and changed indexing.rst slightly to add the index names section (moved around the Index objects part so it could address MultiIndex too) |
@jreback this is all working now + has the docs, etc. |
@@ -82,6 +88,20 @@ pandas 0.13 | |||
- removed the ``warn`` argument from ``open``. Instead a ``PossibleDataLossError`` exception will | |||
be raised if you try to use ``mode='w'`` with an OPEN file handle (:issue:`4367`) | |||
- allow a passed locations array or mask as a ``where`` condition (:issue:`4467`) | |||
- ``Index`` and ``MultiIndex`` changes (:issue:`4039`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this need to be indented? (eg outer level should be same as existing and inner level indented?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep needs to be indented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That indentation level is because the previous entry is a sub-bullet of the HDFStore changes. (this matches up with the outer indentation level, which is 2 spaces.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then can you change the ones below it cuz this one sticks out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What follows are sub-bullets of "Index
and MultiIndex
changes", just like how the HDFStore changes are grouped above it - I can change it if you want, I was matching the look of other elements that have multiple changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh you're right! sorry about that. only thing is that to prevent sphinx from complaining you should put a newline between an outdented bullet point and the previously indented one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpcloud added the space.
* `names` is now a property *and* is set up as an immutable tuple. * `levels` are always (shallow) copied now and it is deprecated to set directly * `labels` are set up as a property now, moving all the processing of labels out of `__new__` + shallow-copied. * `levels` and `labels` are immutable. * Add names tests, motivating example from pandas-dev#3742, reflect tuple-ish output from names, and level names check to reindex test. * Add set_levels, set_labels, set_names and rename to index * Deprecate setting labels and levels directly Similar to other set_* methods...allows mutation if necessary but otherwise returns same object. Labels are now converted to `FrozenNDArray` and wrapped in a `FrozenList`. Should mostly resolve pandas-dev#3714 because you have to work to actually make assignments to an `Index`. BUG: Give MultiIndex its own astype method Fixes issue with set_value forgetting names.
* Index derivatives can set `name` or `names` as well as `dtype` on copy. MultiIndex can set `levels`, `labels`, and `names`. * Also, `__deepcopy__` just calls `copy(deep=True)` * Now, BlockManager.copy() takes an additional argument `copy_axes` which copies axes as well. Defaults to False. * `Series.copy()` takes an optional deep argument, which causes it to copy its index. * `DataFrame.copy()` passes `copy_axes=True` when deepcopying. * Add copy kwarg to MultiIndex `__new__`
ok bombs away |
ENH/BUG: Fix names, levels and labels handling in MultiIndex
@jreback actually, I was flip-flopping on this for a while. I tried copying axes in a few places, and I kept getting issues with the check that |
@jreback After you brought this up - I'm thinking that you could copy index first, then pass it as a parameter to the copy on blocks to overwrite items and ref_items. Maybe that would work? What's the difference between ref_items and items? |
@jtratner So the only thing you can do is to copy it BEFORE you start and then use that one (and yes, could be done in |
This PR covers:
Fixes: #4202, #3714, #3742 (there are might be some others, but I've blanked on them...)
Bug fixes:
MultiIndex
preserves names as much as possible and it's now harder to overwrite index metadata by making changes down the line.set_values
no longer messes up names.External API Changes:
FrozenList
andFrozenNDArray
)MultiIndex
now shallow copies levels and labels before storing them.astype
method toMultiIndex
to resolve issue withset_values
inNDFrame
set_names
,set_labels
, andset_levels
methods allow setting of these attributes and take aninplace=True
keyword argument to mutate in place.Index
has arename
method that works similarly to theset_*
methods.Index
methods to be more descriptive / more specific (e.g., replacingException
withValueError
, etc.)Index.copy()
now accepts keyword arguments (name=
,names=
,levels=
,labels=
,) which return a new copy with those attributes set. It also acceptsdeep
, which is there for compatibility with othercopy()
methods, but doesn't actually change what copy does (though, for MultiIndex, it makes the copy operation slower)Internal changes:
MultiIndex
now uses_set_levels
,_get_levels
,_set_labels
,_get_labels
internally to handle labels and levels (and uses that directly in__array_finalize__
and__setstate__
, etc.)MultiIndex.copy(deep=True)
will deepcopy levels, labels, and names.Index
objects handle names with_set_names
and_get_names
.Index
now inherits fromFrozenNDArray
which (mostly) blocks mutable methods (except forview()
andreshape()
)Index
now actually copies ndarrays when copy=True is passed to constructor and dtype=None