-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parse empty df #14596
Fix parse empty df #14596
Conversation
@@ -57,5 +57,6 @@ Bug Fixes | |||
- Bug in ``DataFrame.to_json`` where ``lines=True`` and a value contained a ``}`` character (:issue:`14391`) | |||
- Bug in ``df.groupby`` causing an ``AttributeError`` when grouping a single index frame by a column and the index level (:issue`14327`) | |||
- Bug in ``df.groupby`` where ``TypeError`` raised when ``pd.Grouper(key=...)`` is passed in a list (:issue:`14334`) | |||
- Bug in ``pd.read_csv`` where reading files fails if the number of headers is equal to the number of lines in the file (:issue:`14515`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.20
""" | ||
df = self.read_csv(StringIO(data), header=[0]) | ||
expected = DataFrame(columns=[('a'), ('b')]) | ||
tm.assert_frame_equal(df, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the same styling as above
e.g. expected
@@ -714,7 +714,9 @@ cdef class TextReader: | |||
start = self.parser.line_start[0] | |||
|
|||
# e.g., if header=3 and file only has 2 lines | |||
elif self.parser.lines < hr + 1: | |||
if (self.parser.lines < hr + 1 | |||
and not isinstance(self.orig_header, list)) or ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really odd what r u trying to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Counteract the extension to the header added in: https://github.com/pandas-dev/pandas/blob/master/pandas/parser.pyx#L519 The issue is that if the header is passed in as a list, it's extended to enable reading in the index name line (I think that's what it's for if I interpreted that comment correctly). But that extended header may actually end up being longer than the total length of the file. The check here if the file is too short doesn't take into account whether or not the header has been artificially extended. So this checks if the header has been artificially extended and disables the complaint about the file being too short if the header was extended beyond the length of the file.
@bkandel You have several tests that are failing now (I think also the one you added) |
@jorisvandenbossche Sorry for the delay in fixing this -- there was more complexity here than I realized. I think it's getting there. |
9d2265f
to
fc330ef
Compare
@@ -80,3 +80,4 @@ Performance Improvements | |||
|
|||
Bug Fixes | |||
~~~~~~~~~ | |||
- Bug in ``pd.read_csv`` where reading files fails if the number of headers is equal to the number of lines in the file (:issue:`14515`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can move to 0.19.2
data = """a,b | ||
""" | ||
df = self.read_csv(StringIO(data), header=[0]) | ||
expected = DataFrame(columns=[('a',), ('b',)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect these to resolve to an Index
actually (we don't have a single level MI)
In [6]: pd.MultiIndex.from_tuples([('a',), ('b',)])
Out[6]: Index(['a', 'b'], dtype='object')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I thought too, but I got dtype matching errors when I did that. I'll try to figure out what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [29]: data = 'a,b\n'
In [30]: df = pd.read_csv(StringIO(data), header=[0])
In [31]: df.columns
Out[31]: Index([(u'a',), (u'b',)], dtype='object')
In [32]: df.columns == Index(['a', 'b'])
Out[32]: array([False, False], dtype=bool)
In [33]: df.columns == Index(['a', 'b'], dtype='object')
Out[33]: array([False, False], dtype=bool)
Looks like it's being parsed as a list of tuples, not a MultiIndex, but that's probably incorrect. I'll see if I can fix that.
""" | ||
df2 = self.read_csv(StringIO(data_multiline), header=[0, 1]) | ||
expected2 = DataFrame(columns=[('a', 'c'), ('b', 'd')]) | ||
tm.assert_frame_equal(df2, expected2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then this works nicely
In [7]: pd.MultiIndex.from_tuples([('a','c'), ('b', 'd')])
Out[7]:
MultiIndex(levels=[['a', 'b'], ['c', 'd']],
labels=[[0, 1], [0, 1]])
71810be
to
524f2cc
Compare
Current coverage is 85.20% (diff: 100%)@@ master #14596 diff @@
==========================================
Files 143 143
Lines 50787 50793 +6
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43273 43280 +7
+ Misses 7514 7513 -1
Partials 0 0
|
@bkandel looks pretty good. ping when all green. |
@jreback all green. Should be ready for final review now. |
read_csv would fail on files if the number of header lines passed in includes all the lines in the files. This commit fixes that bug.
A test in test_to_csv checked for the presence of exactly the behavior we're fixing here: A file with 5 lines that asks for a header of length 5 should work and return an empty dataframe, not error.
6ba1bf4
to
32e3b0a
Compare
@jreback just rebased and should be good to go now. |
thanks! |
closes pandas-dev#14515 This commit fixes a bug where `read_csv` failed when given a file with a multiindex header and empty content. Because pandas reads index names as a separate line following the header lines, the reader looks for the line with index names in it. If the content of the dataframe is empty, the reader will choke. This bug surfaced after pandas-dev#6618 stopped writing an extra line after multiindex columns, which led to a situation where pandas could write CSV's that it couldn't then read. This commit changes that behavior by explicitly checking if the index name row exists, and processing it correctly if it doesn't. Author: Ben Kandel <ben.kandel@gmail.com> Closes pandas-dev#14596 from bkandel/fix-parse-empty-df and squashes the following commits: 32e3b0a [Ben Kandel] lint e6b1237 [Ben Kandel] lint fedfff8 [Ben Kandel] fix multiindex column parsing 518982d [Ben Kandel] move to 0.19.2 fc23e5c [Ben Kandel] fix errant this_columns 3d9bbdd [Ben Kandel] whatsnew 68eadf3 [Ben Kandel] Modify test. 17e44dd [Ben Kandel] fix python parser too 72adaf2 [Ben Kandel] remove unnecessary test bfe0423 [Ben Kandel] typo 2f64d57 [Ben Kandel] pep8 b8200e4 [Ben Kandel] BUG: read_csv with empty df (cherry picked from commit f862b52)
git diff upstream/master | flake8 --diff
This commit fixes a bug where
read_csv
failed when given a file with a multiindex header and empty content. Because pandas reads index names as a separate line following the header lines, the reader looks for the line with index names in it. If the content of the dataframe is empty, the reader will choke. This bug surfaced after #6618 stopped writing an extra line after multiindex columns, which led to a situation where pandas could write CSV's that it couldn't then read.This commit changes that behavior by explicitly checking if the index name row exists, and processing it correctly if it doesn't.