Fix parse empty df #14596

bkandel · 2016-11-06T02:13:00Z

closes Pandas 0.19 read_csv with header=[0, 1] on an empty df throws error #14515
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

This commit fixes a bug where read_csv failed when given a file with a multiindex header and empty content. Because pandas reads index names as a separate line following the header lines, the reader looks for the line with index names in it. If the content of the dataframe is empty, the reader will choke. This bug surfaced after #6618 stopped writing an extra line after multiindex columns, which led to a situation where pandas could write CSV's that it couldn't then read.

This commit changes that behavior by explicitly checking if the index name row exists, and processing it correctly if it doesn't.

jreback · 2016-11-06T02:14:27Z

doc/source/whatsnew/v0.19.1.txt

@@ -57,5 +57,6 @@ Bug Fixes
 - Bug in ``DataFrame.to_json`` where ``lines=True`` and a value contained a ``}`` character (:issue:`14391`)
 - Bug in ``df.groupby`` causing an ``AttributeError`` when grouping a single index frame by a column and the index level (:issue`14327`)
 - Bug in ``df.groupby`` where ``TypeError`` raised when ``pd.Grouper(key=...)`` is passed in a list (:issue:`14334`)
+- Bug in ``pd.read_csv`` where reading files fails if the number of headers is equal to the number of lines in the file (:issue:`14515`)


jreback · 2016-11-06T02:15:01Z

pandas/io/tests/parser/common.py

+"""
+        df = self.read_csv(StringIO(data), header=[0])
+        expected = DataFrame(columns=[('a'), ('b')])
+        tm.assert_frame_equal(df, expected)


use the same styling as above
e.g. expected

jreback · 2016-11-06T02:15:54Z

pandas/parser.pyx

@@ -714,7 +714,9 @@ cdef class TextReader:
                    start = self.parser.line_start[0]

                # e.g., if header=3 and file only has 2 lines
-                elif self.parser.lines < hr + 1:
+                if (self.parser.lines < hr + 1
+                    and not isinstance(self.orig_header, list)) or (


this is really odd what r u trying to do

Counteract the extension to the header added in: https://github.com/pandas-dev/pandas/blob/master/pandas/parser.pyx#L519 The issue is that if the header is passed in as a list, it's extended to enable reading in the index name line (I think that's what it's for if I interpreted that comment correctly). But that extended header may actually end up being longer than the total length of the file. The check here if the file is too short doesn't take into account whether or not the header has been artificially extended. So this checks if the header has been artificially extended and disables the complaint about the file being too short if the header was extended beyond the length of the file.

jorisvandenbossche · 2016-11-11T09:05:54Z

@bkandel You have several tests that are failing now (I think also the one you added)

bkandel · 2016-11-11T14:32:08Z

@jorisvandenbossche Sorry for the delay in fixing this -- there was more complexity here than I realized. I think it's getting there.

jreback · 2016-11-12T16:40:58Z

doc/source/whatsnew/v0.20.0.txt

@@ -80,3 +80,4 @@ Performance Improvements

 Bug Fixes
 ~~~~~~~~~
+- Bug in ``pd.read_csv`` where reading files fails if the number of headers is equal to the number of lines in the file (:issue:`14515`)


can move to 0.19.2

jreback · 2016-11-12T16:43:21Z

pandas/io/tests/parser/common.py

+        data = """a,b
+"""
+        df = self.read_csv(StringIO(data), header=[0])
+        expected = DataFrame(columns=[('a',), ('b',)])


I would expect these to resolve to an Index actually (we don't have a single level MI)

In [6]: pd.MultiIndex.from_tuples([('a',), ('b',)]) Out[6]: Index(['a', 'b'], dtype='object')

That's what I thought too, but I got dtype matching errors when I did that. I'll try to figure out what's going on.

In [29]: data = 'a,b\n' In [30]: df = pd.read_csv(StringIO(data), header=[0]) In [31]: df.columns Out[31]: Index([(u'a',), (u'b',)], dtype='object') In [32]: df.columns == Index(['a', 'b']) Out[32]: array([False, False], dtype=bool) In [33]: df.columns == Index(['a', 'b'], dtype='object') Out[33]: array([False, False], dtype=bool)

Looks like it's being parsed as a list of tuples, not a MultiIndex, but that's probably incorrect. I'll see if I can fix that.

jreback · 2016-11-12T16:43:55Z

pandas/io/tests/parser/common.py

+"""
+        df2 = self.read_csv(StringIO(data_multiline), header=[0, 1])
+        expected2 = DataFrame(columns=[('a', 'c'), ('b', 'd')])
+        tm.assert_frame_equal(df2, expected2)


then this works nicely

In [7]: pd.MultiIndex.from_tuples([('a','c'), ('b', 'd')]) Out[7]: MultiIndex(levels=[['a', 'b'], ['c', 'd']], labels=[[0, 1], [0, 1]])

codecov-io · 2016-11-13T13:30:33Z

Current coverage is 85.20% (diff: 100%)

Merging #14596 into master will increase coverage by <.01%

@@             master     #14596   diff @@
==========================================
  Files           143        143          
  Lines         50787      50793     +6   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43273      43280     +7   
+ Misses         7514       7513     -1   
  Partials          0          0

Powered by Codecov. Last update f26b049...32e3b0a

jreback · 2016-11-15T19:40:30Z

@bkandel looks pretty good. ping when all green.

bkandel · 2016-11-15T21:19:45Z

@jreback all green. Should be ready for final review now.

read_csv would fail on files if the number of header lines passed in includes all the lines in the files. This commit fixes that bug.

A test in test_to_csv checked for the presence of exactly the behavior we're fixing here: A file with 5 lines that asks for a header of length 5 should work and return an empty dataframe, not error.

bkandel · 2016-11-22T01:49:51Z

@jreback just rebased and should be good to go now.

jreback · 2016-11-22T11:24:02Z

thanks!

closes pandas-dev#14515 This commit fixes a bug where `read_csv` failed when given a file with a multiindex header and empty content. Because pandas reads index names as a separate line following the header lines, the reader looks for the line with index names in it. If the content of the dataframe is empty, the reader will choke. This bug surfaced after pandas-dev#6618 stopped writing an extra line after multiindex columns, which led to a situation where pandas could write CSV's that it couldn't then read. This commit changes that behavior by explicitly checking if the index name row exists, and processing it correctly if it doesn't. Author: Ben Kandel <ben.kandel@gmail.com> Closes pandas-dev#14596 from bkandel/fix-parse-empty-df and squashes the following commits: 32e3b0a [Ben Kandel] lint e6b1237 [Ben Kandel] lint fedfff8 [Ben Kandel] fix multiindex column parsing 518982d [Ben Kandel] move to 0.19.2 fc23e5c [Ben Kandel] fix errant this_columns 3d9bbdd [Ben Kandel] whatsnew 68eadf3 [Ben Kandel] Modify test. 17e44dd [Ben Kandel] fix python parser too 72adaf2 [Ben Kandel] remove unnecessary test bfe0423 [Ben Kandel] typo 2f64d57 [Ben Kandel] pep8 b8200e4 [Ben Kandel] BUG: read_csv with empty df (cherry picked from commit f862b52)

jreback requested changes Nov 6, 2016

View reviewed changes

sinhrks added IO CSV read_csv, to_csv MultiIndex Bug labels Nov 7, 2016

bkandel force-pushed the fix-parse-empty-df branch from 9d2265f to fc330ef Compare November 11, 2016 15:40

jreback reviewed Nov 12, 2016

View reviewed changes

bkandel force-pushed the fix-parse-empty-df branch from 71810be to 524f2cc Compare November 13, 2016 13:30

Ben Kandel added 12 commits November 21, 2016 20:48

BUG: read_csv with empty df

b8200e4

read_csv would fail on files if the number of header lines passed in includes all the lines in the files. This commit fixes that bug.

pep8

2f64d57

typo

bfe0423

remove unnecessary test

72adaf2

fix python parser too

17e44dd

Modify test.

68eadf3

A test in test_to_csv checked for the presence of exactly the behavior we're fixing here: A file with 5 lines that asks for a header of length 5 should work and return an empty dataframe, not error.

whatsnew

3d9bbdd

fix errant this_columns

fc23e5c

move to 0.19.2

518982d

fix multiindex column parsing

fedfff8

lint

e6b1237

lint

32e3b0a

bkandel force-pushed the fix-parse-empty-df branch from 6ba1bf4 to 32e3b0a Compare November 22, 2016 01:49

jreback added this to the 0.19.2 milestone Nov 22, 2016

jreback closed this in f862b52 Nov 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parse empty df #14596

Fix parse empty df #14596

bkandel commented Nov 6, 2016 •

edited

Loading

jreback Nov 6, 2016

jreback Nov 6, 2016

jreback Nov 6, 2016

bkandel Nov 6, 2016 •

edited

Loading

jorisvandenbossche commented Nov 11, 2016

bkandel commented Nov 11, 2016

jreback Nov 12, 2016

jreback Nov 12, 2016

bkandel Nov 13, 2016

bkandel Nov 13, 2016

jreback Nov 12, 2016

codecov-io commented Nov 13, 2016 •

edited

Loading

jreback commented Nov 15, 2016

bkandel commented Nov 15, 2016

bkandel commented Nov 22, 2016

jreback commented Nov 22, 2016

Fix parse empty df #14596

Fix parse empty df #14596

Conversation

bkandel commented Nov 6, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkandel Nov 6, 2016 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 11, 2016

bkandel commented Nov 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Nov 13, 2016 • edited Loading

Current coverage is 85.20% (diff: 100%)

jreback commented Nov 15, 2016

bkandel commented Nov 15, 2016

bkandel commented Nov 22, 2016

jreback commented Nov 22, 2016

bkandel commented Nov 6, 2016 •

edited

Loading

bkandel Nov 6, 2016 •

edited

Loading

codecov-io commented Nov 13, 2016 •

edited

Loading