Pandas 0.19 read_csv with header=[0, 1] on an empty df throws error #14515

kaloramik · 2016-10-27T05:37:46Z

Pandas 0.19 incorrectly handles empty dataframe files with multi index columns

import pandas as pd
import tempfile

df = pd.DataFrame.from_records([], columns=['col_1', 'col_2'])
joined_df_in = pd.concat([df, df], keys=['a', 'b'], axis=1)
joined_df_in.reset_index(drop=True, inplace=True)

with tempfile.NamedTemporaryFile(delete=False) as f:
    joined_df_in.to_csv(f.name, index=False)

What the file looks like

a,a,b,b
col_1,col_2,col_1,col_2

Expected Output

# in pandas 0.18.1
pd.read_csv(f.name, header=[0,1])

yields what we expect, an empty MultiIndex data frame

(a, col_1)  (a, col_2)  (b, col_1)  (b, col_2)

# in pandas 0.19
pd.read_csv(f.name, header=[0,1])

Throws

---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-10-1051c5f9aa58> in <module>()
----> 1 pd.read_csv(f.name, header=[0,1])

/Users/mik-OD/anaconda/envs/signals/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    643                     skip_blank_lines=skip_blank_lines)
    644 
--> 645         return _read(filepath_or_buffer, kwds)
    646 
    647     parser_f.__name__ = name

/Users/mik-OD/anaconda/envs/signals/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    386 
    387     # Create the parser.
--> 388     parser = TextFileReader(filepath_or_buffer, **kwds)
    389 
    390     if (nrows is not None) and (chunksize is not None):

/Users/mik-OD/anaconda/envs/signals/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    727             self.options['has_index_names'] = kwds['has_index_names']
    728 
--> 729         self._make_engine(self.engine)
    730 
    731     def close(self):

/Users/mik-OD/anaconda/envs/signals/lib/python3.5/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    920     def _make_engine(self, engine='c'):
    921         if engine == 'c':
--> 922             self._engine = CParserWrapper(self.f, **self.options)
    923         else:
    924             if engine == 'python':

/Users/mik-OD/anaconda/envs/signals/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1387         kwds['allow_leading_cols'] = self.index_col is not False
   1388 
-> 1389         self._reader = _parser.TextReader(src, **kwds)
   1390 
   1391         # XXX

pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:5811)()

pandas/parser.pyx in pandas.parser.TextReader._get_header (pandas/parser.c:8615)()

CParserError: Passed header=[0,1], len of 2, but only 2 lines in file

Expected Output

Output of `pd.show_versions()`

For pandas 0.81

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: None
numpy: 1.11.1
scipy: 0.17.1
statsmodels: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.1
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)
jinja2: 2.7.3
boto: 2.40.0
pandas_datareader: None

For pandas 0.19


INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: None
numpy: 1.11.1
scipy: 0.17.1
statsmodels: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.1
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)
jinja2: 2.7.3
boto: 2.40.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2016-10-27T08:03:53Z

@kaloramik So the change is not in read_csv (because the example you give raises for me for both 0.19.0 and 0.18.1, and also 0.16), but in the output that to_csv is generating.

In versions < 0.19.0, the file looks like:

a,a,b,b
col_1,col_2,col_1,col_2
,,,

while in 0.19.0 it looks like (what you showed above):

a,a,b,b
col_1,col_2,col_1,col_2

So previously there was an extra line with empty values. Reading this in with 0.19.0 still gives your desired result of an empty frame:

s = """a,a,b,b
col_1,col_2,col_1,col_2
,,,"""

In [89]: pd.read_csv(StringIO(s), header=[0,1])
Out[89]: 
Empty DataFrame
Columns: [(a, col_1), (a, col_2), (b, col_1), (b, col_2)]
Index: []

In [90]: pd.__version__
Out[90]: '0.19.0'

(however, something could be said this should actually give you one row of NaNs)

So the change is in to_csv. In 0.19.0, the extra line is not added

In [94]: df = pd.DataFrame(columns=pd.MultiIndex.from_product([('a', 'b'), ('col_1', 'col_2')]))

In [96]: print(df.to_csv())
,a,a,b,b
,col_1,col_2,col_1,col_2

while in 0.18.0 there was an extra line with comma's:

In [32]: df = pd.DataFrame(columns=pd.MultiIndex.from_product([('a', 'b'), ('col_1', 'col_2')]))

In [34]: print(df.to_csv())
,a,a,b,b
,col_1,col_2,col_1,col_2
,,,,

This was a bug (since you don't have any data, there should not be a line of missing values), and this bug was fixed in 0.19.0, see #6618

kaloramik · 2016-10-27T08:20:17Z

@jorisvandenbossche hmm really? That's not what I'm seeing at all. Is it possible I have a package thats screwing something up? Can you post your pd.show_versions?

But looking at the behavior, shouldn't the expected behavior be what I posted? As in, if you read in a file of length 2, and your headers are taken up to by 2 lines, then it should return an empty df with those columns. I believe the same behavior applies for a single header.

The error message doesn't seem to make sense

Passed header=[0,1], len of 2, but only 2 lines in file

it DOES have 2 lines in the file, so it should be able to construct the header. In addition, the source code has the following comment
https://github.com/pandas-dev/pandas/blob/6130e77fb7c9d44fde5d98f9719bd67bb9ec2ade/pandas/parser.pyx

                # e.g., if header=3 and file only has 2 lines
                elif self.parser.lines < hr + 1:
                    msg = self.orig_header
                    if isinstance(msg, list):
                        msg = "[%s], len of %d," % (
                            ','.join([ str(m) for m in msg ]), len(msg))
                    raise CParserError(
                        'Passed header=%s but only %d lines in file'
                        % (msg, self.parser.lines))

According to the comment, the function should fail if the file has less than len(header) lines, implying that the function should succeed if len(header) == len(lines). Does that sound right?

kaloramik · 2016-10-27T08:26:18Z

Oh actually, scratch that, you are right about 0.18.1 returning an extra line of commas (And so the read_csv succeeds I guess)

But this breaks behavior now, as in my data pipelines, I am unable to write then read empty dataframes as before. I think the above behavior I described is still the desired one? Unless you have better workarounds? ( I don't think replicating the old behavior by forcibly adding a row of commas would be a good idea)

jorisvandenbossche · 2016-10-27T08:26:35Z

But looking at the behavior, shouldn't the expected behavior be what I posted?

Possibly. But I am just pointing out that it is not a change in read_csv. The code you link to hasn't changed in 2 years (and I tested up to 0.16 that this has been raising this error consistently).

Apart from that, it is worth discussing if we should allow this. IMO returning an empty frame is indeed more logical to do.

jorisvandenbossche · 2016-10-27T08:30:26Z

The bug fix in to_csv was in any case a good one, so we can only fix it in read_csv. Personally I am in favor of returning an empty frame instead of erroring.
As you point out, this is more in line with a single header line:

s = """a,b
"""

In [14]: pd.read_csv(StringIO(s))
Out[14]: 
Empty DataFrame
Columns: [a, b]
Index: []

Note that also for a single header, once you pass the header kwarg, it raises:

In [105]: pd.read_csv(StringIO(s), header=[0])
...
CParserError: Passed header=[0], len of 1, but only 1 lines in file

cc @gfyoung @chris-b1

kaloramik · 2016-10-27T08:31:34Z

Got it. Thanks for the clarification! Actually as a temporary workaround I guess forcing a write of an empty row on empty data frames should be ok.

Do you know if there are any other workarounds, perhaps from the read side?

jorisvandenbossche · 2016-10-27T08:34:22Z

Hmm, I don't directly see a workaround on the read side. If you want to end up with the multi-index, I don't think there is an easy solution. Probably easier to temporarily fix on the write side as you point out.

closes pandas-dev#14515 This commit fixes a bug where `read_csv` failed when given a file with a multiindex header and empty content. Because pandas reads index names as a separate line following the header lines, the reader looks for the line with index names in it. If the content of the dataframe is empty, the reader will choke. This bug surfaced after pandas-dev#6618 stopped writing an extra line after multiindex columns, which led to a situation where pandas could write CSV's that it couldn't then read. This commit changes that behavior by explicitly checking if the index name row exists, and processing it correctly if it doesn't. Author: Ben Kandel <ben.kandel@gmail.com> Closes pandas-dev#14596 from bkandel/fix-parse-empty-df and squashes the following commits: 32e3b0a [Ben Kandel] lint e6b1237 [Ben Kandel] lint fedfff8 [Ben Kandel] fix multiindex column parsing 518982d [Ben Kandel] move to 0.19.2 fc23e5c [Ben Kandel] fix errant this_columns 3d9bbdd [Ben Kandel] whatsnew 68eadf3 [Ben Kandel] Modify test. 17e44dd [Ben Kandel] fix python parser too 72adaf2 [Ben Kandel] remove unnecessary test bfe0423 [Ben Kandel] typo 2f64d57 [Ben Kandel] pep8 b8200e4 [Ben Kandel] BUG: read_csv with empty df (cherry picked from commit f862b52)

jorisvandenbossche added the IO CSV read_csv, to_csv label Oct 27, 2016

jorisvandenbossche added this to the No action milestone Oct 27, 2016

bkandel mentioned this issue Nov 6, 2016

Fix parse empty df #14596

Closed

4 tasks

jreback modified the milestones: 0.19.2, No action Nov 22, 2016

jreback closed this as completed in f862b52 Nov 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas 0.19 read_csv with header=[0, 1] on an empty df throws error #14515

Pandas 0.19 read_csv with header=[0, 1] on an empty df throws error #14515

kaloramik commented Oct 27, 2016 •

edited by jorisvandenbossche

Loading

jorisvandenbossche commented Oct 27, 2016

kaloramik commented Oct 27, 2016

kaloramik commented Oct 27, 2016

jorisvandenbossche commented Oct 27, 2016

jorisvandenbossche commented Oct 27, 2016

kaloramik commented Oct 27, 2016

jorisvandenbossche commented Oct 27, 2016

Pandas 0.19 read_csv with header=[0, 1] on an empty df throws error #14515

Pandas 0.19 read_csv with header=[0, 1] on an empty df throws error #14515

Comments

kaloramik commented Oct 27, 2016 • edited by jorisvandenbossche Loading

Pandas 0.19 incorrectly handles empty dataframe files with multi index columns

Expected Output

Expected Output

Output of pd.show_versions()

jorisvandenbossche commented Oct 27, 2016

kaloramik commented Oct 27, 2016

kaloramik commented Oct 27, 2016

jorisvandenbossche commented Oct 27, 2016

jorisvandenbossche commented Oct 27, 2016

kaloramik commented Oct 27, 2016

jorisvandenbossche commented Oct 27, 2016

kaloramik commented Oct 27, 2016 •

edited by jorisvandenbossche

Loading

Output of `pd.show_versions()`