Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_excel return empty dataframe when using usecols #20480

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 31 additions & 8 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2852,23 +2852,46 @@ Parsing Specific Columns

It is often the case that users will insert columns to do temporary computations
in Excel and you may not want to read in those columns. ``read_excel`` takes
a ``usecols`` keyword to allow you to specify a subset of columns to parse.
either a ``usecols`` or ``usecols_excel`` keyword to allow you to specify a
subset of columns to parse. Note that you can not use both ``usecols`` and
``usecols_excel`` named arguments at the same time.

If ``usecols`` is an integer, then it is assumed to indicate the last column
to be parsed.
If ``usecols_excel`` is supplied, then it is assumed that indicates a comma
separated list of Excel column letters and column ranges to be parsed.

.. code-block:: python

read_excel('path_to_file.xls', 'Sheet1', usecols=2)
read_excel('path_to_file.xls', 'Sheet1', usecols_excel='A:E')
read_excel('path_to_file.xls', 'Sheet1', usecols_excel='A,C,E:F')

If `usecols` is a list of integers, then it is assumed to be the file column
indices to be parsed.
If ``usecols`` is a list of integers, then it is assumed to be the file
column indices to be parsed.

.. code-block:: python

read_excel('path_to_file.xls', 'Sheet1', usecols=[0, 2, 3])
read_excel('path_to_file.xls', 'Sheet1', usecols=[1, 3, 5])

Element order is ignored, so ``usecols_excel=[0, 1]`` is the same as ``[1, 0]``.

If ``usecols`` is a list of strings, then it is assumed that each string
correspond to column names provided either by the user in `names` or
inferred from the document header row(s) and those strings define which columns
will be parsed.

.. code-block:: python

read_excel('path_to_file.xls', 'Sheet1', usecols=['foo', 'bar'])

Element order is ignored, so ``usecols=['baz', 'joe']`` is the same as
``['joe', 'baz']``.

If ``usecols`` is callable, the callable function will be evaluated against the
column names, returning names where the callable function evaluates to True.

.. code-block:: python

read_excel('path_to_file.xls', 'Sheet1', usecols=lambda x: x.isalpha())

Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.

Parsing Dates
+++++++++++++
Expand Down
4 changes: 2 additions & 2 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Datetimelike API Changes
Other API Changes
^^^^^^^^^^^^^^^^^

-
- :func:`read_excel` has gained the keyword argument ``usecols_excel`` that receives a string containing comma separated Excel ranges and columns. The ``usecols`` keyword argument at :func:`read_excel` had removed support for a string containing comma separated Excel ranges and columns and for an int indicating the first j columns to be read in a ``DataFrame``. Also, the ``usecols`` keyword argument at :func:`read_excel` had added support for receiving a list of strings containing column labels and a callable. (:issue:`18273`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better written as:

- :func:`read_excel:" now has a keyword argument of ``usecols_excel`` which allows you to parse select columns via A1 notation (:issue:`18273`)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd, please, take a look at this: #20480 (comment)

@jreback has asked to point what is changing.

@jreback, should I do what @WillAyd is asking or should I keep the message I have already written?

-
-

Expand Down Expand Up @@ -148,7 +148,7 @@ I/O
^^^

-
-
- Bug in :func:`read_excel` where ``usecols`` keyword argument as a list of strings were returning a empty ``DataFrame`` (:issue:`18273`)
-

Plotting
Expand Down
99 changes: 82 additions & 17 deletions pandas/io/excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@
import abc
import warnings
import numpy as np
import string
import re
from io import UnsupportedOperation

from pandas.core.dtypes.common import (
Expand Down Expand Up @@ -85,20 +87,42 @@
Column (0-indexed) to use as the row labels of the DataFrame.
Pass None if there is no such column. If a list is passed,
those columns will be combined into a ``MultiIndex``. If a
subset of data is selected with ``usecols``, index_col
is based on the subset.
subset of data is selected with ``usecols_excel`` or ``usecols``,
index_col is based on the subset.
parse_cols : int or list, default None

.. deprecated:: 0.21.0
Pass in `usecols` instead.

usecols : int or list, default None
usecols : list-like or callable, default None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was a single int allowed before?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or string
that correspond to column names provided either by the user in `names` or
inferred from the document header row(s). For example, a valid list-like
`usecols` parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Note that
you can not give both ``usecols`` and ``usecols_excel`` keyword arguments
at the same time.

If callable, the callable function will be evaluated against the column
names, returning names where the callable function evaluates to True. An
example of a valid callable argument would be ``lambda x: x.upper() in
['AAA', 'BBB', 'DDD']``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a versionchanged tag, mention what is changing


.. versionadded:: 0.24.0
Added support to column labels and now `usecols_excel` is the keyword that
receives separated comma list of excel columns and ranges.
usecols_excel : string, default None
Return a subset of the columns from a spreadsheet specified as Excel column
ranges and columns. Note that you can not use both ``usecols`` and
``usecols_excel`` keyword arguments at the same time.

* If None then parse all columns,
* If int then indicates last column to be parsed
* If list of ints then indicates list of column numbers to be parsed
* If string then indicates comma separated list of Excel column letters and
column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
both sides.
column ranges (e.g. "A:E" or "A,C,E:F") to be parsed. Ranges are
inclusive of both sides.

.. versionadded:: 0.24.0

squeeze : boolean, default False
If the parsed data only contains one column then return a Series
dtype : Type name or dict of column -> type, default None
Expand Down Expand Up @@ -269,6 +293,19 @@ def _get_default_writer(ext):
return _default_writers[ext]


def _is_excel_columns_notation(columns):
"""
Receives a string and check if the string is a comma separated list of
Excel index columns and index ranges. An Excel range is a string with two
column indexes separated by ':').
"""
if isinstance(columns, compat.string_types) and all(
(x in string.ascii_letters) for x in re.split(r',|:', columns)):
return True

return False


def get_writer(engine_name):
try:
return _writers[engine_name]
Expand All @@ -286,6 +323,7 @@ def read_excel(io,
names=None,
index_col=None,
usecols=None,
usecols_excel=None,
squeeze=False,
dtype=None,
engine=None,
Expand All @@ -311,6 +349,7 @@ def read_excel(io,
header=header,
names=names,
index_col=index_col,
usecols_excel=usecols_excel,
usecols=usecols,
squeeze=squeeze,
dtype=dtype,
Expand Down Expand Up @@ -405,6 +444,7 @@ def parse(self,
names=None,
index_col=None,
usecols=None,
usecols_excel=None,
squeeze=False,
converters=None,
true_values=None,
Expand Down Expand Up @@ -439,6 +479,7 @@ def parse(self,
header=header,
names=names,
index_col=index_col,
usecols_excel=usecols_excel,
usecols=usecols,
squeeze=squeeze,
converters=converters,
Expand All @@ -455,7 +496,7 @@ def parse(self,
convert_float=convert_float,
**kwds)

def _should_parse(self, i, usecols):
def _should_parse(self, i, usecols_excel, usecols):

def _range2cols(areas):
"""
Expand All @@ -481,19 +522,20 @@ def _excel2num(x):
cols.append(_excel2num(rng))
return cols

if isinstance(usecols, int):
return i <= usecols
elif isinstance(usecols, compat.string_types):
return i in _range2cols(usecols)
else:
return i in usecols
# check if usecols_excel is a string that indicates a comma separated
# list of Excel column letters and column ranges
if isinstance(usecols_excel, compat.string_types):
return i in _range2cols(usecols_excel)

return True

def _parse_excel(self,
sheet_name=0,
header=0,
names=None,
index_col=None,
usecols=None,
usecols_excel=None,
squeeze=False,
dtype=None,
true_values=None,
Expand All @@ -512,6 +554,25 @@ def _parse_excel(self,

_validate_header_arg(header)

if (usecols is not None) and (usecols_excel is not None):
raise ValueError("Cannot specify both `usecols` and "
"`usecols_excel`. Choose one of them.")

# Check if some string in usecols may be interpreted as a Excel
# range or positional column
elif _is_excel_columns_notation(usecols):
warnings.warn("The `usecols` keyword argument used to refer to "
"Excel ranges and columns as strings was "
"renamed to `usecols_excel`.", UserWarning,
stacklevel=3)
usecols_excel = usecols
usecols = None

elif (usecols_excel is not None) and not _is_excel_columns_notation(
usecols_excel):
raise TypeError("`usecols_excel` must be None or a string as a "
"comma separeted Excel ranges and columns.")

if 'chunksize' in kwds:
raise NotImplementedError("chunksize keyword of read_excel "
"is not implemented")
Expand Down Expand Up @@ -615,10 +676,13 @@ def _parse_cell(cell_contents, cell_typ):
row = []
for j, (value, typ) in enumerate(zip(sheet.row_values(i),
sheet.row_types(i))):
if usecols is not None and j not in should_parse:
should_parse[j] = self._should_parse(j, usecols)
if ((usecols is not None) or (usecols_excel is not None) or
(j not in should_parse)):
should_parse[j] = self._should_parse(j, usecols_excel,
usecols)

if usecols is None or should_parse[j]:
if (((usecols_excel is None) and (usecols is None)) or
should_parse[j]):
row.append(_parse_cell(value, typ))
data.append(row)

Expand Down Expand Up @@ -674,6 +738,7 @@ def _parse_cell(cell_contents, cell_typ):
dtype=dtype,
true_values=true_values,
false_values=false_values,
usecols=usecols,
skiprows=skiprows,
nrows=nrows,
na_values=na_values,
Expand Down
Loading