Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Better error message if usecols doesn't match columns #17310

Merged
Merged
20 changes: 18 additions & 2 deletions pandas/io/parsers.py
Original file line number Diff line number Diff line change
Expand Up @@ -1662,14 +1662,22 @@ def __init__(self, src, **kwds):
# GH 14671
if (self.usecols_dtype == 'string' and
not set(usecols).issubset(self.orig_names)):
raise ValueError("Usecols do not match names.")
missing = [c for c in usecols if c not in self.orig_names]
raise ValueError(
"Usecols do not match columns, "
"columns expected but not found: {}".format(missing)
Copy link
Member

@gfyoung gfyoung Oct 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use keyword arguments in the string formatting. Same for your other error messages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I see a lot of duplicate code here. Let's abstract into a method (you can just create a private method outside of the class). That will make your life easier.

)

if len(self.names) > len(usecols):
self.names = [n for i, n in enumerate(self.names)
if (i in usecols or n in usecols)]

if len(self.names) < len(usecols):
raise ValueError("Usecols do not match names.")
missing = [c for c in usecols if c not in self.names]
raise ValueError(
"Usecols do not match columns, "
"columns expected but not found: {}".format(missing)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same


self._set_noconvert_columns()

Expand Down Expand Up @@ -2442,6 +2450,14 @@ def _handle_usecols(self, columns, usecols_key):
raise ValueError("If using multiple headers, usecols must "
"be integers.")
col_indices = []

missing = [c for c in self.usecols if c not in usecols_key]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure what design patterns Pandas follows for this kind of thing, but would you rather me drop this down into a try/catch for col_indices.append(usecols_key.index(col)) where the error is currently being thrown?

I worry this initial approach adds needless computation time - but unsure if it's more readable? Let me know 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate further on that logic you provided: col_indices.append(usecols_key.index(col)). I'm not sure I fully follow where / how this would be added.

Secondly, don't worry about computation time. Get a working implementation first. Then we'll worry about optimizing, if need be. Chances are that won't be an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, at the moment the error is being raised a few lines further down when we're looking for the index of the column in the usecols_key list.

The alternate proposal would be something like:

try:
    col_indices.append(usecols_key.index(col))
except ValueError:
    # Calculate missing usecols and raise error accordingly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, understood. Yes, absolutely, let's do try-except.

if len(missing) > 0:
raise ValueError(
"Usecols do not match columns, "
"columns expected but not found: {}".format(missing)
)

for col in self.usecols:
if isinstance(col, string_types):
col_indices.append(usecols_key.index(col))
Expand Down
16 changes: 8 additions & 8 deletions pandas/tests/io/parser/usecols.py
Original file line number Diff line number Diff line change
Expand Up @@ -480,10 +480,10 @@ def test_raise_on_usecols_names_mismatch(self):
# GH 14671
data = 'a,b,c,d\n1,2,3,4\n5,6,7,8'

if self.engine == 'c':
msg = 'Usecols do not match names'
else:
msg = 'is not in list'
msg = (
"Usecols do not match columns, "
"columns expected but not found: {}"
)

usecols = ['a', 'b', 'c', 'd']
df = self.read_csv(StringIO(data), usecols=usecols)
Expand All @@ -492,11 +492,11 @@ def test_raise_on_usecols_names_mismatch(self):
tm.assert_frame_equal(df, expected)

usecols = ['a', 'b', 'c', 'f']
with tm.assert_raises_regex(ValueError, msg):
with tm.assert_raises_regex(ValueError, msg.format(['f'])):
self.read_csv(StringIO(data), usecols=usecols)

usecols = ['a', 'b', 'f']
with tm.assert_raises_regex(ValueError, msg):
with tm.assert_raises_regex(ValueError, msg.format(['f'])):
self.read_csv(StringIO(data), usecols=usecols)

names = ['A', 'B', 'C', 'D']
Expand All @@ -520,9 +520,9 @@ def test_raise_on_usecols_names_mismatch(self):
# tm.assert_frame_equal(df, expected)

usecols = ['A', 'B', 'C', 'f']
with tm.assert_raises_regex(ValueError, msg):
with tm.assert_ra(ValueError, msg.format(['f'])):
self.read_csv(StringIO(data), header=0, names=names,
usecols=usecols)
usecols = ['A', 'B', 'f']
with tm.assert_raises_regex(ValueError, msg):
with tm.assert_raises_regex(ValueError, msg.format(['f'])):
self.read_csv(StringIO(data), names=names, usecols=usecols)