Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_fwf: skip_blank_lines does nothing #22693

Closed
Tracked by #3
ivanradicek opened this issue Sep 13, 2018 · 10 comments
Closed
Tracked by #3

read_fwf: skip_blank_lines does nothing #22693

ivanradicek opened this issue Sep 13, 2018 · 10 comments
Assignees
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@ivanradicek
Copy link

Code Sample, a copy-pastable example if possible

from io import StringIO
import pandas as pd

f = StringIO('''A B

C D''')

df = pd.read_fwf(f, colspecs=[(0, 1), (2,3)], header=None, skip_blank_lines=True)
print(df)

Problem description

Output:

     0    1
0    A    B
1  NaN  NaN
2    C    D

The (second) blank line is not skipped, but instead there is a row with two NaN values. It seems that skip_blank_lines has no effect on read_fwf. On the other hand, read_csv(f, sep=' ', header=None), produces the expected output below.

Expected Output

   0  1
0  A  B
1  C  D

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.1
pytest: 3.6.2
pip: 18.0
setuptools: 39.0.1
Cython: None
numpy: 1.15.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: 1.5.5
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Sep 13, 2018

Thanks for the report - investigation and PRs are certainly welcome

@WillAyd WillAyd added Bug IO Data IO issues that don't fit into a more specific label labels Sep 13, 2018
@WillAyd WillAyd added this to the Contributions Welcome milestone Sep 13, 2018
@danpere
Copy link

danpere commented Sep 13, 2018

The bug appears to be in pandas/io/parsers.py in the way _remove_empty_lines() is used with fixed-width files (or probably files with whitespace delimiters for that matter). It sees that the line is ['', ''] which in a CSV would mean the string "," and therefore non-empty, but for a fixed-width file can happen when the line is empty. Arguably, if the line really were spaces out to the last field, empty strings might be the right thing to extract. (Edit: Nevermind, then the fields would be a series of spaces, not empty.) I'm not sure if the right fix is to change _remove_empty_lines() or its usages.

@dvalters
Copy link

I believe this bug also extends to read_excel and read_csv in files that have 'empty' trailing lines, and is more generic than just the read_fwf function.

Code sample

import pandas as pd
from io import StringIO

csv_f = StringIO('''A,B,C,D 
FOO,1,2,3 
FOO,4,5,6 
,,, 
FOO,7,8,9 
,10,11,12 
,,, 
,,, 
,,, 
,,, 
,,, 
''' 
) 

df = pd.read_csv(csv_f, header=None, skip_blank_lines=True)
print(df)

Output

      0    1    2    3
0     A    B    C    D
1   FOO    1    2    3
2   FOO    4    5    6
3   NaN  NaN  NaN  NaN
4   FOO    7    8    9
5   NaN   10   11   12
6   NaN  NaN  NaN  NaN
7   NaN  NaN  NaN  NaN
8   NaN  NaN  NaN  NaN
9   NaN  NaN  NaN  NaN
10  NaN  NaN  NaN  NaN

With read_excel if the workbook has 'blank' lines that contain any sort of formula that result in a null string or blank cell (but not empty cell), similar behaviour is exhibited when skip_blank_lines=True (which is default true anyway according to the docs https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#na-and-missing-data-handling)

@dvalters
Copy link

Also possibly related to: #10164

@jbrockmendel jbrockmendel added IO Fixed Width read_fwf and removed IO Data IO issues that don't fit into a more specific label labels Dec 11, 2019
@dcdenu4
Copy link

dcdenu4 commented Jun 4, 2020

Any updates on this? I'm still seeing this behavior in pandas 1.0.3

from io import StringIO
csv_file = StringIO('''lucode,desc,val1,val2
1,corn,0.5,2
,,,
3,peas,1,-2
'''
)
df = pandas.read_csv(csv_file, skip_blank_lines=True)
df
   lucode  desc  val1  val2
0     1.0  corn   0.5   2.0
1     NaN   NaN   NaN   NaN
2     3.0  peas   1.0  -2.0

@jreback
Copy link
Contributor

jreback commented Jun 4, 2020

@dcdenu4 or anyone can submit a PR
pandas is all volunteer and we have 3000+ open issues

@mroeschke
Copy link
Member

mroeschke commented Jun 22, 2021

This looks to work on master now. Could use a test

In [1]: from io import StringIO
   ...: import pandas as pd
   ...:
   ...: f = StringIO('''A B
   ...:
   ...: C D''')
   ...:
   ...: df = pd.read_fwf(f, colspecs=[(0, 1), (2,3)], header=None, skip_blank_lines=True)
   ...: print(df)
   0  1
0  A  B
1  C  D

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug IO Fixed Width read_fwf labels Jun 22, 2021
@lucnguyen93
Copy link

take

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jnclt
Copy link
Contributor

jnclt commented Dec 2, 2022

As mentioned above, the original issue isn't reproducible anymore and there is a test covering this case (test_fwf_skip_blank_lines):

def test_fwf_skip_blank_lines():

As for read_csv, I think the behavior is as expected (Nan for coma-separated blank values).

I guess this issue can be closed?

@mroeschke
Copy link
Member

Yeah looks like test_fwf_skip_blank_lines tests this so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
10 participants