read_fwf: skip_blank_lines does nothing #22693

ivanradicek · 2018-09-13T13:50:10Z

Code Sample, a copy-pastable example if possible

from io import StringIO
import pandas as pd

f = StringIO('''A B

C D''')

df = pd.read_fwf(f, colspecs=[(0, 1), (2,3)], header=None, skip_blank_lines=True)
print(df)

Problem description

Output:

     0    1
0    A    B
1  NaN  NaN
2    C    D

The (second) blank line is not skipped, but instead there is a row with two NaN values. It seems that skip_blank_lines has no effect on read_fwf. On the other hand, read_csv(f, sep=' ', header=None), produces the expected output below.

Expected Output

   0  1
0  A  B
1  C  D

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.1
pytest: 3.6.2
pip: 18.0
setuptools: 39.0.1
Cython: None
numpy: 1.15.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: 1.5.5
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-09-13T16:31:54Z

Thanks for the report - investigation and PRs are certainly welcome

danpere · 2018-09-13T18:59:14Z

The bug appears to be in pandas/io/parsers.py in the way _remove_empty_lines() is used with fixed-width files (or probably files with whitespace delimiters for that matter). It sees that the line is ['', ''] which in a CSV would mean the string "," and therefore non-empty, but for a fixed-width file can happen when the line is empty. ~~Arguably, if the line really were spaces out to the last field, empty strings might be the right thing to extract.~~ (Edit: Nevermind, then the fields would be a series of spaces, not empty.) I'm not sure if the right fix is to change _remove_empty_lines() or its usages.

dvalters · 2019-07-11T08:19:47Z

I believe this bug also extends to read_excel and read_csv in files that have 'empty' trailing lines, and is more generic than just the read_fwf function.

Code sample

import pandas as pd
from io import StringIO

csv_f = StringIO('''A,B,C,D 
FOO,1,2,3 
FOO,4,5,6 
,,, 
FOO,7,8,9 
,10,11,12 
,,, 
,,, 
,,, 
,,, 
,,, 
''' 
) 

df = pd.read_csv(csv_f, header=None, skip_blank_lines=True)
print(df)

Output

      0    1    2    3
0     A    B    C    D
1   FOO    1    2    3
2   FOO    4    5    6
3   NaN  NaN  NaN  NaN
4   FOO    7    8    9
5   NaN   10   11   12
6   NaN  NaN  NaN  NaN
7   NaN  NaN  NaN  NaN
8   NaN  NaN  NaN  NaN
9   NaN  NaN  NaN  NaN
10  NaN  NaN  NaN  NaN

With read_excel if the workbook has 'blank' lines that contain any sort of formula that result in a null string or blank cell (but not empty cell), similar behaviour is exhibited when skip_blank_lines=True (which is default true anyway according to the docs https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#na-and-missing-data-handling)

dvalters · 2019-07-11T08:22:21Z

Also possibly related to: #10164

dcdenu4 · 2020-06-04T12:55:40Z

Any updates on this? I'm still seeing this behavior in pandas 1.0.3

from io import StringIO
csv_file = StringIO('''lucode,desc,val1,val2
1,corn,0.5,2
,,,
3,peas,1,-2
'''
)
df = pandas.read_csv(csv_file, skip_blank_lines=True)
df
   lucode  desc  val1  val2
0     1.0  corn   0.5   2.0
1     NaN   NaN   NaN   NaN
2     3.0  peas   1.0  -2.0

jreback · 2020-06-04T14:37:32Z

@dcdenu4 or anyone can submit a PR
pandas is all volunteer and we have 3000+ open issues

mroeschke · 2021-06-22T05:01:15Z

This looks to work on master now. Could use a test

In [1]: from io import StringIO
   ...: import pandas as pd
   ...:
   ...: f = StringIO('''A B
   ...:
   ...: C D''')
   ...:
   ...: df = pd.read_fwf(f, colspecs=[(0, 1), (2,3)], header=None, skip_blank_lines=True)
   ...: print(df)
   0  1
0  A  B
1  C  D

lucnguyen93 · 2022-05-30T20:31:02Z

take

jnclt · 2022-12-02T17:53:06Z

As mentioned above, the original issue isn't reproducible anymore and there is a test covering this case (test_fwf_skip_blank_lines):

pandas/pandas/tests/io/parser/test_read_fwf.py

Line 354 in 36dcf51

def test_fwf_skip_blank_lines():

As for read_csv, I think the behavior is as expected (Nan for coma-separated blank values).

I guess this issue can be closed?

mroeschke · 2023-03-09T17:23:17Z

Yeah looks like test_fwf_skip_blank_lines tests this so closing

WillAyd added Bug IO Data IO issues that don't fit into a more specific label labels Sep 13, 2018

WillAyd added this to the Contributions Welcome milestone Sep 13, 2018

xiejxie mentioned this issue Sep 27, 2018

Fix blank line skipping in read_fwf, issue 22693 #22849

Closed

4 tasks

jbrockmendel added IO Fixed Width read_fwf and removed IO Data IO issues that don't fit into a more specific label labels Dec 11, 2019

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug IO Fixed Width read_fwf labels Jun 22, 2021

github-actions bot assigned lucnguyen93 May 30, 2022

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

phofl mentioned this issue Nov 16, 2022

TST: Fixed issues that need tests noatamir/pyladies-berlin-sprints#3

Open

17 tasks

This was referenced Jan 17, 2023

22693 test skip blank rows read csv #50798

Closed

adding test to check if rows are skipped when skip_blank_lines is set… #50843

Closed

mroeschke closed this as completed Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_fwf: skip_blank_lines does nothing #22693

read_fwf: skip_blank_lines does nothing #22693

ivanradicek commented Sep 13, 2018

INSTALLED VERSIONS

WillAyd commented Sep 13, 2018

danpere commented Sep 13, 2018 •

edited

Loading

dvalters commented Jul 11, 2019

dvalters commented Jul 11, 2019

dcdenu4 commented Jun 4, 2020

jreback commented Jun 4, 2020

mroeschke commented Jun 22, 2021 •

edited

Loading

lucnguyen93 commented May 30, 2022

jnclt commented Dec 2, 2022

mroeschke commented Mar 9, 2023

read_fwf: skip_blank_lines does nothing #22693

read_fwf: skip_blank_lines does nothing #22693

Comments

ivanradicek commented Sep 13, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Sep 13, 2018

danpere commented Sep 13, 2018 • edited Loading

dvalters commented Jul 11, 2019

Code sample

Output

dvalters commented Jul 11, 2019

dcdenu4 commented Jun 4, 2020

jreback commented Jun 4, 2020

mroeschke commented Jun 22, 2021 • edited Loading

lucnguyen93 commented May 30, 2022

jnclt commented Dec 2, 2022

mroeschke commented Mar 9, 2023

Output of `pd.show_versions()`

danpere commented Sep 13, 2018 •

edited

Loading

mroeschke commented Jun 22, 2021 •

edited

Loading