ENH: read_xml handling of bad lines #59384

davetapley · 2024-08-01T17:05:18Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

Be able to read_xml and skip non-parseable lines.

E.g.

With:

<gage_rain id="" last_rpt="-999 -999" min_10="-999" min_30="-999" hour_1="-999" hour_3="-999" hour_6="-999" day_1="-999" day_3="-999" day_7="-999" day_30="-999" ytd="-999" null="-999" name="" lat=" -999" long="--999 " updated="2024-07-31 19:40:00" m1="-999" m2="-999" m3="-999" m4="-999" m5="-999" m6="-999" m7="-999" m8="-999" m9="-999" m10="-999" m11="-999" m12="-999"/>
<gage_rain id="470" last_rpt="2024-07-31 11:58:03" min_10="0.00" min_30="0.00" hour_1="0.00" hour_3="0.00" hour_6="0.00" day_1="0.00" day_3="0.00" day_7="0.67" day_30="1.93" ytd="12.25" null="-999" name="Lee Butte Precipitation" lat="34.83403" long="-111.53714" updated="2024-07-31 19:40:00" m1="1.93" m2="0.00" m3="1.45" m4="2.95" m5="1.54" m6="1.97" m7="0.86" m8="0.87" m9="0.00" m10="0.71" m11="2.87" m12="2.44"/>

If I:

dtype = {'id': str, 'lat': pd.Float32Dtype, 'long': pd.Float32Dtype}
df = pd.read_xml('fcdyc_alert_rain.xml', dtype=dtype)

I get:

  File "lib.pyx", line 2391, in pandas._libs.lib.maybe_convert_numeric
ValueError: Unable to parse string "--999

Feature Description

#15122 but for read_xml

Alternative Solutions

read_xml with no dtype kwarg, and manually manipulate the DataFrame afterwards.

Additional Context

No response

The text was updated successfully, but these errors were encountered:

rhshadrach · 2024-08-03T15:40:41Z

Thanks for the request. I'm open to the addition of an errors argument as in read_csv, provided the implementation is straight forward (I haven't checked). If this causes anything more than negligible complexity in the algorithm however, I think we should cautiously reevaluate it.

jahn96 · 2024-08-04T12:55:17Z

take

jahn96 · 2024-08-19T13:37:54Z

@davetapley One clarifying question: it seems like read_csv has an option to specify what to do when encountering the bad line, but the bad line means a line with too many fields, not the line with non-parseable value documentation. Could you clarify what your expectation is? Also, Could you try your example again? I couldn't reproduce your issue with the same error. Thanks!

jahn96 · 2024-08-20T20:18:51Z

@davetapley Also, this seems more of the issue with the data not the XML parser itself since --999 can't be a float. Is your request to have a custom error handling with these data conversion errors?

davetapley added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 1, 2024

rhshadrach added IO XML read_xml, to_xml and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 3, 2024

github-actions bot assigned jahn96 Aug 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: read_xml handling of bad lines #59384

ENH: read_xml handling of bad lines #59384

davetapley commented Aug 1, 2024 •

edited

Loading

rhshadrach commented Aug 3, 2024

jahn96 commented Aug 4, 2024

jahn96 commented Aug 19, 2024

jahn96 commented Aug 20, 2024

ENH: read_xml handling of bad lines #59384

ENH: read_xml handling of bad lines #59384

Comments

davetapley commented Aug 1, 2024 • edited Loading

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

rhshadrach commented Aug 3, 2024

jahn96 commented Aug 4, 2024

jahn96 commented Aug 19, 2024

jahn96 commented Aug 20, 2024

davetapley commented Aug 1, 2024 •

edited

Loading