Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: read_xml handling of bad lines #59384

Open
1 of 3 tasks
davetapley opened this issue Aug 1, 2024 · 4 comments
Open
1 of 3 tasks

ENH: read_xml handling of bad lines #59384

davetapley opened this issue Aug 1, 2024 · 4 comments
Assignees
Labels
Enhancement IO XML read_xml, to_xml

Comments

@davetapley
Copy link
Contributor

davetapley commented Aug 1, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Be able to read_xml and skip non-parseable lines.

E.g.

With:

<gage_rain id="" last_rpt="-999 -999" min_10="-999" min_30="-999" hour_1="-999" hour_3="-999" hour_6="-999" day_1="-999" day_3="-999" day_7="-999" day_30="-999" ytd="-999" null="-999" name="" lat=" -999" long="--999 " updated="2024-07-31 19:40:00" m1="-999" m2="-999" m3="-999" m4="-999" m5="-999" m6="-999" m7="-999" m8="-999" m9="-999" m10="-999" m11="-999" m12="-999"/>
<gage_rain id="470" last_rpt="2024-07-31 11:58:03" min_10="0.00" min_30="0.00" hour_1="0.00" hour_3="0.00" hour_6="0.00" day_1="0.00" day_3="0.00" day_7="0.67" day_30="1.93" ytd="12.25" null="-999" name="Lee Butte Precipitation" lat="34.83403" long="-111.53714" updated="2024-07-31 19:40:00" m1="1.93" m2="0.00" m3="1.45" m4="2.95" m5="1.54" m6="1.97" m7="0.86" m8="0.87" m9="0.00" m10="0.71" m11="2.87" m12="2.44"/>

If I:

dtype = {'id': str, 'lat': pd.Float32Dtype, 'long': pd.Float32Dtype}
df = pd.read_xml('fcdyc_alert_rain.xml', dtype=dtype)

I get:

  File "lib.pyx", line 2391, in pandas._libs.lib.maybe_convert_numeric
ValueError: Unable to parse string "--999 

Feature Description

#15122 but for read_xml

Alternative Solutions

read_xml with no dtype kwarg, and manually manipulate the DataFrame afterwards.

Additional Context

No response

@davetapley davetapley added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 1, 2024
@rhshadrach rhshadrach added IO XML read_xml, to_xml and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 3, 2024
@rhshadrach
Copy link
Member

Thanks for the request. I'm open to the addition of an errors argument as in read_csv, provided the implementation is straight forward (I haven't checked). If this causes anything more than negligible complexity in the algorithm however, I think we should cautiously reevaluate it.

@jahn96
Copy link
Contributor

jahn96 commented Aug 4, 2024

take

@jahn96
Copy link
Contributor

jahn96 commented Aug 19, 2024

@davetapley One clarifying question: it seems like read_csv has an option to specify what to do when encountering the bad line, but the bad line means a line with too many fields, not the line with non-parseable value documentation. Could you clarify what your expectation is? Also, Could you try your example again? I couldn't reproduce your issue with the same error. Thanks!

@jahn96
Copy link
Contributor

jahn96 commented Aug 20, 2024

@davetapley Also, this seems more of the issue with the data not the XML parser itself since --999 can't be a float. Is your request to have a custom error handling with these data conversion errors?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO XML read_xml, to_xml
Projects
None yet
Development

No branches or pull requests

3 participants