Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in 1.10.0: ComputeError when parsing quoted string #19432

Closed
2 tasks done
mihai-afternet opened this issue Oct 24, 2024 · 5 comments
Closed
2 tasks done

Regression in 1.10.0: ComputeError when parsing quoted string #19432

mihai-afternet opened this issue Oct 24, 2024 · 5 comments
Labels
invalid A bug report that is not actually a bug needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@mihai-afternet
Copy link

mihai-afternet commented Oct 24, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Fails:

import polars as pl
import io

data = '''Name
"test test" test
another name
'''

df = pl.read_csv(io.StringIO(data)

Works:

import polars as pl
import io

data = '''Name
"test test" test
another name
'''

df = pl.read_csv(io.StringIO(data), quote_char=None)

Or:

import polars as pl
import io

df = pl.DataFrame({
    "name": ['"test test" test']
})

csv_buffer = io.StringIO()
df.write_csv(csv_buffer)
csv_buffer.seek(0)

df_from_csv = pl.read_csv(csv_buffer)

Log output

pydf = PyDataFrame.read_csv(
           ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: could not parse `"test test" test` as dtype `str` at column 'Name' (column number 1)

The current offset in the file is 5 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `"test test" test` to the `null_values` list.

Original error:  csv file

Field `"test test" test` is not properly escaped.

Issue description

I encountered a ComputeError when attempting to parse a string in a CSV column using Polars version 1.10.0. The string in question is "test test" test, which should be parsed as a valid string. This issue did not occur in previous versions of Polars, making it a regression introduced in 1.10.0.

I can use the quote_char=None parameter to overcome the issue.

Expected behavior

The string <"test test" test> should be successfully parsed as a valid string in the 'Name' column without raising an error

Installed versions

--------Version info---------
Polars:              1.11.0
Index type:          UInt32
Platform:            Windows-10-10.0.19045-SP0
Python:              3.12.0 (tags/v3.12.0:0fb18b0, Oct  2 2023, 13:03:39) [MSC v.1935 64 bit (AMD64)]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.9.0
gevent               <not installed>
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         1.6.0
numpy                2.1.1
openpyxl             <not installed>
pandas               2.2.2
pyarrow              <not installed>
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@mihai-afternet mihai-afternet added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Oct 24, 2024
@coastalwhite
Copy link
Collaborator

This was caused by #19124. @ritchie46 can you have a look.

@ritchie46
Copy link
Member

ritchie46 commented Oct 25, 2024

It's actually a correct error. The value is incorrectly escaped. It should be """test test"" test", enclosing the entire field in " and doubling (escaping) internal ".

Previously we read it incorrectly, so this was a bug fix.

@ritchie46 ritchie46 added invalid A bug report that is not actually a bug and removed bug Something isn't working labels Oct 25, 2024
@Filimoa
Copy link

Filimoa commented Oct 25, 2024

Is there a workaround? I imagine many people don't have control over the underlying data so this makes it impossible to read certain datasets with polars.

@ritchie46
Copy link
Member

Yes, Set a different quoting value if the data isn't quoted properly. The error gives a few tips.

Reading it in if the quote char is set to ", isn't an option. It's invalid csv.

@DeflateAwning
Copy link
Contributor

In summary, setting quote_char=None allows this to work as it previously did.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid A bug report that is not actually a bug needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

5 participants