Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV reader incorrectly splits row in two pieces on quoted newline #19078

Closed
orlp opened this issue Oct 2, 2024 · 5 comments · Fixed by #19088
Closed

CSV reader incorrectly splits row in two pieces on quoted newline #19078

orlp opened this issue Oct 2, 2024 · 5 comments · Fixed by #19088
Assignees
Labels
A-io-csv Area: reading/writing CSV files accepted Ready for implementation bug Something isn't working P-high Priority: high

Comments

@orlp
Copy link
Collaborator

orlp commented Oct 2, 2024

Given the following test.csv if we try to load it as such:

dtypes = {
    "Name": pl.Utf8,
    "Address": pl.Utf8,
    "Email": pl.Utf8,
    "Phonenumber": pl.Utf8,
    "Date_of_birth": pl.Date,
    "Company": pl.Utf8,
    "Job": pl.Utf8,
    "IBAN": pl.Utf8,
    "Creditcard": pl.Int64,
    "Creation_date": pl.Date,
}

df = pl.read_csv("test.csv", schema_overrides=dtypes, separator=";", ignore_errors=True)

We notice that the 5002th entry is broken up across two rows:

>>> df[5001:5004]
shape: (3, 10)
┌───────────────┬───────────────┬───────────────┬───────────────┬───┬───────────────┬───────────────┬───────────────┬───────────────┐
│ NameAddressEmailPhonenumber   ┆ … ┆ JobIBANCreditcardCreation_date │
│ ------------           ┆   ┆ ------------           │
│ strstrstrstr           ┆   ┆ strstri64date          │
╞═══════════════╪═══════════════╪═══════════════╪═══════════════╪═══╪═══════════════╪═══════════════╪═══════════════╪═══════════════╡
│ Kelsey Long679 Lindsayjessicagarcia566-408-9606x ┆ … ┆ Designer, cerGB70EVGJ6264921314175014522009-06-22    │
│               ┆ Drive Suite   ┆ @example.com633           ┆   ┆ amics/pottery90407029557            ┆               │
│               ┆ 413           ┆               ┆               ┆   ┆               ┆               ┆               ┆               │
│               ┆ Ro…           ┆               ┆               ┆   ┆               ┆               ┆               ┆               │
│ Christopher5240 Williamsnullnull          ┆ … ┆ nullnullnullnull          │
│ CollinsForge Suite   ┆               ┆               ┆   ┆               ┆               ┆               ┆               │
│               ┆ 570           ┆               ┆               ┆   ┆               ┆               ┆               ┆               │
│ Port Gary, WYkennedybarbar ┆ (578)971-03661977-07-15    ┆ … ┆ GB18OYSE786444767664624172nullnull          │
│ 31010"        ┆ a@example.orgx10767        ┆               ┆   ┆ 896429599615290        ┆               ┆               │
└───────────────┴───────────────┴───────────────┴───────────────┴───┴───────────────┴───────────────┴───────────────┴───────────────┘

But in the source CSV there appears to be no problem:

"Kelsey Long";"679 Lindsay Drive Suite 413
Rogersfort, OR 90448";"jessicagarcia@example.com";"566-408-9606x633";"1992-08-29";"Dixon PLC";"Designer, ceramics/pottery";"GB70EVGJ62649904070295";"213141750145257";"2009-06-22"
"Christopher Collins";"5240 Williams Forge Suite 570
Port Gary, WY 31010";"kennedybarbara@example.org";"(578)971-0366x10767";"1977-07-15";"Reed, Edwards and Nguyen";"Engineer, electrical";"GB18OYSE78644896429599";"4767664624172615290";"1973-09-21"
"Ronnie Giles";"USNS Brown
FPO AP 51442";"julia23@example.net";"001-632-763-2460x0516";"1997-11-28";"Holt-Hale";"Patent attorney";"GB19MXNA87198353574367";"30192887443942";"1993-11-30"

And in fact, if you try to load just those three above lines with the header prepended it works without problems:

>>> pl.read_csv("small.csv", schema_overrides=dtypes, separator=";")
shape: (3, 10)
┌───────────────┬───────────────┬───────────────┬───────────────┬───┬───────────────┬───────────────┬───────────────┬───────────────┐
│ NameAddressEmailPhonenumber   ┆ … ┆ JobIBANCreditcardCreation_date │
│ ------------           ┆   ┆ ------------           │
│ strstrstrstr           ┆   ┆ strstri64date          │
╞═══════════════╪═══════════════╪═══════════════╪═══════════════╪═══╪═══════════════╪═══════════════╪═══════════════╪═══════════════╡
│ Kelsey Long679 Lindsayjessicagarcia566-408-9606x ┆ … ┆ Designer, cerGB70EVGJ6264921314175014522009-06-22    │
│               ┆ Drive Suite   ┆ @example.com633           ┆   ┆ amics/pottery90407029557            ┆               │
│               ┆ 413           ┆               ┆               ┆   ┆               ┆               ┆               ┆               │
│               ┆ Ro…           ┆               ┆               ┆   ┆               ┆               ┆               ┆               │
│ Christopher5240 Williamskennedybarbar ┆ (578)971-0366 ┆ … ┆ Engineer,     ┆ GB18OYSE7864447676646241721973-09-21    │
│ CollinsForge Suitea@example.orgx10767        ┆   ┆ electrical896429599615290        ┆               │
│               ┆ 570           ┆               ┆               ┆   ┆               ┆               ┆               ┆               │
│               ┆ …             ┆               ┆               ┆   ┆               ┆               ┆               ┆               │
│ Ronnie GilesUSNS Brownjulia23@examp001-632-763-2 ┆ … ┆ PatentGB19MXNA8719830192887443941993-11-30    │
│               ┆ FPO AP 51442le.net460x0516      ┆   ┆ attorney3535743672             ┆               │
└───────────────┴───────────────┴───────────────┴───────────────┴───┴───────────────┴───────────────┴───────────────┴───────────────┘
@orlp
Copy link
Collaborator Author

orlp commented Oct 2, 2024

The bug does not occur with n_threads=1 so I believe this to be a bug in how we split up CSV files for parallel reading.

@orlp orlp added bug Something isn't working P-high Priority: high A-io-csv Area: reading/writing CSV files labels Oct 2, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Oct 2, 2024
@orlp orlp added the accepted Ready for implementation label Oct 2, 2024
@ritchie46
Copy link
Member

The bug does not occur with n_threads=1 so I believe this to be a bug in how we split up CSV files for parallel reading.

Ah yes.. we do some looking around to see if we made a valid split. Will have to increase strictness there. If we cannot find valid splits, we fallback to single threaded read.

@ritchie46
Copy link
Member

I cannot reproduce. How many threads have you got @orlp ?

@cmdlineluser
Copy link
Contributor

test.csv works for me also, but this seems to reproduce the problem:

N = 1041

pl.read_csv(
    pl.DataFrame({"foo": ['ABCDE FGHIJ\nKLMNOP'] * N})
      .with_row_index()
      .write_csv()
      .encode()
)

# ComputeError: could not parse `KLMNOP"` as dtype `i64` at column 'index' (column number 1)

@orlp
Copy link
Collaborator Author

orlp commented Oct 3, 2024

@ritchie46 10 threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-csv Area: reading/writing CSV files accepted Ready for implementation bug Something isn't working P-high Priority: high
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants