CSV reader incorrectly splits row in two pieces on quoted newline #19078

orlp · 2024-10-02T22:51:42Z

Given the following test.csv if we try to load it as such:

dtypes = {
    "Name": pl.Utf8,
    "Address": pl.Utf8,
    "Email": pl.Utf8,
    "Phonenumber": pl.Utf8,
    "Date_of_birth": pl.Date,
    "Company": pl.Utf8,
    "Job": pl.Utf8,
    "IBAN": pl.Utf8,
    "Creditcard": pl.Int64,
    "Creation_date": pl.Date,
}

df = pl.read_csv("test.csv", schema_overrides=dtypes, separator=";", ignore_errors=True)

We notice that the 5002th entry is broken up across two rows:

>>> df[5001:5004]
shape: (3, 10)
┌───────────────┬───────────────┬───────────────┬───────────────┬───┬───────────────┬───────────────┬───────────────┬───────────────┐
│ Name          ┆ Address       ┆ Email         ┆ Phonenumber   ┆ … ┆ Job           ┆ IBAN          ┆ Creditcard    ┆ Creation_date │
│ ---           ┆ ---           ┆ ---           ┆ ---           ┆   ┆ ---           ┆ ---           ┆ ---           ┆ ---           │
│ str           ┆ str           ┆ str           ┆ str           ┆   ┆ str           ┆ str           ┆ i64           ┆ date          │
╞═══════════════╪═══════════════╪═══════════════╪═══════════════╪═══╪═══════════════╪═══════════════╪═══════════════╪═══════════════╡
│ Kelsey Long   ┆ 679 Lindsay   ┆ jessicagarcia ┆ 566-408-9606x ┆ … ┆ Designer, cer ┆ GB70EVGJ62649 ┆ 2131417501452 ┆ 2009-06-22    │
│               ┆ Drive Suite   ┆ @example.com  ┆ 633           ┆   ┆ amics/pottery ┆ 904070295     ┆ 57            ┆               │
│               ┆ 413           ┆               ┆               ┆   ┆               ┆               ┆               ┆               │
│               ┆ Ro…           ┆               ┆               ┆   ┆               ┆               ┆               ┆               │
│ Christopher   ┆ 5240 Williams ┆ null          ┆ null          ┆ … ┆ null          ┆ null          ┆ null          ┆ null          │
│ Collins       ┆ Forge Suite   ┆               ┆               ┆   ┆               ┆               ┆               ┆               │
│               ┆ 570           ┆               ┆               ┆   ┆               ┆               ┆               ┆               │
│ Port Gary, WY ┆ kennedybarbar ┆ (578)971-0366 ┆ 1977-07-15    ┆ … ┆ GB18OYSE78644 ┆ 4767664624172 ┆ null          ┆ null          │
│ 31010"        ┆ a@example.org ┆ x10767        ┆               ┆   ┆ 896429599     ┆ 615290        ┆               ┆               │
└───────────────┴───────────────┴───────────────┴───────────────┴───┴───────────────┴───────────────┴───────────────┴───────────────┘

But in the source CSV there appears to be no problem:

"Kelsey Long";"679 Lindsay Drive Suite 413
Rogersfort, OR 90448";"jessicagarcia@example.com";"566-408-9606x633";"1992-08-29";"Dixon PLC";"Designer, ceramics/pottery";"GB70EVGJ62649904070295";"213141750145257";"2009-06-22"
"Christopher Collins";"5240 Williams Forge Suite 570
Port Gary, WY 31010";"kennedybarbara@example.org";"(578)971-0366x10767";"1977-07-15";"Reed, Edwards and Nguyen";"Engineer, electrical";"GB18OYSE78644896429599";"4767664624172615290";"1973-09-21"
"Ronnie Giles";"USNS Brown
FPO AP 51442";"julia23@example.net";"001-632-763-2460x0516";"1997-11-28";"Holt-Hale";"Patent attorney";"GB19MXNA87198353574367";"30192887443942";"1993-11-30"

And in fact, if you try to load just those three above lines with the header prepended it works without problems:

>>> pl.read_csv("small.csv", schema_overrides=dtypes, separator=";")
shape: (3, 10)
┌───────────────┬───────────────┬───────────────┬───────────────┬───┬───────────────┬───────────────┬───────────────┬───────────────┐
│ Name          ┆ Address       ┆ Email         ┆ Phonenumber   ┆ … ┆ Job           ┆ IBAN          ┆ Creditcard    ┆ Creation_date │
│ ---           ┆ ---           ┆ ---           ┆ ---           ┆   ┆ ---           ┆ ---           ┆ ---           ┆ ---           │
│ str           ┆ str           ┆ str           ┆ str           ┆   ┆ str           ┆ str           ┆ i64           ┆ date          │
╞═══════════════╪═══════════════╪═══════════════╪═══════════════╪═══╪═══════════════╪═══════════════╪═══════════════╪═══════════════╡
│ Kelsey Long   ┆ 679 Lindsay   ┆ jessicagarcia ┆ 566-408-9606x ┆ … ┆ Designer, cer ┆ GB70EVGJ62649 ┆ 2131417501452 ┆ 2009-06-22    │
│               ┆ Drive Suite   ┆ @example.com  ┆ 633           ┆   ┆ amics/pottery ┆ 904070295     ┆ 57            ┆               │
│               ┆ 413           ┆               ┆               ┆   ┆               ┆               ┆               ┆               │
│               ┆ Ro…           ┆               ┆               ┆   ┆               ┆               ┆               ┆               │
│ Christopher   ┆ 5240 Williams ┆ kennedybarbar ┆ (578)971-0366 ┆ … ┆ Engineer,     ┆ GB18OYSE78644 ┆ 4767664624172 ┆ 1973-09-21    │
│ Collins       ┆ Forge Suite   ┆ a@example.org ┆ x10767        ┆   ┆ electrical    ┆ 896429599     ┆ 615290        ┆               │
│               ┆ 570           ┆               ┆               ┆   ┆               ┆               ┆               ┆               │
│               ┆ …             ┆               ┆               ┆   ┆               ┆               ┆               ┆               │
│ Ronnie Giles  ┆ USNS Brown    ┆ julia23@examp ┆ 001-632-763-2 ┆ … ┆ Patent        ┆ GB19MXNA87198 ┆ 3019288744394 ┆ 1993-11-30    │
│               ┆ FPO AP 51442  ┆ le.net        ┆ 460x0516      ┆   ┆ attorney      ┆ 353574367     ┆ 2             ┆               │
└───────────────┴───────────────┴───────────────┴───────────────┴───┴───────────────┴───────────────┴───────────────┴───────────────┘

The text was updated successfully, but these errors were encountered:

orlp · 2024-10-02T22:58:57Z

The bug does not occur with n_threads=1 so I believe this to be a bug in how we split up CSV files for parallel reading.

ritchie46 · 2024-10-03T06:10:04Z

The bug does not occur with n_threads=1 so I believe this to be a bug in how we split up CSV files for parallel reading.

Ah yes.. we do some looking around to see if we made a valid split. Will have to increase strictness there. If we cannot find valid splits, we fallback to single threaded read.

ritchie46 · 2024-10-03T07:45:20Z

I cannot reproduce. How many threads have you got @orlp ?

cmdlineluser · 2024-10-03T08:50:34Z

test.csv works for me also, but this seems to reproduce the problem:

N = 1041

pl.read_csv(
    pl.DataFrame({"foo": ['ABCDE FGHIJ\nKLMNOP'] * N})
      .with_row_index()
      .write_csv()
      .encode()
)

# ComputeError: could not parse `KLMNOP"` as dtype `i64` at column 'index' (column number 1)

orlp · 2024-10-03T09:14:06Z

@ritchie46 10 threads.

orlp added bug Something isn't working P-high Priority: high A-io-csv Area: reading/writing CSV files labels Oct 2, 2024

github-project-automation bot added this to Backlog Oct 2, 2024

github-project-automation bot moved this to Ready in Backlog Oct 2, 2024

orlp added the accepted Ready for implementation label Oct 2, 2024

ritchie46 self-assigned this Oct 3, 2024

ritchie46 mentioned this issue Oct 3, 2024

perf: Use two-pass algorithm for csv to ensure correctness and SIMDize more ~17% #19088

Merged

ritchie46 closed this as completed in #19088 Oct 5, 2024

github-project-automation bot moved this from Ready to Done in Backlog Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV reader incorrectly splits row in two pieces on quoted newline #19078

CSV reader incorrectly splits row in two pieces on quoted newline #19078

orlp commented Oct 2, 2024 •

edited

Loading

orlp commented Oct 2, 2024

ritchie46 commented Oct 3, 2024

ritchie46 commented Oct 3, 2024

cmdlineluser commented Oct 3, 2024

orlp commented Oct 3, 2024

CSV reader incorrectly splits row in two pieces on quoted newline #19078

CSV reader incorrectly splits row in two pieces on quoted newline #19078

Comments

orlp commented Oct 2, 2024 • edited Loading

orlp commented Oct 2, 2024

ritchie46 commented Oct 3, 2024

ritchie46 commented Oct 3, 2024

cmdlineluser commented Oct 3, 2024

orlp commented Oct 3, 2024

orlp commented Oct 2, 2024 •

edited

Loading