Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass through non-UTF8 bytes in lines preprocessor #107

Merged

Conversation

al2o3cr
Copy link
Contributor

@al2o3cr al2o3cr commented May 15, 2022

Attempting to CSV.decode a stream that contains non-UTF8 bytes raises a FunctionClauseError:

     ** (FunctionClauseError) no function clause matching in CSV.Decoding.Preprocessing.Lines.starts_sequence?/5

     The following arguments were given to CSV.Decoding.Preprocessing.Lines.starts_sequence?/5:

         # 1
         <<225, 110, 100, 101, 122>>

         # 2
         "n"

         # 3
         false

         # 4
         44

         # 5
         ""

     Attempted function clauses (showing 5 out of 5):

         defp starts_sequence?(<<34::utf8(), tail::binary()>>, last_token, false, separator, _) when last_token == <<separator::utf8()>>
         defp starts_sequence?(<<34::utf8(), tail::binary()>>, "", false, separator, _)
         defp starts_sequence?(<<34::utf8(), tail::binary()>>, _, quoted, separator, sequence_start)
         defp starts_sequence?(<<head::utf8(), tail::binary()>>, _, quoted, separator, sequence_start)
         defp starts_sequence?("", _, quoted, _, sequence_start)

     code: result = CSV.decode(stream) |> Enum.to_list()
     stacktrace:
       (csv 2.4.1) CSV.Decoding.Preprocessing.Lines.starts_sequence?/5
       (csv 2.4.1) lib/csv/decoding/preprocessing/lines.ex:85: CSV.Decoding.Preprocessing.Lines.start_sequence/3
       (elixir 1.13.0) lib/stream.ex:902: Stream.do_transform_user/6

This makes it impossible to handle encoding errors per-line or use machinery like Decoder's replacement option.

The code that would prevent this crash was accidentally deleted in 4f5069b because it is "unused" for files that only contain valid UTF8.

This PR restores the deleted clause and adds a high-level test; existing tests cover Decoder and Lexer but not the complete pipeline.

Attempting to `CSV.decode` a stream that contains non-UTF8 bytes raises
a `FunctionClauseError`:

```
     ** (FunctionClauseError) no function clause matching in CSV.Decoding.Preprocessing.Lines.starts_sequence?/5

     The following arguments were given to CSV.Decoding.Preprocessing.Lines.starts_sequence?/5:

         # 1
         <<225, 110, 100, 101, 122>>

         # 2
         "n"

         # 3
         false

         # 4
         44

         # 5
         ""

     Attempted function clauses (showing 5 out of 5):

         defp starts_sequence?(<<34::utf8(), tail::binary()>>, last_token, false, separator, _) when last_token == <<separator::utf8()>>
         defp starts_sequence?(<<34::utf8(), tail::binary()>>, "", false, separator, _)
         defp starts_sequence?(<<34::utf8(), tail::binary()>>, _, quoted, separator, sequence_start)
         defp starts_sequence?(<<head::utf8(), tail::binary()>>, _, quoted, separator, sequence_start)
         defp starts_sequence?("", _, quoted, _, sequence_start)

     code: result = CSV.decode(stream) |> Enum.to_list()
     stacktrace:
       (csv 2.4.1) CSV.Decoding.Preprocessing.Lines.starts_sequence?/5
       (csv 2.4.1) lib/csv/decoding/preprocessing/lines.ex:85: CSV.Decoding.Preprocessing.Lines.start_sequence/3
       (elixir 1.13.0) lib/stream.ex:902: Stream.do_transform_user/6
```

This makes it impossible to handle encoding errors per-line or use
machinery like `Decoder`'s `replacement` option.

The code that would prevent this crash was accidentally deleted in beatrichartz@4f5069b
because it is "unused" for files that only contain valid UTF8.
@aezell
Copy link

aezell commented Jun 7, 2022

We are running into this same error with similar data.

@nwai90
Copy link

nwai90 commented Jun 17, 2022

Same here, experienced the issue with non UTF-8 characters

Screenshot 2022-06-17 at 6 00 05 PM

@beatrichartz beatrichartz merged commit 17d7a24 into beatrichartz:master Sep 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants