Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/csv: parser does not preserve \r\n line ending #21201

Closed
ghost opened this issue Jul 28, 2017 · 12 comments
Closed

encoding/csv: parser does not preserve \r\n line ending #21201

ghost opened this issue Jul 28, 2017 · 12 comments
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@ghost
Copy link

ghost commented Jul 28, 2017

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.8.1 linux/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/alexhultman/go"
GORACE=""
GOROOT="/usr/lib/golang"
GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build738528540=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

What did you do?

Read CSV file using standard CSV parser.

If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.

What did you expect to see?

Preserved line endings aka I want the verbatim data as in the CSV file, not some made up corrupt data.

What did you see instead?

\r\n is read as \n

@ALTree
Copy link
Member

ALTree commented Jul 28, 2017

This report is unclear. Please post an auto-contained code snippet that exhibits the issue.

@ALTree ALTree changed the title CSV parser does not preserve \r\n line ending encoding/csv: parser does not preserve \r\n line ending Jul 28, 2017
@ALTree ALTree added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jul 28, 2017
@ghost
Copy link
Author

ghost commented Jul 31, 2017

package main

import (
	"encoding/csv"
	"encoding/json"
	"fmt"
	"io"
	"log"
	"strings"
)

func main() {
	in := "\"Hello\r\nHi\""
	r := csv.NewReader(strings.NewReader(in))

	for {
		record, err := r.Read()
		if err == io.EOF {
			break
		}
		if err != nil {
			log.Fatal(err)
		}

		json, _ := json.Marshal(record[0])
		fmt.Println(string(json))
	}
}

Output:

"Hello\nHi"

@nussjustin
Copy link
Contributor

The function for reading the input explains this behaviour:

// Handle \r\n here. We make the simplifying assumption that
// anytime \r is followed by \n that it can be folded to \n.
// We will not detect files which contain both \r\n and bare \n.

I agree that this is unexpected (and not explicitly tested). It could easily be fixed by not using (*Reader).readRune in the Quoted case of (*Reader).parseField.

I think this is worth fixing.

/cc @pborman Any opinion on this?

@ALTree ALTree added NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Jul 31, 2017
@ALTree ALTree added this to the Go1.10 milestone Jul 31, 2017
@pborman
Copy link
Contributor

pborman commented Jul 31, 2017

Yes, it should use r1, _, err = r.r.ReadRune() inside of quotes.

@nussjustin
Copy link
Contributor

The code must not forget to increment r.column. I also think a new method (readRawRune, or something similiar) would make it more clear that the carriage return is kept explicitly.

In any case, if nobody else wants to tackle this, I can send a patch once the tree opens again.

@ajstarks
Copy link
Contributor

ajstarks commented Aug 1, 2017

See also: #7897

@nussjustin
Copy link
Contributor

We all agree that the corruption you saw is obviously a bug and should be fixed. Bugs happen and, as I said, I will happily mail a fix for this issue.

One of the problems with CSV is that many people don't care about following the specification. This leads to many different and broken implementations of CSV (e.g. the popular "just join the fields with a comma"-implementation) and CSV files that are hard to parse correctly. The encoding/csv already tries to support some common cases features (\n and \r\n for line endings, non double-quoted quotes in quoted fields, trailing commas, comments, different delimiters) and in trying to do so introduced the bug you found.

Please note that #7879 is really a feature request and not a bug report like this issue. The RFC doesn't handle \r line separators other than \r\n. Yes, it could easily be implemented and encoding/csv already supports just \n, but just because it's relatively easy doesn't mean it's without costs. More features mean more API, more code, more complexity, and more chances for bugs.

We should all strive to keep the language, the stdlib and the rest of the ecosystem simple, while not giving up on user experience. Adding new features to the stdlib, especially without any kind of evidence that they are really needed, should always be avoided if possible. Issue #7879 for example only provides a single example for the requested feature and no further evidence that it is necessary, which makes it really unlinkely that the Go team will accept it, since they have to rely on mostly experience instead.

In my experience most CSV files use either \n or \r\n as line delimiters, so just supporting these two should already handle most cases. But maybe I'm wrong and having different row separators is something common in CSV files, which could make it worth implementing. I can't find any information about this in #7879 and a quick google search didn't yield any evidence that this is the case. I'll gladly let someone convince me that I'm wrong.

If you think that the Go team made the wrong decision, for example with #7897, you can always try to collect more examples/evidence and ask them to rethink their decision or try to get opinions from the community. Not everything the Go team and us other contributors do is always perfect, and I think everyone here is old enough to understand this. Just keep in mind that making everyone happy is not always feasible, not just technically but also in terms of complexity.

I could write much more about this, but we should try to keep this about the issue itself.

@pborman
Copy link
Contributor

pborman commented Aug 1, 2017

@alexhultman I hope you realize that we agreed that you had reported a bug an it should be fixed. You have pointed out a situation in which the CSV package is not following the RFC. There have been other changes to the CSV parser that have been accepted since I initially wrote it in 2011. @nussjustin is correct that there are so many different flavors of CSV files. At that time, RFC 4180 was the only standard from a standards body I could find that discussed the CSV format.

@nussjustin if you want to provide the fix, I would be all for that. And you are absolutely correct about trying to keep correct track of the cursor while walking through the input. If you could address #19019 at the same time that would be great (since it is also related to quotes).

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/52810 mentions this issue: encoding/csv: preserve \r\n in quoted fields

@ghost
Copy link
Author

ghost commented Aug 14, 2017

thanks

@rsc
Copy link
Contributor

rsc commented Nov 15, 2017

I disagree that this is a problem. I filed #22746 to revert the CL for Go 1.10. If you think the CL should be kept, please comment there. Thanks.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/78295 mentions this issue: encoding/csv: restore Go 1.9 quoted \r\n handling in Reader

gopherbot pushed a commit that referenced this issue Nov 16, 2017
CL 52810 changed Reader to interpret a quoted \r\n as a raw \r\n
when reading fields. This seems likely to break existing users, and
discussion on both #21201 (the original issue that triggered the change)
and #22746 (discussing whether to revert the change) failed to identify
a single motivating example for this change. To avoid breaking existing
users for no clear reason, revert the change.

The Reader has been rewritten in the interim so this is not a git revert
but instead and adjustment (and slight simplification) of the new Reader.

Fixes #22746.

Change-Id: Ie857b2f4b1359a207d085b6d3c3a6d440a997d12
Reviewed-on: https://go-review.googlesource.com/78295
Reviewed-by: Joe Tsai <thebrokentoaster@gmail.com>
neeral added a commit to neeral/cockroach that referenced this issue Aug 2, 2018
See cockroachdb#25344.

It appears this is being caused by the behaviour of Golang's
encoding/csv library, which folds \r\n into \n when reading. This was
fixed in golang/go#21201 but then reverted golang/go#22746. It appears
based on that second issue that Go is unlikely to change that behavior.

Check in the stdlib `encoding/csv` into `pkg/util` with
golang/go#22746 reverted.

Release note:
`\r\n` characters in CSV files were silently converted into `\n`. This
causes imported data to be different. This is now fixed.
neeral added a commit to neeral/cockroach that referenced this issue Aug 2, 2018
See cockroachdb#25344.

It appears this is being caused by the behaviour of Golang's
encoding/csv library, which folds \r\n into \n when reading. This was
fixed in golang/go#21201 but then reverted golang/go#22746. It appears
based on that second issue that Go is unlikely to change that behavior.

Check in the stdlib `encoding/csv` into `pkg/util` with
golang/go#22746 reverted.

Release note:
`\r\n` characters in CSV files were silently converted into `\n`. This
causes imported data to be different. This is now fixed.
dt pushed a commit to dt/cockroach that referenced this issue Aug 6, 2018
See cockroachdb#25344.

It appears this is being caused by the behaviour of Golang's
encoding/csv library, which folds \r\n into \n when reading. This was
fixed in golang/go#21201 but then reverted golang/go#22746. It appears
based on that second issue that Go is unlikely to change that behavior.

Check in the stdlib `encoding/csv` into `pkg/util` with
golang/go#22746 reverted.

Release note:
`\r\n` characters in CSV files were silently converted into `\n`. This
causes imported data to be different. This is now fixed.
@golang golang locked and limited conversation to collaborators Nov 16, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
None yet
Development

No branches or pull requests

6 participants