-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding/csv: parser does not preserve \r\n line ending #21201
Comments
This report is unclear. Please post an auto-contained code snippet that exhibits the issue. |
package main
import (
"encoding/csv"
"encoding/json"
"fmt"
"io"
"log"
"strings"
)
func main() {
in := "\"Hello\r\nHi\""
r := csv.NewReader(strings.NewReader(in))
for {
record, err := r.Read()
if err == io.EOF {
break
}
if err != nil {
log.Fatal(err)
}
json, _ := json.Marshal(record[0])
fmt.Println(string(json))
}
} Output:
|
The function for reading the input explains this behaviour: // Handle \r\n here. We make the simplifying assumption that
// anytime \r is followed by \n that it can be folded to \n.
// We will not detect files which contain both \r\n and bare \n. I agree that this is unexpected (and not explicitly tested). It could easily be fixed by not using (*Reader).readRune in the Quoted case of (*Reader).parseField. I think this is worth fixing. /cc @pborman Any opinion on this? |
Yes, it should use |
The code must not forget to increment In any case, if nobody else wants to tackle this, I can send a patch once the tree opens again. |
See also: #7897 |
We all agree that the corruption you saw is obviously a bug and should be fixed. Bugs happen and, as I said, I will happily mail a fix for this issue. One of the problems with CSV is that many people don't care about following the specification. This leads to many different and broken implementations of CSV (e.g. the popular "just join the fields with a comma"-implementation) and CSV files that are hard to parse correctly. The encoding/csv already tries to support some common cases features (\n and \r\n for line endings, non double-quoted quotes in quoted fields, trailing commas, comments, different delimiters) and in trying to do so introduced the bug you found. Please note that #7879 is really a feature request and not a bug report like this issue. The RFC doesn't handle \r line separators other than \r\n. Yes, it could easily be implemented and encoding/csv already supports just \n, but just because it's relatively easy doesn't mean it's without costs. More features mean more API, more code, more complexity, and more chances for bugs. We should all strive to keep the language, the stdlib and the rest of the ecosystem simple, while not giving up on user experience. Adding new features to the stdlib, especially without any kind of evidence that they are really needed, should always be avoided if possible. Issue #7879 for example only provides a single example for the requested feature and no further evidence that it is necessary, which makes it really unlinkely that the Go team will accept it, since they have to rely on mostly experience instead. In my experience most CSV files use either \n or \r\n as line delimiters, so just supporting these two should already handle most cases. But maybe I'm wrong and having different row separators is something common in CSV files, which could make it worth implementing. I can't find any information about this in #7879 and a quick google search didn't yield any evidence that this is the case. I'll gladly let someone convince me that I'm wrong. If you think that the Go team made the wrong decision, for example with #7897, you can always try to collect more examples/evidence and ask them to rethink their decision or try to get opinions from the community. Not everything the Go team and us other contributors do is always perfect, and I think everyone here is old enough to understand this. Just keep in mind that making everyone happy is not always feasible, not just technically but also in terms of complexity. I could write much more about this, but we should try to keep this about the issue itself. |
@alexhultman I hope you realize that we agreed that you had reported a bug an it should be fixed. You have pointed out a situation in which the CSV package is not following the RFC. There have been other changes to the CSV parser that have been accepted since I initially wrote it in 2011. @nussjustin is correct that there are so many different flavors of CSV files. At that time, RFC 4180 was the only standard from a standards body I could find that discussed the CSV format. @nussjustin if you want to provide the fix, I would be all for that. And you are absolutely correct about trying to keep correct track of the cursor while walking through the input. If you could address #19019 at the same time that would be great (since it is also related to quotes). |
Change https://golang.org/cl/52810 mentions this issue: |
thanks |
I disagree that this is a problem. I filed #22746 to revert the CL for Go 1.10. If you think the CL should be kept, please comment there. Thanks. |
Change https://golang.org/cl/78295 mentions this issue: |
CL 52810 changed Reader to interpret a quoted \r\n as a raw \r\n when reading fields. This seems likely to break existing users, and discussion on both #21201 (the original issue that triggered the change) and #22746 (discussing whether to revert the change) failed to identify a single motivating example for this change. To avoid breaking existing users for no clear reason, revert the change. The Reader has been rewritten in the interim so this is not a git revert but instead and adjustment (and slight simplification) of the new Reader. Fixes #22746. Change-Id: Ie857b2f4b1359a207d085b6d3c3a6d440a997d12 Reviewed-on: https://go-review.googlesource.com/78295 Reviewed-by: Joe Tsai <thebrokentoaster@gmail.com>
See cockroachdb#25344. It appears this is being caused by the behaviour of Golang's encoding/csv library, which folds \r\n into \n when reading. This was fixed in golang/go#21201 but then reverted golang/go#22746. It appears based on that second issue that Go is unlikely to change that behavior. Check in the stdlib `encoding/csv` into `pkg/util` with golang/go#22746 reverted. Release note: `\r\n` characters in CSV files were silently converted into `\n`. This causes imported data to be different. This is now fixed.
See cockroachdb#25344. It appears this is being caused by the behaviour of Golang's encoding/csv library, which folds \r\n into \n when reading. This was fixed in golang/go#21201 but then reverted golang/go#22746. It appears based on that second issue that Go is unlikely to change that behavior. Check in the stdlib `encoding/csv` into `pkg/util` with golang/go#22746 reverted. Release note: `\r\n` characters in CSV files were silently converted into `\n`. This causes imported data to be different. This is now fixed.
See cockroachdb#25344. It appears this is being caused by the behaviour of Golang's encoding/csv library, which folds \r\n into \n when reading. This was fixed in golang/go#21201 but then reverted golang/go#22746. It appears based on that second issue that Go is unlikely to change that behavior. Check in the stdlib `encoding/csv` into `pkg/util` with golang/go#22746 reverted. Release note: `\r\n` characters in CSV files were silently converted into `\n`. This causes imported data to be different. This is now fixed.
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?go version go1.8.1 linux/amd64
What operating system and processor architecture are you using (
go env
)?GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/alexhultman/go"
GORACE=""
GOROOT="/usr/lib/golang"
GOTOOLDIR="/usr/lib/golang/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build738528540=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
What did you do?
Read CSV file using standard CSV parser.
If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.
What did you expect to see?
Preserved line endings aka I want the verbatim data as in the CSV file, not some made up corrupt data.
What did you see instead?
\r\n is read as \n
The text was updated successfully, but these errors were encountered: