Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds ability to quote always in CSV.build #6723

Merged
merged 3 commits into from
Oct 3, 2018

Conversation

maiha
Copy link
Contributor

@maiha maiha commented Sep 14, 2018

Currently, CSV.build quotes only those fields which contain special characters.
PR adds ability to quote all fields by quote_always option.

puts CSV.build {|csv| csv.row 101, "doesn't"}
puts CSV.build(quote_always: true) {|csv| csv.row 101, "doesn't"}
101,doesn't
"101","doesn't"

This allows us to link to tools that are not RFC compliant. For example, ClickHouse can't parse the former.
ClickHouse/ClickHouse#2192

Best regards,

@asterite
Copy link
Member

Maybe always_quote reads better (I'm not a native English speaker so other's opinion is best)

@maiha
Copy link
Contributor Author

maiha commented Sep 14, 2018

Yep, I investigated the other programming languages.

Language Library Name Type
C# CsvHelper QuoteAllFields bool
Go (stdlib) N/A N/A
Haskell cassava Quoting.QuoteAll enum
Java Apache Commons CSV QuoteMode.ALL enum
Python (stdlib) QUOTE_ALL enum
Ruby fastercsv force_quotes bool

Well, I found two points.

  1. naming: all is better than always stuff
  2. type: enum is better than bool for expandability

enum suggestion

class CSV::Builder
  enum Quote
    None    # No quotes
    Minimal # Quotes according to RFC 4180 (default)
    All     # Always quote
  end
end
  • values are almost same as cassava
  • naming issue: Quote, Quoting, Mode, QuoteMode ?

thought?

@bcardiff
Copy link
Member

Naming the enum Quoting as in cassava seems nice enough.
The argument can also be quoting.

@maiha
Copy link
Contributor Author

maiha commented Sep 15, 2018

I modified the code to use enum Quoting.
Here, I changed the name to RFC which is more objective since Minimal has a subjective meaning.
The default quoting method is RFC which has backword compatibility.

puts CSV.build {|csv| csv.row 1, ","}
puts CSV.build(quoting: CSV::Builder::Quoting::NONE) {|csv| csv.row 1, ","}
puts CSV.build(quoting: CSV::Builder::Quoting::RFC ) {|csv| csv.row 1, ","}
puts CSV.build(quoting: CSV::Builder::Quoting::ALL ) {|csv| csv.row 1, ","}
1,","
1,,
1,","
"1",","

return true
case @quoting
when .rfc?
if value.is_a?(String)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the value is a Char or Symbol the quoting logic is not applied.
That was a pre-existing corner case that was not handled.
Should we handled it now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I prefer this PR just provides a new feature about quoting behavior.
And then, I'd like to fix the corner case soon in another PR that provides fix the existing issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then perhaps revert the changes to def <<(value : Nil | Bool | Char | Number | Symbol), meaning value here is always String, and you can revert the if value.is_a? String change. And leave all this to a different PR.

@maiha
Copy link
Contributor Author

maiha commented Sep 15, 2018

FYI: There was almost no performance change in compatible mode(Quoting::RFC).

Benchmark.ips(warmup: 0, calculation: 10) do |x|
  x.report("csv(orig)") { build_csv1(reports) }
  x.report("csv(RFC) ") { build_csv2(reports, "RFC") }
  x.report("csv(ALL) ") { build_csv2(reports, "ALL") }
end
csv(orig)   3.27  (305.83ms) (± 6.77%)  238787461 B/op        fastest
csv(RFC)    3.22  (310.67ms) (± 9.10%)  238787333 B/op   1.02× slower
csv(ALL)    2.52  (396.54ms) (± 9.99%)  238786808 B/op   1.30× slower
  • sample data: rows=109,488, cols=16, bytes=26.5 MB

@@ -35,15 +35,21 @@ require "csv"
# 4,5,6,7,8
# ```
class CSV::Builder
enum Quoting
NONE # No quotes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation doesn't work, it needs to be before the constants to show up in crystal doc.

# ```
# result = CSV.build do |csv|
# csv.row "one", "two"
# csv.row "three"
# end
# result # => "one,two\nthree\n"
# result = CSV.build(quoting: CSV::Builder::Quoting::ALL) do |csv|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quoting: :all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I didn't know such a cool syntax! Thank you!

@asterite
Copy link
Member

What's the point of no quotes? Isn't that a broken format?

@RX14
Copy link
Contributor

RX14 commented Sep 15, 2018

I don't see why is was made an enum, it should just be a boolean, always_quote. It's unnecessarily complex.

@RX14
Copy link
Contributor

RX14 commented Sep 15, 2018

Actually i've changed my mind. If all these CSV libs have an option for no quoting, there must be a reason. An enum is fine.

@asterite
Copy link
Member

Right, but I'd like to know the reason.

@maiha
Copy link
Contributor Author

maiha commented Sep 15, 2018

What's the point of no quotes? Isn't that a broken format?

To be honest, I just imitated other libraries, so I don't know exact reasons.

In my opinion, it can be expected to bypass the judgment process and speed up a little, such as when it is composed only of numbers or alphabets.

@asterite
Copy link
Member

Makes sense

@RX14
Copy link
Contributor

RX14 commented Sep 15, 2018

@asterite probably compatability reasons. You can ask this guy if you want.

@RX14
Copy link
Contributor

RX14 commented Sep 15, 2018

@asterite compile failed, looks like a compiler bug?

@maiha
Copy link
Contributor Author

maiha commented Sep 15, 2018

Oh, I forgot about special keywords All and None. However, this should only be enabled in the case of Flags. Right?

@bew
Copy link
Contributor

bew commented Sep 15, 2018

@maiha the error is not about the special enum members All & None, it looks like an automatic cast error

@asterite
Copy link
Member

Yes, there's a bug regarding named arguments and automatic casts. I think it's #6623 . I don't have time to fix it.

@maiha
Copy link
Contributor Author

maiha commented Sep 15, 2018

Since it seems that it takes time to solve the problem, I reverted the refactoring codes about "use Symbol for enum".

@RX14
Copy link
Contributor

RX14 commented Sep 15, 2018

This is actually a new bug introduced since 0.26.1

notice how the first std_spec compiles with the 0.26.1 compiler than fails when rebuilt on master.

@RX14
Copy link
Contributor

RX14 commented Sep 15, 2018

So it's definitely #6618

Copy link
Member

@sdogruyol sdogruyol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @maiha 👍

Copy link
Member

@bcardiff bcardiff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is GTG.
A PR should follow to handle the corner case for quoting for Chars and Symbol.

@sdogruyol sdogruyol merged commit 08e3db6 into crystal-lang:master Oct 3, 2018
@sdogruyol sdogruyol added this to the 0.27.0 milestone Oct 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants