Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV.parse not removing BOM character #43

Closed
davich opened this issue Sep 6, 2018 · 7 comments
Closed

CSV.parse not removing BOM character #43

davich opened this issue Sep 6, 2018 · 7 comments

Comments

@davich
Copy link

davich commented Sep 6, 2018

It looks like there's a bug in CSV 3.0.0 where parse with encoding 'bom|utf-8' doesn't remove the bom character (but it seems to be fine on the read method)

require 'csv'
require 'tempfile'
t = Tempfile.new("file.csv")
bom_character = 65_279
t << "name\nMy CSV".codepoints.unshift(bom_character).pack("U*")
t.rewind
csv = CSV.read(t, headers: true, encoding: 'bom|utf-8')
csv.first["name"]
=> "My CSV"
csv = CSV.parse(File.read(t), headers: true, encoding: 'bom|utf-8')
csv.first["name"]
=> nil
@kou
Copy link
Member

kou commented Sep 6, 2018

BOM is for file not string data.

I don't know why some users want to use BOM for string data.
Can you show your use case?

@davich
Copy link
Author

davich commented Sep 6, 2018

Right. So I should do:

csv = CSV.parse(File.read(t, encoding: 'bom|utf-8'), headers: true)

Thanks very much for your help! Sorry to raise an issue when the issue is with my understanding 👍

@matschaffer
Copy link

I had to do this with a file today as well (parsing an export from 新生銀行)

CSV.foreach(file, encoding: 'bom|utf-8') - thanks for posting your solution @davich

@khiav223577
Copy link

Right. So I should do:

csv = CSV.parse(File.read(t, encoding: 'bom|utf-8'), headers: true)

Thanks very much for your help! Sorry to raise an issue when the issue is with my understanding 👍

I have to manually remove BOM in order to test the response in my rspec.

BOM = "\xEF\xBB\xBF"
expect(CSV.parse(response.body.delete_prefix(BOM))).to eq [['abcdefg']]

It is not elegant, but maybe it is intended since it says: BOM is for opening a file not parse target string. in #23

@matschaffer
Copy link

Per @kou 's comment, it's a little curious that response.body would contain a BOM. But if it does, what you did seems to make sense.

I see a mention of using IO.read on https://bugs.ruby-lang.org/issues/15210 from @nobu but I'm not sure it makes sense just to remove a few characters from a string.

@adamreisnz
Copy link

If you are rendering CSV output in an API response for download to the user, and you want the CSV to include a BOM marker (to prevent character encoding issues when opening the file in older versions of Excel), then it makes perfect sense for the output to have a UTF-8 BOM marker.

So stripping it from the body in tests seems like the way to go.

@kou
Copy link
Member

kou commented Jan 7, 2025

If a response body is for download (Content-Disposition: attachment), a test should save the response to a file and read it as a file. A test should not parse the response body as a string (with delete_prefix(BOM)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants