CSV.parse not removing BOM character #43

davich · 2018-09-06T01:15:19Z

It looks like there's a bug in CSV 3.0.0 where parse with encoding 'bom|utf-8' doesn't remove the bom character (but it seems to be fine on the read method)

require 'csv'
require 'tempfile'
t = Tempfile.new("file.csv")
bom_character = 65_279
t << "name\nMy CSV".codepoints.unshift(bom_character).pack("U*")
t.rewind
csv = CSV.read(t, headers: true, encoding: 'bom|utf-8')
csv.first["name"]
=> "My CSV"
csv = CSV.parse(File.read(t), headers: true, encoding: 'bom|utf-8')
csv.first["name"]
=> nil

The text was updated successfully, but these errors were encountered:

kou · 2018-09-06T07:13:38Z

BOM is for file not string data.

I don't know why some users want to use BOM for string data.
Can you show your use case?

davich · 2018-09-06T07:37:46Z

Right. So I should do:

csv = CSV.parse(File.read(t, encoding: 'bom|utf-8'), headers: true)

Thanks very much for your help! Sorry to raise an issue when the issue is with my understanding 👍

matschaffer · 2021-09-27T04:58:50Z

I had to do this with a file today as well (parsing an export from 新生銀行)

CSV.foreach(file, encoding: 'bom|utf-8') - thanks for posting your solution @davich

khiav223577 · 2021-10-28T11:18:52Z

Right. So I should do:
csv = CSV.parse(File.read(t, encoding: 'bom|utf-8'), headers: true)
Thanks very much for your help! Sorry to raise an issue when the issue is with my understanding 👍

I have to manually remove BOM in order to test the response in my rspec.

BOM = "\xEF\xBB\xBF"
expect(CSV.parse(response.body.delete_prefix(BOM))).to eq [['abcdefg']]

It is not elegant, but maybe it is intended since it says: BOM is for opening a file not parse target string. in #23

matschaffer · 2021-10-29T05:32:34Z

Per @kou 's comment, it's a little curious that response.body would contain a BOM. But if it does, what you did seems to make sense.

I see a mention of using IO.read on https://bugs.ruby-lang.org/issues/15210 from @nobu but I'm not sure it makes sense just to remove a few characters from a string.

adamreisnz · 2025-01-06T23:27:30Z

If you are rendering CSV output in an API response for download to the user, and you want the CSV to include a BOM marker (to prevent character encoding issues when opening the file in older versions of Excel), then it makes perfect sense for the output to have a UTF-8 BOM marker.

So stripping it from the body in tests seems like the way to go.

kou · 2025-01-07T01:08:33Z

If a response body is for download (Content-Disposition: attachment), a test should save the response to a file and read it as a file. A test should not parse the response body as a string (with delete_prefix(BOM)).

davich closed this as completed Sep 6, 2018

rafaltrojanowski mentioned this issue Oct 24, 2019

Testing - Write Feature/Integration Tests for UntaggedAnimalAssessmentJob rubyforgood/abalone#67

Closed

JunichiIto mentioned this issue May 4, 2024

feature request: Warn when reading BOM text with headers option #301

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV.parse not removing BOM character #43

CSV.parse not removing BOM character #43

davich commented Sep 6, 2018 •

edited

Loading

kou commented Sep 6, 2018

davich commented Sep 6, 2018

matschaffer commented Sep 27, 2021

khiav223577 commented Oct 28, 2021

matschaffer commented Oct 29, 2021

adamreisnz commented Jan 6, 2025

kou commented Jan 7, 2025

CSV.parse not removing BOM character #43

CSV.parse not removing BOM character #43

Comments

davich commented Sep 6, 2018 • edited Loading

kou commented Sep 6, 2018

davich commented Sep 6, 2018

matschaffer commented Sep 27, 2021

khiav223577 commented Oct 28, 2021

matschaffer commented Oct 29, 2021

adamreisnz commented Jan 6, 2025

kou commented Jan 7, 2025

davich commented Sep 6, 2018 •

edited

Loading