Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dev/core#1424 Support MS Excel for diacritic characters when exporting to csv #15968

Merged
merged 1 commit into from
Dec 17, 2019

Conversation

eileenmcnaughton
Copy link
Contributor

@eileenmcnaughton eileenmcnaughton commented Nov 26, 2019

Overview

Fixes a long-standing issue whereby files exported from CiviCRM with diacritic characters - eg.
ę are mangled when opened in MS Excel.

Before

No BOM string pre-pended to csv. Diacritic characters present correctly in Open Office but are mangled in all version of Excel.

After

 Prepended - this tells Excel that it is UTF-8 encoded and diacritics are displayed correctly.

This is a standard so other programmes should be unaffected

Technical Details

The underlying issue is that the characters are UTF-8 encoded & MS Excel by default does not assume UTF8.
In order to do so it needs a BOM - Byte Order Marker - which is a few hex characters at the start of
the csv.

The BOM indicator is 'not recommended' .... unless you want your csv to work with
MS Excel. Since MS Excel is a major use case for csvs I think it's pretty clear that all
things being equal we want to support it... I note that there are extensions to export to excel
natively but I don't think that replaces this. The number one reason to want to export
a csv is to work with it in a spreadsheeting programme & unless we can't safely make it
work in core we shouldn't require an extension for that.

There are various recommendations over time but it seems things have improved in the MS
Excel world and what works on Windows now appears to work on Mac too - at least on a recent version.

This link https://donatstudios.com/CSV-An-Encoding-Nightmare is a pretty good discussion.

While the above link and others say that you need a different BOM for MAC than Windows my
testing shows that the one recommended for Windows works fine on Mac Excel (while you would need
to convert to UTF 16 to follow the Mac recommendation. (https://csv.thephpleague.com/9.0/interoperability/encoding/)

Other links
https://stackoverflow.com/questions/35294443/does-excel-for-mac-2016-properly-import-unicode-in-csv-files
https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom/2223926#2223926

Notably this summary :
" When should you encode with a BOM?
If you're unable to record the metadata in any other way (through a charset tag or file system meta), and the programs being used like BOMs, you should encode with a BOM. This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page. The BOM tells programs like Office that, yes, the text in this file is Unicode; here's the encoding used.

When it comes down to it, the only files I ever really have problems with are CSV. Depending on the program, it either must, or must not have a BOM. For example, if you're using Excel 2007+ on Windows, it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data."

So the argument for a BOM is - if you want it to be compatible with MS Excel use a BOM. This seems to apply.

The risk is that perhaps there is some variant of csv viewing programme that can't cope - the risk of this is rather mitigited by

  1. the fact that MS Excel adds the BOM to the start of any files it saves as UTF-8 encoded csv so any programme that expects to open MS Excel generated
    csvs would need to be able to cope with it. In addition it is a standard, if not required, file indicator so it feels like the programmes
    that were legacy in 2016 writeups might be of no concern now.
  2. There really is no programatic way to export these csvs so the risk that this is being parsed by code is close to zero

There are 2 other things we could do to mitigate

  1. test on more platforms - I've tested with MS Excel for Mac (Office 365 v 16.31) and Libre Office and MS Numbers
    & MS word, cot editor & notes
  2. We could add a setting. I'm generally a bit loath on settings as they are a bit of a maintenance nightmare.
    However, perhaps it would be a setting like 'export csvs in legacy format & there could be a link
    to the gitlab to explain your use-case if you feel the need to set it. If no-one does then we could later remove.

Final note - csv tables are created with the CRM_Core_Report_Excel (spot the irony) from 3 places

  • the main export, the export for custom fields & a third place which I believe to be unreachable. This is one path
    only but I can look at centralising for the custom fields export path.

Comments

…g to csv

There is a long-standing issue whereby files exported from CiviCRM with diacritic characters - eg.
ę are mangled when opened in MS Excel.

The underlying issue is that the characters are UTF-8 encoded & MS Excel by default does not assume UTF8.
In order to do so it needs a BOM - Byte Order Marker - which is a few hex characters at the start of
the csv.

The BOM indicator is 'not recommended' .... unless you want your csv to work with
MS Excel. Since MS Excel is a major use case for csvs I think it's pretty clear that all
things being equal we want to support it... I note that there are extensions to export to excel
natively but I don't think that replaces this. The number one reason to want to export
a csv is to work with it in a spreadsheeting programme & unless we can't safely make it
work in core we shouldn't require an extension for that.

There are various recommendations over time but it seems things have improved in the MS
Excel world and what works on Windows now appears to work on Mac too - at least on a recent version.

This link https://donatstudios.com/CSV-An-Encoding-Nightmare is a pretty good discussion.

While the above link and others say that you need a different BOM for MAC than Windows my
testing shows that the one recommended for Windows works fine on Mac Excel (while you would need
to convert to UTF 16 to follow the Mac recommendation. (https://csv.thephpleague.com/9.0/interoperability/encoding/)

Other links
https://stackoverflow.com/questions/35294443/does-excel-for-mac-2016-properly-import-unicode-in-csv-files
https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom/2223926#2223926

Notably this summary :
  " When should you encode with a BOM?
If you're unable to record the metadata in any other way (through a charset tag or file system meta), and the programs being used like BOMs, you should encode with a BOM. This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page. The BOM tells programs like Office that, yes, the text in this file is Unicode; here's the encoding used.

When it comes down to it, the only files I ever really have problems with are CSV. Depending on the program, it either must, or must not have a BOM. For example, if you're using Excel 2007+ on Windows, it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data."

So the argument for a BOM is - if you want it to be compatible with MS Excel use a BOM. This seems to apply.

The risk is that perhaps there is some variant of csv viewing programme that can't cope - the risk of this is rather mitigited by
1) the fact that MS Excel adds the BOM to the start of any files it saves as UTF-8 encoded csv so any programme that expects to open MS Excel generated
csvs would need to be able to cope with it. In addition it is a standard, if not required, file indicator so it feels like the programmes
that were legacy in 2016 writeups might be of no concern now.
2) There really is no programatic way to export these csvs so the risk that this is being parsed by code is close to zero

There are 2 other things we could do to mitigate
1) test on more platforms - I've tested with MS Excel for Mac (Office 365 v 16.31) and Libre Office and MS Numbers
& MS word, cot editor & notes
2) We could add a setting. I'm generally a bit loath on settings as they are a bit of a maintenance nightmare.
However, perhaps it would be a setting like 'export csvs in legacy format & there could be a link
to the gitlab to explain your use-case if you feel the need to set it. If no-one does then we could later remove.

Final note - csv tables are created with the CRM_Core_Report_Excel (spot the irony) from 3 places
- the main export, the export for custom fields &  a third place which I believe to be unreachable. This is one path
only but I can look at centralising for the custom fields export path.
@civibot
Copy link

civibot bot commented Nov 26, 2019

(Standard links)

@civibot civibot bot added the master label Nov 26, 2019
@eileenmcnaughton eileenmcnaughton changed the title dev/core#1424 Support MS Excel for diacritic characters when exportin… dev/core#1424 Support MS Excel for diacritic characters when exporting to csv Nov 26, 2019
@seamuslee001
Copy link
Contributor

This is a thumbs up from me 👍 I'll test it when i get a chance

@eileenmcnaughton
Copy link
Contributor Author

@seamuslee001 please specify what you test it with - I tested with a few things & I guess @demeritcowboy probably has Windows & maybe Windows Excel

@demeritcowboy
Copy link
Contributor

I can check tomorrow. I have windows excel 2019 and also an older version too. I haven't read the whole issue but people do other things with csv too, like input to mailchimp or custom processing scripts. Do people open exports on phones nowadays - is that a thing?

@seamuslee001
Copy link
Contributor

I tested this on my laptop and works for me, I have Excel 2013 on my laptop

@seamuslee001
Copy link
Contributor

Note i am latest Windows 10 but with older Excel for reasons

@mlutfy
Copy link
Member

mlutfy commented Nov 27, 2019

Seems OK to me.

I tested with Vim, LibreOffice and with file:

$ file CiviCRM_Contact_Search.csv 
CiviCRM_Contact_Search.csv: UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators

@kcristiano
Copy link
Member

I am having an issue with Office 365 v 16.0.12 when exporting from Reports, But searches export fine.

file Report_20191127-925.csv
Report_20191127-925.csv: UTF-8 Unicode text, with CRLF line terminators
file CiviCRM_Contact_Search.csv
CiviCRM_Contact_Search.csv: UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators

@demeritcowboy
Copy link
Contributor

  • PASS: Excel 2010 on windows 7
  • PASS: Import to MS Access 2013
  • PASS: Mailchimp import
  • PASS: Parsing via script with league/csv (it recognizes and removes BOM)
  • Minor Issue: Parsing via script with fgetcsv: It fails if your script needs to recognize the first cell and/or it displays weird depending what you're doing with it.

@eileenmcnaughton
Copy link
Contributor Author

@kcristiano is your comment with this patch? I haven't dug into how reports generate csvs but this patch should not alter it

@kcristiano
Copy link
Member

@eileenmcnaughton I agree this patch only covers the export of searches. I tested reports and it was not affected. I suspect we have to look into a similar fix for reports.

@eileenmcnaughton
Copy link
Contributor Author

I should note I'm working through a couple of cleanups towards this also working for custom searches - just opened this minor one #15974

@seamuslee001
Copy link
Contributor

Given everyone's testing so far is showing this to be likely to be good i'm adding the merge ready flag

@seamuslee001 seamuslee001 added the merge ready PR will be merged after a few days if there are no objections label Nov 28, 2019
@mlutfy
Copy link
Member

mlutfy commented Dec 2, 2019

The BOM indeed causes issues for fgetcsv (good catch!), but I would be tempted to ignore it considering we want it easier for Excel users, and developers can workaround it programmatically.

Last call for merge?

@seamuslee001
Copy link
Contributor

as no objections have been issued i'm merging

@seamuslee001 seamuslee001 merged commit 17072a2 into civicrm:master Dec 17, 2019
@seamuslee001 seamuslee001 deleted the bom branch December 17, 2019 19:35
eileenmcnaughton added a commit to eileenmcnaughton/civicrm-core that referenced this pull request Jan 7, 2020
Long version is here civicrm#15968

Short version is - this seems to work for me without any obvious bad effects

Bug: T237564

Change-Id: I9d9658a76b50534d87bc922a77a5c7402b011325
eileenmcnaughton added a commit to eileenmcnaughton/civicrm-core that referenced this pull request Jan 7, 2020
Long version is here civicrm#15968

Short version is - this seems to work for me without any obvious bad effects

Bug: T237564

Change-Id: I9d9658a76b50534d87bc922a77a5c7402b011325
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
master merge ready PR will be merged after a few days if there are no objections
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants