-
-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dev/core#1424 Support MS Excel for diacritic characters when exporting to csv #15968
Conversation
…g to csv There is a long-standing issue whereby files exported from CiviCRM with diacritic characters - eg. ę are mangled when opened in MS Excel. The underlying issue is that the characters are UTF-8 encoded & MS Excel by default does not assume UTF8. In order to do so it needs a BOM - Byte Order Marker - which is a few hex characters at the start of the csv. The BOM indicator is 'not recommended' .... unless you want your csv to work with MS Excel. Since MS Excel is a major use case for csvs I think it's pretty clear that all things being equal we want to support it... I note that there are extensions to export to excel natively but I don't think that replaces this. The number one reason to want to export a csv is to work with it in a spreadsheeting programme & unless we can't safely make it work in core we shouldn't require an extension for that. There are various recommendations over time but it seems things have improved in the MS Excel world and what works on Windows now appears to work on Mac too - at least on a recent version. This link https://donatstudios.com/CSV-An-Encoding-Nightmare is a pretty good discussion. While the above link and others say that you need a different BOM for MAC than Windows my testing shows that the one recommended for Windows works fine on Mac Excel (while you would need to convert to UTF 16 to follow the Mac recommendation. (https://csv.thephpleague.com/9.0/interoperability/encoding/) Other links https://stackoverflow.com/questions/35294443/does-excel-for-mac-2016-properly-import-unicode-in-csv-files https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom/2223926#2223926 Notably this summary : " When should you encode with a BOM? If you're unable to record the metadata in any other way (through a charset tag or file system meta), and the programs being used like BOMs, you should encode with a BOM. This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page. The BOM tells programs like Office that, yes, the text in this file is Unicode; here's the encoding used. When it comes down to it, the only files I ever really have problems with are CSV. Depending on the program, it either must, or must not have a BOM. For example, if you're using Excel 2007+ on Windows, it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data." So the argument for a BOM is - if you want it to be compatible with MS Excel use a BOM. This seems to apply. The risk is that perhaps there is some variant of csv viewing programme that can't cope - the risk of this is rather mitigited by 1) the fact that MS Excel adds the BOM to the start of any files it saves as UTF-8 encoded csv so any programme that expects to open MS Excel generated csvs would need to be able to cope with it. In addition it is a standard, if not required, file indicator so it feels like the programmes that were legacy in 2016 writeups might be of no concern now. 2) There really is no programatic way to export these csvs so the risk that this is being parsed by code is close to zero There are 2 other things we could do to mitigate 1) test on more platforms - I've tested with MS Excel for Mac (Office 365 v 16.31) and Libre Office and MS Numbers & MS word, cot editor & notes 2) We could add a setting. I'm generally a bit loath on settings as they are a bit of a maintenance nightmare. However, perhaps it would be a setting like 'export csvs in legacy format & there could be a link to the gitlab to explain your use-case if you feel the need to set it. If no-one does then we could later remove. Final note - csv tables are created with the CRM_Core_Report_Excel (spot the irony) from 3 places - the main export, the export for custom fields & a third place which I believe to be unreachable. This is one path only but I can look at centralising for the custom fields export path.
(Standard links)
|
This is a thumbs up from me 👍 I'll test it when i get a chance |
@seamuslee001 please specify what you test it with - I tested with a few things & I guess @demeritcowboy probably has Windows & maybe Windows Excel |
I can check tomorrow. I have windows excel 2019 and also an older version too. I haven't read the whole issue but people do other things with csv too, like input to mailchimp or custom processing scripts. Do people open exports on phones nowadays - is that a thing? |
I tested this on my laptop and works for me, I have Excel 2013 on my laptop |
Note i am latest Windows 10 but with older Excel for reasons |
Seems OK to me. I tested with Vim, LibreOffice and with
|
I am having an issue with Office 365 v 16.0.12 when exporting from Reports, But searches export fine.
|
|
@kcristiano is your comment with this patch? I haven't dug into how reports generate csvs but this patch should not alter it |
@eileenmcnaughton I agree this patch only covers the export of searches. I tested reports and it was not affected. I suspect we have to look into a similar fix for reports. |
I should note I'm working through a couple of cleanups towards this also working for custom searches - just opened this minor one #15974 |
Given everyone's testing so far is showing this to be likely to be good i'm adding the merge ready flag |
The BOM indeed causes issues for fgetcsv (good catch!), but I would be tempted to ignore it considering we want it easier for Excel users, and developers can workaround it programmatically. Last call for merge? |
as no objections have been issued i'm merging |
Long version is here civicrm#15968 Short version is - this seems to work for me without any obvious bad effects Bug: T237564 Change-Id: I9d9658a76b50534d87bc922a77a5c7402b011325
Long version is here civicrm#15968 Short version is - this seems to work for me without any obvious bad effects Bug: T237564 Change-Id: I9d9658a76b50534d87bc922a77a5c7402b011325
Overview
Fixes a long-standing issue whereby files exported from CiviCRM with diacritic characters - eg.
ę are mangled when opened in MS Excel.
Before
No BOM string pre-pended to csv. Diacritic characters present correctly in Open Office but are mangled in all version of Excel.
After
 Prepended - this tells Excel that it is UTF-8 encoded and diacritics are displayed correctly.
This is a standard so other programmes should be unaffected
Technical Details
The underlying issue is that the characters are UTF-8 encoded & MS Excel by default does not assume UTF8.
In order to do so it needs a BOM - Byte Order Marker - which is a few hex characters at the start of
the csv.
The BOM indicator is 'not recommended' .... unless you want your csv to work with
MS Excel. Since MS Excel is a major use case for csvs I think it's pretty clear that all
things being equal we want to support it... I note that there are extensions to export to excel
natively but I don't think that replaces this. The number one reason to want to export
a csv is to work with it in a spreadsheeting programme & unless we can't safely make it
work in core we shouldn't require an extension for that.
There are various recommendations over time but it seems things have improved in the MS
Excel world and what works on Windows now appears to work on Mac too - at least on a recent version.
This link https://donatstudios.com/CSV-An-Encoding-Nightmare is a pretty good discussion.
While the above link and others say that you need a different BOM for MAC than Windows my
testing shows that the one recommended for Windows works fine on Mac Excel (while you would need
to convert to UTF 16 to follow the Mac recommendation. (https://csv.thephpleague.com/9.0/interoperability/encoding/)
Other links
https://stackoverflow.com/questions/35294443/does-excel-for-mac-2016-properly-import-unicode-in-csv-files
https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom/2223926#2223926
Notably this summary :
" When should you encode with a BOM?
If you're unable to record the metadata in any other way (through a charset tag or file system meta), and the programs being used like BOMs, you should encode with a BOM. This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page. The BOM tells programs like Office that, yes, the text in this file is Unicode; here's the encoding used.
When it comes down to it, the only files I ever really have problems with are CSV. Depending on the program, it either must, or must not have a BOM. For example, if you're using Excel 2007+ on Windows, it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data."
So the argument for a BOM is - if you want it to be compatible with MS Excel use a BOM. This seems to apply.
The risk is that perhaps there is some variant of csv viewing programme that can't cope - the risk of this is rather mitigited by
csvs would need to be able to cope with it. In addition it is a standard, if not required, file indicator so it feels like the programmes
that were legacy in 2016 writeups might be of no concern now.
There are 2 other things we could do to mitigate
& MS word, cot editor & notes
However, perhaps it would be a setting like 'export csvs in legacy format & there could be a link
to the gitlab to explain your use-case if you feel the need to set it. If no-one does then we could later remove.
Final note - csv tables are created with the CRM_Core_Report_Excel (spot the irony) from 3 places
only but I can look at centralising for the custom fields export path.
Comments