Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Consider this program: #include <cstdio> int main(void) { const char *filename = u8"ディセント3.txt"; auto fp = std::fopen(filename, "r"); if (fp) { std::fclose(fp); return 0; } else { return 1; }; } If a file named ディセント3.txt exists, then will that program successfully open it? The answer is: it depends. filename is going to point to these bytes: Raw bytes: e3 83 87 e3 82 a3 e3 82 bb e3 83 b3 e3 83 88 33 2e 74 78 74 00 Characters: ディセント3.txt␀ Internally, Windows uses UTF-16. When you call fopen(), Windows will convert the filename parameter into UTF-16 [1]. If the program is run with a UTF-8 Windows code page, then the above bytes will be correctly interpreted as UTF-8 when being converted into UTF-16 [2]. The final UTF-16 string will be this*: Raw bytes: ff fe c7 30 a3 30 bb 30 f3 30 c8 30 33 00 2e 00 74 00 78 00 74 00 Characters: ディセント3.txt On the other hand, if the program is run with code page 932, then the original bytes will be incorrectly interpreted as code page 932 when being converted into UTF-16. The final UTF-16 string will be this*: Raw bytes: ff fe 5d 7e fd ff 67 7e 63 ff 67 7e 7b ff 5d 7e 73 ff 5d 7e fd ff 33 00 2e 00 74 00 78 00 74 00 Characters: 繝�繧」繧サ繝ウ繝�3.txt In other words, if that program gets compiled on Windows with a UTF-8 execution character set, then it needs to be run with a UTF-8 Windows code page. Otherwise, mojibake might happen. *Unlike the first string, this one does not have a null terminator. This is because the Windows kernel doesn’t use null terminated strings for paths [3][4]. --- Before this commit, Descent 3 would pass UTF-8 to fopen(), even if Descent 3 is run with a non-UTF-8 Windows code page [5]. This commit makes sure that Descent 3 gets run with a UTF-8 Windows code page. The Windows code page isn’t just used by fopen(). It also gets used by many other functions in the Windows API [6]. I don’t know if Descent 3 uses any of those other functions, but if it does, then this commit will also help make sure that those functions receive strings with the correct character encoding. Descent 3 uses UTF-8 for strings by default [7]. Making sure that Descent 3 uses UTF-8 everywhere will make encoding-related mistakes less likely in the future. Fixes DescentDevelopers#483. [1]: <https://stackoverflow.com/a/7950569/7593853> [2]: <https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170#remarks> [3]: <https://stackoverflow.com/a/52372115/7593853> [4]: <https://googleprojectzero.blogspot.com/2016/02/the-definitive-guide-on-win32-to-nt.html> [5]: <DescentDevelopers#475 (comment)> [6]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page#-a-vs--w-apis> [7]: adf58ec (Explicitly declare execution character set, 2024-07-07)
- Loading branch information