-
Notifications
You must be signed in to change notification settings - Fork 8
SMS Alphabets
A short message (SM) may be written in several alphabets/encodings. That means that a SM may transport a different number of "text characters" depending on the encoding being used. A short message may be encoded in:
- 7 bit GSM default alphabet
- 8 bit (unspecified)
- 16 bit UCS-2 alphabet
See the old standard GSM 03.38":http://ftp.3gpp.org/specs/html-info/0338.htm or the more recent "3GPP 23.038":http://ftp.3gpp.org/specs/html-info/23038.htm.
Btw, the encoding is specified within the GSM PDU in a field named DCS (data coding scheme).
One SM can encoded 160 characters, of 7 bit each. Whenever the SM is a long message and requires concatenation, each part may encode 153 characters. The GSM default alphabet is very similar to ASCII but not equal. It has the concept of extensions, in order to support a broader range of characters, such as the euro currency.
This type of encoding is used normally only in binary SMs (e.g. legacy ringtones, wap push). Although it may be used with text, there is no way to define which charset this encoding represents. For example, ISO-8859-1 is different ISO-8859-15, so the character with a certain value may represent two different characters, on two different mobile equipments. Typically one of three things happens: a) the text is represented in the same charset as the one (the language) currently active in the phone b) the text is not represent c) text is all messed up
Therefore, this encoding should not be used when encoding text messages.
As mentioned here, http://en.wikipedia.org/wiki/UTF-16/UCS-2, UCS-2 is very similar to UTF-16. UCS-2 is used by default on some current smartphones. The main reason for this is that problems of supporting "strange" characters mostly disappear. Each character is represented by 2 bytes, so a SM may encode only 70 characters. If it's a long message, then each part may encode only 67 characters. Last, the UCS-2 encoding specifies that an initial character may be used in order to define if the bytes of the wide characters are encoded in the big-endian or little-endian rules. This character is named BOM (byte order mark). Some phones have buggy implementations of UCS-2 and try to, wrongly, represent this character, which they do as a square box.
The following information just clarifies some points, such as IA5 that may be referred in some places.
IA5 is roughly the same as the GSM default alphabet but each character is represented in 8 bits rather than a byte. Note that IA5 != ASCII, because some characters in GSM default alphabet IA5 is used, for example, in UCP when encoding the (text) MSG field in standard messages (UCP 01, 30, or 51 if using MT=3). Therefore, IA5 is JUST an easier way to represent the (7 bit) GSM default alphabet, but in 8 bit. This avoids applications to implement "complex" 7 bit packing when encoding messages. To be clear, it is NOT another alphabet besides the 7 bit GSM default alphabet, 8bit raw or the UCS-2 alphabet... which are the ones used in the SMS pdu, at low-level.
Some info on "IA5":http://www.zytrax.com/tech/ia5.html.