The conversion between Unicode and UTF-8 is easy. According to RFC 2022: UTF-8, a transformation format of Unicode and ISO-10646 by François Yergeau, there is simple conversion between the two code, summarized as follows:
| Unicode | Expanded as binary | Sequence mask | UTF-8 octet sequence (binary) |
|---|---|---|---|
| 0000 - 007F | 0000 0000 0000 0000 - 0000 0000 0111 1111 | 0000 0000 0xxx xxxx | 0xxxxxxx |
| 0080 - 07FF | 0000 0000 1000 0000 - 0000 0111 1111 1111 | 0000 0xxx xxxx xxxx | 110xxxxx 10xxxxxx |
| 0800 - FFFF | 0000 1000 0000 0000 - 1111 1111 1111 1111 | xxxx xxxx xxxx xxxx | 1110xxxx 10xxxxxx 10xxxxxx |
Therefore, even if a system are not capable of displaying control characters, they can use a maximum of three 8-bit character to represent a normal 16-bit wide character.
For example, the UTF-8 code sequence EAB080 => 1110 1010 1011 0000 1000 0000 => Unicode Sequence 1010 1100 0000 0000 => \u0xAC00 = 가, which is the first character in the Hangul character set.
| Back to previous Page | Back to the Mimosa Pudica Club | Back to CKC's Homepage | e-mail to me | |||
| 返 回 主 頁 | 返 回 含 羞 草 私 人 會 所 | 返 回 趙 家 俊 個 人 網 頁 | 請 給 我 寫 信 |
| (c)
Copyright the Mimosa Pudica Club, 1998. Created at: 1998-03-14 17:30 Last Updated: 2005-07-21 00:08:05 -0400 Version: 1.1.0 This homepage is created using Microsoft FrontPage Express 2.0 |