Converting between Unicode and UTF-8

The conversion between Unicode and UTF-8 is easy. According to RFC 2022: UTF-8, a transformation format of Unicode and ISO-10646 by François Yergeau, there is simple conversion between the two code, summarized as follows:

Unicode Expanded as binary Sequence mask UTF-8 octet sequence (binary)
0000 - 007F 0000 0000 0000 0000 - 0000 0000 0111 1111 0000 0000 0xxx xxxx 0xxxxxxx
0080 - 07FF 0000 0000 1000 0000 - 0000 0111 1111 1111 0000 0xxx xxxx xxxx 110xxxxx 10xxxxxx
0800 - FFFF 0000 1000 0000 0000 - 1111 1111 1111 1111 xxxx xxxx xxxx xxxx 1110xxxx 10xxxxxx 10xxxxxx

Therefore, even if a system are not capable of displaying control characters, they can use a maximum of three 8-bit character to represent a normal 16-bit wide character.

For example, the UTF-8 code sequence EAB080 => 1110 1010 1011 0000 1000 0000 => Unicode Sequence 1010 1100 0000 0000 => \u0xAC00 = , which is the first character in the Hangul character set.

 


Back to previous Page       Back to the Mimosa Pudica Club       Back to CKC's Homepage       e-mail to me
返 回 主 頁       返 回 含 羞 草 私 人 會 所       返 回 趙 家 俊 個 人 網 頁       請 給 我 寫 信

(c) Copyright the Mimosa Pudica Club, 1998.
Created at: 1998-03-14 17:30
Last Updated: 2005-07-21 00:08:05 -0400
Version: 1.1.0
This homepage is created using
Microsoft FrontPage Express 2.0