Joke Collection Website - Talk about mood - Talk about the difference between utf-8 and gb-2312.

Talk about the difference between utf-8 and gb-2312.

UTF-8 variable-length character encoding

/view/25412.htm

UTF-8 is a variable-length character encoding of UNICODE, also known as universal code, which was founded by Ken Thompson in 1992. Now it has been standardized to RFC 3629. UTF-8 encodes UNICODE characters with 1 to 6 bytes. The advantages of UTF-8 encoding: UTF-8 encoding can be read and written quickly by masking bits and shifting operations. When comparing strings, strcmp () and wcscmp () return the same results, which makes sorting easier. Bytes FF and FE will never appear in UTF-8 encoding, so they can be used to show that UTF-16 or UTF-32 text (see BOM) UTF-8 is byte order independent. Its byte order is the same in all systems, so it doesn't actually need a BOM.

Disadvantages of UTF-8 encoding:

You can't judge the number of bytes of UTF-8 text from the number of UNICODE characters, because UTF-8 is a variable-length encoding, and it needs to encode those characters that only need one byte in the extended ASCII character set. ISO Latin-1 is a subset of UNICODE, but not a subset of UTF-8. UTF-8 encoding of 8 characters will be filtered by the email gateway, because internet information was originally designed as 7-bit ASCII code. Therefore, UTF-7 coding is produced. The probability of UTF-8 using the value of 1xxxxx in its representation is more than 5%, but existing implementations such as ISO 222, 4873, 6429 and 8859 systems will mistake it for C1 control code. Therefore, UTF-7.5 coding was produced.

GB2312 code is the national code for information exchange of Chinese characters in the People's Republic of China, and its full name is "Basic Set of Chinese Character Coded Character Set for Information Exchange", which was promulgated by the State Administration of Standards and implemented on May 1, 1981, and is popular in mainland China. This code is also used in Singapore and other places.

GB 2312 or GB 2312-8 is a Chinese national standard of simplified Chinese character set, which is called "Basic Set of Chinese Character Coded Character Set for Information Interchange", also known as GB, which was promulgated by the General Administration of Standards of China and implemented on May 1, 1981. GB2312 code is used in Chinese mainland; Singapore and other places also adopt this coding. Almost all Chinese systems and international software in Chinese mainland support GB 2312.

GB 2312 standard * * * contains 6763 Chinese characters, including 3755 first-class Chinese characters and 38 second-class Chinese characters; At the same time, GB 2312 contains 682 full-width characters including Latin alphabet, Greek alphabet, Japanese hiragana and katakana alphabet, and Russian Cyrillic alphabet.

the appearance of p>GB 2312 basically meets the needs of computer processing of Chinese characters, and its Chinese characters have covered 99.75% of the usage frequency in Chinese mainland.

partition means that the received Chinese characters are "partitioned" in

GB 2312, and each partition contains 94 Chinese characters/symbols. This representation is also called location code.

1-9 is a special symbol.

Area 16-55 is a first-class Chinese character, sorted by pinyin.

56-87 are secondary Chinese characters, sorted by radicals/strokes.

areas 1-15 and 88-94 are not coded.

for example, the word "ah" is the first Chinese character in GB2312, and its area code is 161.

Chinese coded character set GB2312

/view/25492.htm for information interchange.