Search This Blog

Sunday, June 20, 2021

character set vs encoding

A character set is a one-to-one mapping between a set of distinct integers and a set of written symbols. For example, define a new character set FOOBAR that maps the alphabet {A, B, C} to the digits 1, 2, and 3, respectively. A character set is an abstract concept that exists only in the mind of the programmer: computers do not directly manipulate character sets.
An encoding is a way characters are stored into 0s and 1s of computer memory. To implement FOOBAR support on a real computer, the most obvious way to encode data would be to represent one character per byte, following the usual way of encoding integers in binary. In this scheme, the string "AABC" would become:
00000001 00000001 00000010 00000011
.
Shift JIS is an encoding of the JIS standard which was the standard encoding for Japanese on Microsoft and Apple computers before the advent of Unicode. The selling point of Shift JIS (a.k.a. SJIS) is that, unlike EUC, it is backwards-compatible with not only ASCII, but also JIS X 0201, so Shift JIS can be used to encode both JIS X 0201 and JIS X 0208 (but not JIS X 0212). One-byte half-width katakana/punctuation is valid Shift JIS. Unfortunately, this compatibility means that Shift JIS is the messiest encoding of all.

No comments:

Post a Comment

Phật giáo vs cúng sao

Nhiều người nói Phật giáo bây giờ biến tướng, cúng sao giải hạng mê tín dị đoan... Nhưng mất đi cái đó rồi, nhóm những con người có ít họ...