* The preview only display some random pages of manuals. You can download
full content via the form below.
Part 2 CHARACTER ENCODING: How do computers deal with multiple languages? by Tze Wei Sim 沈志偉
[email protected]
Content Basic Computing Knowledge Binary, Decimal and Hexadecimal Numbers Unicode Character Set Character Encoding Language Input Software Fonts Glyphs
Data Communication •
•
In order for computers to understand each other, they have to speak and understand the same language.
In computing terms, they must have the same encoding (speaking) and decoding (understanding) protocol.
Data Communication •
• •
Every time we press a button on a keyboard, it generates a sequence of high and low voltages which resemble binary numbers.
These sequences of data are saved in memory or transmitted to another computer via a network. In order for the recipient to understand (decode) what the sender was speaking (encode), both of them have to have the same understanding (encoding) of what that string of binary numbers mean.
Numeral Systems •
Computer data is represented in binary numbers (base-2 numeral system) as opposed to decimal numbers (base-10 numeral system) we use in daily life. Decimal Numbers
Binary Numbers
Hexadecimal Numbers
0
0
0
1
1
1
2
10
2
3
11
3
4
100
4
5
101
5
6
110
6
7
111
7
8
1000
8
9
1001
9
10
1010
A
11
1011
B
12
1100
C
13
1101
D
14
1110
E
15
1111
F
16
10000
10
Common Character Sets •
ASCII (American Standard Code for Information Interchange)
- originally based on the English language that encodes 128 characters - numbers 0-9, letters a-z and A-Z, some basic punctuation symbols, some control codes - all stored in 7 binary digits (bits) Keys
Binary Representation
Decimal Number
A
1000001
65
B
1000010
66
C
1000011
67
!
0100001
33
?
0111111
63
$
0100100
36
Backspace
0001000
8
Escape
0011011
27
Delete
1111111
127
Common Character Sets • •
Most early computers kept data in an 8-bit byte system. With an 8-bit byte, not only is it possible to store every possible ASCII character, but there is also one whole bit spare.
Byte = the smallest addressable unit of memory in many computer architectures
• •
Because bytes have room for up to eight bits, many people had their own ideas of what should go where in the space from 100000002 (or 12810) to 111111112 (or 25510). For example on some American PCs the character code 100000102 (or 13010) would display as é, but on computers in Israel it was the Hebrew letter Gimel ()ג, so when Americans sent their résumés to Israel they arrived as rגsumגs.
Common Character Sets • -
Unicode A group of ambitious people came up with the idea of creating a single character set that included every reasonable writing system in the world, covering 110,181 characters from the world's alphabets, ideograph sets, and symbol collections. አማርኛ (Amharic), தமிழ் (Tamil) and even old characters which are not commonly used anymore such as (Baybayin), the old Filipino writing system, 𡨸喃 (Chữ nôm), the old Vietnamese characters are assigned binary codes (aka code points) to prevent confusion between computers.
Unicode •
• •
The code assigned to a specific character in Unicode Standard is called a code point.
A binary number for a character can be very long. The Chinese character 𤭢 is represented by this string of binary number 100100101101100010 (or 15037010).
Note: To make the code point more concise, it is expressed in this format: U+hexadecimal number. Thus, U+24B62.
UTF-8 Encoding • •
The string of numbers has to be encoded and segmented into several 8-bit bytes in order to store on computer memory, transmit across communication networks, and be deciphered correctly by other computers. UTF-8 is an encoding method widely used on the internet and increasingly being used as the default character encoding in operating systems, programming languages, and software applications.
First Code Point
Last Code Point
No. of Bit
No. of Bytes Required
U+0000
U+007F
7 bits
1
0xxxxxxx
U+0080
U+07FF
11 bits
2
110xxxxx
10xxxxxx
U+0800
U+FFFF
16 bits
3
1110xxxx
10xxxxxx
10xxxxxx
U+10000
U+1FFFFF
21 bits
4
11110xxx
10xxxxxx
10xxxxxx
10xxxxxx
U+200000
U+3FFFFFF
26 bits
5
111110xx
10xxxxxx
10xxxxxx
10xxxxxx
10xxxxxx
U+7FFFFFFF 31 bits
6
1111110x
10xxxxxx
10xxxxxx
10xxxxxx
10xxxxxx
U+4000000
1st Byte
2nd Byte
3rd Byte
4th Byte
5th Byte
6th Byte
10xxxxxx
UTF-8 Encoding To encode the Chinese character 𤭢 which is represented by this string of binary number 100100101101100010. The following protocol is performed by the encoder: 1. Since it is between U+10000 and U+1FFFFF, this will take 4 bytes to encode. No. of Bit
No. of Bytes Required
1st Byte
2nd Byte
3rd Byte
4th Byte
21 bits
4
11110xxx
10xxxxxx
10xxxxxx
10xxxxxx
2. Three leading zeros are added in front of 100100101101100010 to make it 000100100101101100010 so it can fill up all the variable x. 3. The character is now made up of 4-byte binary numbers (32 bits) ready to be saved and transmitted to another computer: 11110000 10100100 10101101 10100010 Note: This lengthy binary number can be concisely written in hexadecimal number: F0 A4 AD A2
Decoding • • •
When the recipient receives a string of 32 bits data, 11110000 10100100 10101101 10100010, the numbers in purple colour will be removed by the recipient’s decoder and revert back to the original 21-bit binary number 100100101101100010. It is now ready to be opened with a computer programs and equipped with a font which can render the 21-bit binary number into a picture (better known as a glyph in typography). There are other encoding methods such as UTF-16, UTF-32 etc. to suit different type of computer architectures.
Why do computers need encoding and decoding? - So that the receiver can make sense of the seemingly random signal. It knows that a new character is being received when it detects 11110.
Language Input Software •
• •
Typing the English language is relatively straight forward in computing. The keyboard generates a binary number 10000012 (or 6510 or 4116) when “A” is pressed.
To type non-English languages, the computer needs a language input software that convert 7-bit ASCII binary number to a Unicode binary number. To type the Arabic “ اAlif” as in العربية, we essentially press the “h” key. The keyboard generates a binary number 10010002 (or 7210 or 4816). The language input software then converts 10010002 (or 7210 or 4816) to 110001001112 (or 157510 or 62716).
Language Input Software •
To encode the Arabic “ اAlif” as in العربيةwhich is represented by this 11-bit binary number 11000100111. The following protocol is performed by the encoder:
1. Since it is between U+0080 and U+07FF, this will take 2 bytes to encode. No. of Bit
No. of Bytes Required
1st Byte
2nd Byte
11 bits
2
110xxxxx
10xxxxxx
2. The 11-bit binary number will fill up all the variable x.
3. The character is now made up of 2-byte binary numbers (16 bits) ready to be saved and transmitted to another computer: 11011000 10100111 Note: This lengthy binary number can be concisely written in hexadecimal number: D8 A7
Font •
Font is a file that maps strings of binary data with designated pictorial glyphs to be shown on computer screen.
The most common font types are: 1. OpenType Fonts 2. TrueType Fonts 3. PostScript Fonts
•
Font can be developed with software i.e. Fontlab, Adobe FDK, RoboFont, Glyphs, DTL Font Master
Font •
Fonts are files kept in Universal Type Client or your Font Book (Mac) and Font folder in Windows.
Font vs. Glyph A font file contains a collection of glyphs (pictures) files assigned with numbers.
Font vs. Glyph A glyph is the design of a character, a symbol or even an object.
Character vs. Glyph
•
• •
•
•
Unicode has the principle of assigning a code point to a character, not a glyph. Both a’s below are the same character but different glyph (design). Both glyphs share the same code point.
Unicode leaves the design of glyphs to type designers so type designers have the liberty to design the appearance of glyphs. Things get more complicated when the definition of a character and a glyph are not clear cut, especially in regions where logographic writing systems are used i.e. China, Japan, Korea, and Vietnam (abbr. CJKV).
Some glyphs are considered as the same character and they share the same code point. For example the character in CJKV used to share the code point of U+7A81.
(abrupt )
It’s variant form
was later added to Unicode and given the code point U+2592E.
But some glyphs are considered as different characters and they have different code points.
(to listen)
(to hit)
Which Font is Better? •
Well-developed fonts usually have many glyphs and therefore are able to support many languages.
Arial Unicode MS
•
Less-developed fonts have lesser glyphs and therefore are less versatile in coping with different languages.
MHeiHK-S
Data Processing •
This is the usual process of encoding a non-ASCII character.
Keyboar d
Language Input Software
UTF-8 Encoder
Computer A
UTF-8 Decoder
Unicod e Font
Computer B
Input Software-induced Disconcordance •
Some language input developers prefer to use U+807C (a rare character) over U+807D (the more common one) i.e. Microsoft Pinyin New Experience Input Style.
Keyboar d
Language Input Software
807C1
UTF-8 Encoder
6
or 807D1
Computer A 6
UTF-8 Decoder
Unicod e Font
Computer B
Font-induced Disconcordance •
This is the usual computing process to type the Arabic character “ اAlif”. 4816 ASCII
Keyboar d
Language Input Software
62716
UTF-8 Encoder
Unicode
Computer A
D8 A716
D8 A716
UTF-8 Decoder
62716 Unicode
Unicod e Font
Computer B
•
However, some font developers skip the language input software and UTF-8 encoding by creating non-Unicode fonts i.e. Kruti Dev 010, MHeiHK-S. 481 6
Keyboar d
ASCII
NonUnicode Font
Non-Unicode Fonts •
Some font developers create fonts that assign characters to code points that have already been taken by other characters.
Non-Unicode Fonts • •
These fonts are called non-Unicode fonts. Data typed with these fonts are not able to be read by other fonts.