CHARACTER ENCODING: How Do Computers Deal With Multiple Language?

* The preview only display some random pages of manuals. You can download full content via the form below.

The preview is being generated... Please wait a moment!

Submitted by: Sim Tze Wei
File size: 2.2 MB
File type: application/pdf
Words: 1,677
Pages: 26

Add to bookmark

Description

Part 2 CHARACTER ENCODING: How do computers deal with multiple languages? by Tze Wei Sim 沈志偉 [email protected]

Content  Basic Computing Knowledge  Binary, Decimal and Hexadecimal Numbers  Unicode Character Set  Character Encoding  Language Input Software  Fonts  Glyphs

Data Communication •

•

In order for computers to understand each other, they have to speak and understand the same language.

In computing terms, they must have the same encoding (speaking) and decoding (understanding) protocol.

Data Communication •

• •

Every time we press a button on a keyboard, it generates a sequence of high and low voltages which resemble binary numbers.

These sequences of data are saved in memory or transmitted to another computer via a network. In order for the recipient to understand (decode) what the sender was speaking (encode), both of them have to have the same understanding (encoding) of what that string of binary numbers mean.

Numeral Systems •

Computer data is represented in binary numbers (base-2 numeral system) as opposed to decimal numbers (base-10 numeral system) we use in daily life. Decimal Numbers

Binary Numbers

Hexadecimal Numbers

0

0

0

1

1

1

2

10

2

3

11

3

4

100

4

5

101

5

6

110

6

7

111

7

8

1000

8

9

1001

9

10

1010

A

11

1011

B

12

1100

C

13

1101

D

14

1110

E

15

1111

F

16

10000

10

Common Character Sets •

ASCII (American Standard Code for Information Interchange)

- originally based on the English language that encodes 128 characters - numbers 0-9, letters a-z and A-Z, some basic punctuation symbols, some control codes - all stored in 7 binary digits (bits) Keys

Binary Representation

Decimal Number

A

1000001

65

B

1000010

66

C

1000011

67

!

0100001

33

?

0111111

63

$

0100100

36

Backspace

0001000

8

Escape

0011011

27

Delete

1111111

127

Common Character Sets • •

Most early computers kept data in an 8-bit byte system. With an 8-bit byte, not only is it possible to store every possible ASCII character, but there is also one whole bit spare.

Byte = the smallest addressable unit of memory in many computer architectures

• •

Because bytes have room for up to eight bits, many people had their own ideas of what should go where in the space from 100000002 (or 12810) to 111111112 (or 25510). For example on some American PCs the character code 100000102 (or 13010) would display as é, but on computers in Israel it was the Hebrew letter Gimel (‫)ג‬, so when Americans sent their résumés to Israel they arrived as r‫ג‬sum‫ג‬s.

Common Character Sets • -

Unicode A group of ambitious people came up with the idea of creating a single character set that included every reasonable writing system in the world, covering 110,181 characters from the world's alphabets, ideograph sets, and symbol collections. አማርኛ (Amharic), தமிழ் (Tamil) and even old characters which are not commonly used anymore such as (Baybayin), the old Filipino writing system, 𡨸喃 (Chữ nôm), the old Vietnamese characters are assigned binary codes (aka code points) to prevent confusion between computers.

Unicode •

• •

The code assigned to a specific character in Unicode Standard is called a code point.

A binary number for a character can be very long. The Chinese character 𤭢 is represented by this string of binary number 100100101101100010 (or 15037010).

Note: To make the code point more concise, it is expressed in this format: U+hexadecimal number. Thus, U+24B62.

UTF-8 Encoding • •

The string of numbers has to be encoded and segmented into several 8-bit bytes in order to store on computer memory, transmit across communication networks, and be deciphered correctly by other computers. UTF-8 is an encoding method widely used on the internet and increasingly being used as the default character encoding in operating systems, programming languages, and software applications.

First Code Point

Last Code Point

No. of Bit

No. of Bytes Required

U+0000

U+007F

7 bits

1

0xxxxxxx

U+0080

U+07FF

11 bits

2

110xxxxx

10xxxxxx

U+0800

U+FFFF

16 bits

3

1110xxxx

10xxxxxx

10xxxxxx

U+10000

U+1FFFFF

21 bits

4

11110xxx

10xxxxxx

10xxxxxx

10xxxxxx

U+200000

U+3FFFFFF

26 bits

5

111110xx

10xxxxxx

10xxxxxx

10xxxxxx

10xxxxxx

U+7FFFFFFF 31 bits

6

1111110x

10xxxxxx

10xxxxxx

10xxxxxx

10xxxxxx

U+4000000

1st Byte

2nd Byte

3rd Byte

4th Byte

5th Byte

6th Byte

10xxxxxx

UTF-8 Encoding To encode the Chinese character 𤭢 which is represented by this string of binary number 100100101101100010. The following protocol is performed by the encoder: 1. Since it is between U+10000 and U+1FFFFF, this will take 4 bytes to encode. No. of Bit

No. of Bytes Required

1st Byte

2nd Byte

3rd Byte

4th Byte

21 bits

4

11110xxx

10xxxxxx

10xxxxxx

10xxxxxx

2. Three leading zeros are added in front of 100100101101100010 to make it 000100100101101100010 so it can fill up all the variable x. 3. The character is now made up of 4-byte binary numbers (32 bits) ready to be saved and transmitted to another computer: 11110000 10100100 10101101 10100010 Note: This lengthy binary number can be concisely written in hexadecimal number: F0 A4 AD A2

Decoding • • •

When the recipient receives a string of 32 bits data, 11110000 10100100 10101101 10100010, the numbers in purple colour will be removed by the recipient’s decoder and revert back to the original 21-bit binary number 100100101101100010. It is now ready to be opened with a computer programs and equipped with a font which can render the 21-bit binary number into a picture (better known as a glyph in typography). There are other encoding methods such as UTF-16, UTF-32 etc. to suit different type of computer architectures.

Why do computers need encoding and decoding? - So that the receiver can make sense of the seemingly random signal. It knows that a new character is being received when it detects 11110.

Language Input Software •

• •

Typing the English language is relatively straight forward in computing. The keyboard generates a binary number 10000012 (or 6510 or 4116) when “A” is pressed.

To type non-English languages, the computer needs a language input software that convert 7-bit ASCII binary number to a Unicode binary number. To type the Arabic ‫“ ا‬Alif” as in ‫العربية‬, we essentially press the “h” key. The keyboard generates a binary number 10010002 (or 7210 or 4816). The language input software then converts 10010002 (or 7210 or 4816) to 110001001112 (or 157510 or 62716).

Language Input Software •

To encode the Arabic ‫“ ا‬Alif” as in ‫ العربية‬which is represented by this 11-bit binary number 11000100111. The following protocol is performed by the encoder:

1. Since it is between U+0080 and U+07FF, this will take 2 bytes to encode. No. of Bit

No. of Bytes Required

1st Byte

2nd Byte

11 bits

2

110xxxxx

10xxxxxx

2. The 11-bit binary number will fill up all the variable x.

3. The character is now made up of 2-byte binary numbers (16 bits) ready to be saved and transmitted to another computer: 11011000 10100111 Note: This lengthy binary number can be concisely written in hexadecimal number: D8 A7

Font •

Font is a file that maps strings of binary data with designated pictorial glyphs to be shown on computer screen.

The most common font types are: 1. OpenType Fonts 2. TrueType Fonts 3. PostScript Fonts

•

Font can be developed with software i.e. Fontlab, Adobe FDK, RoboFont, Glyphs, DTL Font Master

Font •

Fonts are files kept in Universal Type Client or your Font Book (Mac) and Font folder in Windows.

Font vs. Glyph A font file contains a collection of glyphs (pictures) files assigned with numbers.

Font vs. Glyph A glyph is the design of a character, a symbol or even an object.

Character vs. Glyph

•

• •

•

•

Unicode has the principle of assigning a code point to a character, not a glyph. Both a’s below are the same character but different glyph (design). Both glyphs share the same code point.

Unicode leaves the design of glyphs to type designers so type designers have the liberty to design the appearance of glyphs. Things get more complicated when the definition of a character and a glyph are not clear cut, especially in regions where logographic writing systems are used i.e. China, Japan, Korea, and Vietnam (abbr. CJKV).

Some glyphs are considered as the same character and they share the same code point. For example the character in CJKV used to share the code point of U+7A81.

(abrupt )

It’s variant form

was later added to Unicode and given the code point U+2592E.

But some glyphs are considered as different characters and they have different code points.

(to listen)

(to hit)

Which Font is Better? •

Well-developed fonts usually have many glyphs and therefore are able to support many languages.

Arial Unicode MS

•

Less-developed fonts have lesser glyphs and therefore are less versatile in coping with different languages.

MHeiHK-S

Data Processing •

This is the usual process of encoding a non-ASCII character.

Keyboar d

Language Input Software

UTF-8 Encoder

Computer A

UTF-8 Decoder

Unicod e Font

Computer B

Input Software-induced Disconcordance •

Some language input developers prefer to use U+807C (a rare character) over U+807D (the more common one) i.e. Microsoft Pinyin New Experience Input Style.

Keyboar d

Language Input Software

807C1

UTF-8 Encoder

6

or 807D1

Computer A 6

UTF-8 Decoder

Unicod e Font

Computer B

Font-induced Disconcordance •

This is the usual computing process to type the Arabic character ‫“ ا‬Alif”. 4816 ASCII

Keyboar d

Language Input Software

62716

UTF-8 Encoder

Unicode

Computer A

D8 A716

D8 A716

UTF-8 Decoder

62716 Unicode

Unicod e Font

Computer B

•

However, some font developers skip the language input software and UTF-8 encoding by creating non-Unicode fonts i.e. Kruti Dev 010, MHeiHK-S. 481 6

Keyboar d

ASCII

NonUnicode Font

Non-Unicode Fonts •

Some font developers create fonts that assign characters to code points that have already been taken by other characters.

Non-Unicode Fonts • •

These fonts are called non-Unicode fonts. Data typed with these fonts are not able to be read by other fonts.

CHARACTER ENCODING: How Do Computers Deal With Multiple Language?

Description

Report "CHARACTER ENCODING: How Do Computers Deal With Multiple Language?"