Mule: Multilingual text processing system

Kenichi HANDA

Electrotechnical Laboratory

1 Introduction

Suppose you are to write an article about places where English is not used, people who do not use English, or incidents which happen to such people in such places. In these situations, it will be of great significance if you can use the characters and the orthography of the vernacular language, which avoids mutual misunderstandings and realizes good communication.

Multilingual text processing system should follow the rules of writing, spelling and characters of each language it handles. Therefore, in designing such a system, we should first determine what languages the system should accept. Although there are 3000 or more written languages in the world and there exist more than 100 character sets (a collection of characters used in a language) registered to ISO, an individual user does not need all of them. We, thus, believe that designing a system which contains all necessary character sets from the first is impractical because:

Our design principle, therefore, is to make system flexible and easily extensible one. We have avoided constructing the system which was able to treat all languages at the start. Instead, we built in only small number of character sets which are frequently used. Concerning the rest of character sets, the system is easily customizable by the user to treat a necessary character set. This is achieved by supplying framework which defines a character set. In addition, since there exist various encoding mechanisms for each character set, the framework also defines an encoding mechanism (here after, coding-system) of the character set.

An actual text processing system, or an editor, makes use of the rules peculiar to each language to support text writung and editing. For instance, a Japanese sentence is written from the left to the right, or from the top to the down, and the next character of a certain character is assumed to be inserted to the right (or under). We do not need to specify where to enter the next character while writing a Japanese sentence, because the editor decides the position based on this assumption. When we write Hebrew documents, which is written from the right to the left, an editor should have another assumption on the writing direction and the assumption determines the position of inserted character. Different languages have different rules for text filling, too. Although in Japanese text we can begin a new line at any place in the middle of a word, we cannot do this while filling sentences written in English. We have to hyphenate English words to split them into two lines and even with hyphens we cannot split a syllable.

Each language has its own rules. They restricts characters, character strings, words and sentences to certain forms. The tools that use these language-specific rules to facilitate the text gener ation is called the language environment. As we have done with character sets and coding-systems, we have designed the system to enable users to build up and customize the language environment for a necessary language.

This paper describes multilingual text processing system Mule. Mule (Multilingual Enhancement to GNU Emacs) is an extension of GNU Emacs. Mule handles multiple character sets and multiple language environment and provides means to construct new ones.

In the following sections, we show how Mule is extended to meet multilingual requirements. Section 2 tells about the character sets and the coding-systems, section 3 explains the input methods of multilingual characters and section 4 describes how Mule treats character strings and words. These sections also show how the users can customize the language environment for a new language.

Figure 1: Snap shot of Mule's buffer

2 Multiple character sets and coding-systems

2.1 Character sets

Figure 1 is a snapshot of the editing work with Mule. The display shows ASCII characters, Japanese characters and many other characters from different languages. Every character that is processed with computers belongs to one character set. A character set is a collection of characters which are used in one language or one group of languages and you need two steps to specify a certain character: first you designate the character set which contain the character and then select one character in the set. That is, the fact you see characters of many languages in Figure 1 means that Mule invokes various character sets and use characters in them. Although you have hardly a chance or a necessity to consider the difference of the character set when inputting a single language, switching character sets is indispensable in multilingual text processing.

Most of the character sets meet technical requirements standardized in ISO 2022 and format requnrements in ISO 2375. Such character sets are registered in ECMA (European Computer Manufacturers Association). Some examples of registered character sets are "Japanese Character","Thai character" and "Latin/Cyrillic character". Although Mule can potentially handle all character sets standardized by ISO 2022, those shown in Table1 are supported as default.

As a matter or fact, character sets which does not conform to ISO 2022 ( called private character sets) are widely used. Among them, Chinese Big5 and Vietnamese VISCII are supportd defaultly by Mule. For the others, Mule offers the framework which systematically defines the character set and processes characters according to the definition, which enables users to add any private character sets freely. A user only have to specify several attributes to define a character set as follows. The basic attributes of character sets are:

94 character set ISO 646 USA (ASCII) 4/2
JIS X 0201 JAPANESE Kana 4/9
JIS X 0201 JAPANESE Roman 4/10
96 character set
Right half of ISO 8859-1 Latin alphabet No.1 4/1
Right half of ISO 8859-2 Latin alphabet No.2 4/2
Right half of ISO 8859-3 Latin alphabet No.3 4/3
Right half of ISO 8859-4 Latin alphabet No.4 4/4
Right half of ISO 8859-7 Greek alphabet 4/6
Right half of ISO 8859-6 Arabic alphabet 4/7
Right half of ISO 8859-8 Hebrew alphabet 4/8
Right half of ISO 8859-5 Cyrillic alphabet 4/12
Right half of ISO 8859-9 Latin alphabet No.5 4/13
TIS 620-2533 Thai Character Set 5/4
94x94 character set JIS C 6226-1978 Japanese 4/0
GB 2312-1980 Chinese 4/1
JIS X 0208-1983 Japanese 4/2
KS C 5601-1987 Korean 4/3
JIS X 0212 Japanese Supplement 4/4
CNS 11643 Set 1 Chinese 4/7
CNS 11643 Set 2 Chinese 4/8

2.2 Coding-system

One character set can be encoded in various coding-systems. For example, so-called JIS code, Japanese EUC code and MS Kanji code are different methods of encoding the same character set: Japanese Kanji. When processing characters, one encoding method is selected to fit the operating system or the application. Most computer systems select different coding-system in communication processes by way of the network (electronic mail etc.) and internal expressions of the OS or text files, because the demand for the coding-system is different. That is, no a everyday affair like receiving a mail, storing it in the file and displaying it on the screen, we have to make conversion between various coding-systems. Therefore, with Mule users can set independent coding-systems for file I/O, displaying on the screen, inputting from the keyboard and process communications respectively.

?@?@Mule supports as default, among coding-systems that meet ISO 2022, Japanese EUC code, Korean EUC code, Chinese EUC code, Compound Text. Mule also offers two encoding method which do not standardized in ISO 2022: MS Kanji code and Big5 Chinese code.

As for a new coding-system Mule does not support as default, a user can define the codingsystem with a function of Mule and Mule processes the coding-system according to the definition. In order to define a coding system, a user only have to specify several attributes of coding-system which tell Mule how to convert outernal text into Mule's internal buffer. If the conversion rule doesn't match ISO 2022, a user can add his own conversion rule easily.

2.3 Internal representations of characters

Mule's buffers may contain characters which belongs to different character sets In coding-systems in general, however, the code for a certain character does not contain any information about the character set it belongs to, it only indicates a character assuming that one character set is fixed.

In order to handle characters from multiple character sets in one buffer, we have adopted a new internal representation of characters. We add information about the character set to the code for each character, so that we can tell the character set from the code. This supplemental information about the character set is called a leading character because it us added ahead of the original code Each character set is assigned one byte leading character and private character sets have one more byte to show the difference. (The ASCII character is an exception and distinguished by not adding a leading character.) With this framework, the character with an internal expression of \222\264\301 will be uniquely decided to .

In short, Mule divides character sets into the following six types according to their original code length and whether it is a private character set or not. ('Type n-m' means that original code length of the character set is n byte and it is internally represented with m byte.) That is, Mule requires 2- byte longer than original code length to represent the character which belongs to a private character set.

Type 1-1: ASCII characters

Type 1-2: One byte characters except ASCII characters (e.g.ISO8859-1,Latin-1)

Type 1-3: One byte characters of private character sets

Type 2-3: 2 byte character (e.g.JISX0208,Japanese)

Type 2-4: Two byte characters of private character sets

Type N : composite character of variable code lengtha

3 Inputting methods

Mule employs the special internal representation described in section 2 in order to display, puts in a file or send to other processes a mixture of characters of multiple character sets. Now the problem is how we can input these various characters. In many cases, the keyboard for the specific language does not exist. Mule offers several systems which enables users to input the characters of a specific language from a usual ASCII keyboard. Here, we show two main systems Quail and Tamago.

3.1 Quail

Quail is a keyboard input translation system. The keyboard input translation is a process that a sequence of one or more keyboard inputs is converted into one specific character. For instance, when we input a Japanese Kana from ASCII keyboard, a sequence of the keyboard input "m" and the input "u" would be translated into one Kama "?Ž¦quot;. A translation rule receives a keyboard input sequence and converts it into one character. Many translation rules is requisite for characters of one language. We call a set of translation rules for a specific language a "package". In one package, there may be only one rule with a certain input sequence, that is, when a certain key input sequence is done, Quail can decide one character uniquely

Quail offers packages (sets of translation rules) for more than 30 language groups. They are for Hangul, Chinese, languages written with Latin characters (English, French, Finnish language, and Esperanto, etc.), languages written with Cyrillic characters (Russian and Macedonian), Greek

Figure 2: Inputting European by Quail

Figure 3: Inputting Hangul by Quail

characters and Hebrew characters. Moreover, users can change the rule in a package or make a completely new one.

Figure 2 and 3 show the inputting processes with different packages. In Figure 2, the package for the Latin alphabet is used. The line beginning with Keystrokes: shows the actual key inputs and they are translated into the line beginning with Inserted Text:. For example, the sequence of two keys "a" and "" is translated into a "a" with an accent.

Figure 3 shows how input is translated with the package of Hangul. In the case of this package, the translation rules have ambiguity because the number of key input translated into one Hangul character is not fixed and rules may be context dependent. The top line shows the ambiguity. Keyboard inputs, "g", "gk", and "gks" can be translated into the characters in the next line respectively. As key inputs "g" and "gk" may be a part of other rules, their traslations are shown with underlines which indicates that they are not yet determined.

The second line and the third line show how context dependent translation rule are treated. Quail package may contain context dependent rules. Because Quail gives priority to the rule with the longest key input, users have to delimit the key input with a space in order to apply other rules.

Figure 4: Inputting Japanese by Tamago

Figure 5: Inputting Chinese by Tamago

3.2 Tamago

Tamago is an input system for Japanese and Chinese which utilizes a Kana-Kanji conversion system: jserver/Wnn or PinYin-Chinese conversion system: cWnn. The conversion takes two steps. At frist, Tamago translates keyboard input into Kana or PinYin just like Quail. Next, it communicates with Wnn or cWnn through the network to convert them into Kanji or Chinese characters. Since these conversion systems have big dictionaries and grammer rules, long Kana or PinYin sequence can be converted at once, which gives user a quite convenient interface.

Figure 4 and 5 are snapshots of Japanese and Chinese character input using Tamago. In there examples, the conversion server returned the correct text at the first time. But when a returned text is not appropriate, a user can select different candidates easily. By this interaction with a user, the server learns correct conversion rules gradually.

4 Characters and Words

The definition of the character and the word differs in different languages. That is why Mule enable users to define the character and the word.

First, let's think about the character. It would be convenient for the editing work if we divide characters into two classes, ones that constitute a word, like "a", "b" or "c", and ones that cannot be a part of a word, space and parentheses would be in this class. For instance, because a word is delimited with a space in some languages including English, when editing a text written in such languages we can find the beginning or the ending of a word by searching for characters which cannot be a part of a word Therefore, GNU Emacs divides characters into classes and characters of each class are assign a value called character syntax.

?@?@Users cannot change or define character syntax in GNU Emacs. Mule, thus, offers a new classification method of the character called character category. Users can freely define or change the character category of characters. Table 2 shows the default character categories, These default categories are meant mainly for clarifying the character set the character belongs to. With these classifications, many useful functions in multilingual text editing may be composed. For example, users can search for a Hangul or a Hebrew in a multilingual documents.

In text editing, a word is a practical unit of work. Deleting a word and moving cursor to the next word is often more rational than deleting a character or moving cursor one character forward. It differs in each language, however, which character string should be defined as a word. Although spaces delimit a word in many languages and GNU Emacs has adopted this as the definition of the word, we cannot rely on spaces in Japanese or Chinese text.

Thus we have designed Mule so that users can define a word using a regular expression. A regular expression is a way to specify a generic character sequence and users should define a word by specifying character categories of the characters in the sequence. Basic editing commands like "forward-word" or "delete-word" may be set to use the new definition of "word", so editing commands can be customized to fit each language. For instance, in editing Japanese, "Hiragana sequence following one or more Chinese characters" is a good definition of the "word" (pseudo-"bunsetsu" actually) and editing commands using this definition is more efficient than those with ordinary GNU Emacs style "word" definition.

5 Conclusion

Multilingual text processing system Mule was described. We have designed Mule as a flexble multilingual system to which users can add new functions or new languages as necessary.

In Mule, the language environment is easily extensible to fit for new languages and new needs. Users can add new character sets and coding-systems, change and create new inputting methods and make new classifications of characters. We believe that multilingual environment should not only handle many languages but also be adaptable for new languages and Mule is a realization of our belief.

A lot of people have helped us develop Mule through the computer networks. We wish to express our deep gratitude for everyone who has given us ideas for the improvement, bug reports and bug fixes, and for everyone who maintain the computer networks.

Appendix A What is needed to operate Mule

Users need the followings to run Mule as an efficient multilingual editor.

Appendix B Distribution and Mailing List

Mule is available by anonymous ftp from:

Access to etlport from outside of Japan is not recommended.

Mule's main discussion is done on the newsgroup fj. editor.mule in Japanese. But for nonJapanese speakers, we are running a mailing list In addition, for testing new version of Mule on various platforms before the official release, we are running a mailing list Please send requests of subscribing to specifying which mailing list you want to join.