Mojibake: The unknown, very common problem of East Asian text input

Published in

Prototypr

6 min readJan 15, 2018

Typing is hard.

The internet is still largely based on English. Web-addresses are written with the English alphabet, websites often have an English equivalent, and English remains the most used language on the internet. On the flip-side, according to Nielsen, nearly half of internet users in 2017 are from Asia (the runner up is Europe, with only 17%).

With internet adoption continuing to explode in Asia, standard internet usage in East Asian countries requires constant switching between languages and input methods via a language selection menu or a keyboard shortcut. Despite decades of work in input methods, this continual code-switching still often results in production errors. For example, a user may try to type an English string using a Japanese input method. The result is gibberish, or, as commonly termed in Japanese, “mojibake” (文字化け).

Text Input in Japanese

Of the East Asian languages, Japanese computer users generate text in the most varied number of scripts. Japanese text input is also relatively quick. This expedience makes production errors in Japanese text generation particularly halting and difficult to recover from.

Common Japanese text generation generally uses three scripts. These scripts include: hiragana (commonly used for native Japanese words), katakana (commonly used for loan words), and kanji (adopted Chinese characters). In modern Japanese, Latin script (romaji) is often commonly used for words that have not been widely transliterated to katakana, or just for stylistic effect.

The phrase “mojibake” written in hiragana, katakana, kanji (final character is hiragana), and romaji, respectively.

To type a string, most Japanese users input in hiragana and/or romaji, and then transliterate the text to the appropriate script. Kanji is a bit of an exception — it is generated in a method akin to Chinese Pinyin input, and is transliterated from the users’ hiragana text input via a completion list.

Input method toggling can be done in several ways, depending on the operating system. Windows provides the user with a “Language Bar” which allows the user to choose a desired input method. Many laptops, tablets, and phones also come with dedicated language switching keys. Despite the ease of language toggling, production errors due to script-mismatch are still prevalent in Japanese text generation.

GOMS Task Analysis of Japanese Input

The possible causes for production errors are obviated by doing a GOMS analysis of Japanese text generation. Disclaimer: things get nerdy starting here.

The ultimate goal of text generation is to create a user-readable string in the correct script and encoding. The sub-goals for this task fall into two different categories: input method selection and text entry. Operators range from finding the language selection UI to typing in characters. Input method selection and text entry alternate constantly as a user tries to input text. Text entry consists the bulk of the operators present in the task of Japanese text input. Only when a different input method is needed does the task cycle loop back to input method selection. Below is a sample GOMS task analysis of Japanese text generation with multiple input methods on Windows 10:

Production Errors in Internet Usage

Production errors due to script-mismatch can hinder even the most basic actions when using a web browser. The most common errors occur generally, in all contexts of communication. There are also more specific instances, such as typing in a URL. Since web addresses are usually in Latin script, it is necessary to switch from hiragana or katakana to an alphanumeric input method before typing a URL. In the common case that this switch is forgotten, the user creates mojibake such as the example, which, of course, does not make for a valid URL.

Failure to switch away from hiragana before typing a URL often results in mojibake.

Such difficulties are also commonplace with chatting or when using a messenging client. Japanese internet users have developed a tremendous and highly varied internet jargon, with abbreviations, emoji, and intentional misspellings. Many of these character sequences have been stored into the muscle memory of Japanese internet users, and accordingly they are not typed character by character, but as a single utterance.

Hence, when the input language is incorrect, users can type entire phrases in the wrong language or the wrong script, which often proves bothersome. For example, internet searches and, of course, URL entry are often conducted in English due to the prevalence of English-based websites. The user then identifies the internet usage session with English-based content, creating a false, over-generalized cognitive map. As a result, when chatting in Japanese, the user might generate text in alphanumeric characters instead, resulting in mojibake.

These difficulties in text generation often prove to be very halting, irritating and time consuming. And while some attempts have been made to correct common errors and specialized cases, production errors are prevalent even for the most skilled computer users.

Mitigations

Modern operating systems and web browsers provide rudimentary mojibake mitigations. For example, Windows 10 will code switch when the user types “www” using an improper input method. Similar mitigations are available for other commonly used strings such as “ftp:”, “http”, “c:/”, etc.

While these string substitution methods are helpful, they are not a cure-all. URL entry errors still remain a very common production error in text generation. For example, simply entering a URL without a “www” subdomain will result in a production error. Furthermore, these rules are applied inconsistently across operating systems and browsers. There is always uncertainty if the user can expect to simply just type in the address bar.

We can do better by applying some design thinking to the mojibake input problem. Here are some ideas:

Better system status. UIs should clearly reflect what input method a person is using. This will allow the user to code switch before text generation begins.
Intelligent code switching. After text is generated, the system should match generated text against common constructions and attempt to correct mojibake. For example, if the user types “Google。cおm” into an address bar, this string can be transliterated to “google.com”.
Better correction.􏰀 Users should be able to switch how a string would be displayed by toggling the input method after text generation.
􏰀Customization. Users should be able to edit all aspects of the input methods and correction rules to maximize efficiency.
Machine-learned correction. If a user fails to navigate to “Apple。cおm” multiple times because it doesn’t exist, the system should learn that this URL is not valid and correct for the user.

Doing Better

Text generation in East Asian scripts is already difficult, and this process is only made more exhausting by constant code-switching. As demonstrated in the task analysis, generating text in Japanese requires an entire separate set of operators before a single character can be written. Production errors due to input method incorrectness are very common due to negative feedback and simple forgetfulness. Thankfully, this problem is relatively solvable with modern technologies and machine learning. The hardest part is realizing there’s an issue in the first place.

Prototypr

Mojibake: The unknown, very common problem of East Asian text input

Text Input in Japanese

GOMS Task Analysis of Japanese Input

Production Errors in Internet Usage

Mitigations

Doing Better

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Prototypr

Written by Alan Sien Wei Hshieh

No responses yet