Cantonese and Jyutping

If you reach here to look for a Jyutping romanization table in plain text format or an alternative ibus-table-jyutping, here they are.

  • Jyutping romanization table (extracted from pdfs provided by a HK government site)
  • ibus jyutping database (for Ubuntu, simply overwrite the original jyutping.db at /usr/share/ibus-table/tables/jyutping.db and run LC_CTYPE="zh_HK.utf8" /usr/bin/ibus-daemon -r --daemonize --xim --replace. Note that you need to first install ibus-table-jyutping.). One may also modify this source and compile his own table.

That is it! But please read on if you are interested in some back stories.

Cantonese

Cantonese is a chinese dialect widely spoken in the southern provinces Guangdong and Guangxi. It is also the mother tone of Hong Kong and Macau Chinese and has high penetration among oversea Chinese. It belongs to the dialect group of Yue, which is the third most widely spoken dialect group in China, after Mandarin and Wu (including Shanghainese). However, it is probably the second most important chinese dialect after Mandarin since it is the only other dialect used as the primary language for official state functions (in Hong Kong and Macau).

Jyutping and Me

Unlike Mandarin, there is no one single romanization system for Cantonese. The most common ones include Cantonese Pinyin, Yale romanization and Jyutping. Jyutping is actually a relatively new system that was introduced by the Language Society of Hong Kong (LSHK) in 1993.

It may sound weird to my mainland friends. As a kid from Hong Kong, we actually never learned a romanization system in school. We just picked up a sound character by character from parents and teachers. Languages are never my favorite subjects and I never thought of learning a romanization system for Cantonese before. The real motivation to push me to learn jyutping is to use it as a Chinese input method. I always do not feel satisfied in inputing thought in my mother tone. I learned Cangjie in high school. It is an input method introduced in Taiwan based on character "shape". It is ideal for typing once one gets very familiar with it. But it is awkward to use it to chat with friends online since I need to translate my "sound bite" to "characters" and then type them. Therefore, I slowly convert to pinyin once I learned some Mandarin. At least, I can skip the audio to visual conversion step.

Pinyin is great when I use it to chat with mainland friends online. But then the problem comes when I chat with friends back home. To type in pinyin, I have to try to think in Mandarin and thus it just sounds like I have a thick northern accent. After lots of comparison of romanization systems, I made up my mind to learn Jyutping about two years ago. I am a language idiot the whole life and it wouldn't have been easy if I wouldn't have found a great tool named HanConv by Aaron Chan. It is a great piece of software that one can convert chinese from one way to another (traditional to simplified, for example). But the function I was interested in is to extract jyutping romanization from text. So to learn the romanization, I simply converted some newspaper articles to jyutping romanization and just typed in all articles character by character through a Jyutping IM. This may not be the most efficient way to learn Jyutping. But it is fast enough for me. While I still occassionally get stuck with some characters, as a native cantonese speaker I am more or less "fluent" in Jyutping after about a week or two of practice.

Jyutping Input Method (IM)

There are quite a few Jyutping IM out there. But I only tried some of them. For windows, I used CPIME and it is quite good. For Android, I tried MultiLing Keyboard by Honso and Jyutping Keyboard by Miles Leung. MultiLing Keyborad probably the best Jyutping IM for Android despite occasional hiccups. Jyutping Keyboard is very good too even though it only allows character by character input and don't update statistics from user inputs.

ibus-table-jyutping

As I am mostly a Linux user now, I was quite disappointed with the Jyutping support in Linux. There is a Jyuting table for ibus. But the frequency appears to be quite a bit off. When I type in a romanization, often some very rare characters (that I don't even know how to pronounce) come as the top choices. Luckily, ibus-table offers a simple interface for someone to build their own tables. Just follow what Cantoinput did, I use the character frequency statistics gather by Shih-Kun Huang at National Chiao Tung University in Taiwan. The only problem left is to gather the romanization for each character. My initial choice is HanConv. Unfortunately, it is not complete and it did has some weird mistakes in the pronounciations of some characters. Therefore, I dropped an email to the great folks of LSHK, thinking that they may know of some kind of databases in Jyutping. However, while the HK government maintain some nice pdf files listing Jyutping romanization for all characters, it appears that no such database is available. The first thing then come to my mind is pdftotext. However, romanization is all shifted in place after pdftotext making the whole conversion moot. Since the table in the pdf file is very structurally format, it is not a very hard OCR problem. I am not an expert on OCR. But at least I know some pattern recognition and image processing. So I decided to build a simple OCR program just for the pdf. What I used is just naive Bayes and belief propagation. I can't claim that the extracted table is error free but it appears to be quite accurate. If you find any error in the extracted table, please contact me.