script_detector

This is a simple utility library for trying to figure out what CJK script a string is in. Three core methods that extend String are provided:

chinese?

Returns true if the string contains Chinese characters and no Japanese or Korean characters

japanese?

Returns true if the string contains specifically Japanese (hiragana or katakana) characters

korean?

Returns true if the string contains specifically Korean (hangul) characters

Once a script has been identified as Chinese, two further methods are provided for determining the script subtype:

traditional_chinese?

Return true if the string contains traditional Chinese characters (繁體字)

simplified_chinese?

Return true if the string contains simplified Chinese characters (简体字)

There is also a helper method that combines these to produce human-readable output:

identify_script

Try to detect script and return one of “Japanese”, “Korean”, “Traditional Chinese”, “Simplified Chinese”, “Ambiguous Chinese” or “Unknown”

It is important to understand that this requires long sections of text to work reliably, since a single character or even several characters may be valid Japanese, traditional Chinese and simplified Chinese simultaneously. Likewise, the string 東京 (Tokyo) will return “false” for Japanese and “true” for traditional Chinese, since those two kanji are also valid traditional Chinese.

Details: unicode.org/faq/han_cjk.html#4

Example

> p string
=> "我的氣墊船充滿了鱔魚."
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.simplified_chinese?
=> false
> string.japanese?
=> false
> string.korean?
=> false
> string.identify_script
=> "Traditional Chinese"

Implementation

Ruby 1.9 Oniguruma regular expressions are used to determine which script is in use. The lists of simplified and traditional Chinese characters have been drawn from the Unihan database’s Unihan_Variants.txt data set, using the assumption that any character with a kTraditionalVariant is simplified and visa versa.

Contributing to script_detector

Copyright © 2012 Jani Patokallio. See LICENSE.txt for further details.