Furigana in SwiftUI (4)

This is part 4. The previous episode is here.

To quickly recap, we are now able to layout text that looks like this:

**Hello** aka **Konnichi wa** aka **今日は** aka **こんにちは**
こんにち is furigana that tells readers how to pronounce 今日

We accomplish this by passing an array of (mainText, furigana) pairs into our container view.

But how can we generate this array? For each entry in our strings files we typically have a romaji version and a kana/kanji version. For example “To go” has a romaji version: “iku” and a kana/kanji version: “行く”

I ended up building a two step process:

convert the romaji to hiragana
‘line up’ the hiragana only string with the kana/kanji string, and infer which hiragana represented the kanji

Aside: I originally imagined a markdown scheme to represent which furigana would decorate which text. Something like this: [[今日((こんいち))]][[は]]

I eventually realized this markdown format wasn’t adding any value, so instead I can generate the (hiragana, kana/kanji) pairs directly from the inputs from the strings files.

Romaji to Hiragana

In an acronym, TDD. Test driven design was essential to accomplishing this conversion. One of the bigger challenges here is the fact that there is some ambiguity in converting from romaji to hiragana. Ō can mean おお or おう. Ji can mean じ or ぢ. Is tenin てにん or てんいん?

I started using the following process:

start with the last character, and iterate through to the first
at each character, prepend it to and previously unused characters
determine if this updated string of characters mapped to a valid hiragana
if not, assume the previous string of characters did map to a valid hiragana and add it to the final result
remove the used characters from the string of unused characters
go to step 2 and grab the ‘next’ character

Consider the following example: ikimasu

does u have a hiragana equivalent? yup: う
grab another character, s. does su have a hiragana? yup: す
grab another character, a. does asu have a hiragana? nope
add すto our result string, and remove su from our working value
does a have a hiragana? yup: あ
grab the next romaji character, m. does ma have a hiragana? yup: ま
etc.

There were other wrinkles that came up right away. They included

how to handle digraphs like kya, sho, chu (きゃ, しょ, ちゅ)
handling longer consonants like the double k in kekkon with っ

Handling these wrinkles often forced me to refactor my algorithm. But thanks to my ever growing collection of TDD test cases, I could instantly see if my courageous changes broke something. I was able to refactor mercilessly which was very freeing.

Writing this, I pictured a different algorithm where step 1 is breaking a string into substrings where each substring ends in a vowel. Then each substring could probably be converted directly using my romaji -> hiragana dictionary. This might be easier to read and maintain. Hmm..

Furigana-ify my text

This felt like one of those tasks that is easy for humans to do visually, but hard to solve with a program.

When we see:

and:

みせに　い　きます

店に　行　きます

humans are pretty good at identifying which chunks of hiragana represent the kanji below. In the happy path, it’s fairly easy to iterate through the two strings and generate the (furigana, mainText pairs)

But sadly my input data was not free of errors. There were cases where my furigana didn’t match my romaji. Also some strings included information in brackets. eg. some languages have masculine and feminine versions of adjectives. So if a user was going from Japanese to Croatian, the Japanese string would need to include gender. so the romaji might look be Takai (M) and the kana/kanji version would be 高い (男).

Sometimes this meant cleaning up the input data. Sometimes it meant tweaking the romaji to hiragana conversion. Sometimes it meant tweaking the furigana generation process. In all cases thanks to my TDD mindset, it meant creating at least one new test case. I loved the fact that I was able to refactor mercilessly and be confident I was creating any regressions.

This post has been more hand wavy than showing specific code examples, but I did come across one code thing I want to share here.

extension Character {
    var isKanji: Bool {
    // HOW???
    }
}

For better or worse, the answer required some unicode kookiness…

extension Character {
    var isKanji: Bool {
        let result = try? /\p{Script=Han}/.firstMatch(in: String(self))
        return result != nil
    }
}

Implementing similar functionality in String is left as an exercise for the users.

Alternatively, isHiragana is a more contained problem to solve

    var isHiragana: Bool {
        ("あ"..."ん").contains(self)
    }

Romaji to Hiragana

Furigana-ify my text

Leave a Reply Cancel reply