Categories
Software

Furigana in SwiftUI (4)

This is part 4. The previous episode is here.

To quickly recap, we are now able to layout text that looks like this:

Hello aka Konnichi wa aka 今日は aka こんにちは
こんにち is furigana that tells readers how to pronounce 今日

We accomplish this by passing an array of (mainText, furigana) pairs into our container view.

But how can we generate this array? For each entry in our strings files we typically have a romaji version and a kana/kanji version. For example “To go” has a romaji version: “iku” and a kana/kanji version: “行く”

I ended up building a two step process:

  1. convert the romaji to hiragana
  2. ‘line up’ the hiragana only string with the kana/kanji string, and infer which hiragana represented the kanji

Aside: I originally imagined a markdown scheme to represent which furigana would decorate which text. Something like this: [[今日((こんいち))]][[は]]

I eventually realized this markdown format wasn’t adding any value, so instead I can generate the (hiragana, kana/kanji) pairs directly from the inputs from the strings files.

Romaji to Hiragana

In an acronym, TDD. Test driven design was essential to accomplishing this conversion. One of the bigger challenges here is the fact that there is some ambiguity in converting from romaji to hiragana. Ō can mean おお or おう. Ji can mean じ or ぢ. Is tenin てにん or てんいん?

I started using the following process:

  1. start with the last character, and iterate through to the first
  2. at each character, prepend it to and previously unused characters
  3. determine if this updated string of characters mapped to a valid hiragana
  4. if not, assume the previous string of characters did map to a valid hiragana and add it to the final result
  5. remove the used characters from the string of unused characters
  6. go to step 2 and grab the ‘next’ character

Consider the following example: ikimasu

  1. does u have a hiragana equivalent? yup: う
  2. grab another character, s. does su have a hiragana? yup: す
  3. grab another character, a. does asu have a hiragana? nope
  4. add すto our result string, and remove su from our working value
  5. does a have a hiragana? yup: あ
  6. grab the next romaji character, m. does ma have a hiragana? yup: ま
  7. etc.

There were other wrinkles that came up right away. They included

  • how to handle digraphs like kya, sho, chu (きゃ, しょ, ちゅ)
  • handling longer consonants like the double k in kekkon with っ

Handling these wrinkles often forced me to refactor my algorithm. But thanks to my ever growing collection of TDD test cases, I could instantly see if my courageous changes broke something. I was able to refactor mercilessly which was very freeing.

Writing this, I pictured a different algorithm where step 1 is breaking a string into substrings where each substring ends in a vowel. Then each substring could probably be converted directly using my romaji -> hiragana dictionary. This might be easier to read and maintain. Hmm..

Furigana-ify my text

This felt like one of those tasks that is easy for humans to do visually, but hard to solve with a program.

When we see:

and:

みせ に  い きます

店 に  行 きます

humans are pretty good at identifying which chunks of hiragana represent the kanji below. In the happy path, it’s fairly easy to iterate through the two strings and generate the (furigana, mainText pairs)

But sadly my input data was not free of errors. There were cases where my furigana didn’t match my romaji. Also some strings included information in brackets. eg. some languages have masculine and feminine versions of adjectives. So if a user was going from Japanese to Croatian, the Japanese string would need to include gender. so the romaji might look be Takai (M) and the kana/kanji version would be 高い (男).

Sometimes this meant cleaning up the input data. Sometimes it meant tweaking the romaji to hiragana conversion. Sometimes it meant tweaking the furigana generation process. In all cases thanks to my TDD mindset, it meant creating at least one new test case. I loved the fact that I was able to refactor mercilessly and be confident I was creating any regressions.

This post has been more hand wavy than showing specific code examples, but I did come across one code thing I want to share here.

extension Character {
    var isKanji: Bool {
    // HOW???
    }
}

For better or worse, the answer required some unicode kookiness…

extension Character {
    var isKanji: Bool {
        let result = try? /\p{Script=Han}/.firstMatch(in: String(self))
        return result != nil
    }
}

Implementing similar functionality in String is left as an exercise for the users.

Alternatively, isHiragana is a more contained problem to solve

    var isHiragana: Bool {
        ("あ"..."ん").contains(self)
    }

Leave a Reply

Your email address will not be published. Required fields are marked *