Skip to content

Improve Finnish hyphenation#32

Open
akx wants to merge 1 commit into
typst:mainfrom
akx:improve-finnish
Open

Improve Finnish hyphenation#32
akx wants to merge 1 commit into
typst:mainfrom
akx:improve-finnish

Conversation

@akx

@akx akx commented Jun 3, 2026

Copy link
Copy Markdown

This PR switches Lang::Finnish to hyph-fi-x-school patterns, replacing the Saarinen hyph-fi rules that haven't been updated in... a while. They're older than me, originally...

The school rules follow the simpler syllabification rules taught in Finnish schools and, crucially, include systematic vowel-sequence break rules (i1o, e1a, o1a, …) that the old set omitted.

This both improves accuracy for Finnish as well as shrinks the embedded trie. I originally noted this with the word "äänioikeus", which was mishyphenated as ää-nioi-keus (when ää-ni-oi-keus is correct).

The new set was validated mechanically against Voikko, which is a full morphological hyphenator for Finnish, with a 109K word list.

metric old (Saarinen) new (school)
exact-match per word 86.00% 95.69%
recall 95.15% 98.67%
precision 99.08% 98.85%

For this set, 7 540 native words become newly correct and 316 become newly wrong, which translates to a roughly 24:1 improvement.

The residual errors can't be resolved by Liang patterns:

  • Finnish is full of compound words; you'd need to know "aamunavaus" is a compound word "aamun-avaus" to correctly automatically hyphenate it at the compound word boundary.
  • Some words in the test set are loanwords or foreign names, which would require linguistic analysis ("this word is actually not Finnish, hyphenate it differently").

The old hand-written Saarinen file had a bunch of hand-written patterns (see the comments in the file linked above) that technically help keeping clusters together in certain internationalisms (that would have been common in 1980-1990 TeXnical writing). However, for general use, they break things (499 words fixed c.f. Voikko, 1 976 words broken). So, no use adding them back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant