Improve Finnish hyphenation#32
Open
akx wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR switches
Lang::Finnishtohyph-fi-x-schoolpatterns, replacing the Saarinenhyph-firules that haven't been updated in... a while. They're older than me, originally...The school rules follow the simpler syllabification rules taught in Finnish schools and, crucially, include systematic vowel-sequence break rules (
i1o,e1a,o1a, …) that the old set omitted.This both improves accuracy for Finnish as well as shrinks the embedded trie. I originally noted this with the word "äänioikeus", which was mishyphenated as
ää-nioi-keus(whenää-ni-oi-keusis correct).The new set was validated mechanically against Voikko, which is a full morphological hyphenator for Finnish, with a 109K word list.
For this set, 7 540 native words become newly correct and 316 become newly wrong, which translates to a roughly 24:1 improvement.
The residual errors can't be resolved by Liang patterns:
The old hand-written Saarinen file had a bunch of hand-written patterns (see the comments in the file linked above) that technically help keeping clusters together in certain internationalisms (that would have been common in 1980-1990 TeXnical writing). However, for general use, they break things (499 words fixed c.f. Voikko, 1 976 words broken). So, no use adding them back.