Skip to content

Support Unicode script extensions #155

@747

Description

@747

Is there any plan to support the script extensions (scx) property, which allows characters to have non-singular script identities?
It has been available in many dynamic languages such as Perl, PHPPython, JavaScript (recently) etc., and would greatly improve the usefulness against the real-world text.

For example, in JS after ES2018:

// match by script (= Ruby /[\p{Hani}\p{Hira}\p{Kana}]+/)
"ア行〜タ行のデータ".match(/[\p{sc=Hani}\p{sc=Hira}\p{sc=Kana}]+/gu);
// => [ "ア行", "タ行のデ", "タ" ]

// match by script_extensions
"ア行〜タ行のデータ".match(/[\p{scx=Hani}\p{scx=Hira}\p{scx=Kana}]+/gu);
// => [ "ア行〜タ行のデータ" ]

While not being the silver bullet due to the Unicode complications, it will catch most of the common pitfalls on Unicode script matching. Manually reproducing the equivalent of scx properties with the vanilla script property can often result in a non-trivial expression.

# implement \p{scx=Hira} equivalent
/[\p{Hira}、-〃〈-】〓-〟〰-〵〷〼〽\u3099-゜゠・ー﹅﹆。-・ー゙゚]/

Sorry if already discussed somewhere, but at least I couldn't find a relevant issue in this repository.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions