Tokenizer results in blank token for extended UTF-8 characters

There is a question mark character in one of the Universal Dependencies datasets which gets wiped out by the tokenizer for the Italian bert & electra models:


https://github.com/UniversalDependencies/UD_Italian-PoSTWITA

warning: big file
https://raw.githubusercontent.com/UniversalDependencies/UD_Italian-PoSTWITA/master/it_postwita-ud-train.conllu

search for "ewww" in the training file

It looks like this if I copy and paste it:

ewww 󾓺 — in viaggio Roma

according to emacs describe-char, it is character 0xFE4FA

Anyway, hopefully that's enough background to figure out which character is causing the problem.  If I run the following sentences through the tokenizer with ` tokenizer.tokenize(sentence)` I get the following:

```
ewww 🐈 — in viaggio Roma   # another random character
ewww 󾓺 — in viaggio Roma    # to test, maybe need to check that this is the weird character, not just a box
ewww — in viaggio Roma
```

```
# i printed the word pieces & their IDs
(['e', '##www', '[UNK]', '—', 'in', 'viaggio', 'Roma'], [126, 18224, 101, 986, 139, 2395, 2097])
(['e', '##www', '—', 'in', 'viaggio', 'Roma'], [126, 18224, 986, 139, 2395, 2097])
(['e', '##www', '—', 'in', 'viaggio', 'Roma'], [126, 18224, 986, 139, 2395, 2097])
```

The missing word causes confusion for me when trying to correlate the Bert embeddings with the words they represent.  Can the tokenizer be fixed to treat that character (or any other strange character) as `[UNK]` as well?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer results in blank token for extended UTF-8 characters #48

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tokenizer results in blank token for extended UTF-8 characters #48

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions