Skip to content

Tokenizer results in blank token for extended UTF-8 charactersΒ #48

@AngledLuffa

Description

@AngledLuffa

There is a question mark character in one of the Universal Dependencies datasets which gets wiped out by the tokenizer for the Italian bert & electra models:

https://github.com/UniversalDependencies/UD_Italian-PoSTWITA

warning: big file
https://raw.githubusercontent.com/UniversalDependencies/UD_Italian-PoSTWITA/master/it_postwita-ud-train.conllu

search for "ewww" in the training file

It looks like this if I copy and paste it:

ewww σΎ“Ί β€” in viaggio Roma

according to emacs describe-char, it is character 0xFE4FA

Anyway, hopefully that's enough background to figure out which character is causing the problem. If I run the following sentences through the tokenizer with tokenizer.tokenize(sentence) I get the following:

ewww 🐈 β€” in viaggio Roma   # another random character
ewww σΎ“Ί β€” in viaggio Roma    # to test, maybe need to check that this is the weird character, not just a box
ewww β€” in viaggio Roma
# i printed the word pieces & their IDs
(['e', '##www', '[UNK]', 'β€”', 'in', 'viaggio', 'Roma'], [126, 18224, 101, 986, 139, 2395, 2097])
(['e', '##www', 'β€”', 'in', 'viaggio', 'Roma'], [126, 18224, 986, 139, 2395, 2097])
(['e', '##www', 'β€”', 'in', 'viaggio', 'Roma'], [126, 18224, 986, 139, 2395, 2097])

The missing word causes confusion for me when trying to correlate the Bert embeddings with the words they represent. Can the tokenizer be fixed to treat that character (or any other strange character) as [UNK] as well?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions