-
Notifications
You must be signed in to change notification settings - Fork 12
Description
There is a question mark character in one of the Universal Dependencies datasets which gets wiped out by the tokenizer for the Italian bert & electra models:
https://github.com/UniversalDependencies/UD_Italian-PoSTWITA
warning: big file
https://raw.githubusercontent.com/UniversalDependencies/UD_Italian-PoSTWITA/master/it_postwita-ud-train.conllu
search for "ewww" in the training file
It looks like this if I copy and paste it:
ewww σΎΊ β in viaggio Roma
according to emacs describe-char, it is character 0xFE4FA
Anyway, hopefully that's enough background to figure out which character is causing the problem. If I run the following sentences through the tokenizer with tokenizer.tokenize(sentence) I get the following:
ewww π β in viaggio Roma # another random character
ewww σΎΊ β in viaggio Roma # to test, maybe need to check that this is the weird character, not just a box
ewww β in viaggio Roma
# i printed the word pieces & their IDs
(['e', '##www', '[UNK]', 'β', 'in', 'viaggio', 'Roma'], [126, 18224, 101, 986, 139, 2395, 2097])
(['e', '##www', 'β', 'in', 'viaggio', 'Roma'], [126, 18224, 986, 139, 2395, 2097])
(['e', '##www', 'β', 'in', 'viaggio', 'Roma'], [126, 18224, 986, 139, 2395, 2097])
The missing word causes confusion for me when trying to correlate the Bert embeddings with the words they represent. Can the tokenizer be fixed to treat that character (or any other strange character) as [UNK] as well?