CopyPastor

0
The tokenizer works in two steps. First, it does pre-tokenization, which is basically splitting on spaces and separating punctuation. Let's have a look at it on a random Czech sentence:
tokenizer.basic_tokenizer.tokenize("Kočka leze dírou.")
This gives you: `['kocka', 'leze', 'dirou', '.']`
In the second step, it applies the word piece splitting algorithm, so you get:
tokenizer.tokenize("Kočka leze dírou.")
You get: `['[UNK]', 'le', '##ze', 'di', '##ro', '##u', '.']`
If there is no way how to split the token into subwords, the whole word becomes `[UNK]`. Tokens starting with `##` get appended to the previous ones, so this way you can find out where the `[UNK]` originally came from.
(And it seems weird to me that Spanish WordPiece tokenizer cannot parse a word that only consists of Latin characters.)

The tokenizer works in two steps. First, it does pre-tokenization, which is basically splitting on spaces and separating punctuation. Let's have a look at it on a random Czech sentence:
```python tokenizer.basic_tokenizer.tokenize("Kočka leze dírou.") ``` This gives you: `['kocka', 'leze', 'dirou', '.']`
In the second step, it applies the word piece splitting algorithm, so you get:
```python tokenizer.tokenize("Kočka leze dírou.") ``` You get: `['[UNK]', 'le', '##ze', 'di', '##ro', '##u', '.']`
If there is no way how to split the token into subwords, the whole word becomes `[UNK]`. Tokens starting with `##` get appended to the previous ones, so this way you can find out where the `[UNK]` originally came from.
(And it seems weird to me that Spanish WordPiece tokenizer cannot parse a word that only consists of Latin characters.)

CopyPastor

Possible Plagiarism

Original Post