CopyPastor

Detecting plagiarism made easy.

Score: 0.827315628528595; Reported for: String similarity Open both answers

Possible Plagiarism

Plagiarized on 2020-02-13
by Wsaitama

Original Post

Original - Posted on 2020-02-13
by Jindřich



            
Present in both answers; Present only in the new answer; Present only in the old answer;

0
The tokenizer works in two steps. First, it does pre-tokenization, which is basically splitting on spaces and separating punctuation. Let's have a look at it on a random Czech sentence:
tokenizer.basic_tokenizer.tokenize("Kočka leze dírou.")
This gives you: `['kocka', 'leze', 'dirou', '.']`
In the second step, it applies the word piece splitting algorithm, so you get:
tokenizer.tokenize("Kočka leze dírou.")
You get: `['[UNK]', 'le', '##ze', 'di', '##ro', '##u', '.']`
If there is no way how to split the token into subwords, the whole word becomes `[UNK]`. Tokens starting with `##` get appended to the previous ones, so this way you can find out where the `[UNK]` originally came from.
(And it seems weird to me that Spanish WordPiece tokenizer cannot parse a word that only consists of Latin characters.)
The tokenizer works in two steps. First, it does pre-tokenization, which is basically splitting on spaces and separating punctuation. Let's have a look at it on a random Czech sentence:
```python tokenizer.basic_tokenizer.tokenize("Kočka leze dírou.") ``` This gives you: `['kocka', 'leze', 'dirou', '.']`
In the second step, it applies the word piece splitting algorithm, so you get:
```python tokenizer.tokenize("Kočka leze dírou.") ``` You get: `['[UNK]', 'le', '##ze', 'di', '##ro', '##u', '.']`
If there is no way how to split the token into subwords, the whole word becomes `[UNK]`. Tokens starting with `##` get appended to the previous ones, so this way you can find out where the `[UNK]` originally came from.
(And it seems weird to me that Spanish WordPiece tokenizer cannot parse a word that only consists of Latin characters.)

        
Present in both answers; Present only in the new answer; Present only in the old answer;