[Bug] The tokenizer will recognize the "Birds" as a <unk> and an "s"

Related issues: #92  #66 

This issue happens in the ```collate_fn``` function of the file ```utils/dataset.py``` during the validation phase. I finally found this issue roots in the bug of the tokenizer when run into the image of "7939894288_3028c8874a_o.jpg".
The original corresponding text of  this image is:
> The birds have various ways of searching for food. What part of their body helps them to grab and pick up food from the ground in the picture?

After adding the prompt before line 94, the variable "conversation" in line 95 is:
> A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <im_start><image><im_end>
Birds have various ways of searching for food. What part of their body helps them to grab and pick up food from the ground in the picture? Please output segmentation mask. ASSISTANT: [SEG].</s>

Meanwhile, the tokenizer will encoder that text into the following ids:
```py
tensor([    1,   319, 13563,  1546,   263, 12758,  5199,   322,   385, 23116,
        21082, 20255, 29889,   450, 20255,  4076,  8444, 29892, 13173, 29892,
          322,  1248,   568,  6089,   304,   278,  5199, 29915, 29879,  5155,
        29889,  3148,  1001, 29901, 32001,  -200, 32002,     0, 29879,   505,
         5164,  5837,   310, 11975,   363,  9687, 29889,  1724,   760,   310,
         1009,  3573,  6911,   963,   304, 17229,   322,  5839,   701,  9687,
          515,   278,  5962,   297,   278,  7623, 29973,  3529,  1962, 10768,
          362, 11105, 29889,   319,  1799,  9047, 13566, 29901, 32000, 29889,
            2])
```
Note that, the ``` 32001,  -200, 32002``` are the ```<im_start><image><im_end>```, after that, there is a zero, see that? That's the problem here! 
Now, let's decode back those ids using the ```tokenizer.decode()``` function:
```py
print(tokenizer.decode([32002,     0, 29879,   505,
         5164,  5837,   310, 11975,   363,  9687, 29889,  1724,   760,   310,
         1009,  3573,  6911,   963,   304, 17229,   322,  5839,   701,  9687,
          515,   278,  5962,   297,   278,  7623, 29973,  3529,  1962, 10768,
          362, 11105, 29889,   319,  1799,  9047, 13566, 29901, 32000, 29889,
            2]))
```
then we will get:

```py
<im_end> <unk>s have various ways of searching for food. What part of their body helps them to grab and pick up food from the ground in the picture? Please output segmentation mask. ASSISTANT: [SEG]  .</s>
```

So, the word "Birds" are split into a "<unk>" and an "s", where the ids are 0 and 29879. However, the 0 is a special token for padding, this will result  in the ```total_len``` variable in  line 96 ```total_len = int(target.ne(tokenizer.pad_token_id).sum())``` is smaller than the total token length, and finally lead to an error in line 135: ```assert cur_len == total_len```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] The tokenizer will recognize the "Birds" as a <unk> and an "s" #193

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] The tokenizer will recognize the "Birds" as a <unk> and an "s" #193

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions