-
Notifications
You must be signed in to change notification settings - Fork 183
Description
This issue happens in the collate_fn
function of the file utils/dataset.py
during the validation phase. I finally found this issue roots in the bug of the tokenizer when run into the image of "7939894288_3028c8874a_o.jpg".
The original corresponding text of this image is:
The birds have various ways of searching for food. What part of their body helps them to grab and pick up food from the ground in the picture?
After adding the prompt before line 94, the variable "conversation" in line 95 is:
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <im_start>
<im_end>
Birds have various ways of searching for food. What part of their body helps them to grab and pick up food from the ground in the picture? Please output segmentation mask. ASSISTANT: [SEG].
Meanwhile, the tokenizer will encoder that text into the following ids:
tensor([ 1, 319, 13563, 1546, 263, 12758, 5199, 322, 385, 23116,
21082, 20255, 29889, 450, 20255, 4076, 8444, 29892, 13173, 29892,
322, 1248, 568, 6089, 304, 278, 5199, 29915, 29879, 5155,
29889, 3148, 1001, 29901, 32001, -200, 32002, 0, 29879, 505,
5164, 5837, 310, 11975, 363, 9687, 29889, 1724, 760, 310,
1009, 3573, 6911, 963, 304, 17229, 322, 5839, 701, 9687,
515, 278, 5962, 297, 278, 7623, 29973, 3529, 1962, 10768,
362, 11105, 29889, 319, 1799, 9047, 13566, 29901, 32000, 29889,
2])
Note that, the 32001, -200, 32002
are the <im_start><image><im_end>
, after that, there is a zero, see that? That's the problem here!
Now, let's decode back those ids using the tokenizer.decode()
function:
print(tokenizer.decode([32002, 0, 29879, 505,
5164, 5837, 310, 11975, 363, 9687, 29889, 1724, 760, 310,
1009, 3573, 6911, 963, 304, 17229, 322, 5839, 701, 9687,
515, 278, 5962, 297, 278, 7623, 29973, 3529, 1962, 10768,
362, 11105, 29889, 319, 1799, 9047, 13566, 29901, 32000, 29889,
2]))
then we will get:
<im_end> <unk>s have various ways of searching for food. What part of their body helps them to grab and pick up food from the ground in the picture? Please output segmentation mask. ASSISTANT: [SEG] .</s>
So, the word "Birds" are split into a "" and an "s", where the ids are 0 and 29879. However, the 0 is a special token for padding, this will result in the total_len
variable in line 96 total_len = int(target.ne(tokenizer.pad_token_id).sum())
is smaller than the total token length, and finally lead to an error in line 135: assert cur_len == total_len