Skip to content

[Bug] The tokenizer will recognize the "Birds" as a <unk> and an "s"Β #193

@Zxl19990529

Description

@Zxl19990529

Related issues: #92 #66

This issue happens in the collate_fn function of the file utils/dataset.py during the validation phase. I finally found this issue roots in the bug of the tokenizer when run into the image of "7939894288_3028c8874a_o.jpg".
The original corresponding text of this image is:

The birds have various ways of searching for food. What part of their body helps them to grab and pick up food from the ground in the picture?

After adding the prompt before line 94, the variable "conversation" in line 95 is:

A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <im_start><im_end>
Birds have various ways of searching for food. What part of their body helps them to grab and pick up food from the ground in the picture? Please output segmentation mask. ASSISTANT: [SEG].

Meanwhile, the tokenizer will encoder that text into the following ids:

tensor([    1,   319, 13563,  1546,   263, 12758,  5199,   322,   385, 23116,
        21082, 20255, 29889,   450, 20255,  4076,  8444, 29892, 13173, 29892,
          322,  1248,   568,  6089,   304,   278,  5199, 29915, 29879,  5155,
        29889,  3148,  1001, 29901, 32001,  -200, 32002,     0, 29879,   505,
         5164,  5837,   310, 11975,   363,  9687, 29889,  1724,   760,   310,
         1009,  3573,  6911,   963,   304, 17229,   322,  5839,   701,  9687,
          515,   278,  5962,   297,   278,  7623, 29973,  3529,  1962, 10768,
          362, 11105, 29889,   319,  1799,  9047, 13566, 29901, 32000, 29889,
            2])

Note that, the 32001, -200, 32002 are the <im_start><image><im_end>, after that, there is a zero, see that? That's the problem here!
Now, let's decode back those ids using the tokenizer.decode() function:

print(tokenizer.decode([32002,     0, 29879,   505,
         5164,  5837,   310, 11975,   363,  9687, 29889,  1724,   760,   310,
         1009,  3573,  6911,   963,   304, 17229,   322,  5839,   701,  9687,
          515,   278,  5962,   297,   278,  7623, 29973,  3529,  1962, 10768,
          362, 11105, 29889,   319,  1799,  9047, 13566, 29901, 32000, 29889,
            2]))

then we will get:

<im_end> <unk>s have various ways of searching for food. What part of their body helps them to grab and pick up food from the ground in the picture? Please output segmentation mask. ASSISTANT: [SEG]  .</s>

So, the word "Birds" are split into a "" and an "s", where the ids are 0 and 29879. However, the 0 is a special token for padding, this will result in the total_len variable in line 96 total_len = int(target.ne(tokenizer.pad_token_id).sum()) is smaller than the total token length, and finally lead to an error in line 135: assert cur_len == total_len

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions