Skip to content

Tokenizer for multiple encodings #213

@Frogley

Description

@Frogley

Is your feature request related to a problem? Please describe.
I need to calculate the number of tokens, but TokenizerGpt3 has errors in calculations for models of GPT-3.5 and above.

TokenizerGpt3 mainly refers to openai-tools. After reading the source code, its implementation mainly refers to data_gym_to_mergeable_bpe_ranks, which requires an encoder.json and a vocab.bpe at runtime. According to openai_public, this method is mainly applicable to gpt-2, and based on the test results, it is also suitable for r50k_base and p50k_base. However, it doesn't work for cl100k_base (GPT-4 and GPT-3.5).

Starting from r50k_base, the tokenizer implementation has changed to load_tiktoken_bpe, which relies on a .tiktoken file at runtime. Currently, there are 2 tokenizer projects supporting GPT-3.5: TiktokenSharp and SharpToken, both implemented in this way.

Describe the solution you'd like
It is difficult to modify the current TokenizerGpt3 to support cl100k_base, maybe a rewrite is the only way. Do you think it's necessary? If so, I'm willing to undertake the rewriting work. Please let me know your opinion.

Describe alternatives you've considered
Or maybe we can just use TiktokenSharp.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions