Tokenizer for multiple encodings

**Is your feature request related to a problem? Please describe.**
I need to calculate the number of tokens, but TokenizerGpt3 has errors in calculations for models of GPT-3.5 and above.

TokenizerGpt3 mainly refers to [openai-tools](https://github.com/dluc/openai-tools/). After reading the source code, its implementation mainly refers to [data_gym_to_mergeable_bpe_ranks](https://github.com/openai/tiktoken/blob/e1c661edf3604706bb2db59cfc7bf92f73c09761/tiktoken/load.py#L57), which requires an encoder.json and a vocab.bpe at runtime. According to [openai_public](https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py), this method is mainly applicable to gpt-2, and based on the test results, it is also suitable for r50k_base and p50k_base. However, it doesn't work for cl100k_base (GPT-4 and GPT-3.5).

Starting from r50k_base, the tokenizer implementation has changed to [load_tiktoken_bpe](https://github.com/openai/tiktoken/blob/e1c661edf3604706bb2db59cfc7bf92f73c09761/tiktoken/load.py#L112), which relies on a .tiktoken file at runtime. Currently, there are 2 tokenizer projects supporting GPT-3.5: [TiktokenSharp](https://github.com/aiqinxuancai/TiktokenSharp/) and [SharpToken](https://github.com/dmitry-brazhenko/SharpToken), both implemented in this way.

**Describe the solution you'd like**
It is difficult to modify the current TokenizerGpt3 to support cl100k_base, maybe a rewrite is the only way. Do you think it's necessary? If so, I'm willing to undertake the rewriting work. Please let me know your opinion.

**Describe alternatives you've considered**
Or maybe we can just use [TiktokenSharp](https://github.com/aiqinxuancai/TiktokenSharp/). 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tokenizer for multiple encodings #213

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Tokenizer for multiple encodings #213

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions