Skip to content

Improved text token counting #463

@tazlin

Description

@tazlin

The central issue revolves around the following function:

def get_things_count(self, generation=None):
if generation is None:
if self.generation is None:
return 0
generation = self.generation
quick_token_count = math.ceil(len(generation) / 4)
if quick_token_count < 20:
quick_token_count = 20
if self.wp.things > quick_token_count:
# logger.debug([self.wp.things,quick_token_count])
return quick_token_count
return self.wp.things

Certain tokenizers have the ability to outperform the fixed factor of "4", which leaves the horde with the belief the worker is generating tokens faster than is possible, where in reality, the tokenizer may simply generate more tokens than that on average. You can analyze the effect different tokenizers have with respect to the ratio of characters_input / tokens_generated here:
https://huggingface.co/spaces/Xenova/the-tokenizer-playground

The horde text model text reference could have a tokenizer_efficiency field added, and the AI-Horde updated to use, to reduce this problem. The text reference uses the huggingface names as the canonical name in the reference, and so the huggingface client library could be used to retrieve the tokenizer.

I propose the following process:

  • A tokenizer_efficiency and tokenizer_vocab_size (for posterity) field be added to the model reference
  • A script written to download all of the tokenizers and their configurations (each is on the order of megabytes. De-deduping may also be possible. The tokenizer_vocab_size is scraped from the tokenizer.json (it is the array length of the vocab list in the model object)
  • A large amount of random (read: representative) text is generated and saved as a fixed dataset that will be used against all tokenizers. The count of characters of this dataset is also saved.
  • Each tokenizer tokenizes the fixed dataset and the resulting number of tokens are counted.
    tokenizer_efficiency = characters_input / tokens_generated
  • The existing text model reference updated with these fields
  • A CI workflow to automate the collection and enforcement of these new fields for https://github.com/Haidra-Org/AI-Horde-text-model-reference.
  • The relevant AI-Horde code update to utilize it.

An potential alternative approach would involve downloading/using the tokenizers API-side somehow (microservice?) but I suspect this would introduce enormous and unnecessary complications as well as add an unacceptable delay to generations.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions