Improved text token counting

The central issue revolves around the following function:

https://github.com/Haidra-Org/AI-Horde/blob/f6952ed9e3d48c84c3f22d33546aade7716a0b86/horde/classes/kobold/processing_generation.py#L90-L101

Certain tokenizers have the ability to outperform the fixed factor of "4", which leaves the horde with the belief the worker is generating tokens faster than is possible, where in reality, the tokenizer may simply generate more tokens than that on average. You can analyze the effect different tokenizers have with respect to the ratio of `characters_input / tokens_generated` here: 
https://huggingface.co/spaces/Xenova/the-tokenizer-playground

The [horde text model text reference](https://github.com/Haidra-Org/AI-Horde-text-model-reference/blob/main/db.json) could have a `tokenizer_efficiency` field added, and the AI-Horde updated to use, to reduce this problem. The text reference uses the huggingface names as the canonical name in the reference, and so the [huggingface client library](https://github.com/huggingface/huggingface_hub) could be used to retrieve the tokenizer. 

I propose the following process:

- A `tokenizer_efficiency` and `tokenizer_vocab_size` (for posterity) field be added to the model reference
- A script written to download all of the tokenizers and their configurations (each is on the order of megabytes. De-deduping may also be possible. The tokenizer_vocab_size is scraped from the tokenizer.json (it is the array length of the vocab list in the model object)
- A large amount of random (read: representative) text is generated and saved as a fixed dataset that will be used against all tokenizers. The count of characters of this dataset is also saved.
- Each tokenizer tokenizes the fixed dataset and the resulting number of tokens are counted. 
tokenizer_efficiency = characters_input / tokens_generated
- The existing text model reference updated with these fields
- A CI workflow to automate the collection and enforcement of these new fields for https://github.com/Haidra-Org/AI-Horde-text-model-reference.
- The relevant AI-Horde code update to utilize it.

An potential alternative approach would involve downloading/using the tokenizers API-side somehow (microservice?) but I suspect this would introduce enormous and unnecessary complications as well as add an unacceptable delay to generations.

	def get_things_count(self, generation=None):
	if generation is None:
	if self.generation is None:
	return 0
	generation = self.generation
	quick_token_count = math.ceil(len(generation) / 4)
	if quick_token_count < 20:
	quick_token_count = 20
	if self.wp.things > quick_token_count:
	# logger.debug([self.wp.things,quick_token_count])
	return quick_token_count
	return self.wp.things

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improved text token counting #463

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Improved text token counting #463

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions