-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
Open
Labels
feature requestNew feature or requestNew feature or requestkeep-openPrevents stale label being appliedPrevents stale label being appliedmulti-modalityRelated to multi-modality (#4194)Related to multi-modality (#4194)
Description
🚀 The feature, motivation and pitch
This issue is for keeping track of the recurrent Whisper asks as well as the linked on-going efforts to support that feature, if any.
When a feature request has no linked PR, feel free to claim the work here if you want to help!
- Support different
response_formats
https://platform.openai.com/docs/api-reference/audio/createTranscription- Related issues: [Usage]: How to let Whisper return timestamps in transcript? #19556, [Usage]: Vllm whisper model response_format verbose_json not working #14818, [Feature]: Implement SRT generation for audio transcription in vLLM #24302
- PR(s): [Frontend] add 'verbose_json' and 'timestamp' feature on Whisper Transcription/Translation #24209 (
verbose_json
, help needed with other formats)
- Support timestamp granularities:
- Context: Very much related to the above. Unfortunately outputting by
word
requires aligning encoder latents (usually extrapolated from the crossattn layers) with decoder ones. I feel a lot of these whsper-specific techniques bring in added complexity to vLLM. However, I think we're open to exploring in this direction if we can come up with a less invasive solutions. Some references to get started https://github.com/m-bain/whisperX WhisperX: Word-level timestamps, diarization (new), batch inference within file(new) openai/whisper#684
- Context: Very much related to the above. Unfortunately outputting by
- Automatic language detection:
- Context: one should be able to let the decoder predict part of its "preamble" prompt, including the language and task token, conditioned on the encoder output. This is effectively utilizing whisper "built-in automatic language detection" feature. Mind that it would be ideal to guide the output tokens among valid languages. Accuracy to evaluate.
- Related issues: [Feature]: will whisper add language detection? #14174
- Beam search:
- PR(s): [WIP][Whisper] beam search for whisper #13758 , this one needs reviving. Feel free to claim work here.
- Feed previous chunk context to improve accuracy
- PR(s): [Frontend] add previous context to whisper transcription over 30s audio #20249 (first tentative, to re-do)
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
JakubCerven, Sugar-zsg and apinge
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or requestkeep-openPrevents stale label being appliedPrevents stale label being appliedmulti-modalityRelated to multi-modality (#4194)Related to multi-modality (#4194)
Type
Projects
Status
In Progress