audio-visual-understanding

Here are 2 public repositories matching this topic...

bytedance / SALMONN

SALMONN family: A suite of advanced multi-modal LLMs

audio music research video speech speech-recognition multi-modal video-understanding audio-processing tsinghua-university bytedance large-language-models iclr2024 icml-2024 audio-visual-understanding

Updated Sep 28, 2025

video-SALMONN 2 is a powerful audio-visual large language model (LLM) that generates high-quality audio-visual video captions, which is developed by the Department of Electronic Engineering at Tsinghua University and ByteDance.

research video speech multi-modal video-understanding audio-processing tsinghua-university bytedance large-language-models llm audio-visual-understanding

Updated Sep 28, 2025
Python

Improve this page

Add a description, image, and links to the audio-visual-understanding topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the audio-visual-understanding topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

audio-visual-understanding

Here are 2 public repositories matching this topic...

bytedance / SALMONN

bytedance / video-SALMONN-2

Improve this page

Add this topic to your repo