OpenGVLab

All

93 repositories

RIVER
Public
[ICLR 2026] RIVER: A Real-Time Interaction Benchmark for Video LLMs
Python
•0•8•1•0•Updated Apr 20, 2026Apr 20, 2026
EfficientQAT
Public
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
Python
•
MIT License
•30•337•13•0•Updated Apr 10, 2026Apr 10, 2026
MMT-Bench
Public
[ICML 2024] | MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Python
•4•119•0•0•Updated Apr 6, 2026Apr 6, 2026
V2PE
Public
[ICCV2025] V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Python
•
MIT License
•2•60•2•0•Updated Apr 4, 2026Apr 4, 2026
GenExam
Public
GenExam: A Multidisciplinary Text-to-Image Exam
benchmark image-generation text-to-image-generation
benchmark image-generation text-to-image-generation
Python
•
MIT License
•4•65•0•0•Updated Mar 29, 2026Mar 29, 2026
InternVideo
Public
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
benchmark action-recognition video-understanding
benchmark action-recognition video-understanding video-data self-supervised multimodal video-dataset open-set-recognition video-retrieval video-question-answering
Python
•
Apache License 2.0
•146•2.2k•136•4•Updated Mar 25, 2026Mar 25, 2026
InternVL-U
Public
InternVL-U is a 4B-parameter unified multimodal model (UMM) that brings multimodal understanding, reasoning, image generation, image editing into a single frame…
Python
•
MIT License
•14•269•5•0•Updated Mar 21, 2026Mar 21, 2026
Vlaser
Public
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
Python
•
MIT License
•0•45•1•0•Updated Mar 18, 2026Mar 18, 2026
GenEditEvalKit
Public
The first unified, efficient, and extensible evaluation toolkit for evaluating image generation and editing models across multiple benchmarks.
Jupyter Notebook
•
MIT License
•4•0•0•0•Updated Mar 7, 2026Mar 7, 2026
VKnowU
Public
Python
•1•11•0•0•Updated Feb 3, 2026Feb 3, 2026
MetaCaptioner
Public
Python
•4•50•2•0•Updated Jan 27, 2026Jan 27, 2026
ScaleCUA
Public
[ICLR 2026 Oral] ScaleCUA is the open-sourced computer use agents that can operate on cross-platform environments (Windows, macOS, Ubuntu, Android).
data models gui-agents
data models gui-agents computer-use-agents scalecua online-evaluation-suite
Python
•
Apache License 2.0
•78•1.1k•14•1•Updated Jan 7, 2026Jan 7, 2026
GUI-Odyssey
Public
[ICCV 2025] GUIOdyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUIOdyssey consists of 8,834 episodes from 6 mobile d…
Python
•9•156•10•0•Updated Jan 3, 2026Jan 3, 2026
SDLM
Public
Sequential Diffusion Language Model (SDLM) enhances pre-trained autoregressive language models by adaptively determining generation length and maintaining KV-ca…
gpt language-model diffusion-models
gpt language-model diffusion-models llm
Python
•
MIT License
•4•97•0•0•Updated Dec 27, 2025Dec 27, 2025
SID-VLN
Public
Official implementation of: Learning Goal-Oriented Language-Guided Navigation with Self-Improving Demonstrations at Scale
Python
•
MIT License
•2•12•0•0•Updated Nov 29, 2025Nov 29, 2025
vinci
Public
Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model
Python
•2•89•2•0•Updated Nov 27, 2025Nov 27, 2025
OmniQuant
Public
[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
quantization large-language-models llm
quantization large-language-models llm
Python
•
MIT License
•78•891•30•2•Updated Nov 26, 2025Nov 26, 2025
VideoChat-Flash
Public
[ICLR2026] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Python
•
MIT License
•19•518•10•0•Updated Nov 18, 2025Nov 18, 2025
ExpVid
Public
0•9•0•0•Updated Oct 28, 2025Oct 28, 2025
VideoChat-R1
Public
[NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning
Python
•10•267•24•0•Updated Oct 18, 2025Oct 18, 2025
NaViL
Public
Python
•
MIT License
•7•92•0•0•Updated Oct 10, 2025Oct 10, 2025
PonderV2
Public
[T-PAMI 2025] PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm
3d-vision pretraining foundation-models
3d-vision pretraining foundation-models
Python
•
MIT License
•8•372•0•0•Updated Sep 30, 2025Sep 30, 2025
InternVL
Public
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
image-classification gpt multi-modal
image-classification gpt multi-modal semantic-segmentation video-classification image-text-retrieval llm vision-language-model gpt-4v vit-6b
Python
•
MIT License
•769•10k•303•11•Updated Sep 22, 2025Sep 22, 2025
EgoExoLearn
Public
[CVPR 2024] Data and benchmark code for the EgoExoLearn dataset
Python
•
MIT License
•2•82•4•0•Updated Aug 26, 2025Aug 26, 2025
VRBench
Public
[ICCV 2025] A Benchmark for Multi-Step Reasoning in Long Narrative Videos
benchmark dataset video-understanding
benchmark dataset video-understanding vlm evaluation-kit multi-step-reasoning video-reasoning llm
Python
•
Apache License 2.0
•0•26•1•0•Updated Aug 8, 2025Aug 8, 2025
PIIP
Public
[NeurIPS 2024 Spotlight ⭐️ & TPAMI 2025] Parameter-Inverted Image Pyramid Networks (PIIP)
computer-vision image-classification object-detection
computer-vision image-classification object-detection semantic-segmentation instance-segmentation vision-transformer multimodal-large-language-models vision-language-models
Python
•
MIT License
•5•113•2•0•Updated Aug 5, 2025Aug 5, 2025
LORIS
Public
[ICML2023] Long-Term Rhythmic Video Soundtracker
music-generation pytorch-implementation multi-modality
music-generation pytorch-implementation multi-modality diffusion-models aigc
Python
•
MIT License
•1•62•1•0•Updated Jul 28, 2025Jul 28, 2025
TPO
Public
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
Jupyter Notebook
•6•65•1•0•Updated Jul 22, 2025Jul 22, 2025
Docopilot
Public
[CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding
Python
•
MIT License
•1•37•2•0•Updated Jul 22, 2025Jul 22, 2025
Mono-InternVL
Public
[CVPR 2025] Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Python
•
MIT License
•0•108•7•0•Updated Jul 18, 2025Jul 18, 2025

ProTip! When viewing an organization's repositories, you can use the props. filter to filter by custom property.