Molmo 2
SOTA video understanding, pointing, and tracking VLM
About
The page introduces Molmo 2, a new family of open multimodal models developed by Ai2, designed for state-of-the-art video understanding, pointing, and tracking. Building on the original Molmo's success in image understanding, Molmo 2 extends these capabilities to video and multi-image inputs. It offers three variants (8B, 4B, and an Olmo-backed 7B) optimized for different needs, demonstrating superior performance and efficiency compared to its predecessors and some proprietary systems on key benchmarks like video tracking, image/multi-image reasoning, and video grounding. Molmo 2 supports advanced features such as video pointing, multi-object tracking with persistent IDs, dense video captioning, anomaly detection, and subtitle-aware QA. The model's open and extensible architecture combines a vision encoder with a language model backbone (Qwen 3 or Olmo) and was trained on a meticulously curated, video-centric multimodal corpus of over 9 million examples, including nine new datasets for dense captioning, long-form QA, and grounded pointing/tracking. It is intended for research and educational use.
Categories & Tags
Color Palette
Design Review
Similar Products
Clear for Slack
Clear messages get answered quicker
Griply 2026
Achieve your goals with a goal-oriented task manager
HappyMail
We made email simple again
Blober.io
The easiest way to transfer files between cloud providers.
Supaguard
Scan, Detect & Protect Your Supabase Data
Timelines Time Tracking 4
Track your time to achieve your New Year resolutions.
SoftReveal — Reveal less. Engage more.
Hide Content, Reveal on Click
Reword
Rewrite messages without leaving your workflow
MoovAI
Launch viral AI ads & pro social content in minutes
Resell AI
Reselling workflow with market-based price suggestions
Qwen-Image-2512
SOTA open-source T2I model with even greater realism
Friendware
Tab-to-complete everywhere on MacOS