What is Vision-Language Model (VLM)?
AI FundamentalsLast updated:
A multimodal AI model that can process and reason about both images and text simultaneously.
VLMs like GPT-4V, Claude's vision, and LLaVA combine visual encoders with language models to understand screenshots, diagrams, documents, and photos. They enable use cases like UI analysis, document extraction, and visual question answering.