Home / Applied AI / 05. AI Specializations / 01. Multimodal Vision Systems Multimodal Vision Systems¶ The chapters in this module, in reading order. # Chapter 00 Image & Video Models: Plain English First 01 Why dense pixels fail first — the network forgets the picture before it learns 02 Vision Transformers in plain sight — how small squares become visual tokens 03 CLIP and shared meaning — pull images and words into one coordinate system 04 Vision-Language Models — where seeing meets speaking 05 LLaVA and Frontier VLMs — the simple recipe behind smart image assistants 06 Training VLMs: Failure Points — where strong demos still crack 07 Image generation families — four ways to paint on the same canvas 08 Text-to-image pipeline — from prompt words to painted pixels 09 Image editing and control — steering the painter without repainting everything 10 Video tokenization and temporal modeling — time turns one picture into a crowd 11 Text-to-video systems — how moving clips are actually built 12 Production multimodal systems — reality begins when the bill arrives 13 Honest admission — what still feels unresolved in image and video models