Course Overview
From LLMs to Multimodal Foundation Models and Large Physical Models
This course introduces multimodal foundation models as a key stage in the evolution of modern AI. Large language models mainly take text as input and generate text as output. Multimodal foundation models extend this paradigm by connecting language with images, audio, video, and other sensory signals, enabling AI systems to understand, generate, and reason across multiple forms of information.
Looking forward, multimodal foundation models will evolve into large physical models that understand geometry, motion, actions, and real-world constraints. The course covers frontier techniques beyond existing textbooks, including recent architectures, training methods, and future directions.
Course Team
Lectures
Assignments
Reviews
Review Materials
This section provides review materials for exams and course preparation.
Rebuttal Guideline: The rebuttal must be written in English and must not exceed one page.
Download ReviewDownload Rebuttal Template