Multimodal Foundation Models

Course Slides and Learning Materials

Towards Large Physical Models

Course Overview

From LLMs to Multimodal Foundation Models and Large Physical Models

Development path from large language models to multimodal foundation models and large physical models

This course introduces multimodal foundation models as a key stage in the evolution of modern AI. Large language models mainly take text as input and generate text as output. Multimodal foundation models extend this paradigm by connecting language with images, audio, video, and other sensory signals, enabling AI systems to understand, generate, and reason across multiple forms of information.

Looking forward, multimodal foundation models will evolve into large physical models that understand geometry, motion, actions, and real-world constraints. The course covers frontier techniques beyond existing textbooks, including recent architectures, training methods, and future directions.

Course Team

Guangrun Wang Guangrun Wang

Course instructor

Xiao Li Xiao Li

Teaching assistant

Xiaoxin Lin Xiaoxin Lin

Teaching assistant

Jiaying Zhou Jiaying Zhou

Teaching assistant

Lectures

Lecture 7: 新一代AI架构

This lecture introduces new AI architectures.

Download Slides

Assignments

Homework Assignment

This section provides homework assignments for the course.

Download Assignment

Reviews

Review Materials

This section provides review materials for exams and course preparation.

Rebuttal Guideline: The rebuttal must be written in English and must not exceed one page.

Download Review
Download Rebuttal Template