Recent advances in machine learning are driven by training scalable models on Internet-scale data (e.g., billions of image-text pairs or trillions of text tokens). This gives rises to foundation models that demonstrate in diverse tasks. In this course, we will study techniques that enable such machine learning systems. We will cover foundation models for language, vision, and other modalities.
Jia-Bin Huang (jbhuang@umd.edu)
Office: 4234 IRB building
Hadi Alzayer (hadi@umd.edu), Yi-Ting Chen (ytchen@umd.edu), Yue Feng (yuefeng@umd.edu), Ji-Ze Jang (gjang@umd.edu), Yao-Chih Lee (yclee@umd.edu)
College Calculus, Linear algebra, Probability and Statistics. Prior courses in machine learning, natural language processing, and computer vision are helpful, but not required.
We have two in-class midterm exams through out the semester. Detailed information will be made available.
Students will work in a group of 2-3 students to work on projects on the topic of multimodal foundation models.
We will have a list of recommended paper readings starting from the third lecture. For each lecture, students will turn in an one-page paper review. The review should have two sections: 1) paper summary and 2) your critiques (strenth/weakness of the paper, interesting insights or questions that worth discussions). The paper review will be due prior to the class (11:00 AM on Tues or Thurs). No late submissions are allowed. The students need to submit at least 20 paper reviews to receive full scores (40%)
Tuesday/Thursday 11:00 AM - 12:15 PM at IRB 0318
No lecture recordings. The instructor will post edited/summarized videos on the selected topics for student reviews. These will be posted shortly after the lectures.
We will be using Piazza as the primary platform for communication. Please do not send individual emails to TAs or the instructor as they are difficult to track.
Several courses offered at UMD also overlap with this course.