Vision and Language

COMP 648: Computer Vision Seminar: Vision and Language | Fall 2023

Instructor: Vicente Ordóñez-Román (vicenteor at rice.edu)

Class Time: Tuesdays from 4pm to 5:15pm Central Time (Sewall Hall 562).

Discussion Forum: link

Course Description: This seminar will explore and analyze the current literature in computer vision, especially focusing on computational methods for visual recognition. Our topics include image classification and understanding, object detection, image segmentation, and other high-level perceptual tasks. Particularly, we will explore this semester recent topics such as: Contrastive-learning (e.g SimCLR, MoCoV2), Vision-language Transformers (e.g. ALBEF, BLIP, BLIP-2, CLIP, OpenCLIP, GLIP), Diffusion Models (e.g. DALL·E 2, Imagen, Phenaki, StableDiffusion, ControlNet, Dreambooth), Learning with Synthetic Data (Hypersim, ThreeDWorld, etc), Combining Vision Models with LLMs (VisualGPT, LLAVA, Flamingo, VisProg, VIPER), among other topics.

Recommended Pre-requisites: COMP 547 (Computer Vision) or COMP 646 (Deep Learning for Vision and Language) or COMP 546/ELEC 546 (Intro to Computer Vision) or COMP 576 (Intro to Deep Learning) or COMP 647 (Deep Learning) or research experience in any of these topics.

Schedule

Date	Topic
Aug 22nd	Introduction and Preliminaries Welcome and Introductions Computer Vision: Cameras, Light, Photography and the Human Eye.
Aug 29th	Introduction and Preliminaries Deep Learning: CNNs, Transformers, Diffusion Challenges: Classification, Segmentation, Grounding, Synthesis, Reasoning
Sep 5th	Vision-Language Pre-Training: Led by Ziyan Yang CLIP: Learning Transferable Visual Models From Natural Language Supervision. ICML 2021. [paper] [github] [OpenCLIP] ALBEF: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. NeurIPS 2021. [paper] [github]
Sep 12th	Large Scale Vision-Language Dataset Collection: Led by Matias Romero LAION-5B: An open large-scale dataset for training next generation image-text models. NeurIPS 2022 (Datasets and Benchmarks). [paper] [link] [demo] DataComp: In search of the next generation of multimodal datasets. arXiv 2023. [paper] [link]
Sep 19th	Rice Colloquium Speaker: Speaker: Julia Stoyanovich, New York University Title: Responsible Data Management. https://events.rice.edu/event/351819-colloquium-responsible-data-management
Sep 26th	Generative AI: Vision + Large Language Models (LLMs): Led by Jason Zhang BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ICML 2023. [paper] [github] LLaVA: Large Language and Vision Assistant: Visual Instruction Tuning. arXiv 2023. [paper] [demo] [github]
Oct 3rd	GPT-4V is Here: Group Discussion The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision). arXiv 2023. [paper]
Oct 10th	MIDTERM RECESS (NO SCHEDULED CLASSES)
Oct 17th	Generative AI: Text-to-Image Synthesis: Led by Moayed Haji Ali Imagen: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. NeurIPS 2022. [paper] Stable Diffusion: High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022. [paper] [github] [StableDiffusionv2]
Oct 24th	Rice Colloquium Speaker: Speaker: Scott Huffman, Google Title: What to Make of Large Language Models? https://events.rice.edu/event/351313-distinguished-lecture-scott-huffman
Oct 31st	Vision + Language + Localization: Led by Bill Qian and Zilin Xiao AMC: Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations. CVPR 2023. [paper] [demo] [github] SAM: Segment anything. ICCV 2023. [paper] [demo] [github]
Nov 7th	Learning Visual Representations from Synthetic Procedurally Generated Images [video] Learning to See by Looking at Noise. NeurIPS 2021. [paper] Procedural Image Programs for Representation Learning. NeurIPS 2022. [paper]
Nov 14th	Visual Programming: Vision + Symbolic Generation from LLMs: Led by Jaywon Koo Visual Programming: Compositional Visual Reasoning without Training [paper] [github] [project page] ViperGPT: Visual Inference via Python Execution for Reasoning [paper] [github] [project page]
Nov 21st	Controllable Text-to-Image Synthesis: Led by Bill Qian and Sheng Cheng DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. CVPR 2023. [paper] [project page] ControlNet: Adding Conditional Control to Text-to-Image Diffusion Models. ICCV 2023 (Marr Prize). [paper] [github]
Nov 28th	More on Generative AI: GLIGEN: Open-Set Grounded Text-to-Image Generation. 2023. [paper] [github] [demo] CoDi: Any-to-Any Generation via Composable Diffusion. 2023. [project page] Visual Programming for Text-to-Image Generation and Evaluation [paper] [github] [project page]

Disclaimer: The topics on this list are tentative and subject to adjustments throughout the semester as interests in the group evolve.

Logistics: This is a seminar with a Satisfactory/Unsatisfactory (S/U) grade. Registered students are required to participate and present a recent work in a topic of interest of the seminar at least once throughout the semester to get 1 credit. If students registered for more than 1 credit then they need to complete an original research project with deliverables in the form of a project page, a technical research report, github code and a demo. Registering more than 1-credit is restricted to the availability of the instructor and will be rare. Most students will be under 1-credit registration. A Satisfactory grade requires participating and presenting a paper at least once during the semester, actively participating in discussions throughout the semester, and completing the project only for students registering in more than 1 credit.

Honor Code and Academic Integrity: "In this course, all students will be held to the standards of the Rice Honor Code, a code that you pledged to honor when you matriculated at this institution. If you are unfamiliar with the details of this code and how it is administered, you should consult the Honor System Handbook at http://honor.rice.edu/honor-system-handbook/. This handbook outlines the University's expectations for the integrity of your academic work, the procedures for resolving alleged violations of those expectations, and the rights and responsibilities of students and faculty members throughout the process." In addition, you are allowed to use AI tools as long as you take responsibility for the accuracy of anything submitted and it does not contradict Rice's Honor Code. For instance, fabricating images or plots/numbers that are supposed to be results of your method or something like that, which is clearly deceptive in nature, are serious violations of the Honor Code. As a short hand also if the output of any AI model was included without reading it first or understanding it, then it is an Honor Code violation. Use of AI for clarification or writing assistance, polishing sentences, automating data pre-processing or common data manipulation routines, are examples of perfectly acceptable use cases.

Title IX Support: Rice University cares about your wellbeing and safety. Rice encourages any student who has experienced an incident of harassment, pregnancy discrimination or gender discrimination or relationship, sexual, or other forms interpersonal violence to seek support through The SAFE Office. Students should be aware when seeking support on campus that most employees, including myself, as the instructor/TA, are required by Title IX to disclose all incidents of non-consensual interpersonal behaviors to Title IX professionals on campus who can act to support that student and meet their needs. For more information, please visit safe.rice.edu or email titleixsupport@rice.edu.

Disability Resource Center: "If you have a documented disability or other condition that may affect academic performance you should: 1) make sure this documentation is on file with the Disability Resource Center (Allen Center, Room 111 / adarice@rice.edu / x5841) to determine the accommodations you need; and 2) talk with me to discuss your accommodation needs."