Vision and Language

CS 6501/4501: Vision and Language | Fall 2020

Instructor: Vicente Ordóñez-Román (vicente at virginia.edu), Office Hours: Fridays between 1pm and 3pm (ET)

Teaching Assistant: Ziyan Yang (zy3cx at virginia.edu), Office Hours: Wednesdays between 1pm and 2pm (ET)

Teaching Assistant: Paola Cascante-Bonilla (pc9za at virginia.edu), Office Hours: Fridays between 4pm and 5pm (ET)

Teaching Assistant: Anshuman Suri (as9rw at virginia.edu), Office Hours: Mondays between 1pm and 2pm (ET)

Class Time: Tuesdays & Thursdays between 3:30PM and 4:45PM (ET).

Discussion Forum: https://campuswire.com/c/G8B171FE1

Course Description: Visual recognition and language understanding are two challenging tasks in AI. In this course we will study and acquire the skills to build machine learning and deep learning models that can reason about images and text for generating image descriptions, visual question answering, image retrieval, and other tasks. On the technical side we will leverage models such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformer networks (e.g. BERT), among others.

Learning Objectives: (a) Develop intuitions about the connections between language and vision, (b) Understanding foundational concepts for representation learning for both images and text, (c) Become familiar with state-of-the-art models for tasks in vision and language, (d) Obtain practical experience in the implementation of these models.

Prerrequisites: It is recommended to have had a prior class in any of the following: Machine Learning, Computer Vision, Deep Learning for Visual Recognition, Natural Language Processing, Artificial Intelligence, or similar. Students are encouraged to complete the following activities before the first lecture: Completing this [Primer on Image Processing], and the tutorial and assignment on [Image Classification] from the Deep Learning for Visual Recognition class.

Schedule

Date	Topic
Tue, Aug 25th	Introduction to Vision and Language [pptx] [pdf]
Thu, Aug 27th	Machine Learning Primer [pptx] [pdf]
Tue, Sep 1st	Machine Learning Primer -- Continuation
Thu, Sep 3rd	Computer Vision Introduction [pptx] [pdf]
Assignment on Text and Image Classification [Colab]. Due September 20th 5pm EST.
Tue, Sep 8th	Computer Vision Introduction -- Continuation
Thu, Sep 10th	Natural Language Processsing Introduction [pptx] [pdf]
Tue, Sep 15th	Recurrent Neural Networks and Neural Image Captioning [pptx] [pdf]
Thu, Sep 17th	Transformer Models and Self-Attention (e.g. BERT, GPT, XLNet) [pptx] [pdf]
Tue, Sep 22th	Guest Lecture: Vision and Language Transformers (UNITER, VilBERT, VisualBERT) Licheng Yu, Research Scientist at Facebook AI Previously researcher at Microsoft and part of the team behind UNITER and MattNet.
Thu, Sep 24th	Walkthrough on Image, Text Classification and Processing / Optimization and Regularization.
Assignment on Multimodal Movie Analysis [Colab]. Due Sunday October 11th 5pm EST.
Tue, Sep 29th	Guest Lecture: Visually Grounded Explanations for Physical Tasks Nazneen Rajani, Research Scientist at Salesforce Research Previously at UT Austin and extensive work on NLP, machine learning, explainable AI.
Thu, Oct 1st	More on Convolutional Neural Network Architectures [pptx] [pdf]
Tue, Oct 6th	Guest Lecture: Biases in Vision and Language: Visual Question Answering Kushal Kafle, Research Scientist at Adobe Research Previously at RIT and extensive work in vision and language and VQA.
Thu, Oct 8th	CNNs for Detection and Segmentation. Visually Grounded Question Answering and Navigation. [pptx] [pdf]
Tue, Oct 13th	Guest Lecture: Vision-and-Language Navigation (VLN) Peter Anderson, Research Scientist at Google Research Previously at Georgia Tech and extensive work in vision and language and VLN.
Thu, Oct 15th	The Rivanna Interactive Environment and Practical Session on Recurrent Neural Networks for Text Generation
Assignment on Text Generation and Image Captioning [Colab]. Due Friday October 30th 5pm EST.
Tue, Oct 20th	Guest Lecture: Cross-Modality Personalization for Image Retrieval Nils Murrugarra-Llerena, Research Scientist at Snap Research Previously at the University of Pittsburgh and extensive work in multi-modal deep learning.
Thu, Oct 22th	Video Representations and Video and Language Tasks [pptx] [pdf]
Tue, Oct 27th	Guest Lecture: Grounding Vision to Sound and Audio Yipin Zhou, Research Scientist at Facebook AI Previously at UNC and extensive work with images, sound, and video.
Thu, Oct 29th	Multimodal Machine Translation [pptx] [pdf]
Tue, Nov 3rd	Election Day
Thu, Nov 5th	Visually Grounded Dialog [pptx] [pdf]
Tue, Nov 10th	Guest Lecture: When is Grounding Helpful for Language and Vision Tasks? Lisa Anne Hendricks, Research Scientist at DeepMind Previously at UC Berkeley and extensive work with deep learning, computer vision, and vision and language.
Thu, Nov 12th	Entry-level Categories and Naming [pptx] [pdf]
Tue, Nov 17th	Practical Session on Deploying Machine Learning-based Vision and Language Models.
Thu, Nov 19th	Open Problems on Vision and Language Research [pptx] [pdf]
Tue, Nov 24th	Course Recap and Final Class Activity

Disclaimer: The professor reserves to right to make changes to the syllabus, including assignment due dates. These changes will be announced as early as possible.

CS4501 Grading: Assignments: 300pts (2 assignments: 150pts + 150pts), Class Project: 600pts, Peer Reviews/Participation: 100pts. Default grade cutoffs: A+ (1000pts), A (930pts), A- (900pts), B+ (870pts), B (830pts), B- (800pts), C+ (770pts), C (730pts), C- (700pts), D+ (670pts), D (630pts), D- (600pts).

CS6501 Grading: Assignments: 300pts (3 assignments: 100pts + 100pts + 100pts), Class Project: 600pts, Peer Reviews/Participation: 100pts. Default grade cutoffs: A+ (1000pts), A (930pts), A- (900pts), B+ (870pts), B (830pts), B- (800pts), C+ (770pts), C (730pts), C- (700pts), D+ (670pts), D (630pts), D- (600pts).

Late Submission Policy: No late assignments will be accepted in this class. Unless the student has procured special accommodations for warranted circumstances. We will be accommodating also due to exceptional circumstances but this is a large class so please make sure this is truly warranted and contact us as soon as possible.

Recording of Lectures: I will be recording every lecture in order to accommodate students who will be learning remotely -- however there might be small discussions pre and post-lecture which might not be recorded -- if these take place they are not considered essential and they will be communicated through other means (e.g. email or UVA Collab). Because lectures include fellow students, you and they may be personally identifiable on the recordings. We might set aside some time at the end for questions that will not be recorded -- this will be announced when it takes place. These recordings may only be used for the purpose of individual or group study with other students enrolled in this class during this semester. You may not distribute them in whole or in part through any other platform or to any persons outside of this class, nor may you make your own recordings of this class unless written permission has been obtained from the Instructor and all participants in the class have been informed that recording will occur. If you want additional details on this, please see Provost Policy 008 which is expected to be updated for the Fall 2020 semester. If you notice that I have failed to activate the recording feature, please remind me!

Academic Integrity Statement: "The School of Engineering and Applied Science relies upon and cherishes its community of trust. We firmly endorse, uphold, and embrace the University’s Honor principle that students will not lie, cheat, or steal, nor shall they tolerate those who do. We recognize that even one honor infraction can destroy an exemplary reputation that has taken years to build. Acting in a manner consistent with the principles of honor will benefit every member of the community both while enrolled in the Engineering School and in the future. Students are expected to be familiar with the university honor code, including the section on academic fraud." In summary, if assignments are individual then no collaboration is expected, no two students should submit the same source code. Regardless of circumstances I will assume that any source code, text, or images submitted alongside reports or projects are of the authorship of the students unless otherwise explicitly stated through appropriate means. Any missing information regarding sources will be regarded potentially as a failure to abide by the academic integrity statement even if that was not the intent. Please be careful about citing sources and clearly stating what is your original work and what is not in all assignments and projects. Especially avoid vague statements such as "we built our model based on X", instead be explicit e.g. "we downloaded X and modified the encoder so that it can work with videos instead of images by adding three more layers". Avoid vague statements that make it difficult to understand what you did from what was done before.

Discrimination and power-based violence: The University of Virginia is dedicated to providing a safe and equitable learning environment for all students. To that end, it is vital that you know two values that I and the University hold as critically important: (1) Power-based personal violence will not be tolerated. (2) Everyone has a responsibility to do their part to maintain a safe community on Grounds. If you or someone you know has been affected by power-based personal violence, more information can be found on the UVA Sexual Violence website that describes reporting options and resources available - www.virginia.edu/sexualviolence. As your professor and as a person, know that I care about you and your well-being and stand ready to provide support and resources as I can. As a faculty member, I am a responsible employee, which means that I am required by University policy and federal law to report what you tell me to the University's Title IX Coordinator. The Title IX Coordinator's job is to ensure that the reporting student receives the resources and support that they need, while also reviewing the information presented to determine whether further action is necessary to ensure survivor safety and the safety of the University community. If you wish to report something that you have seen, you can do so at the Just Report It portal (http://justreportit.virginia.edu/). The worst possible situation would be for you or your friend to remain silent when there are so many here willing and able to help.

Anti-racism commitment: I acknowledge that racism and white supremacy are baked into the history of UVA as an institution. I believe that my pedagogical philosophies and practices can either reinforce inequities or work to eliminate them. I am committed and actively working to be a better, more careful listener; continuing to learn about the ways systemic injustices disadvantage Black students and colleagues and other students and colleagues of color in and out of the classroom; and advocating for and implementing anti-racist educational practices. I will hold myself accountable, encourage you to help me do so, and invite you to join me in this work.

Accessibility Statement: "The University of Virginia strives to provide accessibility to all students. If you require an accommodation to fully access this course, please contact the Student Disability Access Center (SDAC) at (434) 243-5180 or sdac@virginia.edu. If you are unsure if you require an accommodation, or to learn more about their services, you may contact the SDAC at the number above or by visiting their website at https://www.studenthealth.virginia.edu/student-disability-access-center/about-sdac." If you need any specific accommodations in the format of the lectures, videos, etc, please communicate it to the instructor as soon as possible.