Body

SelfEQ helps computers ‘see’ more accurately and consistently

New training method improves visual grounding

Rice University Computer Science Ph.D. student Catherine (Ruozhen) He

Rice University Computer Science Ph.D. student Catherine (Ruozhen) He presented her paper, Improved Visual Grounding through Self-Consistent Explanations, at the Conference on Computer Vision and Pattern Recognition (CVPR) 2024 conference this summer in Seattle.

The paper outlines a method for improving the ability of vision-and-language models (VLMs) to localize objects described by text within images. The process, known as “visual grounding,” is key to helping machine learning models understand the world.

Catherine He was advised on this paper by Associate Professor of Computer Science Vicente Ordóñez-Román. Recently graduated Ph.D. students Paola Cascante-Bonilla and Ziyan Yang also contributed to this research, with collaboration from UC Irvine professor Alex Berg.

How VLMs Help Machines Perceive The World

The AI sub-field of computer vision focuses on helping computers see and make sense of the world how a human would. We have a lifetime of experience learning to parse and contextualize the input from our five senses, computers don’t. 

VLMs learn by training on large amounts of images and text. Their goal is to help a computer perceive visual and textual information, such as localizing objects within an image based on text input. Someone could, for example, input the phrase “frisbee” and a photo of people playing at the park, and the VLM would try to find a frisbee within that picture. 

The method VLMs use to find objects in an image is called visual grounding, and it’s the focus of Catherine He’s research outlined in the paper.

Catherine He’s research group proposed a method called Self-consistency EQuivalence Tuning (SelfEQ). In the paper, they show it can improve the visual grounding capabilities of VLMs and help demystify the way AI works for people who don’t study it.

Our VLM model was tasked with pointing to the area that has  "flowing water" in the first picture and "a cactus" in the  second picture
Our VLM model was tasked with pointing to the area that has "flowing water" in the first picture and "a cactus" in the second picture

Improving A VLM’s Self-Consistency and Working Vocabulary

The goal of SelfEQ is to tune a VLM on text paraphrases to strengthen its self-consistency property and broaden its working vocabulary for localizing objects. Other VLMs struggled with different phrases for the same object, like “frisbee” and “disc,” but SelfEQ was able to learn that they are the same.

First, a large language model (LLM) was used to generate text paraphrases. Then that data was fed into the VLM. Because the LLM already knew there was a relationship between those words and the VLM trained on the LLM’s data, SelfEQ was able to learn relationships between different words referring to the same object. 

“Language models know that frisbee and disc are related concepts, so by using the knowledge from the large language model we can also give this knowledge to the visual model,” explained Professor Ordóñez-Román.

SelfEQ and Improved Visual Grounding

Catherine He’s group takes advantage of an intermediate step in the process a VLM uses to map text to an object within an image. When trying to localize an object, a VLM will generate a gradient-based visual explanation called a heatmap. 

Many visual grounding models are trained with sets of coordinates called bounding boxes to show the model the location of the desired object in an image. SelfEQ uses just images and text, augmented by paraphrases from a large language model (LLM). The resulting gradient explanation map is then checked to see how well and consistently the model matched the text to the image. 

Catherine He’s group hopes this method of training will improve the model’s self-consistency, making it more reliable. 

“It’s literally how the model thinks about the text and the image,” He explained. “We say AI is like a black box, and after ChatGPT getting more and more popular there are many fierce debates on whether we should trust AI and whether AI is safe for use. So I would say making AI more interpretable and more explainable is very important for trustworthy AI.”
 

Catherine He and Vicente Ordóñez after the poster session at CVPR 2024 in Seattle.
Catherine He and Vicente Ordóñez after the poster session at CVPR 2024 in Seattle.

What Comes Next

Catherine He explained that this could have multiple downstream applications depending on its use.

“For visual grounding, there should be many downstream paths, like object detection or image captioning…In our method there’s no specific output, we just visualize the intermediate layer inside the model,” she said.

Improved visual grounding by VLMs could, for example, help a robot identify a specific object and perform an action with that object. 

“If three mugs are on a table and only one of them has coffee, you tell the robot to pick up the mug with the coffee and it can choose the right one,” said Ordóñez-Román.

 

John Bogna, contributing writer