3rd Workshop on Advances in Language and Vision Research (ALVR)

Photo by boykpe on iStock
Time (ICT) Event Who
8:10-8:20 Opening Remarks Workshop Organizers
8:20-9:10 Invited Talk 1: Foundations of Multimodal Interactions and Multisensory Foundation Models Abstract
Building multisensory AI systems that learn from multiple sensory inputs such as text, speech, video, real-world sensors, wearable devices, and medical data holds great promise for impact in many scientific areas with practical benefits, such as in supporting human health and well-being, enabling multimedia content processing, and enhancing real-world autonomous agents. In this talk, I will discuss my research on the machine learning principles of multisensory intelligence, as well as practical methods for building multisensory foundation models over many modalities and tasks. In the first half, I will present a theoretical framework formalizing how modalities interact with each other to give rise to new information for a task. These interactions are the basic building blocks in all multimodal problems, and their quantification enables users to understand their multimodal datasets and design principled approaches to learn these interactions. In the second part, I will discuss our collaborative efforts in scaling AI to many modalities and tasks for real-world impact on affective computing, mental health, and cancer prognosis.
Paul Liang
9:10-10:00 Invited Talk 2: Connecting 3D and Language Abstract
People communicate about objects, scenes, and spatial relations in the real world using natural language. In this talk, I will give an overview of recent work that explores how to endow computational systems with the ability to connect natural language and 3D representations. Concretely, I will summarize projects that develop neural models for localizing and describing objects in 3D scenes using natural language, generating 3D content from text, and other recent work exploring tasks that require connections between natural language and 3D representations.
Angel Chang
10:00-10:30 Coffee Break
10:30-11:20 Invited Talk 3: Planning and Reasoning for Multi-Agent Embodied Tasks: Benchmarking and Evaluation Abstract
Language is never used in isolation. It is articulated, understood, and contextualized against the speaker’s history, actions, and environment. To this end, we have developed a series of benchmarks to evaluate language comprehension in dynamic, embodied settings where robots execute instructions given by humans. First, I will discuss Situated Instruction Following, where the meaning of instructions is revealed through the past actions and anticipated future behaviors of the human speaker. Then, I will introduce PARTNR, a multi-agent benchmark that involves a human and their robotic assistant collaboratively completing tasks under various constraints, including spatial, temporal, and heterogeneity challenges. Lastly, I will present GOAT-Bench, a robot navigation benchmark in a continual learning framework where the navigation target is defined through multiple modalities, including images and natural language descriptions of objects. I will share our analysis of these benchmarks, demonstrating how state-of-the-art LLMs and large-scale multi-modal models struggle with tracking task progression, managing ambiguities, and dividing tasks effectively.
Roozbeh Mottaghi
11:20-12:10 Invited Talk 4: The Role of Joint Embodiment in Situated Language-Based Interactions Abstract
Large-scale pretraining has become the standard solution to automated reasoning over text and/or visual perception. But how far does this approach get us to systems that generalize to language use in realistic multi-agent situated interactions? First, I will talk about existing work in evaluating the spatial and compositional reasoning capabilities of current multimodal LMs. Then, I will talk about how these benchmarks miss a key aspect of real-world situated interactions: joint embodiment. I will discuss how joint embodiment in a shared world supports perspective-taking, an underlooked aspect of situated reasoning, and introduce a new environment and benchmark for studying the influence of perspective-taking on language use in interaction.
Alane Suhr
12:10-13:30 Lunch
13:30-14:20 Invited Talk 5: Toward Vision and Richer Language(s) Abstract
How rich is the language in vision and language research? Arguably for some time, visual understanding has been the first class citizen of this research direction. In this talk, I will present research projects in the past several years that aim at elevating the role of language(s). This spans improving specificity and informativeness, reasoning about the language in pixels, as well as going beyond declarative, literal, and English languages. Collectively, this moves us toward tightly connecting vision with rich(er) languages.
Soravit "Beer" Changpinyo
14:20-15:00 Poster Highlight
15:00-15:50 Poster Session, Coffee Break
15:50-16:40 Invited Talk 6: Representing Illustrative Visual Semantics with Descriptive Language Abstract
Contemporary visual semantic representations predominantly revolve around commonplace objects found in everyday images and videos, ranging from ladybugs and bunnies to airplanes. However, crucial visual cues extend beyond mere object recognition and interaction. They encompass a spectrum of richer semantics, including vector graphics (e.g., angles, mazes), fine-grained attributes and affordances. Moreover, they entail intricate visual dynamics, such as object interactions, actions, and activities. Regrettably, traditional visual representations relying solely on pixels and regions fail to fully encapsulate these nuances. In this task, I propose to design intermediate symbolic semantic representations to precisely describe and aggregate these low-level visual signals. This augmentation promises to enhance their utility as inputs for large language models or vision-language models, thereby facilitating high-level knowledge reasoning and discovery tasks. I will present several applications range from playful maze solving and fine-grained concept recognition to video activity detection.
Heng Ji
16:40-17:20 Panel Session
Xin Eric Wang, Paul Liang, Roozbeh Mottaghi, Soravit "Beer" Changpinyo, Heng Ji