Language and vision research has attracted great attention from both natural language processing (NLP) and computer vision (CV) researchers. Gradually, this area is shifting from passive perception, templated language, and synthetic imagery/environments to active perception, natural language, and photo-realistic simulation or real world deployment.
This workshop covers (but is not limited to) the following topics:
Long papers may consist of up to 8 pages of content, plus unlimited pages for references and an appendix; final versions of long papers will be given one additional page of content (up to 9 pages) so that reviewers’ comments can be considered.
Short papers may consist of up to 4 pages of content, plus unlimited references and an appendix. Short papers will be given 5 content pages in the proceedings upon acceptance. Authors are encouraged to use this additional page to address reviewers’ comments in their final versions.
We are also including a non-archival track to allow dual submission of work to ALVR 2024 and other conferences/journals. Space permitting, these submissions will still participate and present their work in the workshop and will be hosted on the workshop website but will not be included in the official proceedings. Please apply the ACL format and submit through openreview but indicate that this is a cross-submission (non-archival) at the bottom of the submission form.
The submission website is https://openreview.net/group?id=aclweb.org/ACL/2024/Workshop/ALVR.
Time (ICT) | Event | Who |
---|---|---|
8:10-8:20 | Opening Remarks | Workshop Organizers |
8:20-9:10 | Invited Talk 1: Foundations of Multimodal Interactions and Multisensory Foundation Models
Abstract
Building multisensory AI systems that learn from multiple sensory inputs such as text, speech, video, real-world sensors, wearable devices, and medical data holds great promise for impact in many scientific areas with practical benefits, such as in supporting human health and well-being, enabling multimedia content processing, and enhancing real-world autonomous agents. In this talk, I will discuss my research on the machine learning principles of multisensory intelligence, as well as practical methods for building multisensory foundation models over many modalities and tasks. In the first half, I will present a theoretical framework formalizing how modalities interact with each other to give rise to new information for a task. These interactions are the basic building blocks in all multimodal problems, and their quantification enables users to understand their multimodal datasets and design principled approaches to learn these interactions. In the second part, I will discuss our collaborative efforts in scaling AI to many modalities and tasks for real-world impact on affective computing, mental health, and cancer prognosis.
|
Paul Liang |
9:10-10:00 | Invited Talk 2: Connecting 3D and Language
Abstract
People communicate about objects, scenes, and spatial relations in the real world using natural language. In this talk, I will give an overview of recent work that explores how to endow computational systems with the ability to connect natural language and 3D representations. Concretely, I will summarize projects that develop neural models for localizing and describing objects in 3D scenes using natural language, generating 3D content from text, and other recent work exploring tasks that require connections between natural language and 3D representations.
|
Angel Chang |
10:00-10:30 | Coffee Break |
|
10:30-11:20 | Invited Talk 3: Planning and Reasoning for Multi-Agent Embodied Tasks: Benchmarking and Evaluation
Abstract
Language is never used in isolation. It is articulated, understood, and contextualized against the speaker’s history, actions, and environment. To this end, we have developed a series of benchmarks to evaluate language comprehension in dynamic, embodied settings where robots execute instructions given by humans. First, I will discuss Situated Instruction Following, where the meaning of instructions is revealed through the past actions and anticipated future behaviors of the human speaker. Then, I will introduce PARTNR, a multi-agent benchmark that involves a human and their robotic assistant collaboratively completing tasks under various constraints, including spatial, temporal, and heterogeneity challenges. Lastly, I will present GOAT-Bench, a robot navigation benchmark in a continual learning framework where the navigation target is defined through multiple modalities, including images and natural language descriptions of objects. I will share our analysis of these benchmarks, demonstrating how state-of-the-art LLMs and large-scale multi-modal models struggle with tracking task progression, managing ambiguities, and dividing tasks effectively.
|
Roozbeh Mottaghi |
11:20-12:10 | Invited Talk 4: The Role of Joint Embodiment in Situated Language-Based Interactions
Abstract
Large-scale pretraining has become the standard solution to automated reasoning over text and/or visual perception. But how far does this approach get us to systems that generalize to language use in realistic multi-agent situated interactions? First, I will talk about existing work in evaluating the spatial and compositional reasoning capabilities of current multimodal LMs. Then, I will talk about how these benchmarks miss a key aspect of real-world situated interactions: joint embodiment. I will discuss how joint embodiment in a shared world supports perspective-taking, an underlooked aspect of situated reasoning, and introduce a new environment and benchmark for studying the influence of perspective-taking on language use in interaction.
|
Alane Suhr |
12:10-13:30 | Lunch |
|
13:30-14:20 | Invited Talk 5: Toward Vision and Richer Language(s)
Abstract
How rich is the language in vision and language research? Arguably for some time, visual understanding has been the first class citizen of this research direction. In this talk, I will present research projects in the past several years that aim at elevating the role of language(s). This spans improving specificity and informativeness, reasoning about the language in pixels, as well as going beyond declarative, literal, and English languages. Collectively, this moves us toward tightly connecting vision with rich(er) languages.
|
Soravit "Beer" Changpinyo |
14:20-15:00 | Poster Highlight |
|
15:00-15:50 | Poster Session, Coffee Break |
|
15:50-16:40 | Invited Talk 6: Representing Illustrative Visual Semantics with Descriptive Language
Abstract
Contemporary visual semantic representations predominantly revolve around commonplace objects found in everyday images and videos, ranging from ladybugs and bunnies to airplanes. However, crucial visual cues extend beyond mere object recognition and interaction. They encompass a spectrum of richer semantics, including vector graphics (e.g., angles, mazes), fine-grained attributes and affordances. Moreover, they entail intricate visual dynamics, such as object interactions, actions, and activities. Regrettably, traditional visual representations relying solely on pixels and regions fail to fully encapsulate these nuances. In this task, I propose to design intermediate symbolic semantic representations to precisely describe and aggregate these low-level visual signals. This augmentation promises to enhance their utility as inputs for large language models or vision-language models, thereby facilitating high-level knowledge reasoning and discovery tasks. I will present several applications range from playful maze solving and fine-grained concept recognition to video activity detection.
|
Heng Ji |
16:40-17:20 | Panel Session |
Xin Eric Wang, Paul Liang, Roozbeh Mottaghi, Soravit "Beer" Changpinyo, Heng Ji |
Dr. Paul Liang is an Assistant Professor at MIT Media Lab and EECS, whose research focuses on advancing the foundations of multisensory artificial intelligence to enhance human experiences across various domains. Recognized with prestigious awards including the Siebel Scholars Award, Waibel Presidential Fellowship, Facebook PhD Fellowship, and multiple best paper awards, Liang's work explores the integration of diverse sensory channels such as text, speech, audio, video, and physiological signals to create AI systems that interact seamlessly with the world. His research interests span from enhancing human physical, emotional, and social well-being through AI, to augmenting human creativity with multimedia generative AI, and addressing critical issues in real-world human-AI interaction such as fairness, trust, and privacy. Beyond research, he has also been honored with the Alan J. Perlis Graduate Student Teaching Award for his contributions to multimodal machine learning education.
Dr. Angel Chang is an Assistant Professor at Simon Fraser University. Prior to this, she was a visiting research scientist at Facebook AI Research and a research scientist at Eloquent Labs working on dialogue. She received my Ph.D. in Computer Science from Stanford, where she was part of the Natural Language Processing Group and advised by Chris Manning. Her research focuses on connecting language to 3D representations of shapes and scenes and grounding of language for embodied agents in indoor environments. She has worked on methods for synthesizing 3D scenes and shapes from natural language, and various datasets for 3D scene understanding. In general, she is interested in the semantics of shapes and scenes, the representation and acquisition of common sense knowledge, and reasoning using probabilistic models.
Dr. Roozbeh Mottaghi is a Senior Research Scientist Manager at FAIR and an Affiliate Associate Professor in Paul G. Allen School of Computer Science and Engineering at the University of Washington. Prior to joining FAIR, he was the Research Manager of the Perceptual Reasoning and Interaction Research (PRIOR) group at the Allen Institute for AI (AI2). He obtained his PhD in Computer Science in 2013 from the University of California, Los Angeles. After PhD, he joined the Computer Science Department at Stanford University as a post-doctoral researcher. His research mainly focuses on embodied AI, reasoning via perception, and learning via interaction, and his work on large-scale Embodied AI received the Outstanding Paper Award at NeurIPS 2022.
Dr. Alane Suhr is an Assistant Professor at UC Berkeley EECS. She received PhD in Computer Science at Cornell University, based at Cornell Tech in New York, NY, and advised by Yoav Artzi. Afterwards, she spent about a year in Seattle, WA at AI2 as a Young Investigator on the Mosaic team (led by Yejin Choi). Her research spans natural language processing, machine learning, and computer vision. She builds systems that use language to interact with people, e.g., in collaborative interactions (like CerealBar). She also designs models and datasets that address and represent problems in language grounding (e.g., NLVR), and develops learning algorithms for systems that learn language through interaction.
Dr. Soravit "Beer" Changpinyo is a researcher at Google DeepMind, specializing in computer vision and natural language processing, with a broad interest in machine learning and artificial intelligence. His work has earned him numerous accolades, including the prestigious Annenberg Graduate Fellowship from the University of Southern California and multiple Outstanding Reviewer Awards from top-tier conferences like CVPR and NeurIPS, where he ranked among the top reviewers. His contributions to the field are further underscored by his impactful research on zero-shot learning, transfer learning, multi-task learning, and neural network optimization, which have applications in both academic and industrial settings. Soravit's recent work includes the Gemini series. His innovative work continues to drive advancements in AI, making him a key figure in the research community.
Heng Ji is a professor at Computer Science Department, and an affiliated faculty member at Electrical and Computer Engineering Department and Coordinated Science Laboratory of University of Illinois Urbana-Champaign. She is an Amazon Scholar. She is the Founding Director of Amazon-Illinois Center on AI for Interactive Conversational Experiences (AICE). She received her B.A. and M. A. in Computational Linguistics from Tsinghua University, and her M.S. and Ph.D. in Computer Science from New York University. Her research interests focus on Natural Language Processing, especially on Multimedia Multilingual Information Extraction, Knowledge-enhanced Large Language Models and Vision-Language Models. She was selected as a "Young Scientist" by the World Laureates Association in 2023 and 2024. She was selected as "Young Scientist" and a member of the Global Future Council on the Future of Computing by the World Economic Forum in 2016 and 2017. She was named as part of Women Leaders of Conversational AI (Class of 2023) by Project Voice. The other awards she received include two Outstanding Paper Awards at NAACL2024, "AI's 10 to Watch" Award by IEEE Intelligent Systems in 2013, NSF CAREER award in 2009, PACLIC2012 Best paper runner-up, "Best of ICDM2013" paper award, "Best of SDM2013" paper award, ACL2018 Best Demo paper nomination, ACL2020 Best Demo Paper Award, NAACL2021 Best Demo Paper Award, Google Research Award in 2009 and 2014, IBM Watson Faculty Award in 2012 and 2014 and Bosch Research Award in 2014-2018. She was invited to testify to the U.S. House Cybersecurity, Data Analytics, & IT Committee as an AI expert in 2023. She was selected to participate in DARPA AI Forward in 2023. She was invited by the Secretary of the U.S. Air Force and AFRL to join Air Force Data Analytics Expert Panel to inform the Air Force Strategy 2030, and invited to speak at the Federal Information Integrity R&D Interagency Working Group (IIRD IWG) briefing in 2023. She is the lead of many multi-institution projects and tasks, including the U.S. ARL projects on information fusion and knowledge networks construction, DARPA ECOLE MIRACLE team, DARPA KAIROS RESIN team and DARPA DEFT Tinker Bell team. She has coordinated the NIST TAC Knowledge Base Population task 2010-2020. She is the Chief Editor of Data Intelligence Journal. She served as the associate editor for IEEE/ACM Transaction on Audio, Speech, and Language Processing, and the Program Committee Co-Chair of many conferences including NAACL-HLT2018 and AACL-IJCNLP2022. She was elected as the North American Chapter of the Association for Computational Linguistics (NAACL) secretary 2020-2023. Her research has been widely supported by the U.S. government agencies (DARPA, NSF, DoE, ARL, IARPA, AFRL, DHS) and industry (Amazon, Google, Bosch, IBM, Disney).
Microsoft | ||
Mila | ||
Thomson Reuters | ||
University of Gothenburg | ||
University of California, Santa Cruz | ||
Apple AI/ML | ||
University of Michigan | ||
Northeastern University | ||
University of California, Santa Barbara | ||
National Institute of Technology Silchar, India | ||
Columbia University | ||
Microsoft | ||
University of Southern California | ||
National Institute of Technology, Silchar, India | ||
National Institute of Technology, Silchar India | ||
University of Amsterdam | ||
Adobe Research | ||
Northeastern University | ||
University of California, Santa Cruz | ||
University of California, Santa Cruz | ||
University of California, Santa Barbara |