Call for Papers

Long papers may consist of up to 8 pages of content, plus unlimited pages for references and an appendix; final versions of long papers will be given one additional page of content (up to 9 pages) so that reviewers’ comments can be considered.

Short papers may consist of up to 4 pages of content, plus unlimited references and an appendix. Short papers will be given 5 content pages in the proceedings upon acceptance. Authors are encouraged to use this additional page to address reviewers’ comments in their final versions.

We are also including a non-archival track to allow dual submission of work to ALVR 2024 and other conferences/journals. Space permitting, these submissions will still participate and present their work in the workshop and will be hosted on the workshop website but will not be included in the official proceedings. Please apply the ACL format and submit through openreview but indicate that this is a cross-submission (non-archival) at the bottom of the submission form.

The submission website is https://openreview.net/group?id=aclweb.org/ACL/2024/Workshop/ALVR.

Schedule (ICT, UTC+7)

Time (ICT)	Event	Who
8:10-8:20	Opening Remarks	Workshop Organizers
8:20-9:10	Invited Talk 1: Foundations of Multimodal Interactions and Multisensory Foundation Models Abstract Building multisensory AI systems that learn from multiple sensory inputs such as text, speech, video, real-world sensors, wearable devices, and medical data holds great promise for impact in many scientific areas with practical benefits, such as in supporting human health and well-being, enabling multimedia content processing, and enhancing real-world autonomous agents. In this talk, I will discuss my research on the machine learning principles of multisensory intelligence, as well as practical methods for building multisensory foundation models over many modalities and tasks. In the first half, I will present a theoretical framework formalizing how modalities interact with each other to give rise to new information for a task. These interactions are the basic building blocks in all multimodal problems, and their quantification enables users to understand their multimodal datasets and design principled approaches to learn these interactions. In the second part, I will discuss our collaborative efforts in scaling AI to many modalities and tasks for real-world impact on affective computing, mental health, and cancer prognosis.	Paul Liang
9:10-10:00	Invited Talk 2: Connecting 3D and Language Abstract People communicate about objects, scenes, and spatial relations in the real world using natural language. In this talk, I will give an overview of recent work that explores how to endow computational systems with the ability to connect natural language and 3D representations. Concretely, I will summarize projects that develop neural models for localizing and describing objects in 3D scenes using natural language, generating 3D content from text, and other recent work exploring tasks that require connections between natural language and 3D representations.	Angel Chang
10:00-10:30	Coffee Break
10:30-11:20	Invited Talk 3: Planning and Reasoning for Multi-Agent Embodied Tasks: Benchmarking and Evaluation Abstract Language is never used in isolation. It is articulated, understood, and contextualized against the speaker’s history, actions, and environment. To this end, we have developed a series of benchmarks to evaluate language comprehension in dynamic, embodied settings where robots execute instructions given by humans. First, I will discuss Situated Instruction Following, where the meaning of instructions is revealed through the past actions and anticipated future behaviors of the human speaker. Then, I will introduce PARTNR, a multi-agent benchmark that involves a human and their robotic assistant collaboratively completing tasks under various constraints, including spatial, temporal, and heterogeneity challenges. Lastly, I will present GOAT-Bench, a robot navigation benchmark in a continual learning framework where the navigation target is defined through multiple modalities, including images and natural language descriptions of objects. I will share our analysis of these benchmarks, demonstrating how state-of-the-art LLMs and large-scale multi-modal models struggle with tracking task progression, managing ambiguities, and dividing tasks effectively.	Roozbeh Mottaghi
11:20-12:10	Invited Talk 4: The Role of Joint Embodiment in Situated Language-Based Interactions Abstract Large-scale pretraining has become the standard solution to automated reasoning over text and/or visual perception. But how far does this approach get us to systems that generalize to language use in realistic multi-agent situated interactions? First, I will talk about existing work in evaluating the spatial and compositional reasoning capabilities of current multimodal LMs. Then, I will talk about how these benchmarks miss a key aspect of real-world situated interactions: joint embodiment. I will discuss how joint embodiment in a shared world supports perspective-taking, an underlooked aspect of situated reasoning, and introduce a new environment and benchmark for studying the influence of perspective-taking on language use in interaction.	Alane Suhr
12:10-13:30	Lunch
13:30-14:20	Invited Talk 5: Toward Vision and Richer Language(s) Abstract How rich is the language in vision and language research? Arguably for some time, visual understanding has been the first class citizen of this research direction. In this talk, I will present research projects in the past several years that aim at elevating the role of language(s). This spans improving specificity and informativeness, reasoning about the language in pixels, as well as going beyond declarative, literal, and English languages. Collectively, this moves us toward tightly connecting vision with rich(er) languages.	Soravit "Beer" Changpinyo
14:20-15:00	Poster Highlight
15:00-15:50	Poster Session, Coffee Break
15:50-16:40	Invited Talk 6: Representing Illustrative Visual Semantics with Descriptive Language Abstract Contemporary visual semantic representations predominantly revolve around commonplace objects found in everyday images and videos, ranging from ladybugs and bunnies to airplanes. However, crucial visual cues extend beyond mere object recognition and interaction. They encompass a spectrum of richer semantics, including vector graphics (e.g., angles, mazes), fine-grained attributes and affordances. Moreover, they entail intricate visual dynamics, such as object interactions, actions, and activities. Regrettably, traditional visual representations relying solely on pixels and regions fail to fully encapsulate these nuances. In this task, I propose to design intermediate symbolic semantic representations to precisely describe and aggregate these low-level visual signals. This augmentation promises to enhance their utility as inputs for large language models or vision-language models, thereby facilitating high-level knowledge reasoning and discovery tasks. I will present several applications range from playful maze solving and fine-grained concept recognition to video activity detection.	Heng Ji
16:40-17:20	Panel Session	Xin Eric Wang, Paul Liang, Roozbeh Mottaghi, Soravit "Beer" Changpinyo, Heng Ji

Accepted Papers

Archival track:

WISMIR3: A Multi-Modal Dataset to Challenge Text-Image Retrieval Approaches
Florian Schneider, Chris Biemann
mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
Gregor Geigle, Abhay Jain, Radu Timofte, Goran Glavaš
LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-Tailed Multi-Label Visual Recognition
Peng Xia, Di Xu, Ming Hu, Lie Ju, Zongyuan Ge
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models
Holy Lovenia, Wenliang Dai, Samuel Cahyawijaya, Ziwei Ji, Pascale Fung
How and where does CLIP process negation?
Vincent Quantmeyer, Pablo Mosteiro, Albert Gatt
Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation
Malvina Nikandrou, Georgios Pantazopoulos, Ioannis Konstas, Alessandro Suglia
English-to-Japanese Multimodal Machine Translation Based on Image-Text Matching of Lecture Videos
Ayu Teramen, Takumi Ohtsuka, Risa Kondo, Tomoyuki Kajiwara, Takashi Ninomiya
VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool
Yan Wang, Yawen Zeng, Jingsheng Zheng, Xiaofen Xing, Jin Xu, Xiangmin Xu
Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples
Philipp J. Rösch, Norbert Oswald, Michaela Geierhos, Jindřich Libovický
Vision Language Models for Spreadsheet Understanding: Challenges and Opportunities
Shiyu Xia, Junyu Xiong, Haoyu Dong, Jianbo Zhao, Yuzhang Tian, Mengyu Zhou, Yeye He, Shi Han, Dongmei Zhang
SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition
Hao Wang, Shuhei Kurita, Shuichiro Shimizu, Daisuke Kawahara
Causal and Temporal Inference in Visual Question Generation by Utilizing Pre-trained Models
Zhanghao Hu, Frank Keller
Improving Vision-Language Cross-Lingual Transfer with Scheduled Unfreezing
Max Reinhardt, Gregor Geigle, Radu Timofte, Goran Glavaš
Automatic Layout Planning for Visually-Rich Documents with Instruction-Following Models
Wanrong Zhu, Ruiyi Zhang, Jennifer Healey, William Yang Wang, Tong Sun
SEA-VQA: Southeast Asian Cultural Context Dataset For Visual Question Answering
Norawit Urailertprasert, Peerat Limkonchotiwat, Supasorn Suwajanakorn, Sarana Nutanong
Wiki-VEL: Visual Entity Linking for Structured Data on Wikimedia Commons
Philipp Bielefeld, Jasmin Geppert, Necdet Güven, Melna Treesa John, Adrian Ziupka, Lucie-Aimée Kaffee, Russa Biswas, Gerard de Melo
VerbCLIP: Improving Verb Understanding in Vision-Language Models with Compositional Structures
Hadi Wazni, Kin Ian Lo, Mehrnoosh Sadrzadeh
Evolutionary Reward Design and Optimization with Multimodal Large Language Models
Ali Emre Narin

Non-archival track:

FaithScore: Fine-grained Evaluations of Hallucinations in Large Vision-Language Models
Liqiang Jing, Ruosen Li, Yunmo Chen, Mengzhao Jia, Xinya Du
HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding
Peng Xia, Xingtong Yu, Ming Hu, Lie Ju, Zhiyong Wang, Peibo Duan, Zongyuan Ge
Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained Evaluation
Seongyun Lee, Seungone Kim, Sue Hyun Park, Geewook Kim, Minjoon Seo
Can Large Vision Language Models Understand and Reason with Charts? An Empirical Study into the Capabilities and Limitations of LVLMs
Mohammed Saidul Islam, Raian Rahman, Ahmed Masry, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, Enamul Hoque
MM-SOC: Benchmarking Multimodal Large Language Models in Social Media Platforms
Yiqiao Jin, Minje Choi, Gaurav Verma, Jindong Wang, Srijan Kumar
ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models
Kaiwen Zhou, Kwonjoon Lee, Teruhisa Misu, Xin Eric Wang
HateSieve: A Contrastive Learning Framework for Detecting and Segmenting Hateful Content in Multimodal Memes
Xuanyu Su, Yansong Li, Diana Inkpen, Nathalie Japkowicz
ANNA: Abstractive Text-to-Image Synthesis with Filtered News Captions
Aashish Anantha Ramakrishnan, Sharon X Huang, Dongwon Lee
Multimodal Reranking for Knowledge-Intensive Visual Question Answering
Haoyang Wen, Honglei Zhuang, Hamed Zamani, Alexander G Hauptmann, Michael Bendersky
Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)
Avshalom Manevich, Reut Tsarfaty
Natural Language Can Facilitate Sim2Real Transfer
Albert Yu, Adeline Foote, Ray Mooney, Roberto Martín-Martín
Multi-Object Hallucination in Vision-Language Models
Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David Fouhey, Joyce Chai

Invited Speakers

Paul Liang

Dr. Paul Liang is an Assistant Professor at MIT Media Lab and EECS, whose research focuses on advancing the foundations of multisensory artificial intelligence to enhance human experiences across various domains. Recognized with prestigious awards including the Siebel Scholars Award, Waibel Presidential Fellowship, Facebook PhD Fellowship, and multiple best paper awards, Liang's work explores the integration of diverse sensory channels such as text, speech, audio, video, and physiological signals to create AI systems that interact seamlessly with the world. His research interests span from enhancing human physical, emotional, and social well-being through AI, to augmenting human creativity with multimedia generative AI, and addressing critical issues in real-world human-AI interaction such as fairness, trust, and privacy. Beyond research, he has also been honored with the Alan J. Perlis Graduate Student Teaching Award for his contributions to multimodal machine learning education.

Angel Chang

Dr. Angel Chang is an Assistant Professor at Simon Fraser University. Prior to this, she was a visiting research scientist at Facebook AI Research and a research scientist at Eloquent Labs working on dialogue. She received my Ph.D. in Computer Science from Stanford, where she was part of the Natural Language Processing Group and advised by Chris Manning. Her research focuses on connecting language to 3D representations of shapes and scenes and grounding of language for embodied agents in indoor environments. She has worked on methods for synthesizing 3D scenes and shapes from natural language, and various datasets for 3D scene understanding. In general, she is interested in the semantics of shapes and scenes, the representation and acquisition of common sense knowledge, and reasoning using probabilistic models.

Roozbeh Mottaghi

Dr. Roozbeh Mottaghi is a Senior Research Scientist Manager at FAIR and an Affiliate Associate Professor in Paul G. Allen School of Computer Science and Engineering at the University of Washington. Prior to joining FAIR, he was the Research Manager of the Perceptual Reasoning and Interaction Research (PRIOR) group at the Allen Institute for AI (AI2). He obtained his PhD in Computer Science in 2013 from the University of California, Los Angeles. After PhD, he joined the Computer Science Department at Stanford University as a post-doctoral researcher. His research mainly focuses on embodied AI, reasoning via perception, and learning via interaction, and his work on large-scale Embodied AI received the Outstanding Paper Award at NeurIPS 2022.

Alane Suhr

Dr. Alane Suhr is an Assistant Professor at UC Berkeley EECS. She received PhD in Computer Science at Cornell University, based at Cornell Tech in New York, NY, and advised by Yoav Artzi. Afterwards, she spent about a year in Seattle, WA at AI2 as a Young Investigator on the Mosaic team (led by Yejin Choi). Her research spans natural language processing, machine learning, and computer vision. She builds systems that use language to interact with people, e.g., in collaborative interactions (like CerealBar). She also designs models and datasets that address and represent problems in language grounding (e.g., NLVR), and develops learning algorithms for systems that learn language through interaction.

Soravit "Beer" Changpinyo

Dr. Soravit "Beer" Changpinyo is a researcher at Google DeepMind, specializing in computer vision and natural language processing, with a broad interest in machine learning and artificial intelligence. His work has earned him numerous accolades, including the prestigious Annenberg Graduate Fellowship from the University of Southern California and multiple Outstanding Reviewer Awards from top-tier conferences like CVPR and NeurIPS, where he ranked among the top reviewers. His contributions to the field are further underscored by his impactful research on zero-shot learning, transfer learning, multi-task learning, and neural network optimization, which have applications in both academic and industrial settings. Soravit's recent work includes the Gemini series. His innovative work continues to drive advancements in AI, making him a key figure in the research community.

Heng Ji

Heng Ji is a professor at Computer Science Department, and an affiliated faculty member at Electrical and Computer Engineering Department and Coordinated Science Laboratory of University of Illinois Urbana-Champaign. She is an Amazon Scholar. She is the Founding Director of Amazon-Illinois Center on AI for Interactive Conversational Experiences (AICE). She received her B.A. and M. A. in Computational Linguistics from Tsinghua University, and her M.S. and Ph.D. in Computer Science from New York University. Her research interests focus on Natural Language Processing, especially on Multimedia Multilingual Information Extraction, Knowledge-enhanced Large Language Models and Vision-Language Models. She was selected as a "Young Scientist" by the World Laureates Association in 2023 and 2024. She was selected as "Young Scientist" and a member of the Global Future Council on the Future of Computing by the World Economic Forum in 2016 and 2017. She was named as part of Women Leaders of Conversational AI (Class of 2023) by Project Voice. The other awards she received include two Outstanding Paper Awards at NAACL2024, "AI's 10 to Watch" Award by IEEE Intelligent Systems in 2013, NSF CAREER award in 2009, PACLIC2012 Best paper runner-up, "Best of ICDM2013" paper award, "Best of SDM2013" paper award, ACL2018 Best Demo paper nomination, ACL2020 Best Demo Paper Award, NAACL2021 Best Demo Paper Award, Google Research Award in 2009 and 2014, IBM Watson Faculty Award in 2012 and 2014 and Bosch Research Award in 2014-2018. She was invited to testify to the U.S. House Cybersecurity, Data Analytics, & IT Committee as an AI expert in 2023. She was selected to participate in DARPA AI Forward in 2023. She was invited by the Secretary of the U.S. Air Force and AFRL to join Air Force Data Analytics Expert Panel to inform the Air Force Strategy 2030, and invited to speak at the Federal Information Integrity R&D Interagency Working Group (IIRD IWG) briefing in 2023. She is the lead of many multi-institution projects and tasks, including the U.S. ARL projects on information fusion and knowledge networks construction, DARPA ECOLE MIRACLE team, DARPA KAIROS RESIN team and DARPA DEFT Tinker Bell team. She has coordinated the NIST TAC Knowledge Base Population task 2010-2020. She is the Chief Editor of Data Intelligence Journal. She served as the associate editor for IEEE/ACM Transaction on Audio, Speech, and Language Processing, and the Program Committee Co-Chair of many conferences including NAACL-HLT2018 and AACL-IJCNLP2022. She was elected as the North American Chapter of the Association for Computational Linguistics (NAACL) secretary 2020-2023. Her research has been widely supported by the U.S. government agencies (DARPA, NSF, DoE, ARL, IARPA, AFRL, DHS) and industry (Amazon, Google, Bosch, IBM, Disney).

Organizers

Jing Gu

UC Santa Cruz

Tsu-Jui (Ray) Fu

UC Santa Barbara

Drew Hudson

Google DeepMind

Asli Celikyilmaz

Fundamentals AI Research (FAIR) @ Meta

William Wang

UC Santa Barbara

Xin Eric Wang

UC Santa Cruz

Contact

Program Committee

Asma Ben Abacha	Microsoft
Shubham Agarwal	Mila
Arjun Akula	Google
Dhivya Chinnappa	Thomson Reuters
Simon Dobnik	University of Gothenburg
Yue Fan	University of California, Santa Cruz
Zhe Gan	Apple AI/ML
Cristina Garbacea	University of Michigan
Huaizu Jiang	Northeastern University
Yujie Lu	University of California, Santa Barbara
Loitongbam Sanayai Meetei	National Institute of Technology Silchar, India
Yulei Niu	Columbia University
Vikas Raunak	Microsoft
Arka Sadhu	University of Southern California
Thoudam Doren Singh	National Institute of Technology, Silchar, India
Alok Singh	National Institute of Technology, Silchar India
Ece Takmaz	University of Amsterdam
Hao Tan	Adobe Research
Yiming Xie	Northeastern University
Qianqi Yan	University of California, Santa Cruz
Kaizhi Zheng	University of California, Santa Cruz
Wanrong Zhu	University of California, Santa Barbara

3^rd Workshop on Advances in Language and Vision Research (ALVR)

In conjunction with ACL 2024
August 16^st 2024 (Full Day)
Location: Bangkok, Thailand, Lotus Suite 11 (Floor 22)