3rd Workshop on Advances in Language and Vision Research (ALVR)

In conjunction with ACL 2024
August 16st 2024 (Full Day)
Location: Bangkok, Thailand, Lotus Suite 11 (Floor 22)

Photo by boykpe on iStock

3rd Workshop on Advances in Language and Vision Research

Language and vision research has attracted great attention from both natural language processing (NLP) and computer vision (CV) researchers. Gradually, this area is shifting from passive perception, templated language, and synthetic imagery/environments to active perception, natural language, and photo-realistic simulation or real world deployment. This workshop covers (but is not limited to) the following topics:

  • Self-supervised vision and language pre-training;
  • New tasks and datasets that provide real-world solutions in language and vision;
  • Text-to-image/video generation and text-guided image/video editing;
  • External knowledge integration in visual and language understanding;
  • Visually-grounded natural language understanding and generation;
  • Language-grounded visual recognition and reasoning;
  • Language-grounded embodied agents, e.g., vision-and-language navigation;
  • Visually-grounded multilingual study, e.g., multimodal machine translation;
  • Shortcomings of the existing large vision\&language models on downstream tasks and solutions;
  • Ethics and bias on large vision\&language model.
  • Multidisciplinary study that may involve linguistics, cognitive science, robotics, etc.
  • Explainability and interpretability on large vision\&language model.

Call for Papers

Long papers may consist of up to 8 pages of content, plus unlimited pages for references and an appendix; final versions of long papers will be given one additional page of content (up to 9 pages) so that reviewers’ comments can be considered.

Short papers may consist of up to 4 pages of content, plus unlimited references and an appendix. Short papers will be given 5 content pages in the proceedings upon acceptance. Authors are encouraged to use this additional page to address reviewers’ comments in their final versions.

We are also including a non-archival track to allow dual submission of work to ALVR 2024 and other conferences/journals. Space permitting, these submissions will still participate and present their work in the workshop and will be hosted on the workshop website but will not be included in the official proceedings. Please apply the ACL format and submit through openreview but indicate that this is a cross-submission (non-archival) at the bottom of the submission form.

The submission website is https://openreview.net/group?id=aclweb.org/ACL/2024/Workshop/ALVR.

Schedule (ICT, UTC+7)

Time (ICT) Event Who
8:10-8:20 Opening Remarks Workshop Organizers
8:20-9:10 Invited Talk 1: Foundations of Multimodal Interactions and Multisensory Foundation Models Abstract
Building multisensory AI systems that learn from multiple sensory inputs such as text, speech, video, real-world sensors, wearable devices, and medical data holds great promise for impact in many scientific areas with practical benefits, such as in supporting human health and well-being, enabling multimedia content processing, and enhancing real-world autonomous agents. In this talk, I will discuss my research on the machine learning principles of multisensory intelligence, as well as practical methods for building multisensory foundation models over many modalities and tasks. In the first half, I will present a theoretical framework formalizing how modalities interact with each other to give rise to new information for a task. These interactions are the basic building blocks in all multimodal problems, and their quantification enables users to understand their multimodal datasets and design principled approaches to learn these interactions. In the second part, I will discuss our collaborative efforts in scaling AI to many modalities and tasks for real-world impact on affective computing, mental health, and cancer prognosis.
Paul Liang
9:10-10:00 Invited Talk 2: Connecting 3D and Language Abstract
People communicate about objects, scenes, and spatial relations in the real world using natural language. In this talk, I will give an overview of recent work that explores how to endow computational systems with the ability to connect natural language and 3D representations. Concretely, I will summarize projects that develop neural models for localizing and describing objects in 3D scenes using natural language, generating 3D content from text, and other recent work exploring tasks that require connections between natural language and 3D representations.
Angel Chang
10:00-10:30 Coffee Break
10:30-11:20 Invited Talk 3: Planning and Reasoning for Multi-Agent Embodied Tasks: Benchmarking and Evaluation Abstract
Language is never used in isolation. It is articulated, understood, and contextualized against the speaker’s history, actions, and environment. To this end, we have developed a series of benchmarks to evaluate language comprehension in dynamic, embodied settings where robots execute instructions given by humans. First, I will discuss Situated Instruction Following, where the meaning of instructions is revealed through the past actions and anticipated future behaviors of the human speaker. Then, I will introduce PARTNR, a multi-agent benchmark that involves a human and their robotic assistant collaboratively completing tasks under various constraints, including spatial, temporal, and heterogeneity challenges. Lastly, I will present GOAT-Bench, a robot navigation benchmark in a continual learning framework where the navigation target is defined through multiple modalities, including images and natural language descriptions of objects. I will share our analysis of these benchmarks, demonstrating how state-of-the-art LLMs and large-scale multi-modal models struggle with tracking task progression, managing ambiguities, and dividing tasks effectively.
Roozbeh Mottaghi
11:20-12:10 Invited Talk 4: The Role of Joint Embodiment in Situated Language-Based Interactions Abstract
Large-scale pretraining has become the standard solution to automated reasoning over text and/or visual perception. But how far does this approach get us to systems that generalize to language use in realistic multi-agent situated interactions? First, I will talk about existing work in evaluating the spatial and compositional reasoning capabilities of current multimodal LMs. Then, I will talk about how these benchmarks miss a key aspect of real-world situated interactions: joint embodiment. I will discuss how joint embodiment in a shared world supports perspective-taking, an underlooked aspect of situated reasoning, and introduce a new environment and benchmark for studying the influence of perspective-taking on language use in interaction.
Alane Suhr
12:10-13:30 Lunch
13:30-14:20 Invited Talk 5: Toward Vision and Richer Language(s) Abstract
How rich is the language in vision and language research? Arguably for some time, visual understanding has been the first class citizen of this research direction. In this talk, I will present research projects in the past several years that aim at elevating the role of language(s). This spans improving specificity and informativeness, reasoning about the language in pixels, as well as going beyond declarative, literal, and English languages. Collectively, this moves us toward tightly connecting vision with rich(er) languages.
Soravit "Beer" Changpinyo
14:20-15:00 Poster Highlight
15:00-15:50 Poster Session, Coffee Break
15:50-16:40 Invited Talk 6: Representing Illustrative Visual Semantics with Descriptive Language Abstract
Contemporary visual semantic representations predominantly revolve around commonplace objects found in everyday images and videos, ranging from ladybugs and bunnies to airplanes. However, crucial visual cues extend beyond mere object recognition and interaction. They encompass a spectrum of richer semantics, including vector graphics (e.g., angles, mazes), fine-grained attributes and affordances. Moreover, they entail intricate visual dynamics, such as object interactions, actions, and activities. Regrettably, traditional visual representations relying solely on pixels and regions fail to fully encapsulate these nuances. In this task, I propose to design intermediate symbolic semantic representations to precisely describe and aggregate these low-level visual signals. This augmentation promises to enhance their utility as inputs for large language models or vision-language models, thereby facilitating high-level knowledge reasoning and discovery tasks. I will present several applications range from playful maze solving and fine-grained concept recognition to video activity detection.
Heng Ji
16:40-17:20 Panel Session
Xin Eric Wang, Paul Liang, Roozbeh Mottaghi, Soravit "Beer" Changpinyo, Heng Ji

Accepted Papers

Archival track:

  • WISMIR3: A Multi-Modal Dataset to Challenge Text-Image Retrieval Approaches
    Florian Schneider, Chris Biemann
  • mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
    Gregor Geigle, Abhay Jain, Radu Timofte, Goran Glavaš
  • LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-Tailed Multi-Label Visual Recognition
    Peng Xia, Di Xu, Ming Hu, Lie Ju, Zongyuan Ge
  • Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models
    Holy Lovenia, Wenliang Dai, Samuel Cahyawijaya, Ziwei Ji, Pascale Fung
  • How and where does CLIP process negation?
    Vincent Quantmeyer, Pablo Mosteiro, Albert Gatt
  • Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation
    Malvina Nikandrou, Georgios Pantazopoulos, Ioannis Konstas, Alessandro Suglia
  • English-to-Japanese Multimodal Machine Translation Based on Image-Text Matching of Lecture Videos
    Ayu Teramen, Takumi Ohtsuka, Risa Kondo, Tomoyuki Kajiwara, Takashi Ninomiya
  • VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool
    Yan Wang, Yawen Zeng, Jingsheng Zheng, Xiaofen Xing, Jin Xu, Xiangmin Xu
  • Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples
    Philipp J. Rösch, Norbert Oswald, Michaela Geierhos, Jindřich Libovický
  • Vision Language Models for Spreadsheet Understanding: Challenges and Opportunities
    Shiyu Xia, Junyu Xiong, Haoyu Dong, Jianbo Zhao, Yuzhang Tian, Mengyu Zhou, Yeye He, Shi Han, Dongmei Zhang
  • SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition
    Hao Wang, Shuhei Kurita, Shuichiro Shimizu, Daisuke Kawahara
  • Causal and Temporal Inference in Visual Question Generation by Utilizing Pre-trained Models
    Zhanghao Hu, Frank Keller
  • Improving Vision-Language Cross-Lingual Transfer with Scheduled Unfreezing
    Max Reinhardt, Gregor Geigle, Radu Timofte, Goran Glavaš
  • Automatic Layout Planning for Visually-Rich Documents with Instruction-Following Models
    Wanrong Zhu, Ruiyi Zhang, Jennifer Healey, William Yang Wang, Tong Sun
  • SEA-VQA: Southeast Asian Cultural Context Dataset For Visual Question Answering
    Norawit Urailertprasert, Peerat Limkonchotiwat, Supasorn Suwajanakorn, Sarana Nutanong
  • Wiki-VEL: Visual Entity Linking for Structured Data on Wikimedia Commons
    Philipp Bielefeld, Jasmin Geppert, Necdet Güven, Melna Treesa John, Adrian Ziupka, Lucie-Aimée Kaffee, Russa Biswas, Gerard de Melo
  • VerbCLIP: Improving Verb Understanding in Vision-Language Models with Compositional Structures
    Hadi Wazni, Kin Ian Lo, Mehrnoosh Sadrzadeh
  • Evolutionary Reward Design and Optimization with Multimodal Large Language Models
    Ali Emre Narin

Non-archival track:

  • FaithScore: Fine-grained Evaluations of Hallucinations in Large Vision-Language Models
    Liqiang Jing, Ruosen Li, Yunmo Chen, Mengzhao Jia, Xinya Du
  • HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding
    Peng Xia, Xingtong Yu, Ming Hu, Lie Ju, Zhiyong Wang, Peibo Duan, Zongyuan Ge
  • Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained Evaluation
    Seongyun Lee, Seungone Kim, Sue Hyun Park, Geewook Kim, Minjoon Seo
  • Can Large Vision Language Models Understand and Reason with Charts? An Empirical Study into the Capabilities and Limitations of LVLMs
    Mohammed Saidul Islam, Raian Rahman, Ahmed Masry, Md Tahmid Rahman Laskar, Mir Tafseer Nayeem, Enamul Hoque
  • MM-SOC: Benchmarking Multimodal Large Language Models in Social Media Platforms
    Yiqiao Jin, Minje Choi, Gaurav Verma, Jindong Wang, Srijan Kumar
  • ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models
    Kaiwen Zhou, Kwonjoon Lee, Teruhisa Misu, Xin Eric Wang
  • HateSieve: A Contrastive Learning Framework for Detecting and Segmenting Hateful Content in Multimodal Memes
    Xuanyu Su, Yansong Li, Diana Inkpen, Nathalie Japkowicz
  • ANNA: Abstractive Text-to-Image Synthesis with Filtered News Captions
    Aashish Anantha Ramakrishnan, Sharon X Huang, Dongwon Lee
  • Multimodal Reranking for Knowledge-Intensive Visual Question Answering
    Haoyang Wen, Honglei Zhuang, Hamed Zamani, Alexander G Hauptmann, Michael Bendersky
  • Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD)
    Avshalom Manevich, Reut Tsarfaty
  • Natural Language Can Facilitate Sim2Real Transfer
    Albert Yu, Adeline Foote, Ray Mooney, Roberto Martín-Martín
  • Multi-Object Hallucination in Vision-Language Models
    Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David Fouhey, Joyce Chai

Invited Speakers


Dr. Paul Liang is an Assistant Professor at MIT Media Lab and EECS, whose research focuses on advancing the foundations of multisensory artificial intelligence to enhance human experiences across various domains. Recognized with prestigious awards including the Siebel Scholars Award, Waibel Presidential Fellowship, Facebook PhD Fellowship, and multiple best paper awards, Liang's work explores the integration of diverse sensory channels such as text, speech, audio, video, and physiological signals to create AI systems that interact seamlessly with the world. His research interests span from enhancing human physical, emotional, and social well-being through AI, to augmenting human creativity with multimedia generative AI, and addressing critical issues in real-world human-AI interaction such as fairness, trust, and privacy. Beyond research, he has also been honored with the Alan J. Perlis Graduate Student Teaching Award for his contributions to multimodal machine learning education.

Dr. Angel Chang is an Assistant Professor at Simon Fraser University. Prior to this, she was a visiting research scientist at Facebook AI Research and a research scientist at Eloquent Labs working on dialogue. She received my Ph.D. in Computer Science from Stanford, where she was part of the Natural Language Processing Group and advised by Chris Manning. Her research focuses on connecting language to 3D representations of shapes and scenes and grounding of language for embodied agents in indoor environments. She has worked on methods for synthesizing 3D scenes and shapes from natural language, and various datasets for 3D scene understanding. In general, she is interested in the semantics of shapes and scenes, the representation and acquisition of common sense knowledge, and reasoning using probabilistic models.

Dr. Roozbeh Mottaghi is a Senior Research Scientist Manager at FAIR and an Affiliate Associate Professor in Paul G. Allen School of Computer Science and Engineering at the University of Washington. Prior to joining FAIR, he was the Research Manager of the Perceptual Reasoning and Interaction Research (PRIOR) group at the Allen Institute for AI (AI2). He obtained his PhD in Computer Science in 2013 from the University of California, Los Angeles. After PhD, he joined the Computer Science Department at Stanford University as a post-doctoral researcher. His research mainly focuses on embodied AI, reasoning via perception, and learning via interaction, and his work on large-scale Embodied AI received the Outstanding Paper Award at NeurIPS 2022.

Dr. Alane Suhr is an Assistant Professor at UC Berkeley EECS. She received PhD in Computer Science at Cornell University, based at Cornell Tech in New York, NY, and advised by Yoav Artzi. Afterwards, she spent about a year in Seattle, WA at AI2 as a Young Investigator on the Mosaic team (led by Yejin Choi). Her research spans natural language processing, machine learning, and computer vision. She builds systems that use language to interact with people, e.g., in collaborative interactions (like CerealBar). She also designs models and datasets that address and represent problems in language grounding (e.g., NLVR), and develops learning algorithms for systems that learn language through interaction.

Dr. Soravit "Beer" Changpinyo is a researcher at Google DeepMind, specializing in computer vision and natural language processing, with a broad interest in machine learning and artificial intelligence. His work has earned him numerous accolades, including the prestigious Annenberg Graduate Fellowship from the University of Southern California and multiple Outstanding Reviewer Awards from top-tier conferences like CVPR and NeurIPS, where he ranked among the top reviewers. His contributions to the field are further underscored by his impactful research on zero-shot learning, transfer learning, multi-task learning, and neural network optimization, which have applications in both academic and industrial settings. Soravit's recent work includes the Gemini series. His innovative work continues to drive advancements in AI, making him a key figure in the research community.

Heng Ji is a professor at Computer Science Department, and an affiliated faculty member at Electrical and Computer Engineering Department and Coordinated Science Laboratory of University of Illinois Urbana-Champaign. She is an Amazon Scholar. She is the Founding Director of Amazon-Illinois Center on AI for Interactive Conversational Experiences (AICE). She received her B.A. and M. A. in Computational Linguistics from Tsinghua University, and her M.S. and Ph.D. in Computer Science from New York University. Her research interests focus on Natural Language Processing, especially on Multimedia Multilingual Information Extraction, Knowledge-enhanced Large Language Models and Vision-Language Models. She was selected as a "Young Scientist" by the World Laureates Association in 2023 and 2024. She was selected as "Young Scientist" and a member of the Global Future Council on the Future of Computing by the World Economic Forum in 2016 and 2017. She was named as part of Women Leaders of Conversational AI (Class of 2023) by Project Voice. The other awards she received include two Outstanding Paper Awards at NAACL2024, "AI's 10 to Watch" Award by IEEE Intelligent Systems in 2013, NSF CAREER award in 2009, PACLIC2012 Best paper runner-up, "Best of ICDM2013" paper award, "Best of SDM2013" paper award, ACL2018 Best Demo paper nomination, ACL2020 Best Demo Paper Award, NAACL2021 Best Demo Paper Award, Google Research Award in 2009 and 2014, IBM Watson Faculty Award in 2012 and 2014 and Bosch Research Award in 2014-2018. She was invited to testify to the U.S. House Cybersecurity, Data Analytics, & IT Committee as an AI expert in 2023. She was selected to participate in DARPA AI Forward in 2023. She was invited by the Secretary of the U.S. Air Force and AFRL to join Air Force Data Analytics Expert Panel to inform the Air Force Strategy 2030, and invited to speak at the Federal Information Integrity R&D Interagency Working Group (IIRD IWG) briefing in 2023. She is the lead of many multi-institution projects and tasks, including the U.S. ARL projects on information fusion and knowledge networks construction, DARPA ECOLE MIRACLE team, DARPA KAIROS RESIN team and DARPA DEFT Tinker Bell team. She has coordinated the NIST TAC Knowledge Base Population task 2010-2020. She is the Chief Editor of Data Intelligence Journal. She served as the associate editor for IEEE/ACM Transaction on Audio, Speech, and Language Processing, and the Program Committee Co-Chair of many conferences including NAACL-HLT2018 and AACL-IJCNLP2022. She was elected as the North American Chapter of the Association for Computational Linguistics (NAACL) secretary 2020-2023. Her research has been widely supported by the U.S. government agencies (DARPA, NSF, DoE, ARL, IARPA, AFRL, DHS) and industry (Amazon, Google, Bosch, IBM, Disney).

Organizers

Jing Gu

UC Santa Cruz

Tsu-Jui (Ray) Fu

UC Santa Barbara

Drew Hudson

Google DeepMind

Asli Celikyilmaz

Fundamentals AI Research (FAIR) @ Meta

William Wang

UC Santa Barbara

Xin Eric Wang

UC Santa Cruz

Contact

Program Committee

  • Asma Ben Abacha
  • Microsoft
  • Shubham Agarwal
  • Mila
  • Arjun Akula
  • Google
  • Dhivya Chinnappa
  • Thomson Reuters
  • Simon Dobnik
  • University of Gothenburg
  • Yue Fan
  • University of California, Santa Cruz
  • Zhe Gan
  • Apple AI/ML
  • Cristina Garbacea
  • University of Michigan
  • Huaizu Jiang
  • Northeastern University
  • Yujie Lu
  • University of California, Santa Barbara
  • Loitongbam Sanayai Meetei
  • National Institute of Technology Silchar, India
  • Yulei Niu
  • Columbia University
  • Vikas Raunak
  • Microsoft
  • Arka Sadhu
  • University of Southern California
  • Thoudam Doren Singh
  • National Institute of Technology, Silchar, India
  • Alok Singh
  • National Institute of Technology, Silchar India
  • Ece Takmaz
  • University of Amsterdam
  • Hao Tan
  • Adobe Research
  • Yiming Xie
  • Northeastern University
  • Qianqi Yan
  • University of California, Santa Cruz
  • Kaizhi Zheng
  • University of California, Santa Cruz
  • Wanrong Zhu
  • University of California, Santa Barbara