4th Workshop on Advances in Language and Vision Research (ALVR)

In conjunction with ACL 2026
July 3rd 2026 (Full Day)
Location: San Diego, California

Photo by boykpe on iStock

4th Workshop on Advances in Language and Vision Research

Language and Vision research has rapidly evolved in recent years, driven by the emergence of large vision-language models (LVLMs). Earlier paradigms focused on passive perception, annotated data, and templated language, whereas today's research addresses active perception, self-supervised learning, open-ended natural language, and real-world deployment. These advances have had a profound impact both within NLP/CV research fields and across domains such as robotics, healthcare, and education. This workshop covers (but is not limited to) the following topics:

  • Self-supervised vision and language pre-training;
  • New tasks and datasets that provide real-world solutions in language and vision;
  • Text-to-image/video generation and text-guided image/video editing;
  • 3D/Spatial reasoning and inference with language and vision;
  • Multimodal agents and Language-grounded embodied agents;
  • Visually-grounded natural language understanding and generation;
  • Culturally-aware LVLMs and LVLMs for underrepresented cultures;
  • Multilingual LVLMs;
  • External knowledge integration in visual and language understanding;
  • Shortcomings of the existing large vision\&language models on downstream tasks and solutions;
  • Training efficiency and optimization of LVLMs;
  • Post-training frameworks for LVLMs, including alignment and reasoning;
  • Ethics and bias on LVLMs;
  • Multidisciplinary study that may involve linguistics, cognitive science, robotics, etc.
  • Practical applications of LVLMs;
  • Explainability and interpretability on LVLMs.

Important Dates

    • Paper Submission Due Date: March 24, 2026
    • Notification of acceptance: May 5, 2026
    • Camera-ready papers due: May 12, 2026
    • Workshop Date: July 2nd or 3rd, 2026

Call for Papers

Long papers may consist of up to 8 pages of content, plus unlimited pages for references and an appendix; final versions of long papers will be given one additional page of content (up to 9 pages) so that reviewers' comments can be considered.

Short papers may consist of up to 4 pages of content, plus unlimited references and an appendix. Short papers will be given 5 content pages in the proceedings upon acceptance. Authors are encouraged to use this additional page to address reviewers' comments in their final versions.

We are also including a non-archival track to allow dual submission of work to ALVR 2026 and other conferences/journals. Space permitting, these submissions will still participate and present their work in the workshop and will be hosted on the workshop website but will not be included in the official proceedings. Please apply the ACL format and submit through openreview but indicate that this is a cross-submission (non-archival) at the bottom of the submission form.

The submission website is https://openreview.net/group?id=aclweb.org/ACL/2026/Workshop/ALVR.

Call for Reviewers: We are actively recruiting reviewers with expertise in multimodal learning, vision-language models, and related areas. If you are interested, please sign up here: Reviewer Sign-up Form .

Contact the Organizing Committee: alvr_workshop_acl_2026@googlegroups.com.

Accepted Papers

Archival track:

  • [Spotlight] CoSMoEs: Compact Sparse Mixture of Experts
    Patrick Huber, Akshat Shrivastava, Ernie Chang, Chinnadhurai Sankar, Ahmed A Aly, Adithya Sagar
  • [Spotlight] Efficient Visual Grounding in VQA via Question-Guided Sparse Attention
    Prasanth
  • [Spotlight] Scaling Vision–Language Models for Pharmaceutical Long-Form Video Reasoning on Industrial GenAI Platform
    Suyash Mishra, Qiang Li, Srikanth Patil, Satyanarayan Pati, BADDU NARENDRA
  • GraphicWeaver: Benchmarking Agentic Planning for Graphic Design Generation
    Dayeon Ki, Tianyi Zhou, Marine Carpuat, Gang Wu, Puneet Mathur, Viswanathan Swaminathan
  • Thinking in Pictures: A Diagnostic Study of Visual vs. Textual Chain-of-Thought Reasoning in Vision-Language Models
    Ben Jenkins
  • A Zipfian Analysis of Visual Token Distributions for AI-Generated Images
    Andrew Shin
  • Semantically Aware Optimal Transport for Dense Label Transfer
    Preeti, Kiran Ravish, Ankita Kushwaha, Pawan Kumar
  • PGGA: A Plan-Grounded GUI Agent for Automated Device Support
    Lei Hsiung, Zhiyu Chen, Seonhoon Kim, Qun Liu
  • CAFES: A Collaborative Multi-Agent Framework for Multi-Granular Multimodal Essay Scoring
    Jiamin Su, Yibo Yan, Zhuoran Gao, Han ZHANG, Xiang Liu, Huiyu Zhou, Xuming Hu
  • GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning
    Jianghangfan Zhang, Yibo Yan, Kening Zheng, Xin Zou, Song Dai, Xuming Hu
  • Look Where You're Told: Instruction-Consistent Attention for GUI Grounding
    Seonhoon Kim, Zhiyu Chen, Xin Li, Qun Liu
  • From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning
    Alberto Gonzalo Rodriguez Salgado
  • When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
    Philip Wootaek Shin, Ajay Narayanan Sridhar, Lakshmi Sivani Devarapalli, Rui Zhang, Jack Sampson, Vijaykrishnan Narayanan
  • VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment
    Md. Mahfuzur Rahman, Marufa Kamal, Fahad Rahman, Sunzida Siddique, Ahmed Rafi Hasan, Mohd Ariful Haque, kishor datta gupta, Roy George
  • Beyond Visual Similarity: Rule-Guided Multimodal Clustering with explicit domain rules
    kishor datta gupta, Mohd Ariful Haque, Marufa Kamal, Ahmed Rafi Hasan, Md. Mahfuzur Rahman, Roy George
  • ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts
    Rongtian Ye
  • Formal Machine Interpretation for the Semasiographic Mixtec Codices of Precolonial and Early Colonial Mesoamerica
    Christopher Driggers-Ellis, Gabriel Ayoubi, Girish Salunke, Christan Grant
  • Temporal-Linguistic Adaptive Streaming for Continuous Sign Language Translation
    Arshia Kermani, Habib Irani, Deautaun Ross, Vangelis Metsis
  • FADE: Probing the Limits of VLMs on fine-grained OCR
    Deep Shah, Nehal Kathrotia, Sanket Badhe
  • Systematic Performance Degradation in Indic Vision-Language Models: Evidence from Hindi and Telugu
    Rishikant Chigrupaatii, Ponnada Sai Tulasi Kanishka, Lalit Chandra Routhu, Martin Patel, Sama Supratheek Reddy, Divyam Gupta, Rajiv Misra, Rohun Tripathi
  • How Fragile Is Vision-Language Alignment? Mapping Concept Disruption Under Text-to-Image Personalization
    Mujtaba Hasan
  • The Compositional Grounding Gap: Why Vision-Language Models Fail at Relational Reasoning and How to Fix It
    Kaustubh S. Bukkapatnam
  • HalluTrace: Causal Attribution and Source-Targeted Decoding for Hallucination in Large Vision-Language Models
    Kaustubh S. Bukkapatnam

Non-archival track:

  • [Spotlight] Theory of Space: Active Exploration and Spatial Belief Probing for Foundation Models
    Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, Ranjay Krishna, Jiajun Wu, Li Fei-Fei, Manling Li
  • [Spotlight] TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots
    Tianyu Liu, Weihao Xuan, Heli Qi, Rui Yang, Sophia Simeng Han, Tinglin Huang, Fang Wu, Nan Liu, Irene Li, Hua Xu
  • [Spotlight] Characterizing Visual Narrative Freedom under Loose Image–Text Alignment
    Yanru Jiang, Gavin Olson, Eugenio Herrera-Berg, Rick Dale, Hongjing Lu, Elisa Kreiss
  • Beyond Blind Retrieval: Adaptive Multimodal RAG for Reliable Medical AI
    Xiaoyu Deng
  • Architectural Enhancement for Safety of Vision-Language Model
    Youngwan Lee, Kangsan Kim, Kwanyong Park, Ilchae Jung, Soojin Jang, Seanie Lee, Yong-Ju Lee, Sung Ju Hwang
  • Evaluation of Multilingual Ability to Use Spatial Deictic Expressions in Vision-Language Models
    Kaito Watanabe, Taisei Yamamoto, Tomoki Doi, Hitomi Yanaka

Invited Speakers


Dr. Mohit Bansal is the John R. & Louise S. Parker Distinguished Professor and the Director of the MURGe-Lab (UNC-AI Group) in the Computer Science department at the University of North Carolina (UNC) Chapel Hill. He received his Ph.D. in 2013 from the University of California at Berkeley (where he was advised by Dan Klein) and his B.Tech. from the Indian Institute of Technology at Kanpur in 2008. His research expertise is in multimodal generative models, reasoning and planning agents, faithful language generation, and interpretable, efficient, and generalizable deep learning. He is an ACL and AAAI Fellow and recipient of the Presidential Early Career Award for Scientists and Engineers (PECASE), IIT Kanpur Young Alumnus Award, DARPA Director's Fellowship, NSF CAREER Award, Google Focused Research Award, Microsoft Investigator Fellowship, Army Young Investigator Award (YIP), DARPA Young Faculty Award (YFA), and outstanding paper awards at ACL, CVPR, EACL, COLING, CoNLL, and TMLR. He has been a keynote speaker for the ECAI 2025, ACM-CODS 2025, AACL-IJCNLP 2023, CoNLL 2023, and INLG 2022 conferences. His service includes EMNLP Program Co-Chair, CoNLL Program Co-Chair, and ACL Executive Committee, ACM Doctoral Dissertation Award Committee, ACL Doctoral Dissertation Award Co-Organizer, ACL Mentorship Program Co-Founder, and Associate Editor-in-Chief for TPAMI, and Associate Editor for TACL, CL, IEEE/ACM TASLP, and CSL journals.

Dr. Raymond J. Mooney is a Professor in the Department of Computer Science at the University of Texas at Austin. He received his Ph.D. in 1988 from the University of Illinois at Urbana/Champaign. He is an author of over 160 published research papers, primarily in the areas of machine learning and natural language processing. He was the President of the International Machine Learning Society from 2008-2011, program co-chair for AAAI 2006, general chair for HLT-EMNLP 2005, and co-chair for ICML 1990. He is a Fellow of the American Association for Artificial Intelligence, the Association for Computing Machinery, and the Association for Computational Linguistics and the recipient of best paper awards from AAAI-96, KDD-04, ICML-05 and ACL-07.

Dr. Amir Zadeh is a Staff ML Researcher at Lambda. His focus is on scalable multimodal learning for world modeling. He received his Ph.D. in Artificial Intelligence from Carnegie Mellon University, with the main focus on multimodal machine learning. He has received honors such as best paper runner ups in ML conferences as well as being the recipient of Yahoo Fellowship during his Ph.D. Dr. Zadeh has published more than 50 papers in top conferences in machine learning including NeurIPS, ICLR, CVPR and ACL, and more. He has served as organizer, senior area chair and committee member in top conferences and workshops.

Dr. Lianhui Qin is an Assistant Professor in the Computer Science Department at UC San Diego. Her research interests broadly span natural language processing, machine learning, and artificial intelligence. She received her PhD from University of Washington (UW, NLP) working with Yejin Choi.

Dr. Jiajun Wu is an Assistant Professor of Computer Science and, by courtesy, of Psychology at Stanford University. His group studies physical scene understanding---building machines that see, reason about, and interact with the physical world. Besides learning algorithms, what are the levels of abstraction needed by AI systems in their representations, and where do they come from? Before joining Stanford, He was a Visiting Faculty Researcher at Google Research, working with Noah Snavely. He finished my PhD at MIT, advised by Bill Freeman and Josh Tenenbaum

Organizers

Qianqi (Jackie) Yan

UC Santa Barbara

Syrielle Montariol

UC Berkeley

Yue Fan

UC Santa Cruz

Jing Gu

xAI

Manling Li

Northwestern University

Parisa Kordjamshidi

Michigan State University

Alane Suhr

UC Berkeley

Xin Eric Wang

UC Santa Barbara

Contact

Contact the Organizing Committee: alvr_workshop_acl_2026@googlegroups.com

Sponsor

Lambda