2nd Workshop on Advances in Language and Vision Research (ALVR)

In conjunction with NAACL 2021
June 11st 2021 (Full Day)
Location: Virtual

Photo by Manuel Arroyo on Unsplash

2nd Workshop on Advances in Language and Vision Research

Language and vision research has attracted great attention from both natural language processing (NLP) and computer vision (CV) researchers. Gradually, this area is shifting from passive perception, templated language, and synthetic imagery/environments to active perception, natural language, and photo-realistic simulation or real world deployment. Thus far, few workshops on language and vision research have been organized by groups from the NLP community. We are organizing the second workshop on Advances in Language and Vision Research (ALVR) in order to promote the frontier of language and vision research and to bring interested researchers together to discuss how to best tackle and solve real-world problems in this area.

This workshop covers (but is not limited to) the following topics:

  • New tasks and datasets that provide real-world solutions in the intersection of NLP and CV;
  • Language-guided interaction with the real world, such as navigation via instruction following or dialogue;
  • External knowledge integration in visual and language understanding;
  • Visually grounded multilingual study, for example multimodal machine translation;
  • Shortcomings of existing language and vision tasks and datasets;
  • Benefits of using multimodal learning in downstream NLP tasks;
  • Self-supervised representation learning in language and vision;
  • Transfer learning (including few/zero-shot learning) and domain adaptation;
  • Cross-modal learning beyond image understanding, such as video and audio;
  • Multidisciplinary study that may involve linguistics, cognitive science, robotics, etc.

Important Dates

  • Archival track:
    • Paper Submission Due Date: March 22, 2021
    • Notification of acceptance: April 15, 2021
    • Camera-ready papers due: April 26, 2021
  • Non-archival track:
    • Paper Submission Due Date: April 30, 2021
    • Notification of acceptance: May 14, 2021
    • Camera-ready papers due: May 24, 2021
  • Workshop Date: June 11, 2021


The workshop includes an archival and a non-archival track on topics related to language-and-vision research. For both tracks, the reviewing process is single-blind. That is, the reviewer will know the authors but not the other way around. Submission is electronic, using the Softconf START conference management system. The submission site will be available at https://www.softconf.com/naacl2021/alvr2021.

Archival Track

The archival track follows the NAACL short paper format (https://2021.naacl.org/calls/papers/#short-papers). Submissions to the archival track may consist of up to 4 pages of content (excluding references) in NAACL format (style sheets are available below), plus extra space for an optional ethics/broader impact statement and unlimited references. Accepted papers will be given 5 content pages for the camera-ready version. Authors are encouraged to use this additional page to address reviewers’ comments in their final versions. The accepted papers to the archival track will be included in the NAACL 2021 Workshop Proceedings. The archival track does not accept double submissions, e.g., no previously published papers or concurrent submissions to other conferences or workshops.

The format of submitted papers to the archival track must follow the NAACL Author Guidelines. Style sheets (Latex, Word) are available here: https://2021.naacl.org/calls/style-and-formatting/

Non-archival Track

The workshop also includes a non-archival track to allow submission of previously published papers and double submission to ALVR and other conferences or journals. Accepted non-archival papers can still be presented as posters at the workshop.

There are no formatting or page restrictions for non-archival submissions. The accepted papers to the non-archival track will be displayed on the workshop website, but will NOT be included in the NAACL 2021 Workshop proceedings or otherwise archived.

Accepted Papers

Archival track:

  • Feature-level Incongruence Reduction for Multimodal Translation
    Zhifeng Li, Yu Hong, Yuchen Pan, Jian Tang, Jianmin Yao and Guodong Zhou
  • Error Causal inference for Multi-Fusion models
    Chengxi Li and Brent Harrison
  • Leveraging Partial Dependency Trees to Control Image Captions
    Wenjie Zhong and Yusuke Miyao
  • Grounding Plural Phrases: Countering Evaluation Biases by Individuation
    Julia Suter, Letitia Parcalabescu and Anette Frank
  • PanGEA: The Panoramic Graph Environment Annotation Toolkit
    Alexander Ku, Peter Anderson, Jordi Pont Tuset and Jason Baldridge
  • Learning to Learn Semantic Factors in Heterogeneous Image Classification
    Boyue Fan and Zhenting Liu
  • Reference and coreference in situated dialogue
    Sharid Loáiciga, Simon Dobnik and David Schlangen

Non-archival track:

  • Interactive Learning from Activity Description
    Khanh Nguyen, Dipendra Misra, Robert Schapire, Miro Dudik and Patrick Shafto
  • Towards Multi-Modal Text-Image Retrieval to improve Human Reading
    Florian Schneider, Özge Alacam, Xintong Wang and Chris Biemann
  • Language-based Video Editing via Multi-Modal Multi-Level Transformer
    Tsu-Jui Fu, Xin Wang, Scott Grafton, Miguel Eckstein and William Yang Wang
  • Learning to Select Question-Relevant Relations for Visual Question Answering
    Hwanhee Lee, Jaewoong Lee, Heejoon Lee and Kyomin Jung
  • CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images
    Shailaja Keyur Sampat, Akshay Kumar, Yezhou Yang and Chitta Baral
  • Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation
    Wanrong Zhu, Xin Wang, Tsu-Jui Fu, An Yan, Pradyumna Narayana, Kazoo Sone, Sugato Basu and William Yang Wang
  • What is Multimodality?
    Letitia Parcalabescu, Nils Trost and Anette Frank
  • Pathdreamer: A World Model for Indoor Navigation
    Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge and Peter Anderson
  • Visual Goal-Step Inference using wikiHow
    Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar and Chris Callison-Burch
  • Neural Event Semantics for Grounded Language Understanding
    Shyamal Buch, Li Fei-Fei and Noah Goodman

Program (PDT)

Time Event Who
8:30-8:35 Opening Remarks Workshop Organizers
8:35-9:10 Invited Speaker 1
Invited Talk & QA
9:10-9:45 Invited Speaker 2
Invited Talk & QA
9:45-10:10 Poster Highlight
10:10-10:50 Poster Session
10:50-11:25 Invited Speaker 3
Invited Talk & QA
11:25-12:00 Invited Speaker 4
Invited Talk & QA
12:00-13:20 Lunch
13:20-13:55 Invited Speaker 5
Invited Talk & QA
13:55-14:30 Invited Speaker 6
Invited Talk & QA
14:30-15:05 Invited Speaker 7
Invited Talk & QA
15:05-15:45 Poster Session
15:45-16:20 Invited Speaker 8
Invited Talk & QA
16:20-16:55 Invited Speaker 9
Invited Talk & QA
16:55-17:30 Invited Speaker 10
Invited Talk & QA
17:30-18:10 Panel Session

Invited Speakers

Jacob Andreas is the X Consortium Career Development Assistant Professor at MIT. His research focuses on building intelligent systems that can communicate effectively using language and learn from human guidance. Jacob earned his Ph.D. from UC Berkeley, his M.Phil. from Cambridge (where he studied as a Churchill scholar) and his B.S. from Columbia. He has been the recipient of an NSF graduate fellowship, a Facebook fellowship, and paper awards at NAACL and ICML.

Jason is a research scientist at Google, where he works on natural language understanding. He was previously an Associate Professor of Computational Linguistics at the University of Texas at Austin. His main research interests include categorial grammars, parsing, semi-supervised learning for NLP, reference resolution and text geolocation. He has long been active in the creation and promotion of open source software for natural language processing, including co-creating the Apache OpenNLP Toolkit and OpenCCG. Jason received his Ph.D. from the University of Edinburgh in 2002, where his doctoral dissertation on Multimodal Combinatory Categorial Grammar was awarded the 2003 Beth Dissertation Prize from the European Association for Logic, Language and Information.

Dr. Mohit Bansal is the John R. & Louise S. Parker Associate Professor and the Director of the MURGe-Lab (UNC-NLP Group) in the Computer Science department at University of North Carolina (UNC) Chapel Hill. He received his PhD from UC Berkeley in 2013 (where he was advised by Dan Klein) and his BTech from IIT Kanpur in 2008. His research expertise is in statistical natural language processing and machine learning, with a particular focus on multimodal, grounded, and embodied semantics (i.e., language with vision and speech, for robotics), human-like language generation and Q&A/dialogue, and interpretable and generalizable deep learning.

Yonatan Bisk is an Assistant Professor at Carnegie Mellon University. Yonatan’s research area is Natural Language Processing (NLP) with a focus on grounding. In particular, his work broadly falls into: 1. Uncovering the latent structures of natural language, 2. Modeling the semantics of the physical world, and 3. Connecting language to perception and control.

Joyce Y. Chai is a Professor at University of Michigan. Her research interests are in the area of artificial intelligence, particularly on natural language processing, situated dialogue agents, human-robot communication, and intelligent user interfaces. Her recent work has focused on grounded language processing to facilitate situated communication with robots and other artificial agents. Prior to joining UM, she was a professor at MSU directing the Language and Interaction Research Lab . At UM, she is a member of Michigan AI Lab and directing the Situated Language and Embodied Dialogue (SLED) research group. She is also affiliated with Michigan Robotics Institute.

Yejin Choi is an associate professor of Paul G. Allen School of Computer Science & Engineering at the University of Washington with the Brett Helsel Career Development Professorship, adjunct of the Linguistics department, and affiliate of the Center for Statistics and Social Sciences. She is also a senior research manager at the Allen Institute for Artificial Intelligence overseeing the project Mosaic on Commonsense Intelligence. She is a co-recepient of the AAAI Outstanding Paper Award in 2020, the Marr Prize (best paper award) at ICCV 2013, a recepient of Borg Early Career Award (BECA) in 2018, and named among IEEE AI's 10 to Watch in 2016.

Raymond J. Mooney is a Professor in the Department of Computer Science at the University of Texas at Austin. He received his Ph.D. in 1988 from the University of Illinois at Urbana/Champaign. He is an author of over 160 published research papers, primarily in the areas of machine learning and natural language processing. He was the President of the International Machine Learning Society from 2008-2011, program co-chair for AAAI 2006, general chair for HLT-EMNLP 2005, and co-chair for ICML 1990. He is a Fellow of the American Association for Artificial Intelligence, the Association for Computing Machinery, and the Association for Computational Linguistics and the recipient of best paper awards from AAAI-96, KDD-04, ICML-05 and ACL-07.

Anna Rohrbach is a Research Scientist at UC Berkeley, working with Prof. Trevor Darrell. She has completed her PhD at Max Planck Institute for Informatics under supervision of Prof. Bernt Schiele. Her research is at the intersection of vision and language. She is interested in a variety of tasks, including image and video description, visual grounding, visual question answering, etc. Recently, she is focusing on building explainable models and addressing bias in existing vision and language models.

Kate is an Associate Professor of Computer Science at Boston University and a consulting professor for the MIT-IBM Watson AI Lab. She leads the Computer Vision and Learning Group at BU, is the founder and co-director of the Artificial Intelligence Research (AIR) initiative, and member of the Image and Video Computing research group. Kate received a PhD from MIT and did her postdoctoral training at UC Berkeley and Harvard. Her research interests are in the broad area of Artificial Intelligence with a focus on dataset bias, adaptive machine learning, learning for image and language understanding, and deep learning.

William Wang is the Director of UC Santa Barbara's Natural Language Processing group and Center for Responsible Machine Learning. He is the Duncan and Suzanne Mellichamp Chair in Artificial Intelligence and Designs, and an Assistant Professor in the Department of Computer Science at the University of California, Santa Barbara. He received his PhD from School of Computer Science, Carnegie Mellon University. He has broad interests in machine learning approaches to data science, including statistical relational learning, information extraction, computational social science, speech, and vision.


Xin (Eric) Wang

UC Santa Cruz

Ronghang Hu

Facebook AI Research

Drew Hudson


Tsu-Jui Fu

UC Santa Barbara

Marcus Rohrbach

Facebook AI Research

Daniel Fried

UC Berkeley


Contact the Organizing Committee: alvr2021_naacl2021@softconf.com

Program Committee

  • Shubham Agarwal
  • Heriot-Watt University
  • Arjun Akula
  • University of California, Los Angeles
  • Asma Ben Abacha
  • Luciana Benotti
  • Universidad Nacional de Córdoba
  • Khyathi Raghavi Chandu
  • Carnegie Mellon University
  • Angel Chang
  • Stanford University
  • Dhivya Chinnappa
  • Thomson Reuters
  • Abhishek Das
  • Facebook AI
  • Simon Dobnik
  • University of Gothenburg
  • Thoudam Doren Singh
  • National Institute of Technology, Silchar, India
  • Hamed Firooz
  • Facebook AI
  • Zhe Gan
  • Microsoft
  • Cristina Garbacea
  • University of Michigan
  • Jack Hessel
  • AI2
  • Gabriel Ilharco
  • University of Washington
  • Shailza Jolly
  • TU Kaiserslautern Germany
  • Marimuthu Kalimuthu
  • Saarland University, Saarland Informatics Campus
  • Noriyuki Kojima
  • Cornell University
  • Christopher Kümmel
  • Beuth University of Applied Sciences Berlin
  • Loitongbam Sanayai Meetei
  • National Institute of Technology Silchar, India
  • Khanh Nguyen
  • University of Maryland
  • Yulei Niu
  • Renmin University of China
  • Aishwarya Padmakumar
  • University of Texas, Austin
  • Hamid Palangi
  • Microsoft Research
  • Shruti Palaskar
  • Carnegie Mellon University
  • Vikas Raunak
  • Carnegie Mellon University
  • Arka Sadhu
  • University of Southern California
  • Alok Singh
  • National Institute of Technology, Silchar India
  • Alane Suhr
  • Cornell University
  • Hao Tan
  • University of North Carolina
  • Xiangru Tang
  • University of the Chinese Academy of Sciences, China
  • Ece Takmaz
  • University of Amsterdam