2nd Workshop on Advances in Language and Vision Research (ALVR)

In conjunction with NAACL 2021
June 11st 2021 (Full Day)
Location: Virtual

Photo by Manuel Arroyo on Unsplash

2nd Workshop on Advances in Language and Vision Research

Language and vision research has attracted great attention from both natural language processing (NLP) and computer vision (CV) researchers. Gradually, this area is shifting from passive perception, templated language, and synthetic imagery/environments to active perception, natural language, and photo-realistic simulation or real world deployment. This workshop covers (but is not limited to) the following topics:

  • New tasks and datasets that provide real-world solutions in the intersection of NLP and CV;
  • Language-guided interaction with the real world, such as navigation via instruction following or dialogue;
  • External knowledge integration in visual and language understanding;
  • Visually grounded multilingual study, for example multimodal machine translation;
  • Shortcomings of existing language and vision tasks and datasets;
  • Benefits of using multimodal learning in downstream NLP tasks;
  • Self-supervised representation learning in language and vision;
  • Transfer learning (including few/zero-shot learning) and domain adaptation;
  • Cross-modal learning beyond image understanding, such as video and audio;
  • Multidisciplinary study that may involve linguistics, cognitive science, robotics, etc.


Schedule (PDT, UTC-7)

Time (PDT) Event Who
8:30-8:35 Opening Remarks Workshop Organizers
8:35-9:10 Invited Talk 1: Instructions, Abstraction, and Theory-of-Mind Abstract
This talk focuses on an overview of our recent environments and benchmarks: ALFRED and ALFWorld for instruction following in embodied and abstract action spaces. The goal is to help move the community towards building agents that connect language to action and understand abstract plans. As we move towards systems which interact with the world, we also need to think about how they interact with other agents. I close with a quick preview of upcoming work at ICML on Theory-of-Mind agents based in the ALFWorld environment.
Yonatan Bisk
9:10-9:45 Invited Talk 2: Generalization in Vision and Language Reasoning Abstract
A key challenge for Artificial Intelligence research is to go beyond static observational data, and consider more challenging settings that involve dynamic actions and incremental decision-making. In this talk, I will introduce our recent work on visually-grounded language reasoning via the studies of vision-and-language navigation. In particular, I will emphasize two main benefits of self-supervised learning that improve generalization by (1) creating counterfactuals to augment observational data; (2) enables transfer learning for challenging settings. I will present our empirical results on indoor and outdoor navigation datasets, and demonstrate the effectiveness of our proposed Adversarial Path Sampler and Multimodal Text Style Transfer approaches for vision-and-language navigation.
William Wang
9:45-10:10 Poster Highlight
10:10-10:50 Poster Session 1
10:50-11:25 Invited Talk 3: If Bears were Bees and Cats were Researchers Abstract
Human knowledge and use of language is inextricably connected to perception, action and the organization of the brain, yet natural language processing is still dominated by text! More research involving language---including speech---in the context of other modalities and environments is needed, and there has never been a better time to do it. In my talk, I’ll share some of my team’s work on these topics, particularly around language-image retrieval, text-to-image generation and vision and language navigation.

That said, this talk will mostly be a Complaining Discussion about vision and language research. Rather than giving a glossy and triumphant overview of some papers, I’d like to include additional observations and questions regarding current research business-as-usual, injecting my own split perspective as an empiricist in natural language processing and as a person with both anthropological and linguistic training, perspectives and predilections. My hope is that through some Complaining and Introspection, I can provide useful reminders of why this research area is still so hard, so wide open, so important and so exciting.
Jason Baldridge
11:25-12:00 Invited Talk 4: Action Learning and Justification Using Language Abstract
Task learning from natural language instructions involves understanding, learning, and justification of perceived actions. In this talk, I will talk about some work done in my lab that incorporates commonsense knowledge in language grounding, action learning and justification. I will also introduce our recent work on hierarchical task learning using the ALFRED dataset and discuss key challenges and opportunities.
Joyce Chai
12:00-12:30 Panel Session 1
Jason Baldridge, Joyce Chai, Kate Saenko, William Wang
12:30-13:20 Lunch
13:20-13:55 Invited Talk 5: Separating Skills and Concepts for Novel Visual Question Answering Kate Saenko
13:55-14:30 Invited Talk 6: Natural Language Explanations of Deep Networks Abstract
Despite major efforts in recent years to improve explainability of deep neural networks, the tools we use for communicating explanations have largely remained the same: visualizations of representative inputs, salient input regions, and local model approximations. But when humans describe complex decision rules, we often use a different explanatory tool: natural language. I'll describe recent work on explaining models for computer vision tasks by automatically constructing natural language descriptions of individual neurons. These descriptions ground prediction in meaningful perceptual and linguistic abstractions, and can be used to surface unexpected model behaviors, identify adversarial vulnerabilities, and even guide text-based image editing. These results show that fine-grained, automatic annotation of deep network models is both possible and practical: rich, language-based explanations produced by automated annotation procedures can surface meaningful and actionable information about model behavior.
Jacob Andreas
14:30-15:05 Invited Talk 7: Sherlock and Merlot: Multimodal Abductive Reasoning and Neural Script Knowledge Abstract
In this talk, I will present Sherlock, a new dataset and tasks for multimodal abductive reasoning, and Merlot, a new multimodal model for neural script knowledge that achieves SOTA on 12+ video-based QA benchmarks.
Yejin Choi
15:05-15:45 Poster Session 2
15:45-16:20 Invited Talk 8: Explanations for Visual Question Answering Abstract
This talk will review several aspects of our recent work on explanation for VQA. First, we have developed a VQA system that can elucidate its answers with multi-modal natural-language and visual explanations that faithfully reflect important aspects of its underlying reasoning while capturing the style of comprehensible human explanations. Crowd-sourced human evaluation of these explanations have demonstrated the advantages of our approach. Second, we have developed methods that use human-provided visual or textual explanations to aid the training of VQA systems and improve their robustness to changing problem distributions. Finally, we have developed a novel framework that constructs explanations for multiple potential answers for a VQA problem and analyzes and compares these competing explanations to improve both the accuracy of the system as well as the quality of its explanations as evaluated by human judges.
Ray Mooney
16:20-16:55 Invited Talk 9: New Frontiers in Vision and Language Research Abstract
Vision and Language research is rapidly evolving, in terms of both the methods and the applications. I will start by presenting a benchmarking study on the benefits of CLIP, a recent powerful multimodal model, for the traditional Vision and Language tasks. Then, I will talk about several new application scenarios for Vision and Language research. First, is our work on incorporating language into a formerly “vision-only” task of video summarization. Second, I will discuss how we can leverage language to address bias in visual classifiers. Lastly, I will talk about automatically generating and detecting out-of-context multimodal media, an emerging misinformation threat scenario.
Anja Rohrbach
16:55-17:30 Invited Talk 10: Knowledgeable and Spatio-Temporal Vision+Language Mohit Bansal
17:30-18:00 Panel Session 2
Jacob Andreas, Mohit Bansal, Yejin Choi, Ray Mooney, Anja Rohrbach

Accepted Papers

Archival track:

  • Feature-level Incongruence Reduction for Multimodal Translation (PDF | Video)
    Zhifeng Li, Yu Hong, Yuchen Pan, Jian Tang, Jianmin Yao and Guodong Zhou
  • Error Causal inference for Multi-Fusion models (PDF | Video)
    Chengxi Li and Brent Harrison
  • Leveraging Partial Dependency Trees to Control Image Captions (PDF | Video)
    Wenjie Zhong and Yusuke Miyao
  • Grounding Plural Phrases: Countering Evaluation Biases by Individuation (PDF | Video)
    Julia Suter, Letitia Parcalabescu and Anette Frank
  • PanGEA: The Panoramic Graph Environment Annotation Toolkit (PDF)
    Alexander Ku, Peter Anderson, Jordi Pont Tuset and Jason Baldridge
  • Learning to Learn Semantic Factors in Heterogeneous Image Classification (PDF | Video)
    Boyue Fan and Zhenting Liu
  • Reference and coreference in situated dialogue (PDF | Video)
    Sharid Loáiciga, Simon Dobnik and David Schlangen

Proceedings: https://www.aclweb.org/anthology/volumes/2021.alvr-1/

Non-archival track:

  • Interactive Learning from Activity Description (PDF)
    Khanh Nguyen, Dipendra Misra, Robert Schapire, Miro Dudik and Patrick Shafto
  • Towards Multi-Modal Text-Image Retrieval to improve Human Reading (PDF)
    Florian Schneider, Özge Alacam, Xintong Wang and Chris Biemann
  • Language-based Video Editing via Multi-Modal Multi-Level Transformer (PDF | Video)
    Tsu-Jui Fu, Xin Wang, Scott Grafton, Miguel Eckstein and William Yang Wang
  • Learning to Select Question-Relevant Relations for Visual Question Answering
    Hwanhee Lee, Jaewoong Lee, Heejoon Lee and Kyomin Jung
  • CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images (PDF)
    Shailaja Keyur Sampat, Akshay Kumar, Yezhou Yang and Chitta Baral
  • Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation (PDF | Video)
    Wanrong Zhu, Xin Wang, Tsu-Jui Fu, An Yan, Pradyumna Narayana, Kazoo Sone, Sugato Basu and William Yang Wang
  • What is Multimodality? (PDF | Video)
    Letitia Parcalabescu, Nils Trost and Anette Frank
  • Pathdreamer: A World Model for Indoor Navigation (PDF | Video)
    Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge and Peter Anderson
  • Visual Goal-Step Inference using wikiHow
    Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar and Chris Callison-Burch
  • Neural Event Semantics for Grounded Language Understanding (PDF | Video)
    Shyamal Buch, Li Fei-Fei and Noah Goodman

Invited Speakers

Jacob Andreas is the X Consortium Career Development Assistant Professor at MIT. His research focuses on building intelligent systems that can communicate effectively using language and learn from human guidance. Jacob earned his Ph.D. from UC Berkeley, his M.Phil. from Cambridge (where he studied as a Churchill scholar) and his B.S. from Columbia. He has been the recipient of an NSF graduate fellowship, a Facebook fellowship, and paper awards at NAACL and ICML.

Jason is a research scientist at Google, where he works on natural language understanding. He was previously an Associate Professor of Computational Linguistics at the University of Texas at Austin. His main research interests include categorial grammars, parsing, semi-supervised learning for NLP, reference resolution and text geolocation. He has long been active in the creation and promotion of open source software for natural language processing, including co-creating the Apache OpenNLP Toolkit and OpenCCG. Jason received his Ph.D. from the University of Edinburgh in 2002, where his doctoral dissertation on Multimodal Combinatory Categorial Grammar was awarded the 2003 Beth Dissertation Prize from the European Association for Logic, Language and Information.

Dr. Mohit Bansal is the John R. & Louise S. Parker Associate Professor and the Director of the MURGe-Lab (UNC-NLP Group) in the Computer Science department at University of North Carolina (UNC) Chapel Hill. He received his PhD from UC Berkeley in 2013 (where he was advised by Dan Klein) and his BTech from IIT Kanpur in 2008. His research expertise is in statistical natural language processing and machine learning, with a particular focus on multimodal, grounded, and embodied semantics (i.e., language with vision and speech, for robotics), human-like language generation and Q&A/dialogue, and interpretable and generalizable deep learning.

Yonatan Bisk is an Assistant Professor in the Language Technologies Institute at Carnegie Mellon University. Prior to that he received his PhD from the University of Illinois at Urbana-Champaign and spent time at USC’s ISI, University of Washington and Microsoft Research. His research focus is on grounded and embodied natural language understanding.

Joyce Y. Chai is a Professor at University of Michigan. Her research interests are in the area of artificial intelligence, particularly on natural language processing, situated dialogue agents, human-robot communication, and intelligent user interfaces. Her recent work has focused on grounded language processing to facilitate situated communication with robots and other artificial agents. Prior to joining UM, she was a professor at MSU directing the Language and Interaction Research Lab . At UM, she is a member of Michigan AI Lab and directing the Situated Language and Embodied Dialogue (SLED) research group. She is also affiliated with Michigan Robotics Institute.

Yejin Choi is an associate professor of Paul G. Allen School of Computer Science & Engineering at the University of Washington with the Brett Helsel Career Development Professorship, adjunct of the Linguistics department, and affiliate of the Center for Statistics and Social Sciences. She is also a senior research manager at the Allen Institute for Artificial Intelligence overseeing the project Mosaic on Commonsense Intelligence. She is a co-recepient of the AAAI Outstanding Paper Award in 2020, the Marr Prize (best paper award) at ICCV 2013, a recepient of Borg Early Career Award (BECA) in 2018, and named among IEEE AI's 10 to Watch in 2016.

Raymond J. Mooney is a Professor in the Department of Computer Science at the University of Texas at Austin. He received his Ph.D. in 1988 from the University of Illinois at Urbana/Champaign. He is an author of over 160 published research papers, primarily in the areas of machine learning and natural language processing. He was the President of the International Machine Learning Society from 2008-2011, program co-chair for AAAI 2006, general chair for HLT-EMNLP 2005, and co-chair for ICML 1990. He is a Fellow of the American Association for Artificial Intelligence, the Association for Computing Machinery, and the Association for Computational Linguistics and the recipient of best paper awards from AAAI-96, KDD-04, ICML-05 and ACL-07.

Anja Rohrbach is a Research Scientist at UC Berkeley, working with Prof. Trevor Darrell. She has completed her PhD at Max Planck Institute for Informatics under supervision of Prof. Bernt Schiele. Her research is at the intersection of vision and language. She is interested in a variety of tasks, including image and video description, visual grounding, visual question answering, etc. Recently, she is focusing on building explainable models and addressing bias in existing vision and language models.

Kate is an Associate Professor of Computer Science at Boston University and a consulting professor for the MIT-IBM Watson AI Lab. She leads the Computer Vision and Learning Group at BU, is the founder and co-director of the Artificial Intelligence Research (AIR) initiative, and member of the Image and Video Computing research group. Kate received a PhD from MIT and did her postdoctoral training at UC Berkeley and Harvard. Her research interests are in the broad area of Artificial Intelligence with a focus on dataset bias, adaptive machine learning, learning for image and language understanding, and deep learning.

William Wang is the Director of UC Santa Barbara's Natural Language Processing group and Center for Responsible Machine Learning. He is the Duncan and Suzanne Mellichamp Chair in Artificial Intelligence and Designs, and an Assistant Professor in the Department of Computer Science at the University of California, Santa Barbara. He received his PhD from School of Computer Science, Carnegie Mellon University. He has broad interests in machine learning approaches to data science, including statistical relational learning, information extraction, computational social science, speech, and vision.


Xin (Eric) Wang

UC Santa Cruz

Ronghang Hu

Facebook AI Research

Drew Hudson


Tsu-Jui Fu

UC Santa Barbara

Marcus Rohrbach

Facebook AI Research

Daniel Fried

UC Berkeley


Contact the Organizing Committee: alvr2021_naacl2021@softconf.com

Program Committee

  • Shubham Agarwal
  • Heriot-Watt University
  • Arjun Akula
  • University of California, Los Angeles
  • Asma Ben Abacha
  • Luciana Benotti
  • Universidad Nacional de Córdoba
  • Khyathi Raghavi Chandu
  • Carnegie Mellon University
  • Angel Chang
  • Stanford University
  • Dhivya Chinnappa
  • Thomson Reuters
  • Abhishek Das
  • Facebook AI
  • Simon Dobnik
  • University of Gothenburg
  • Thoudam Doren Singh
  • National Institute of Technology, Silchar, India
  • Hamed Firooz
  • Facebook AI
  • Zhe Gan
  • Microsoft
  • Cristina Garbacea
  • University of Michigan
  • Jack Hessel
  • AI2
  • Gabriel Ilharco
  • University of Washington
  • Shailza Jolly
  • TU Kaiserslautern Germany
  • Marimuthu Kalimuthu
  • Saarland University, Saarland Informatics Campus
  • Noriyuki Kojima
  • Cornell University
  • Christopher Kümmel
  • Beuth University of Applied Sciences Berlin
  • Loitongbam Sanayai Meetei
  • National Institute of Technology Silchar, India
  • Khanh Nguyen
  • University of Maryland
  • Yulei Niu
  • Renmin University of China
  • Aishwarya Padmakumar
  • University of Texas, Austin
  • Hamid Palangi
  • Microsoft Research
  • Shruti Palaskar
  • Carnegie Mellon University
  • Vikas Raunak
  • Carnegie Mellon University
  • Arka Sadhu
  • University of Southern California
  • Alok Singh
  • National Institute of Technology, Silchar India
  • Alane Suhr
  • Cornell University
  • Hao Tan
  • University of North Carolina
  • Xiangru Tang
  • University of the Chinese Academy of Sciences, China
  • Ece Takmaz
  • University of Amsterdam