ALVR 2021

Schedule (PDT, UTC-7)

Time (PDT)	Event	Who
8:30-8:35	Opening Remarks	Workshop Organizers
8:35-9:10	Invited Talk 1: Instructions, Abstraction, and Theory-of-Mind Abstract This talk focuses on an overview of our recent environments and benchmarks: ALFRED and ALFWorld for instruction following in embodied and abstract action spaces. The goal is to help move the community towards building agents that connect language to action and understand abstract plans. As we move towards systems which interact with the world, we also need to think about how they interact with other agents. I close with a quick preview of upcoming work at ICML on Theory-of-Mind agents based in the ALFWorld environment.	Yonatan Bisk
9:10-9:45	Invited Talk 2: Generalization in Vision and Language Reasoning Abstract A key challenge for Artificial Intelligence research is to go beyond static observational data, and consider more challenging settings that involve dynamic actions and incremental decision-making. In this talk, I will introduce our recent work on visually-grounded language reasoning via the studies of vision-and-language navigation. In particular, I will emphasize two main benefits of self-supervised learning that improve generalization by (1) creating counterfactuals to augment observational data; (2) enables transfer learning for challenging settings. I will present our empirical results on indoor and outdoor navigation datasets, and demonstrate the effectiveness of our proposed Adversarial Path Sampler and Multimodal Text Style Transfer approaches for vision-and-language navigation.	William Wang
9:45-10:10	Poster Highlight
10:10-10:50	Poster Session 1
10:50-11:25	Invited Talk 3: If Bears were Bees and Cats were Researchers Abstract Human knowledge and use of language is inextricably connected to perception, action and the organization of the brain, yet natural language processing is still dominated by text! More research involving language---including speech---in the context of other modalities and environments is needed, and there has never been a better time to do it. In my talk, I’ll share some of my team’s work on these topics, particularly around language-image retrieval, text-to-image generation and vision and language navigation. That said, this talk will mostly be a Complaining Discussion about vision and language research. Rather than giving a glossy and triumphant overview of some papers, I’d like to include additional observations and questions regarding current research business-as-usual, injecting my own split perspective as an empiricist in natural language processing and as a person with both anthropological and linguistic training, perspectives and predilections. My hope is that through some Complaining and Introspection, I can provide useful reminders of why this research area is still so hard, so wide open, so important and so exciting.	Jason Baldridge
11:25-12:00	Invited Talk 4: Action Learning and Justification Using Language Abstract Task learning from natural language instructions involves understanding, learning, and justification of perceived actions. In this talk, I will talk about some work done in my lab that incorporates commonsense knowledge in language grounding, action learning and justification. I will also introduce our recent work on hierarchical task learning using the ALFRED dataset and discuss key challenges and opportunities.	Joyce Chai
12:00-12:30	Panel Session 1	Jason Baldridge, Joyce Chai, Kate Saenko, William Wang
12:30-13:20	Lunch
13:20-13:55	Invited Talk 5: Separating Skills and Concepts for Novel Visual Question Answering	Kate Saenko
13:55-14:30	Invited Talk 6: Natural Language Explanations of Deep Networks Abstract Despite major efforts in recent years to improve explainability of deep neural networks, the tools we use for communicating explanations have largely remained the same: visualizations of representative inputs, salient input regions, and local model approximations. But when humans describe complex decision rules, we often use a different explanatory tool: natural language. I'll describe recent work on explaining models for computer vision tasks by automatically constructing natural language descriptions of individual neurons. These descriptions ground prediction in meaningful perceptual and linguistic abstractions, and can be used to surface unexpected model behaviors, identify adversarial vulnerabilities, and even guide text-based image editing. These results show that fine-grained, automatic annotation of deep network models is both possible and practical: rich, language-based explanations produced by automated annotation procedures can surface meaningful and actionable information about model behavior.	Jacob Andreas
14:30-15:05	Invited Talk 7: Sherlock and Merlot: Multimodal Abductive Reasoning and Neural Script Knowledge Abstract In this talk, I will present Sherlock, a new dataset and tasks for multimodal abductive reasoning, and Merlot, a new multimodal model for neural script knowledge that achieves SOTA on 12+ video-based QA benchmarks.	Yejin Choi
15:05-15:45	Poster Session 2
15:45-16:20	Invited Talk 8: Explanations for Visual Question Answering Abstract This talk will review several aspects of our recent work on explanation for VQA. First, we have developed a VQA system that can elucidate its answers with multi-modal natural-language and visual explanations that faithfully reflect important aspects of its underlying reasoning while capturing the style of comprehensible human explanations. Crowd-sourced human evaluation of these explanations have demonstrated the advantages of our approach. Second, we have developed methods that use human-provided visual or textual explanations to aid the training of VQA systems and improve their robustness to changing problem distributions. Finally, we have developed a novel framework that constructs explanations for multiple potential answers for a VQA problem and analyzes and compares these competing explanations to improve both the accuracy of the system as well as the quality of its explanations as evaluated by human judges.	Ray Mooney
16:20-16:55	Invited Talk 9: New Frontiers in Vision and Language Research Abstract Vision and Language research is rapidly evolving, in terms of both the methods and the applications. I will start by presenting a benchmarking study on the benefits of CLIP, a recent powerful multimodal model, for the traditional Vision and Language tasks. Then, I will talk about several new application scenarios for Vision and Language research. First, is our work on incorporating language into a formerly “vision-only” task of video summarization. Second, I will discuss how we can leverage language to address bias in visual classifiers. Lastly, I will talk about automatically generating and detecting out-of-context multimodal media, an emerging misinformation threat scenario.	Anja Rohrbach
16:55-17:30	Invited Talk 10: Knowledgeable and Spatio-Temporal Vision+Language	Mohit Bansal
17:30-18:00	Panel Session 2	Jacob Andreas, Mohit Bansal, Yejin Choi, Ray Mooney, Anja Rohrbach

Accepted Papers

Archival track:

Feature-level Incongruence Reduction for Multimodal Translation (PDF | Video)
Zhifeng Li, Yu Hong, Yuchen Pan, Jian Tang, Jianmin Yao and Guodong Zhou
Error Causal inference for Multi-Fusion models (PDF | Video)
Chengxi Li and Brent Harrison
Leveraging Partial Dependency Trees to Control Image Captions (PDF | Video)
Wenjie Zhong and Yusuke Miyao
Grounding Plural Phrases: Countering Evaluation Biases by Individuation (PDF | Video)
Julia Suter, Letitia Parcalabescu and Anette Frank
PanGEA: The Panoramic Graph Environment Annotation Toolkit (PDF)
Alexander Ku, Peter Anderson, Jordi Pont Tuset and Jason Baldridge
Learning to Learn Semantic Factors in Heterogeneous Image Classification (PDF | Video)
Boyue Fan and Zhenting Liu
Reference and coreference in situated dialogue (PDF | Video)
Sharid Loáiciga, Simon Dobnik and David Schlangen

Proceedings: https://www.aclweb.org/anthology/volumes/2021.alvr-1/

Non-archival track:

Interactive Learning from Activity Description (PDF)
Khanh Nguyen, Dipendra Misra, Robert Schapire, Miro Dudik and Patrick Shafto
Towards Multi-Modal Text-Image Retrieval to improve Human Reading (PDF)
Florian Schneider, Özge Alacam, Xintong Wang and Chris Biemann
Language-based Video Editing via Multi-Modal Multi-Level Transformer (PDF | Video)
Tsu-Jui Fu, Xin Wang, Scott Grafton, Miguel Eckstein and William Yang Wang
Learning to Select Question-Relevant Relations for Visual Question Answering
Hwanhee Lee, Jaewoong Lee, Heejoon Lee and Kyomin Jung
CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images (PDF)
Shailaja Keyur Sampat, Akshay Kumar, Yezhou Yang and Chitta Baral
Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation (PDF | Video)
Wanrong Zhu, Xin Wang, Tsu-Jui Fu, An Yan, Pradyumna Narayana, Kazoo Sone, Sugato Basu and William Yang Wang
What is Multimodality? (PDF | Video)
Letitia Parcalabescu, Nils Trost and Anette Frank
Pathdreamer: A World Model for Indoor Navigation (PDF | Video)
Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge and Peter Anderson
Visual Goal-Step Inference using wikiHow
Yue Yang, Artemis Panagopoulou, Qing Lyu, Li Zhang, Mark Yatskar and Chris Callison-Burch
Neural Event Semantics for Grounded Language Understanding (PDF | Video)
Shyamal Buch, Li Fei-Fei and Noah Goodman

Invited Speakers

Jacob Andreas

Jacob Andreas is the X Consortium Career Development Assistant Professor at MIT. His research focuses on building intelligent systems that can communicate effectively using language and learn from human guidance. Jacob earned his Ph.D. from UC Berkeley, his M.Phil. from Cambridge (where he studied as a Churchill scholar) and his B.S. from Columbia. He has been the recipient of an NSF graduate fellowship, a Facebook fellowship, and paper awards at NAACL and ICML.

Jason Baldridge

Jason is a research scientist at Google, where he works on natural language understanding. He was previously an Associate Professor of Computational Linguistics at the University of Texas at Austin. His main research interests include categorial grammars, parsing, semi-supervised learning for NLP, reference resolution and text geolocation. He has long been active in the creation and promotion of open source software for natural language processing, including co-creating the Apache OpenNLP Toolkit and OpenCCG. Jason received his Ph.D. from the University of Edinburgh in 2002, where his doctoral dissertation on Multimodal Combinatory Categorial Grammar was awarded the 2003 Beth Dissertation Prize from the European Association for Logic, Language and Information.

Mohit Bansal

Dr. Mohit Bansal is the John R. & Louise S. Parker Associate Professor and the Director of the MURGe-Lab (UNC-NLP Group) in the Computer Science department at University of North Carolina (UNC) Chapel Hill. He received his PhD from UC Berkeley in 2013 (where he was advised by Dan Klein) and his BTech from IIT Kanpur in 2008. His research expertise is in statistical natural language processing and machine learning, with a particular focus on multimodal, grounded, and embodied semantics (i.e., language with vision and speech, for robotics), human-like language generation and Q&A/dialogue, and interpretable and generalizable deep learning.

Yonatan Bisk

Yonatan Bisk is an Assistant Professor in the Language Technologies Institute at Carnegie Mellon University. Prior to that he received his PhD from the University of Illinois at Urbana-Champaign and spent time at USC’s ISI, University of Washington and Microsoft Research. His research focus is on grounded and embodied natural language understanding.

Joyce Y. Chai

Joyce Y. Chai is a Professor at University of Michigan. Her research interests are in the area of artificial intelligence, particularly on natural language processing, situated dialogue agents, human-robot communication, and intelligent user interfaces. Her recent work has focused on grounded language processing to facilitate situated communication with robots and other artificial agents. Prior to joining UM, she was a professor at MSU directing the Language and Interaction Research Lab . At UM, she is a member of Michigan AI Lab and directing the Situated Language and Embodied Dialogue (SLED) research group. She is also affiliated with Michigan Robotics Institute.

Yejin Choi

Yejin Choi is an associate professor of Paul G. Allen School of Computer Science & Engineering at the University of Washington with the Brett Helsel Career Development Professorship, adjunct of the Linguistics department, and affiliate of the Center for Statistics and Social Sciences. She is also a senior research manager at the Allen Institute for Artificial Intelligence overseeing the project Mosaic on Commonsense Intelligence. She is a co-recepient of the AAAI Outstanding Paper Award in 2020, the Marr Prize (best paper award) at ICCV 2013, a recepient of Borg Early Career Award (BECA) in 2018, and named among IEEE AI's 10 to Watch in 2016.

Raymond J. Mooney

Raymond J. Mooney is a Professor in the Department of Computer Science at the University of Texas at Austin. He received his Ph.D. in 1988 from the University of Illinois at Urbana/Champaign. He is an author of over 160 published research papers, primarily in the areas of machine learning and natural language processing. He was the President of the International Machine Learning Society from 2008-2011, program co-chair for AAAI 2006, general chair for HLT-EMNLP 2005, and co-chair for ICML 1990. He is a Fellow of the American Association for Artificial Intelligence, the Association for Computing Machinery, and the Association for Computational Linguistics and the recipient of best paper awards from AAAI-96, KDD-04, ICML-05 and ACL-07.

Anja Rohrbach

Anja Rohrbach is a Research Scientist at UC Berkeley, working with Prof. Trevor Darrell. She has completed her PhD at Max Planck Institute for Informatics under supervision of Prof. Bernt Schiele. Her research is at the intersection of vision and language. She is interested in a variety of tasks, including image and video description, visual grounding, visual question answering, etc. Recently, she is focusing on building explainable models and addressing bias in existing vision and language models.

Kate Saenko

Kate is an Associate Professor of Computer Science at Boston University and a consulting professor for the MIT-IBM Watson AI Lab. She leads the Computer Vision and Learning Group at BU, is the founder and co-director of the Artificial Intelligence Research (AIR) initiative, and member of the Image and Video Computing research group. Kate received a PhD from MIT and did her postdoctoral training at UC Berkeley and Harvard. Her research interests are in the broad area of Artificial Intelligence with a focus on dataset bias, adaptive machine learning, learning for image and language understanding, and deep learning.

William Wang

William Wang is the Director of UC Santa Barbara's Natural Language Processing group and Center for Responsible Machine Learning. He is the Duncan and Suzanne Mellichamp Chair in Artificial Intelligence and Designs, and an Assistant Professor in the Department of Computer Science at the University of California, Santa Barbara. He received his PhD from School of Computer Science, Carnegie Mellon University. He has broad interests in machine learning approaches to data science, including statistical relational learning, information extraction, computational social science, speech, and vision.

Shubham Agarwal	Heriot-Watt University
Arjun Akula	University of California, Los Angeles
Asma Ben Abacha	NIH/NLM
Luciana Benotti	Universidad Nacional de Córdoba
Khyathi Raghavi Chandu	Carnegie Mellon University
Angel Chang	Stanford University
Dhivya Chinnappa	Thomson Reuters
Abhishek Das	Facebook AI
Simon Dobnik	University of Gothenburg
Thoudam Doren Singh	National Institute of Technology, Silchar, India
Hamed Firooz	Facebook AI
Zhe Gan	Microsoft
Cristina Garbacea	University of Michigan
Jack Hessel	AI2
Gabriel Ilharco	University of Washington
Shailza Jolly	TU Kaiserslautern Germany
Marimuthu Kalimuthu	Saarland University, Saarland Informatics Campus
Noriyuki Kojima	Cornell University
Christopher Kümmel	Beuth University of Applied Sciences Berlin
Loitongbam Sanayai Meetei	National Institute of Technology Silchar, India
Khanh Nguyen	University of Maryland
Yulei Niu	Renmin University of China
Aishwarya Padmakumar	University of Texas, Austin
Hamid Palangi	Microsoft Research
Shruti Palaskar	Carnegie Mellon University
Vikas Raunak	Carnegie Mellon University
Arka Sadhu	University of Southern California
Alok Singh	National Institute of Technology, Silchar India
Alane Suhr	Cornell University
Hao Tan	University of North Carolina
Xiangru Tang	University of the Chinese Academy of Sciences, China
Ece Takmaz	University of Amsterdam

2^nd Workshop on Advances in Language and Vision Research (ALVR)

In conjunction with NAACL 2021
June 11^st 2021 (Full Day)
Location: Virtual

2^nd Workshop on Advances in Language and Vision Research

Schedule (PDT, UTC-7)

Accepted Papers

Archival track:

Non-archival track:

Invited Speakers

Jacob Andreas

Jason Baldridge

Mohit Bansal

Yonatan Bisk

Joyce Y. Chai

Yejin Choi

Raymond J. Mooney

Anja Rohrbach

Kate Saenko

William Wang

Organizers

Xin (Eric) Wang

UC Santa Cruz

Ronghang Hu

Facebook AI Research

Drew Hudson

Stanford

Tsu-Jui Fu

UC Santa Barbara

Marcus Rohrbach

Facebook AI Research

Daniel Fried

UC Berkeley

Contact

Program Committee

2nd Workshop on Advances in Language and Vision Research (ALVR)

In conjunction with NAACL 2021 June 11st 2021 (Full Day) Location: Virtual

2nd Workshop on Advances in Language and Vision Research

Schedule (PDT, UTC-7)

Accepted Papers

Archival track:

Non-archival track:

Invited Speakers

Organizers

UC Santa Cruz

Facebook AI Research

Stanford

UC Santa Barbara

Facebook AI Research

UC Berkeley

Contact

Program Committee

2^nd Workshop on Advances in Language and Vision Research (ALVR)

In conjunction with NAACL 2021
June 11^st 2021 (Full Day)
Location: Virtual

2^nd Workshop on Advances in Language and Vision Research