Language and vision research has attracted great attention from both natural language processing (NLP) and computer vision (CV) researchers. Gradually, this area is shifting from passive perception, templated language, and synthetic imagery/environments to active perception, natural language, and photo-realistic simulation or real world deployment.
This workshop covers (but is not limited to) the following topics:
|8:30-8:35||Opening Remarks||Workshop Organizers|
|8:35-9:10||Invited Talk 1: Instructions, Abstraction, and Theory-of-Mind
This talk focuses on an overview of our recent environments and benchmarks: ALFRED and ALFWorld for instruction following in embodied and abstract action spaces. The goal is to help move the community towards building agents that connect language to action and understand abstract plans. As we move towards systems which interact with the world, we also need to think about how they interact with other agents. I close with a quick preview of upcoming work at ICML on Theory-of-Mind agents based in the ALFWorld environment.
|9:10-9:45||Invited Talk 2: Generalization in Vision and Language Reasoning
A key challenge for Artificial Intelligence research is to go beyond static observational data, and consider more challenging settings that involve dynamic actions and incremental decision-making. In this talk, I will introduce our recent work on visually-grounded language reasoning via the studies of vision-and-language navigation. In particular, I will emphasize two main benefits of self-supervised learning that improve generalization by (1) creating counterfactuals to augment observational data; (2) enables transfer learning for challenging settings. I will present our empirical results on indoor and outdoor navigation datasets, and demonstrate the effectiveness of our proposed Adversarial Path Sampler and Multimodal Text Style Transfer approaches for vision-and-language navigation.
|10:10-10:50||Poster Session 1
|10:50-11:25||Invited Talk 3: If Bears were Bees and Cats were Researchers
Human knowledge and use of language is inextricably connected to perception, action and the organization of the brain, yet natural language processing is still dominated by text! More research involving language---including speech---in the context of other modalities and environments is needed, and there has never been a better time to do it. In my talk, I’ll share some of my team’s work on these topics, particularly around language-image retrieval, text-to-image generation and vision and language navigation.
That said, this talk will mostly be a Complaining Discussion about vision and language research. Rather than giving a glossy and triumphant overview of some papers, I’d like to include additional observations and questions regarding current research business-as-usual, injecting my own split perspective as an empiricist in natural language processing and as a person with both anthropological and linguistic training, perspectives and predilections. My hope is that through some Complaining and Introspection, I can provide useful reminders of why this research area is still so hard, so wide open, so important and so exciting.
|11:25-12:00||Invited Talk 4: Action Learning and Justification Using Language
Task learning from natural language instructions involves understanding, learning, and justification of perceived actions. In this talk, I will talk about some work done in my lab that incorporates commonsense knowledge in language grounding, action learning and justification. I will also introduce our recent work on hierarchical task learning using the ALFRED dataset and discuss key challenges and opportunities.
|12:00-12:30||Panel Session 1
||Jason Baldridge, Joyce Chai, Kate Saenko, William Wang|
|13:20-13:55||Invited Talk 5: Separating Skills and Concepts for Novel Visual Question Answering||Kate Saenko|
|13:55-14:30||Invited Talk 6: Natural Language Explanations of Deep Networks
Despite major efforts in recent years to improve explainability of deep neural networks, the tools we use for communicating explanations have largely remained the same: visualizations of representative inputs, salient input regions, and local model approximations. But when humans describe complex decision rules, we often use a different explanatory tool: natural language. I'll describe recent work on explaining models for computer vision tasks by automatically constructing natural language descriptions of individual neurons. These descriptions ground prediction in meaningful perceptual and linguistic abstractions, and can be used to surface unexpected model behaviors, identify adversarial vulnerabilities, and even guide text-based image editing. These results show that fine-grained, automatic annotation of deep network models is both possible and practical: rich, language-based explanations produced by automated annotation procedures can surface meaningful and actionable information about model behavior.
|14:30-15:05||Invited Talk 7: Sherlock and Merlot: Multimodal Abductive Reasoning and Neural Script Knowledge
In this talk, I will present Sherlock, a new dataset and tasks for multimodal abductive reasoning, and Merlot, a new multimodal model for neural script knowledge that achieves SOTA on 12+ video-based QA benchmarks.
|15:05-15:45||Poster Session 2
|15:45-16:20||Invited Talk 8: Explanations for Visual Question Answering
This talk will review several aspects of our recent work on explanation for VQA. First, we have developed a VQA system that can elucidate its answers with multi-modal natural-language and visual explanations that faithfully reflect important aspects of its underlying reasoning while capturing the style of comprehensible human explanations. Crowd-sourced human evaluation of these explanations have demonstrated the advantages of our approach. Second, we have developed methods that use human-provided visual or textual explanations to aid the training of VQA systems and improve their robustness to changing problem distributions. Finally, we have developed a novel framework that constructs explanations for multiple potential answers for a VQA problem and analyzes and compares these competing explanations to improve both the accuracy of the system as well as the quality of its explanations as evaluated by human judges.
|16:20-16:55||Invited Talk 9: New Frontiers in Vision and Language Research
Vision and Language research is rapidly evolving, in terms of both the methods and the applications. I will start by presenting a benchmarking study on the benefits of CLIP, a recent powerful multimodal model, for the traditional Vision and Language tasks. Then, I will talk about several new application scenarios for Vision and Language research. First, is our work on incorporating language into a formerly “vision-only” task of video summarization. Second, I will discuss how we can leverage language to address bias in visual classifiers. Lastly, I will talk about automatically generating and detecting out-of-context multimodal media, an emerging misinformation threat scenario.
|16:55-17:30||Invited Talk 10: Knowledgeable and Spatio-Temporal Vision+Language||Mohit Bansal|
|17:30-18:00||Panel Session 2
||Jacob Andreas, Mohit Bansal, Yejin Choi, Ray Mooney, Anja Rohrbach|
Jacob Andreas is the X Consortium Career Development Assistant Professor at MIT. His research focuses on building intelligent systems that can communicate effectively using language and learn from human guidance. Jacob earned his Ph.D. from UC Berkeley, his M.Phil. from Cambridge (where he studied as a Churchill scholar) and his B.S. from Columbia. He has been the recipient of an NSF graduate fellowship, a Facebook fellowship, and paper awards at NAACL and ICML.
Jason is a research scientist at Google, where he works on natural language understanding. He was previously an Associate Professor of Computational Linguistics at the University of Texas at Austin. His main research interests include categorial grammars, parsing, semi-supervised learning for NLP, reference resolution and text geolocation. He has long been active in the creation and promotion of open source software for natural language processing, including co-creating the Apache OpenNLP Toolkit and OpenCCG. Jason received his Ph.D. from the University of Edinburgh in 2002, where his doctoral dissertation on Multimodal Combinatory Categorial Grammar was awarded the 2003 Beth Dissertation Prize from the European Association for Logic, Language and Information.
Dr. Mohit Bansal is the John R. & Louise S. Parker Associate Professor and the Director of the MURGe-Lab (UNC-NLP Group) in the Computer Science department at University of North Carolina (UNC) Chapel Hill. He received his PhD from UC Berkeley in 2013 (where he was advised by Dan Klein) and his BTech from IIT Kanpur in 2008. His research expertise is in statistical natural language processing and machine learning, with a particular focus on multimodal, grounded, and embodied semantics (i.e., language with vision and speech, for robotics), human-like language generation and Q&A/dialogue, and interpretable and generalizable deep learning.
Yonatan Bisk is an Assistant Professor in the Language Technologies Institute at Carnegie Mellon University. Prior to that he received his PhD from the University of Illinois at Urbana-Champaign and spent time at USC’s ISI, University of Washington and Microsoft Research. His research focus is on grounded and embodied natural language understanding.
Joyce Y. Chai is a Professor at University of Michigan. Her research interests are in the area of artificial intelligence, particularly on natural language processing, situated dialogue agents, human-robot communication, and intelligent user interfaces. Her recent work has focused on grounded language processing to facilitate situated communication with robots and other artificial agents. Prior to joining UM, she was a professor at MSU directing the Language and Interaction Research Lab . At UM, she is a member of Michigan AI Lab and directing the Situated Language and Embodied Dialogue (SLED) research group. She is also affiliated with Michigan Robotics Institute.
Yejin Choi is an associate professor of Paul G. Allen School of Computer Science & Engineering at the University of Washington with the Brett Helsel Career Development Professorship, adjunct of the Linguistics department, and affiliate of the Center for Statistics and Social Sciences. She is also a senior research manager at the Allen Institute for Artificial Intelligence overseeing the project Mosaic on Commonsense Intelligence. She is a co-recepient of the AAAI Outstanding Paper Award in 2020, the Marr Prize (best paper award) at ICCV 2013, a recepient of Borg Early Career Award (BECA) in 2018, and named among IEEE AI's 10 to Watch in 2016.
Raymond J. Mooney is a Professor in the Department of Computer Science at the University of Texas at Austin. He received his Ph.D. in 1988 from the University of Illinois at Urbana/Champaign. He is an author of over 160 published research papers, primarily in the areas of machine learning and natural language processing. He was the President of the International Machine Learning Society from 2008-2011, program co-chair for AAAI 2006, general chair for HLT-EMNLP 2005, and co-chair for ICML 1990. He is a Fellow of the American Association for Artificial Intelligence, the Association for Computing Machinery, and the Association for Computational Linguistics and the recipient of best paper awards from AAAI-96, KDD-04, ICML-05 and ACL-07.
Anja Rohrbach is a Research Scientist at UC Berkeley, working with Prof. Trevor Darrell. She has completed her PhD at Max Planck Institute for Informatics under supervision of Prof. Bernt Schiele. Her research is at the intersection of vision and language. She is interested in a variety of tasks, including image and video description, visual grounding, visual question answering, etc. Recently, she is focusing on building explainable models and addressing bias in existing vision and language models.
Kate is an Associate Professor of Computer Science at Boston University and a consulting professor for the MIT-IBM Watson AI Lab. She leads the Computer Vision and Learning Group at BU, is the founder and co-director of the Artificial Intelligence Research (AIR) initiative, and member of the Image and Video Computing research group. Kate received a PhD from MIT and did her postdoctoral training at UC Berkeley and Harvard. Her research interests are in the broad area of Artificial Intelligence with a focus on dataset bias, adaptive machine learning, learning for image and language understanding, and deep learning.
William Wang is the Director of UC Santa Barbara's Natural Language Processing group and Center for Responsible Machine Learning. He is the Duncan and Suzanne Mellichamp Chair in Artificial Intelligence and Designs, and an Assistant Professor in the Department of Computer Science at the University of California, Santa Barbara. He received his PhD from School of Computer Science, Carnegie Mellon University. He has broad interests in machine learning approaches to data science, including statistical relational learning, information extraction, computational social science, speech, and vision.
Contact the Organizing Committee: email@example.com
|University of California, Los Angeles|
|Universidad Nacional de Córdoba|
|Carnegie Mellon University|
|University of Gothenburg|
|National Institute of Technology, Silchar, India|
|University of Michigan|
|University of Washington|
|TU Kaiserslautern Germany|
|Saarland University, Saarland Informatics Campus|
|Beuth University of Applied Sciences Berlin|
|National Institute of Technology Silchar, India|
|University of Maryland|
|Renmin University of China|
|University of Texas, Austin|
|Carnegie Mellon University|
|Carnegie Mellon University|
|University of Southern California|
|National Institute of Technology, Silchar India|
|University of North Carolina|
|University of the Chinese Academy of Sciences, China|
|University of Amsterdam|