Workshop on Advances in Language and Vision Research (ALVR)

In conjunction with ACL 2020

July 9th 2020 (Full Day)

Location: Virtual

ACL 2020 Workshop on Advances in Language and Vision Research

Language and vision research has attracted great attention from both natural language processing (NLP) and computer vision (CV) researchers. Gradually, this area is shifting from passive perception, templated language, and synthetic imagery/environments to active perception, natural language, and photo-realistic simulation or real world deployment. Thus far, few workshops on language and vision research have been organized by groups from the NLP community. We propose the first workshop on Advances in Language and Vision Research (ALVR) in order to promote the frontier of language and vision research and to bring interested researchers together to discuss how to best tackle and solve real-world problems in this area.

This workshop covers (but is not limited to) the following topics:

  • New tasks and datasets that provide real-world solutions in the intersection of NLP and CV;
  • Language-guided interaction with the real world, such as navigation via instruction following or dialogue;
  • External knowledge integration in visual and language understanding;
  • Visually grounded multilingual study, for example multimodal machine translation;
  • Shortcoming of existing language and vision tasks and datasets;
  • Benefits of using multimodal learning in downstream NLP tasks;
  • Self-supervised representation learning in language and vision;
  • Transfer learning (including few/zero-shot learning) and domain adaptation;
  • Cross-modal learning beyond image understanding, such as videos and audios;
  • Multidisciplinary study that may involve linguistics, cognitive science, robotics, etc.

Video-guided Machine Translation (VMT) Challenge

We also hold the first Video-guided Machine Translation (VMT) Challenge. This challenge aims to benchmark progress towards models that translate source language sentence into the target language with video information as the additional spatiotemporal context. The challenge is based on the recently released large-scale multilingual video description dataset, VATEX. The VATEX dataset contains over 41,250 videos and 825,000 high-quality captions in both English and Chinese, half of which are English-Chinese translation pairs.

Winners will be announced and awarded in the workshop.

Please refer to the VMT Challenge website for additional details (such as participation, dates, and starter code)!

REVERIE Challenge

The objective of REVERIE Challenge is to benchmark the state-of-the-art for the remote object grounding task defined in our CVPR paper, in the hope that it might drive progress towards more flexible and powerful human interactions with robots.

The REVERIE task requires an intelligent agent to correctly localise a remote target object (cannot be observed at the starting location) specified by a concise high-level natural language instruction, such as 'bring me the blue cushion from the sofa in the living room'. Since the target object is in a different room from the starting one, the agent needs first to navigate to the goal location. When the agent determines to stop, it should select one object from a list of candidates provided by the simulator. The agent can attempt to localise the target at any step, which is totally up to algorithm design. But we only allow the agent output once in each episode, which means the agent only can guess the answer once in a single run.

Please visit REVERIE Challenge website for more details!


The workshop includes an archival and a non-archival track on topics related to language-and-vision research. For both tracks, the reviewing process is single-blind. That is, the reviewer will know the authors but not the other way around. Submission is electronic, using the Softconf START conference management system. The submission site will be available at

If you are interested in taking a more active part in the workshop, we also encourage you to apply to join the program committee and participate in reviewing submissions following this link: Qualified reviewers will be selected based on the prior reviewing experience and publication records.

Archival Track

The archival track follows the ACL short paper format. Submissions to the archival track may consist of up to 4 pages of content (excluding references) in ACL format (style sheets are available below), plus unlimited references. Accepted papers will be given 5 content pages for camera-ready version. Authors are encouraged to use this additional page to address reviewers’ comments in their final versions. The accepted papers to the archival track will be included in the ACL 2020 Workshop proceedings. The archival track does not accept double submissions, e.g., no previously published papers or concurrent submissions to other conferences or workshops.

The format of submitted papers to the archival track must follow the ACL Author Guidelines. Style sheets (Latex, Word) are available here. And the Overleaf template is also available here.

Non-archival Track

The workshop also includes a non-archival track to allow submission of previously published papers and double submission to ALVR and other conferences or journals. Accepted non-archival papers can still be presented as posters at the workshop.

There are no formatting or page restrictions for non-archival submissions. The accepted papers to the non-archival track will be displayed on the workshop website, but will NOT be included in the ACL 2020 Workshop proceedings or otherwise archived.

Important Dates

  • Submission Deadline (extended): April 16, 2020 (11:59pm Anywhere on Earth time, UTC-12)
  • Notification: May 11, 2020
  • Camera Ready deadline: May 21, 2020 (11:59pm Anywhere on Earth time, UTC-12)
  • Workshop Day: July 9, 2020


Papers accepted to the archival track:

  • Extending ImageNet to Arabic using Arabic WordNet - Abdulkareem Alsudais (PDF | Recorded Talk)
  • Toward General Scene Graph: Integration of Visual Semantic Knowledge with Entity Synset Alignment - Woo Suk Choi, Kyoung-Woon On, Yu-Jung Heo and Byoung-Tak Zhang (PDF | Recorded Talk)
  • Visual Question Generation from Radiology Images - Mourad Sarrouti, Asma Ben Abacha and Dina Demner-Fushman (PDF | Recorded Talk | Slides & Video)
  • On the role of effective and referring questions in GuessWhat?! - Mauricio Mazuecos, Alberto Testoni, Raffaella Bernardi and Luciana Benotti (PDF | Recorded Talk)
  • Latent Alignment of Procedural Concepts in Multimodal Recipes - Hossein Rajaby Faghihi, Roshanak Mirzaee, Sudarshan Paliwal and Parisa Kordjamshidi (PDF | Recorded Talk)

Papers accepted to the non-archival track:

  • Pix2R: Guiding Reinforcement Learning using Natural Language by Mapping Pixels to Rewards - Prasoon Goyal, Scott Niekum and Raymond Mooney (PDF | Recorded Talk)
  • TextCaps: a Dataset for Image Captioning with Reading Comprehension - Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach and Amanpreet Singh (PDF | Recorded Talk)
  • Improving VQA and its Explanations by Comparing Competing Explanations - Jialin Wu, Liyan Chen and Raymond Mooney (PDF)
  • Bridging Languages through Images with Deep Partial Canonical Correlation Analysis - Guy Rotman, Ivan Vulić and Roi Reichart (PDF)
  • Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling - Tsu-Jui Fu, Xin Wang, Matthew Peterson, Scott Grafton, Miguel Eckstein and William Yang Wang (PDF | Recorded Talk | Slides)
  • Measuring Social Biases in Grounded Vision and Language Embeddings - Candace Ross, Boris Katz and Andrei Barbu (PDF)
  • Exploring Phrase Grounding without Training: Contextualisation and Extension to Text-Based Image Retrieval - Letitia Parcalabescu and Anette Frank (PDF | Recorded Talk)
  • What is Learned in Visually Grounded Neural Syntax Acquisition - Noriyuki Kojima, Hadar Averbuch-Elor, Alexander Rush and Yoav Artzi (PDF | Recorded Talk)
  • Learning to Map Natural Language Instructions to Physical Quadcopter Control Using Simulated Flight - Valts Blukis, Yannick Terme, Eyvind Niklasson, Ross Knepper and Yoav Artzi (PDF | Demo Video | Poster)
  • Learning Latent Graph Representations for Relational VQA - Liyan Chen and Raymond Mooney (PDF)
  • Entity Skeletons for Visual Storytelling - Khyathi Raghavi Chandu, Ruo-Ping Dong and Alan W Black (PDF)
  • Keyframe Segmentation and Positional Encoding for Video-guided Machine Translation Challenge 2020 - Tosho Hirasawa, Zhishen Yang, Mamoru Komachi, Naoaki Okazaki (PDF | Recorded Talk)
  • DeepFuse: HKU’s Multimodal Machine Translation System for VMT’20 - Zhiyong Wu (PDF | Recorded Talk)
  • Enhancing Neural Machine Translation with Multimodal Rewards - Yuqing Song, Shizhe Chen, Qin Jin (PDF | Recorded Talk)

Program (PDT)

ATTENTION! The video recording of the whole workshop is located at Some pre-recorded talks are listed below, while others are included in the workshop recording.

8:20-8:25 Opening Remarks [slides] Workshop Organizers
8:25-9:10 Grounding Natural Language to 3D [slides]
Invited Talk & QA
Angel Chang
9:10-9:55 Challenges in Evaluating Vision and Language Tasks
Invited Talk & QA
Lucia Specia
9:55-10:40 Multimodal AI: Understanding Human Behaviors [slides]
Invited Talk & QA
Louis-Philippe Morency
10:50-11:35 Robot Control in Situated Instruction Following [video] [slides]
Invited Talk & QA
Yoav Artzi
11:35-11:45 Video-guided Machine Translation (VMT) Challenge [slides] Xin Wang
11:45-12:10 VMT Challenge Talk:
  • Keyframe Segmentation and Positional Encoding for Video-guided Machine Translation [video]
  • DeepFuse: HKU’s Multimodal Machine Translation System for VMT’20 [video]
  • Enhancing Neural Machine Translation with Multimodal Rewards [video]

  • Tosho Hirasawa et al.
    Zhiyong Wu
    Yuqing Song et al.
    12:10-12:20 VMT Challenge Live QA All the Challenge Presenters
    13:30-14:15 Augment Machine Intelligence with Multimodal Information [video]
    Invited Talk & QA
    Zhou Yu
    14:15-15:00 Dungeons and DQNs: Grounding Language in Shared Experience [video (SlidesLive)] [video (YouTube)]
    Invited Talk & QA
    Mark Riedl
    15:00-15:15 REVERIE Challenge [video] Yuankai Qi
    15:15-15:35 REVERIE Challenge Winner Talk:
    Distance-aware and Robust Network with Wandering Reducing Strategy for REVERIE [video]

    Chen Gao et al.
    15:35-15:45 REVERIE Challenge Live QA All the Challenge Presenters
    16:00-16:45 Vision+Language Research: Self-supervised Learning, Adversarial Training,
    Multimodal Inference and Explainability [video] [slides]

    Invited Talk & QA
    Jingjing Liu
    16:45-17:10 Archival Track Recorded Talks [link]
    17:10-17:45 Poster Session and QA All the Workshop Paper Authors

    Invited Speakers

    (presentation order)

    Angel Chang is an Assistant Professor at Simon Fraser University. Prior to this, She was a visiting research scientist at Facebook AI Research and a research scientist at Eloquent Labs working on dialogue. She received my Ph.D. in Computer Science from Stanford, where she was part of the Natural Language Processing Group and advised by Chris Manning. Her research focuses on connecting language to 3D representations of shapes and scenes and addresses grounding of language for embodied agents in indoor environments. In general, she is interested in the semantics of shapes and scenes, the representation and acquisition of common sense knowledge, and reasoning using probabilistic models.

    Lucia Specia is a Professor of Faculty of Engineering, Department of Computing, Imperial College London. Her research focuses on various aspects of data-driven approaches to natural language processing, with a particular interest in multimodal and multilingual context models and work at the intersection of language and vision. Her work has various applications such as machine translation, image captioning, quality estimation and text adaptation.

    Louis-Philippe Morency is a Leonardo Associate Professor at CMU Language Technology Institute where he leads the Multimodal Communication and Machine Learning Laboratory (MultiComp Lab). He was previously Research Faculty at USC Computer Science Department. He received Ph.D. in Computer Science from MIT Computer Science and Artificial Intelligence Laboratory. His research focuses on building the computational foundations to enable computers with the abilities to analyze, recognize and predict subtle human communicative behaviors during social interactions. Central to this research effort is the technical challenge of multimodal machine learning: mathematical foundation to study heterogeneous multimodal data and the contingency often found between modalities.

    Yoav Artzi is an Assistant Professor in the Department of Computer Science and Cornell Tech at Cornell University. He received an NSF CAREER award, paper awards in EMNLP 2015, ACL 2017, and NAACL 2018. Previously, he received a B.Sc. summa cum laude from Tel Aviv University and a Ph.D. from the University of Washington. He works in the intersection of natural language processing, machine learning, vision, and robotics. His current main research focus is algorithms for natural language understanding with specific interest in situated interactions.

    Zhou Yu is an Assistant Professor at the Computer Science Department, University of California, Davis. She is the director of the Davis NLP Lab. She received PhD at Language Technology Institute under School of Computer Science, Carnegie Mellon University 2017. She designs algorithms for real-time intelligent interactive systems that coordinate with user actions that are beyond spoken languages, including non-verbal behaviors to achieve effective and natural communications. In particular, She optimize human-machine communication via studies of multimodal sensing and analysis, speech and natural language processing, machine learning and human-computer interaction.

    Mark Riedl is an Associate Professor in the Georgia Tech School of Interactive Computing and director of the Entertainment Intelligence Lab. Mark's research focuses on the intersection of artificial intelligence, virtual worlds, and storytelling. Mark earned a PhD degree in 2004 from North Carolina State University. From 2004-2007, Mark was a Research Scientist at the University of Southern California Institute for Creative Technologies where he researched and developed interactive, narrative-based training systems. His research is supported by the NSF, DARPA, the U.S. Army, Google, and Disney. Mark was the recipient of a DARPA Young Faculty Award and an NSF CAREER Award.

    Jingjing (JJ) Liu is a Senior Principal Research Manager at Microsoft, leading a research group in Multimodal AI. Her current research interests center on Vision+Language Multimodal Intelligence, the intersection between NLP and Computer Vision. Dr. Liu received the PhD degree in Computer Science from MIT EECS. She also holds an MBA degree from Judge Business School (JBS) at University of Cambridge. Before joining MSR, Dr. Liu was Research Scientist at MIT CSAIL and Director of Product at the unicorn startup Mobvoi Inc.


    Xin (Eric) Wang

    UC Santa Cruz

    Jesse Thomason

    University of Washington

    Ronghang Hu

    UC Berkeley

    Xinlei Chen

    Facebook AI Research

    Peter Anderson

    Google Research

    Qi Wu

    University of Adelaide

    Asli Celikyilmaz

    Microsoft Research

    Jason Baldridge

    Google Research

    William Wang

    UC Santa Barbara

    Program Committee

  • Jacob Andreas
  • MIT
  • Angel Chang
  • Simon Fraser Univeristy
  • Devendra Chaplot
  • CMU
  • Abhishek Das
  • Georgia Tech
  • Daniel Fried
  • UC Berkeley
  • Zhe Gan
  • Microsoft
  • Christopher Kanan
  • Rochester Institute of Technology
  • Jiasen Lu
  • Georgia Tech
  • Ray Mooney
  • University of Texas, Austin
  • Khanh Nguyen
  • University of Maryland
  • Aishwarya Padmakumar
  • University of Texas, Austin
  • Hamid Palangi
  • Microsoft Research
  • Alessandro Suglia
  • Heriot-Watt University
  • Vikas Raunak
  • CMU
  • Volkan Cirik
  • CMU
  • Parminder Bhatia
  • Amazon
  • Khyathi Raghavi Chandu
  • CMU
  • Asma Ben Abacha
  • Thoudam Doren Singh
  • National Institute of Technology, Silchar, India
  • Dhivya Chinnappa
  • Thomson Reuters
  • Shailza Jolly
  • TU Kaiserslautern
  • Alok Singh
  • National Institute of Technology, Silchar, India
  • Mohamed Elhoseiny
  • Marimuthu Kalimuthu
  • Saarland University
  • Simon Dobnik
  • University of Gothenburg
  • Shruti Palaskar
  • CMU
  • Yuankai Qi
  • University of Adelaide


    Contact the Organizing Committee: