ALVR 2020

ACL 2020 Workshop on Advances in Language and Vision Research

Language and vision research has attracted great attention from both natural language processing (NLP) and computer vision (CV) researchers. Gradually, this area is shifting from passive perception, templated language, and synthetic imagery/environments to active perception, natural language, and photo-realistic simulation or real world deployment. Thus far, few workshops on language and vision research have been organized by groups from the NLP community. We propose the first workshop on Advances in Language and Vision Research (ALVR) in order to promote the frontier of language and vision research and to bring interested researchers together to discuss how to best tackle and solve real-world problems in this area.

This workshop covers (but is not limited to) the following topics:

New tasks and datasets that provide real-world solutions in the intersection of NLP and CV;
Language-guided interaction with the real world, such as navigation via instruction following or dialogue;
External knowledge integration in visual and language understanding;
Visually grounded multilingual study, for example multimodal machine translation;
Shortcoming of existing language and vision tasks and datasets;
Benefits of using multimodal learning in downstream NLP tasks;
Self-supervised representation learning in language and vision;
Transfer learning (including few/zero-shot learning) and domain adaptation;
Cross-modal learning beyond image understanding, such as videos and audios;
Multidisciplinary study that may involve linguistics, cognitive science, robotics, etc.

Video-guided Machine Translation (VMT) Challenge

We also hold the first Video-guided Machine Translation (VMT) Challenge. This challenge aims to benchmark progress towards models that translate source language sentence into the target language with video information as the additional spatiotemporal context. The challenge is based on the recently released large-scale multilingual video description dataset, VATEX. The VATEX dataset contains over 41,250 videos and 825,000 high-quality captions in both English and Chinese, half of which are English-Chinese translation pairs.

Winners will be announced and awarded in the workshop.

Please refer to the VMT Challenge website for additional details (such as participation, dates, and starter code)!

REVERIE Challenge

The objective of REVERIE Challenge is to benchmark the state-of-the-art for the remote object grounding task defined in our CVPR paper, in the hope that it might drive progress towards more flexible and powerful human interactions with robots.

The REVERIE task requires an intelligent agent to correctly localise a remote target object (cannot be observed at the starting location) specified by a concise high-level natural language instruction, such as 'bring me the blue cushion from the sofa in the living room'. Since the target object is in a different room from the starting one, the agent needs first to navigate to the goal location. When the agent determines to stop, it should select one object from a list of candidates provided by the simulator. The agent can attempt to localise the target at any step, which is totally up to algorithm design. But we only allow the agent output once in each episode, which means the agent only can guess the answer once in a single run.

Please visit REVERIE Challenge website for more details!

Submission

The workshop includes an archival and a non-archival track on topics related to language-and-vision research. For both tracks, the reviewing process is single-blind. That is, the reviewer will know the authors but not the other way around. Submission is electronic, using the Softconf START conference management system. The submission site will be available at https://www.softconf.com/acl2020/alvr.

If you are interested in taking a more active part in the workshop, we also encourage you to apply to join the program committee and participate in reviewing submissions following this link: https://forms.gle/voyxjQLFb8duYM5e7. Qualified reviewers will be selected based on the prior reviewing experience and publication records.

Archival Track

The archival track follows the ACL short paper format. Submissions to the archival track may consist of up to 4 pages of content (excluding references) in ACL format (style sheets are available below), plus unlimited references. Accepted papers will be given 5 content pages for camera-ready version. Authors are encouraged to use this additional page to address reviewers’ comments in their final versions. The accepted papers to the archival track will be included in the ACL 2020 Workshop proceedings. The archival track does not accept double submissions, e.g., no previously published papers or concurrent submissions to other conferences or workshops.

The format of submitted papers to the archival track must follow the ACL Author Guidelines. Style sheets (Latex, Word) are available here. And the Overleaf template is also available here.

Non-archival Track

The workshop also includes a non-archival track to allow submission of previously published papers and double submission to ALVR and other conferences or journals. Accepted non-archival papers can still be presented as posters at the workshop.

There are no formatting or page restrictions for non-archival submissions. The accepted papers to the non-archival track will be displayed on the workshop website, but will NOT be included in the ACL 2020 Workshop proceedings or otherwise archived.

Important Dates

Submission Deadline (extended): April 16, 2020 (11:59pm Anywhere on Earth time, UTC-12)

~~Notification: May 11, 2020~~

~~Camera Ready deadline: May 21, 2020 (11:59pm Anywhere on Earth time, UTC-12)~~

Workshop Day: July 9, 2020

Proceedings

Papers accepted to the archival track:

Extending ImageNet to Arabic using Arabic WordNet - Abdulkareem Alsudais (PDF | Recorded Talk)
Toward General Scene Graph: Integration of Visual Semantic Knowledge with Entity Synset Alignment - Woo Suk Choi, Kyoung-Woon On, Yu-Jung Heo and Byoung-Tak Zhang (PDF | Recorded Talk)
Visual Question Generation from Radiology Images - Mourad Sarrouti, Asma Ben Abacha and Dina Demner-Fushman (PDF | Recorded Talk | Slides & Video)
On the role of effective and referring questions in GuessWhat?! - Mauricio Mazuecos, Alberto Testoni, Raffaella Bernardi and Luciana Benotti (PDF | Recorded Talk)
Latent Alignment of Procedural Concepts in Multimodal Recipes - Hossein Rajaby Faghihi, Roshanak Mirzaee, Sudarshan Paliwal and Parisa Kordjamshidi (PDF | Recorded Talk)

Papers accepted to the non-archival track:

Pix2R: Guiding Reinforcement Learning using Natural Language by Mapping Pixels to Rewards - Prasoon Goyal, Scott Niekum and Raymond Mooney (PDF | Recorded Talk)
TextCaps: a Dataset for Image Captioning with Reading Comprehension - Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach and Amanpreet Singh (PDF | Recorded Talk)
Improving VQA and its Explanations by Comparing Competing Explanations - Jialin Wu, Liyan Chen and Raymond Mooney (PDF)
Bridging Languages through Images with Deep Partial Canonical Correlation Analysis - Guy Rotman, Ivan Vulić and Roi Reichart (PDF)
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling - Tsu-Jui Fu, Xin Wang, Matthew Peterson, Scott Grafton, Miguel Eckstein and William Yang Wang (PDF | Recorded Talk | Slides)
Measuring Social Biases in Grounded Vision and Language Embeddings - Candace Ross, Boris Katz and Andrei Barbu (PDF)
Exploring Phrase Grounding without Training: Contextualisation and Extension to Text-Based Image Retrieval - Letitia Parcalabescu and Anette Frank (PDF | Recorded Talk)
What is Learned in Visually Grounded Neural Syntax Acquisition - Noriyuki Kojima, Hadar Averbuch-Elor, Alexander Rush and Yoav Artzi (PDF | Recorded Talk)
Learning to Map Natural Language Instructions to Physical Quadcopter Control Using Simulated Flight - Valts Blukis, Yannick Terme, Eyvind Niklasson, Ross Knepper and Yoav Artzi (PDF | Demo Video | Poster)
Learning Latent Graph Representations for Relational VQA - Liyan Chen and Raymond Mooney (PDF)
Entity Skeletons for Visual Storytelling - Khyathi Raghavi Chandu, Ruo-Ping Dong and Alan W Black (PDF)
Keyframe Segmentation and Positional Encoding for Video-guided Machine Translation Challenge 2020 - Tosho Hirasawa, Zhishen Yang, Mamoru Komachi, Naoaki Okazaki (PDF | Recorded Talk)
DeepFuse: HKU’s Multimodal Machine Translation System for VMT’20 - Zhiyong Wu (PDF | Recorded Talk)
Enhancing Neural Machine Translation with Multimodal Rewards - Yuqing Song, Shizhe Chen, Qin Jin (PDF | Recorded Talk)

Program (PDT)

ATTENTION! The video recording of the whole workshop is located at https://slideslive.com/38931798/w9-alvr-live-stream. Some pre-recorded talks are listed below, while others are included in the workshop recording.

8:20-8:25	Opening Remarks [slides]	Workshop Organizers
8:25-9:10	Grounding Natural Language to 3D [slides] Invited Talk & QA	Angel Chang
9:10-9:55	Challenges in Evaluating Vision and Language Tasks Invited Talk & QA	Lucia Specia
9:55-10:40	Multimodal AI: Understanding Human Behaviors [slides] Invited Talk & QA	Louis-Philippe Morency
	Break
10:50-11:35	Robot Control in Situated Instruction Following [video] [slides] Invited Talk & QA	Yoav Artzi
11:35-11:45	Video-guided Machine Translation (VMT) Challenge [slides]	Xin Wang
11:45-12:10	VMT Challenge Talk: Keyframe Segmentation and Positional Encoding for Video-guided Machine Translation [video] DeepFuse: HKU’s Multimodal Machine Translation System for VMT’20 [video] Enhancing Neural Machine Translation with Multimodal Rewards [video]	Tosho Hirasawa et al. Zhiyong Wu Yuqing Song et al.
12:10-12:20	VMT Challenge Live QA	All the Challenge Presenters
	Break
13:30-14:15	Augment Machine Intelligence with Multimodal Information [video] Invited Talk & QA	Zhou Yu
14:15-15:00	Dungeons and DQNs: Grounding Language in Shared Experience [video (SlidesLive)] [video (YouTube)] Invited Talk & QA	Mark Riedl
15:00-15:15	REVERIE Challenge [video]	Yuankai Qi
15:15-15:35	REVERIE Challenge Winner Talk: Distance-aware and Robust Network with Wandering Reducing Strategy for REVERIE [video]	Chen Gao et al.
15:35-15:45	REVERIE Challenge Live QA	All the Challenge Presenters
	Break
16:00-16:45	Vision+Language Research: Self-supervised Learning, Adversarial Training, Multimodal Inference and Explainability [video] [slides] Invited Talk & QA	Jingjing Liu
16:45-17:10	Archival Track Recorded Talks [link]
17:10-17:45	Poster Session and QA	All the Workshop Paper Authors

Invited Speakers

(presentation order)

Angel Chang

Angel Chang is an Assistant Professor at Simon Fraser University. Prior to this, She was a visiting research scientist at Facebook AI Research and a research scientist at Eloquent Labs working on dialogue. She received my Ph.D. in Computer Science from Stanford, where she was part of the Natural Language Processing Group and advised by Chris Manning. Her research focuses on connecting language to 3D representations of shapes and scenes and addresses grounding of language for embodied agents in indoor environments. In general, she is interested in the semantics of shapes and scenes, the representation and acquisition of common sense knowledge, and reasoning using probabilistic models.

Lucia Specia

Lucia Specia is a Professor of Faculty of Engineering, Department of Computing, Imperial College London. Her research focuses on various aspects of data-driven approaches to natural language processing, with a particular interest in multimodal and multilingual context models and work at the intersection of language and vision. Her work has various applications such as machine translation, image captioning, quality estimation and text adaptation.

Louis-Philippe Morency

Louis-Philippe Morency is a Leonardo Associate Professor at CMU Language Technology Institute where he leads the Multimodal Communication and Machine Learning Laboratory (MultiComp Lab). He was previously Research Faculty at USC Computer Science Department. He received Ph.D. in Computer Science from MIT Computer Science and Artificial Intelligence Laboratory. His research focuses on building the computational foundations to enable computers with the abilities to analyze, recognize and predict subtle human communicative behaviors during social interactions. Central to this research effort is the technical challenge of multimodal machine learning: mathematical foundation to study heterogeneous multimodal data and the contingency often found between modalities.

Yoav Artzi

Yoav Artzi is an Assistant Professor in the Department of Computer Science and Cornell Tech at Cornell University. He received an NSF CAREER award, paper awards in EMNLP 2015, ACL 2017, and NAACL 2018. Previously, he received a B.Sc. summa cum laude from Tel Aviv University and a Ph.D. from the University of Washington. He works in the intersection of natural language processing, machine learning, vision, and robotics. His current main research focus is algorithms for natural language understanding with specific interest in situated interactions.

Zhou Yu

Zhou Yu is an Assistant Professor at the Computer Science Department, University of California, Davis. She is the director of the Davis NLP Lab. She received PhD at Language Technology Institute under School of Computer Science, Carnegie Mellon University 2017. She designs algorithms for real-time intelligent interactive systems that coordinate with user actions that are beyond spoken languages, including non-verbal behaviors to achieve effective and natural communications. In particular, She optimize human-machine communication via studies of multimodal sensing and analysis, speech and natural language processing, machine learning and human-computer interaction.

Mark Riedl

Mark Riedl is an Associate Professor in the Georgia Tech School of Interactive Computing and director of the Entertainment Intelligence Lab. Mark's research focuses on the intersection of artificial intelligence, virtual worlds, and storytelling. Mark earned a PhD degree in 2004 from North Carolina State University. From 2004-2007, Mark was a Research Scientist at the University of Southern California Institute for Creative Technologies where he researched and developed interactive, narrative-based training systems. His research is supported by the NSF, DARPA, the U.S. Army, Google, and Disney. Mark was the recipient of a DARPA Young Faculty Award and an NSF CAREER Award.

Jingjing (JJ) Liu

Jingjing (JJ) Liu is a Senior Principal Research Manager at Microsoft, leading a research group in Multimodal AI. Her current research interests center on Vision+Language Multimodal Intelligence, the intersection between NLP and Computer Vision. Dr. Liu received the PhD degree in Computer Science from MIT EECS. She also holds an MBA degree from Judge Business School (JBS) at University of Cambridge. Before joining MSR, Dr. Liu was Research Scientist at MIT CSAIL and Director of Product at the unicorn startup Mobvoi Inc.

Jacob Andreas	MIT
Angel Chang	Simon Fraser Univeristy
Devendra Chaplot	CMU
Abhishek Das	Georgia Tech
Daniel Fried	UC Berkeley
Zhe Gan	Microsoft
Christopher Kanan	Rochester Institute of Technology
Jiasen Lu	Georgia Tech
Ray Mooney	University of Texas, Austin
Khanh Nguyen	University of Maryland
Aishwarya Padmakumar	University of Texas, Austin
Hamid Palangi	Microsoft Research
Alessandro Suglia	Heriot-Watt University
Vikas Raunak	CMU
Volkan Cirik	CMU
Parminder Bhatia	Amazon
Khyathi Raghavi Chandu	CMU
Asma Ben Abacha	NIH/NLM
Thoudam Doren Singh	National Institute of Technology, Silchar, India
Dhivya Chinnappa	Thomson Reuters
Shailza Jolly	TU Kaiserslautern
Alok Singh	National Institute of Technology, Silchar, India
Mohamed Elhoseiny	KAUST
Marimuthu Kalimuthu	Saarland University
Simon Dobnik	University of Gothenburg
Shruti Palaskar	CMU
Yuankai Qi	University of Adelaide

Workshop on Advances in Language and Vision Research (ALVR)

In conjunction with ACL 2020 July 9th 2020 (Full Day) Location: Virtual

ACL 2020 Workshop on Advances in Language and Vision Research

Video-guided Machine Translation (VMT) Challenge

REVERIE Challenge

Submission

Archival Track

Non-archival Track

Important Dates

Proceedings

Papers accepted to the archival track:

Papers accepted to the non-archival track:

Program (PDT)

Invited Speakers

Organizers

UC Santa Cruz

University of Washington

UC Berkeley

Facebook AI Research

Google Research

University of Adelaide

Microsoft Research

Google Research

UC Santa Barbara

Program Committee

Contacts

In conjunction with ACL 2020
July 9^th 2020 (Full Day)
Location: Virtual