Language and vision research has attracted great attention from both natural language processing (NLP) and computer vision (CV) researchers. Gradually, this area is shifting from passive perception, templated language, and synthetic imagery/environments to active perception, natural language, and photo-realistic simulation or real world deployment. Thus far, few workshops on language and vision research have been organized by groups from the NLP community. We propose the first workshop on Advances in Language and Vision Research (ALVR) in order to promote the frontier of language and vision research and to bring interested researchers together to discuss how to best tackle and solve real-world problems in this area.
This workshop covers (but is not limited to) the following topics:
We also hold the first Video-guided Machine Translation (VMT) Challenge. This challenge aims to benchmark progress towards models that translate source language sentence into the target language with video information as the additional spatiotemporal context. The challenge is based on the recently released large-scale multilingual video description dataset, VATEX. The VATEX dataset contains over 41,250 videos and 825,000 high-quality captions in both English and Chinese, half of which are English-Chinese translation pairs.
Winners will be announced and awarded in the workshop.
Please refer to the VMT Challenge website for additional details (such as participation, dates, and starter code)!
The objective of REVERIE Challenge is to benchmark the state-of-the-art for the remote object grounding task defined in our CVPR paper, in the hope that it might drive progress towards more flexible and powerful human interactions with robots.
The REVERIE task requires an intelligent agent to correctly localise a remote target object (cannot be observed at the starting location) specified by a concise high-level natural language instruction, such as 'bring me the blue cushion from the sofa in the living room'. Since the target object is in a different room from the starting one, the agent needs first to navigate to the goal location. When the agent determines to stop, it should select one object from a list of candidates provided by the simulator. The agent can attempt to localise the target at any step, which is totally up to algorithm design. But we only allow the agent output once in each episode, which means the agent only can guess the answer once in a single run.
Please visit REVERIE Challenge website for more details!
The workshop includes an archival and a non-archival track on topics related to language-and-vision research. For both tracks, the reviewing process is single-blind. That is, the reviewer will know the authors but not the other way around. Submission is electronic, using the Softconf START conference management system. The submission site will be available at https://www.softconf.com/acl2020/alvr.
If you are interested in taking a more active part in the workshop, we also encourage you to apply to join the program committee and participate in reviewing submissions following this link: https://forms.gle/voyxjQLFb8duYM5e7. Qualified reviewers will be selected based on the prior reviewing experience and publication records.
The archival track follows the ACL short paper format. Submissions to the archival track may consist of up to 4 pages of content (excluding references) in ACL format (style sheets are available below), plus unlimited references. Accepted papers will be given 5 content pages for camera-ready version. Authors are encouraged to use this additional page to address reviewers’ comments in their final versions. The accepted papers to the archival track will be included in the ACL 2020 Workshop proceedings. The archival track does not accept double submissions, e.g., no previously published papers or concurrent submissions to other conferences or workshops.
The format of submitted papers to the archival track must follow the ACL Author Guidelines. Style sheets (Latex, Word) are available here. And the Overleaf template is also available here.
The workshop also includes a non-archival track to allow submission of previously published papers and double submission to ALVR and other conferences or journals. Accepted non-archival papers can still be presented as posters at the workshop.
There are no formatting or page restrictions for non-archival submissions. The accepted papers to the non-archival track will be displayed on the workshop website, but will NOT be included in the ACL 2020 Workshop proceedings or otherwise archived.
ATTENTION! The video recording of the whole workshop is located at https://slideslive.com/38931798/w9-alvr-live-stream. Some pre-recorded talks are listed below, while others are included in the workshop recording.
8:20-8:25 | Opening Remarks [slides] | Workshop Organizers |
8:25-9:10 | Grounding Natural Language to 3D [slides] Invited Talk & QA |
Angel Chang |
9:10-9:55 | Challenges in Evaluating Vision and Language Tasks Invited Talk & QA |
Lucia Specia |
9:55-10:40 | Multimodal AI: Understanding Human Behaviors [slides] Invited Talk & QA |
Louis-Philippe Morency |
Break | ||
10:50-11:35 | Robot Control in Situated Instruction Following [video] [slides] Invited Talk & QA |
Yoav Artzi |
11:35-11:45 | Video-guided Machine Translation (VMT) Challenge [slides] | Xin Wang |
11:45-12:10 | VMT Challenge Talk: |
Tosho Hirasawa et al. Zhiyong Wu Yuqing Song et al. |
12:10-12:20 | VMT Challenge Live QA | All the Challenge Presenters |
Break | ||
13:30-14:15 | Augment Machine Intelligence with Multimodal Information [video] Invited Talk & QA |
Zhou Yu |
14:15-15:00 | Dungeons and DQNs: Grounding Language in Shared Experience [video (SlidesLive)] [video (YouTube)] Invited Talk & QA |
Mark Riedl |
15:00-15:15 | REVERIE Challenge [video] | Yuankai Qi |
15:15-15:35 | REVERIE Challenge Winner Talk: Distance-aware and Robust Network with Wandering Reducing Strategy for REVERIE [video] |
Chen Gao et al. |
15:35-15:45 | REVERIE Challenge Live QA | All the Challenge Presenters |
Break | ||
16:00-16:45 | Vision+Language Research: Self-supervised Learning, Adversarial Training, Multimodal Inference and Explainability [video] [slides] Invited Talk & QA |
Jingjing Liu |
16:45-17:10 | Archival Track Recorded Talks [link] | |
17:10-17:45 | Poster Session and QA | All the Workshop Paper Authors |
(presentation order)
Angel Chang is an Assistant Professor at Simon Fraser University. Prior to this, She was a visiting research scientist at Facebook AI Research and a research scientist at Eloquent Labs working on dialogue. She received my Ph.D. in Computer Science from Stanford, where she was part of the Natural Language Processing Group and advised by Chris Manning. Her research focuses on connecting language to 3D representations of shapes and scenes and addresses grounding of language for embodied agents in indoor environments. In general, she is interested in the semantics of shapes and scenes, the representation and acquisition of common sense knowledge, and reasoning using probabilistic models.
Lucia Specia is a Professor of Faculty of Engineering, Department of Computing, Imperial College London. Her research focuses on various aspects of data-driven approaches to natural language processing, with a particular interest in multimodal and multilingual context models and work at the intersection of language and vision. Her work has various applications such as machine translation, image captioning, quality estimation and text adaptation.
Louis-Philippe Morency is a Leonardo Associate Professor at CMU Language Technology Institute where he leads the Multimodal Communication and Machine Learning Laboratory (MultiComp Lab). He was previously Research Faculty at USC Computer Science Department. He received Ph.D. in Computer Science from MIT Computer Science and Artificial Intelligence Laboratory. His research focuses on building the computational foundations to enable computers with the abilities to analyze, recognize and predict subtle human communicative behaviors during social interactions. Central to this research effort is the technical challenge of multimodal machine learning: mathematical foundation to study heterogeneous multimodal data and the contingency often found between modalities.
Yoav Artzi is an Assistant Professor in the Department of Computer Science and Cornell Tech at Cornell University. He received an NSF CAREER award, paper awards in EMNLP 2015, ACL 2017, and NAACL 2018. Previously, he received a B.Sc. summa cum laude from Tel Aviv University and a Ph.D. from the University of Washington. He works in the intersection of natural language processing, machine learning, vision, and robotics. His current main research focus is algorithms for natural language understanding with specific interest in situated interactions.
Zhou Yu is an Assistant Professor at the Computer Science Department, University of California, Davis. She is the director of the Davis NLP Lab. She received PhD at Language Technology Institute under School of Computer Science, Carnegie Mellon University 2017. She designs algorithms for real-time intelligent interactive systems that coordinate with user actions that are beyond spoken languages, including non-verbal behaviors to achieve effective and natural communications. In particular, She optimize human-machine communication via studies of multimodal sensing and analysis, speech and natural language processing, machine learning and human-computer interaction.
Mark Riedl is an Associate Professor in the Georgia Tech School of Interactive Computing and director of the Entertainment Intelligence Lab. Mark's research focuses on the intersection of artificial intelligence, virtual worlds, and storytelling. Mark earned a PhD degree in 2004 from North Carolina State University. From 2004-2007, Mark was a Research Scientist at the University of Southern California Institute for Creative Technologies where he researched and developed interactive, narrative-based training systems. His research is supported by the NSF, DARPA, the U.S. Army, Google, and Disney. Mark was the recipient of a DARPA Young Faculty Award and an NSF CAREER Award.
Jingjing (JJ) Liu is a Senior Principal Research Manager at Microsoft, leading a research group in Multimodal AI. Her current research interests center on Vision+Language Multimodal Intelligence, the intersection between NLP and Computer Vision. Dr. Liu received the PhD degree in Computer Science from MIT EECS. She also holds an MBA degree from Judge Business School (JBS) at University of Cambridge. Before joining MSR, Dr. Liu was Research Scientist at MIT CSAIL and Director of Product at the unicorn startup Mobvoi Inc.
MIT | ||
Simon Fraser Univeristy | ||
CMU | ||
Georgia Tech | ||
UC Berkeley | ||
Microsoft | ||
Rochester Institute of Technology | ||
Georgia Tech | ||
University of Texas, Austin | ||
University of Maryland | ||
University of Texas, Austin | ||
Microsoft Research | ||
Heriot-Watt University | ||
CMU | ||
CMU | ||
Amazon | ||
CMU | ||
NIH/NLM | ||
National Institute of Technology, Silchar, India | ||
Thomson Reuters | ||
TU Kaiserslautern | ||
National Institute of Technology, Silchar, India | ||
KAUST | ||
Saarland University | ||
University of Gothenburg | ||
CMU | ||
University of Adelaide |
Contact the Organizing Committee: alvr-2020@googlegroups.com