Shortcomings in (some) Vision and Language Models

Desmond Elliott
Assistant Professor, University of Copenhagen

Thursday 27 June 2019
11:15 - 12:15
Room EM 1.83


I will discuss three projects on understanding and addressing the shortcomings of multimodal translation and image captioning models. In multimodal translation, I will argue why we should evaluate system performance in the presence of congruent or incongruent images, and I will present a new benchmark dataset in which the source language verbs have multiple possible senses in the target language. Finally, I will discuss some recent work on compositional generalisation to unseen concept pairs in image captioning.


Desmond is an Assistant Professor in the Department of Computer Science at the University of Copenhagen. His research interests include image captioning, multimodal machine translation, and more broadly, multilingual multimodal learning. He co-presented a tutorial on Multimodal Learning and Reasoning at ACL 2016, co-organised three shared tasks on Multimodal Translation between 2016 and 2018, and co-organised the 2018 Frederick Jelinek Memorial Summer Workshop on Grounded Sequence-to-Sequence Transduction.

Host: Yannis Konstas