Data Release: 27 March 2017

NLG Challenge

Entry Submission Deadline: 31 October 2017


Natural language generation plays a critical role for Conversational Agents as it has a significant impact on a user’s impression of the system. This shared task focuses on recent end-to-end (E2E), data-driven NLG methods, which jointly learn sentence planning and surface realisation from non-aligned data, e.g. (Wen et al., 2015; Mei et al., 2016; Dusek and Jurcicek, 2016; Lampouras and Vlachos, 2016) etc.

So far, E2E NLG approaches were limited to small, de-lexicalised data sets, e.g. BAGEL, SF Hotels/ Restaurants, or RoboCup. In this shared challenge, we will provide a new crowd-sourced data set of 50k instances in the restaurant domain, as described in (Novikova, Lemon and Rieser, 2016). Each instance consist of a dialogue act-based meaning representation (MR) and up to 5 references in natural language. In contrast to previously used data, our data set includes additional challenges, such as open vocabulary, complex syntactic structures and diverse discourse phenomena. For example:


name[The Eagle],
eatType[coffee shop],
near[Burger King]


“The three star coffee shop, The Eagle, gives families a mid-priced dining experience featuring a variety of wines and cheeses. Find The Eagle near Burger King.”

The full data set can be obtained by registering below. A sample of the data can be obtained here. A detailed description of the data can be found in our upcoming SIGDIAL paper, which is available on arXiv.

This challenge follows on from previous successful shared tasks on generation, e.g. SemEval’17 task 9 on text generation from AMR, and Generation Challenges 2008-11. However, this is the first NLG task to concentrate on (1) generation from dialogue acts, (2) using semantically un-aligned data.

Proposed Task

The task is to generate an utterance from a given MR, which is a) similar to human generated reference texts, and b) highly rated by humans. Similarity will be assessed using standard metrics, such as BLEU and METEOR. Human ratings will be obtained using a mixture of crowd-sourcing and expert annotations. We will also test a suite of novel metrics to estimate the quality of a generated utterance.

The metrics used for automatic evaluation are available on Github.

Download Training Data

Please register here to download the training and development data.

The description of the data is available here. A paper with a detailed description of the dataset will appear on SIGDIAL and is now available on arXiv.

To cite the dataset and/or challenge, use:

  title={The {E2E} Dataset: New Challenges for End-to-End Generation},
  author={Novikova, Jekaterina and Du{\v{s}}ek, Ondrej and Rieser, Verena},
  booktitle={Proceedings of the 18th Annual Meeting of the Special Interest Group on Discourse and Dialogue},
  address={Saarbr\"ucken, Germany},

Baseline System

We used TGen (Dusek and Jurcicek, 2016) as the baseline system for the challenge. It is a seq2seq model with attention (Bahdanau et al., 2015) with added beam search and a reranker penalizing outputs that stray away from the input MR. The baseline scores on the development set are as follows:


The full baseline system outputs for the development set can be downloaded here (one instance per line). If you want to run the baseline yourself, basic instructions are provided in the TGen Github repository.

The scripts used for evaluation are available on Github.

Important Dates

13 March 2017:
Registration opens
27 March 2017:
Training and development data are released (MRs + references)
27 June 2017:
The baseline system is released.
16 October 2017:
Test data is released (MRs only)
31 October 2017:
Entry submission deadline
15 Nov 2017:
Evaluation results are released
15 December 2017:
Participants submit a paper describing their systems
February 2018:
Results presented at workshop


Organising Comittee

Jekaterina Novikova
Ondrej Dusek
Verena Rieser

Heriot-Watt University, Edinburgh, UK.

Contact Details

Advisory Committee

Mohit Bansal, University of Northern Carolina Chapel Hill
Ehud Reiter, University of Aberdeen
Amanda Stent, Bloomberg
Andreas Vlachos, University of Sheffield
Marilyn Walker, University of California Santa Cruz
Matthew Walter, Toyota Technological Institute at Chicago
Tsung-Hsien Wen, University of Cambridge
Luke Zettlemoyer, University of Washington