Data Release: 27 March 2017

NLG Challenge

Entry Submission Deadline: 31 October 2017


Natural language generation plays a critical role for Conversational Agents as it has a significant impact on a user’s impression of the system. This shared task focuses on recent end-to-end (E2E), data-driven NLG methods, which jointly learn sentence planning and surface realisation from non-aligned data, e.g. (Wen et al., 2015; Mei et al., 2016; Dusek and Jurcicek, 2016; Lampouras and Vlachos, 2016) etc.

So far, E2E NLG approaches were limited to small, de-lexicalised data sets, e.g. BAGEL, SF Hotels/ Restaurants, or RoboCup. In this shared challenge, we will provide a new crowd-sourced data set of 50k instances in the restaurant domain, as described in (Novikova, Lemon and Rieser, 2016). Each instance consist of a dialogue act-based meaning representation (MR) and up to 5 references in natural language. In contrast to previously used data, our data set includes additional challenges, such as open vocabulary, complex syntactic structures and diverse discourse phenomena. For example:


name[The Eagle],
eatType[coffee shop],
near[Burger King]


“The three star coffee shop, The Eagle, gives families a mid-priced dining experience featuring a variety of wines and cheeses. Find The Eagle near Burger King.”

The full data set can now be downloaded here. A detailed description of the data can be found in our SIGDIAL 2017 paper.

This challenge follows on from previous successful shared tasks on generation, e.g. SemEval’17 task 9 on text generation from AMR, and Generation Challenges 2008-11. However, this is the first NLG task to concentrate on (1) generation from dialogue acts, (2) using semantically un-aligned data.

The Task

The task is to generate an utterance from a given MR, which is a) similar to human generated reference texts, and b) highly rated by humans. Similarity will be assessed using standard metrics, such as BLEU and METEOR. Human ratings will be obtained using a mixture of crowd-sourcing and expert annotations. We will also test a suite of novel metrics to estimate the quality of a generated utterance.

The metrics used for automatic evaluation are available on Github.

Download Training and Test Data

The full E2E dataset is now available for download here. The package includes a description of the data format. A paper with a detailed description of the dataset appeared on SIGDIAL 2017 and is also available on arXiv.

To cite the dataset and/or challenge, use:

  title={The {E2E} Dataset: New Challenges for End-to-End Generation},
  author={Novikova, Jekaterina and Du{\v{s}}ek, Ondrej and Rieser, Verena},
  booktitle={Proceedings of the 18th Annual Meeting 
             of the Special Interest Group on Discourse and Dialogue},
  address={Saarbr\"ucken, Germany},

Baseline System

We used TGen (Dusek and Jurcicek, 2016) as the baseline system for the challenge. It is a seq2seq model with attention (Bahdanau et al., 2015) with added beam search and a reranker penalizing outputs that stray away from the input MR. The baseline scores on the development set are as follows:


The full baseline system outputs can be downloaded here for both the development and test sets (one instance per line). If you want to run the baseline yourself, basic instructions are provided in the TGen Github repository.

The scripts used for evaluation are available on Github.

Important Dates

13 March 2017:
Registration opens
27 March 2017:
Training and development data are released (MRs + references)
27 June 2017:
The baseline system is released.
16 October 2017:
Test data is released (MRs only)
31 October 2017:
Entry submission deadline
15 November 2017:
Evaluation results are released
15 December 2017:
Participants submit a paper describing their systems
1 March 2018:
Final versions of the description papers due
Late 2018:
Results presented at INLG (TBA)

Evaluation Results

We are happy to announce that the interest in the E2E NLG shared task has by far outperformed our expectations. Heriot-Watt University has set out this challenge for the first time this year, and we received a total of 60 submissions by 16 institutions with about 1/3 of these submissions coming from industry. In comparison, the well established Conference in Machine Translation WMT’17 (running since 2006) got 31 institutions submitting to a total of 8 tasks.

Automatic Metrics

The automatic evaluation results were obtained using the metrics scripts provided with the baseline. The table is sortable – just click on the metric you want use for sorting. Click again to reverse the sort.

SubmitterAffiliationSystem nameP?BLEUNISTMETEORROUGE_LCIDEr
BASELINEHeriot-Watt UniversityBaseline0.65938.60940.44830.68502.2338
Biao ZhangXiamen Universitybzhang_submit0.65458.18400.43920.70832.1012
Chen ShuangHarbin Institute of TechnologyAbstract-beam10.58545.46910.39770.67471.6391
Chen ShuangHarbin Institute of TechnologyAbstract-beam20.59165.94770.39740.67011.6513
Chen ShuangHarbin Institute of TechnologyAbstract-beam30.61506.80290.40680.67501.7870
Chen ShuangHarbin Institute of TechnologyAbstract-greedy0.66358.39770.43120.69092.0788
Chen ShuangHarbin Institute of TechnologyNonAbstract-beam20.58606.16020.38330.66191.6133
Chen ShuangHarbin Institute of TechnologyNonAbstract-beam30.60886.97900.38990.66281.7015
Chen ShuangHarbin Institute of TechnologyPrimary_NonAbstract-beam10.58595.43830.38360.67141.5790
ZHAWZurich University of Applied Sciencesbase0.65448.33910.44480.67832.1438
ZHAWZurich University of Applied Sciencesprimary_10.58648.02120.43220.59981.8173
ZHAWZurich University of Applied Sciencesprimary_20.60048.13940.43880.61191.9188
FORGePompeu Fabra UniversityE2E_UPF_10.42076.51390.36850.54371.3106
FORGePompeu Fabra UniversityE2E_UPF_20.41136.32930.36860.55931.2467
FORGePompeu Fabra UniversityE2E_UPF_30.45997.10920.38580.56111.5586
Sheffield NLPUniversity of Sheffieldsheffield_primarySystem1_var10.60158.30750.44050.67782.1775
Sheffield NLPUniversity of Sheffieldsheffield_primarySystem1_var20.62338.17510.43780.68872.2840
Sheffield NLPUniversity of Sheffieldsheffield_primarySystem1_var30.56908.03820.42020.63482.0956
Sheffield NLPUniversity of Sheffieldsheffield_primarySystem1_var40.57997.91630.43100.66702.0691
Sheffield NLPUniversity of Sheffieldsheffield_primarySystem2_var10.54365.74620.35610.61521.4130
Sheffield NLPUniversity of Sheffieldsheffield_primarySystem2_var20.53567.83730.38310.55131.5825
HarvardNLP & Henry ElderHarvard SEAS & Adaptmain_1_support_10.65818.57190.44090.68932.1065
HarvardNLP & Henry ElderHarvard SEAS & Adaptmain_1_support_20.66188.60250.45710.70382.3371
HarvardNLP & Henry ElderHarvard SEAS & Adaptmain_1_support_30.67378.60610.45230.70842.3056
HarvardNLP & Henry ElderHarvard SEAS & AdaptPrimary_main_10.64968.52680.43860.68722.0850
Heng GongHarbin Institute of TechnologyPrimary_test_20.64228.34530.44690.66452.2721
Heng GongHarbin Institute of Technologytest_10.63968.31110.44660.66202.2272
Heng GongHarbin Institute of Technologytest_30.63958.31270.44570.66282.2442
Heng GongHarbin Institute of Technologytest_40.63958.31270.44570.66282.2442
<anonymous 5><anonymous 5>primary_submission-temperature_1.10.50927.19540.40250.58721.5039
<anonymous 5><anonymous 5>supporting_submission-temperature_0.90.55737.70130.41540.61301.8110
<anonymous 5><anonymous 5>supporting_submission-temperature_1.00.52657.39910.40950.59921.6488
<anonymous 1><anonymous 1><anonymous 1 combined>0.29214.76900.25150.43610.6674
<anonymous 1><anonymous 1><anonymous 1 primary>0.47236.19380.31700.56161.2127
Shubham AgarwalNLEsubmission_primary0.65348.53000.44350.68292.1539
Shubham AgarwalNLEsubmission_second0.66698.53880.44840.69912.2239
Shubham AgarwalNLEsubmission_third0.66768.54160.44850.69912.2276
UCSC-Slug2SlugUC Santa CruzSlug2Slug0.66198.61300.44540.67722.2615
UCSC-Slug2SlugUC Santa CruzSlug2Slug-alt (late submission)0.60358.39540.43690.59912.1019
Thomson Reuters NLGThomson ReutersNonPrimary_1_test_output_model_11_post0.65368.32930.45500.68052.1050
Thomson Reuters NLGThomson ReutersNonPrimary_2_test_output_model_13_post0.65628.39420.45710.68762.1706
Thomson Reuters NLGThomson ReutersNonPrimary_3_test_output_beam_5_model_11_post0.68058.77770.44620.69282.3195
Thomson Reuters NLGThomson ReutersNonPrimary_4_test_output_beam_5_model_13_post0.67428.65900.44990.69832.3018
Thomson Reuters NLGThomson ReutersNonPrimary_5_submission_60.62088.06320.44170.66922.1127
Thomson Reuters NLGThomson ReutersNonPrimary_6_submission_4_beam0.62018.09380.44190.67402.1251
Thomson Reuters NLGThomson ReutersNonPrimary_7_submission_40.61828.06160.44170.67292.0783
Thomson Reuters NLGThomson ReutersNonPrimary_8_test_train_only0.41116.75410.39700.54351.4096
Thomson Reuters NLGThomson ReutersPrimary_1_submission_6_beam0.63368.18480.43220.68282.1425
Thomson Reuters NLGThomson ReutersPrimary_2_test_train_dev0.42026.76860.39680.54811.4389
<anonymous 3><anonymous 3>System 1/Primary-Sys10.65618.51050.45170.68392.2183
<anonymous 3><anonymous 3>System 1/Sys1-Model10.64768.43010.45080.67952.1233
<anonymous 3><anonymous 3>System 2/Primary-Sys20.65028.52110.43960.68532.1670
<anonymous 3><anonymous 3>System 2/Sys2-Model10.66068.62230.44390.67722.1997
<anonymous 3><anonymous 3>System 2/Sys2-Model20.65638.54820.44820.68352.1953
<anonymous 3><anonymous 3>System 2/Sys2-Model30.36816.60040.38460.52591.5205
UIT-DANGNTVNU-HCM University of Information Technologytest_e2e_result_2 final_TSV0.59907.92770.43460.66342.0783
UKP-TUDATechnische Universität Darmstadttest_e2e-Puzikov0.56577.45440.45290.66141.8206
Note: “P?” denotes primary submissions.

Human Evaluation (updated results)

The human evaluation was conducted on the 20 primary systems and the baseline using the CrowdFlower platform. Crowd workers were presented with five randomly selected outputs of different systems corresponding to a single meaning representation, and were asked to rank these systems from the best to worst, ties permitted. A single human-authored reference was provided for comparison. We collected separate ranks for quality and naturalness.

Quality is defined as an overall quality of the utterance, in terms of its grammatical correctness, fluency, adequacy and other important factors. When collecting quality ratings, system outputs were presented to crowd workers together with the corresponding meaning representation.

Naturalness is defined the extent to which the utterance could have been produced by a native speaker. When collecting naturalness ratings, system outputs were presented to crowd workers without the corresponding meaning representation.

If used in a real-life NLG system, quality would be considered the primary measure.

The final evaluation results were produced using the TrueSkill algorithm (Sakaguchi et al., 2014). For naturalness, the algorithm performed 1890 pairwise comparisons per each system (37800 comparisons in total), for quality – 1260 comparisons per system (25200 comparisons in total). In results tables, systems are ordered by their inferred system TrueSkill scores, and clustered. Systems within a cluster are considered tied. The system clusters have been created using bootstrap resampling, with a p-level of p ≤ 0.05.


#TrueSkillRangeSystem nameSubmitter
10.300(1.0, 1.0)Slug2SlugUCSC-Slug2Slug
20.228(2.0, 4.0)ukp-tudaUKP-TUDA
0.213(2.0, 5.0)Primary_test_2Heng Gong
0.184(3.0, 5.0)test_e2e_result_2_final_TSVUIT-DANGNT
0.184(3.0, 6.0)BaselineBASELINE
0.136(5.0, 7.0)Slug2Slug-alt (late submission)UCSC-Slug2Slug
0.117(6.0, 8.0)primary_2ZHAW
0.084(7.0, 10.0)System 1/Primary-Sys1<anonymous 3>
0.065(8.0, 10.0)System 2/Primary-Sys2<anonymous 3>
0.048(8.0, 12.0)submission_primaryNLE
0.018(10.0, 13.0)primary_1ZHAW
0.014(10.0, 14.0)E2E_UPF_1FORGe
-0.012(11.0, 14.0)sheffield_primarySystem1_var1Sheffield NLP
-0.012(11.0, 14.0)Primary_main_1HarvardNLP & Henry Elder
3-0.078(15.0, 16.0)Primary_2_test_train_devThomson Reuters NLG
-0.083(15.0, 16.0)E2E_UPF_3FORGe
4-0.152(17.0, 19.0)primary_submission-temperature_1.1<anonymous 5>
-0.185(17.0, 19.0)Primary_1_submission_6_beamThomson Reuters NLG
-0.186(17.0, 19.0)bzhang_submitBiao Zhang
5-0.426(20.0, 21.0)Primary_NonAbstract-beam1Chen Shuang
-0.457(20.0, 21.0)sheffield_primarySystem2_var1Sheffield NLP


#TrueSkillRangeSystem nameSubmitter
10.211(1.0, 1.0)sheffield_primarySystem2_var1Sheffield NLP
20.171(2.0, 3.0)Slug2SlugUCSC-Slug2Slug
0.154(2.0, 4.0)Primary_NonAbstract-beam1Chen Shuang
0.126(3.0, 6.0)Primary_main_1HarvardNLP & Henry Elder
0.105(4.0, 8.0)submission_primaryNLE
0.101(4.0, 8.0)BaselineBASELINE
0.091(5.0, 8.0)test_e2e_result_2 final_TSVUIT-DANGNT
0.077(5.0, 10.0)ukp-tudaUKP-TUDA
0.060(7.0, 11.0)System 2/Primary-Sys2<anonymous 3>
0.046(9.0, 12.0)Primary_test_2Heng Gong
0.027(9.0, 12.0)System 1/Primary-Sys1<anonymous 3>
0.027(10.0, 12.0)bzhang_submitBiao Zhang
3-0.053(13.0, 16.0)Primary_1_submission_6_beamThomson Reuters NLG
-0.073(13.0, 17.0)Slug2Slug-alt (late submission)UCSC-Slug2Slug
-0.077(13.0, 17.0)sheffield_primarySystem1_var1Sheffield NLP
-0.083(13.0, 17.0)primary_2ZHAW
-0.104(15.0, 17.0)primary_1ZHAW
4-0.144(18.0, 19.0)E2E_UPF_1FORGe
-0.164(18.0, 19.0)primary_submission-temperature_1.1<anonymous 5>
5-0.243(20.0, 21.0)Primary_2_test_train_devThomson Reuters NLG
-0.255(20.0, 21.0)E2E_UPF_3FORGe

Proceedings (Full System Descriptions)

All submitters participating in human evaluation provided a description of their primary systems as a technical paper. The papers are linked below:

Submitter / System
Anonymous 3, System 1
Anonymous 3, System 2
Anonymous 5
Chen Shuang
FORGe (both systems)
HarvardNLP & Henry Elder
Heng Gong (final paper pending)
Sheffield NLP (both systems)
Thomson_Reuters NLP, System 1
Thomson_Reuters NLP, System 2
Biao Zhang
ZHAW (both systems)


Organising Comittee

Jekaterina Novikova
Ondrej Dusek
Verena Rieser

Heriot-Watt University, Edinburgh, UK.

Contact Details

Advisory Committee

Mohit Bansal, University of Northern Carolina Chapel Hill
Ehud Reiter, University of Aberdeen
Amanda Stent, Bloomberg
Andreas Vlachos, University of Sheffield
Marilyn Walker, University of California Santa Cruz
Matthew Walter, Toyota Technological Institute at Chicago
Tsung-Hsien Wen, University of Cambridge
Luke Zettlemoyer, University of Washington